Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg...
Understanding Spoken Language using
Statistical and Computational Methods
Steven GreenbergInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704
http://www.icsi.berkeley.edu/~steveng(contains electronic versions of papers and links to data)
Patterns of Speech Sounds in Unscripted Communication - Production, Perception, Phonology. Akademie Sankelmark, October 8-11, 2000
OR ….
How I Learned to Stop Worryingand Use
The Canonical Form
DisclaimerI am a Phonetician - NOT!
(many thanks for the invite)
No Scientist is an Island …IMPORTANT COLLEAGUES
PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECH (SWITCHBOARD)Candace Cardinal, Rachel Coulston, Dan Ellis, Eric Fosler, Joy Holllenback, John
Ohala, Colleen Richey
STATISTICAL ANALYSIS OF PRONUNCIATION VARIATIONEric Fosler, Leah Hitchcock, Joy Hollenback
ARTICULATORY-ACOUSTIC BASIS OF CONSONANT RECOGNITIONLeah Hitchcock, Rosaria Silipo
AUTOMATIC PHONETIC TRANSCRIPTION OF SPONTANEOUS SPEECHShawn Chang, Lokendra Shastri
Germane Publications
http://www.icsi.berkeley.edu/~steveng
STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELINGFosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic
speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco.Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the
production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .
Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176.
Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.
Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27.
PERCEPTUAL BASES OF SPEECH INTELLIGIBILITYGreenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral
information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.
Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8.
Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest
AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATIONChang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech
(American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing.Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow
neural networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724.
Prologue
Language - The Traditional PerspectiveThe “classical” view of spoken language posits a quasi-arbitrary relation between
the lower and higher tiers of linguistic organization
Phonetic orthography
Language - A Syllable-Centric PerspectiveA more empirical perspective of spoken language focuses on the syllable as the
interface between “sound” and “meaning”
Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with
respect to syllabic position
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with
respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with
respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level
• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with
respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level
• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY– It may be unrealistic to assume that any phonetic transcription based
exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material
Take Home Messages• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 80-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with
respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level
• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY– It may be unrealistic to assume that any phonetic transcription based
exclusively on segments (such as the IPA) is truly capable of capturing the important phonetic detail of spontaneous material
Take Home Messages• PHONETIC PROPERTIES OF SPONTANEOUS SPEECH REFLECT
INFORMATION CONTENT
Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features
• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features
• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition– Not all articulatory-acoustic features are created equal - place-of-articulation
cues appear to be most important for consonant recognition
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features
• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition– Not all articulatory-acoustic features are created equal - place-of-articulation
cues appear to be most important for consonant recognition
• COMPUTATIONAL METHODS– Automatic methods for phonetic transcription based on articulatory-acoustic
features
Road Map• PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN ENGLISH
– Provides the basis for the statistical analyses of spontaneous material
• A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE PERSPECTIVE OF:– Phonetic segments– Words – Syllables– Articulatory-acoustic features
• PERCEPTUAL EVIDENCE– The articulatory-acoustic basis of consonant recognition– Not all articulatory-acoustic features are created equal - place-of-articulation
cues appear to be most important for consonant recognition
• COMPUTATIONAL METHODS– Automatic methods for phonetic transcription based on articulatory-acoustic
features– Is the most likely means through which it will be possible to generate sufficient
empirical data with which to rigorously test hypotheses germane to spoken language
Phonetic Transcription of Spontaneous (American) English
Phonetic Transcription of Spontaneous English• TELEPHONE DIALOGUES OF 5-10 MINUTES DURATION - SWITCHBOARD• AMOUNT OF MATERIAL MANUALLY TRANSCRIBED
– 3 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods)
– 1 hour labeled and segmented at the phonetic-segment level
• DIVERSITY OF MATERIAL TRANSCRIBED– Spans speech of both genders (ca. 50/50%) reflecting a wide range of American
dialectal variation (6 regions + “army brat”), speaking rate and voice quality
• TRANSCRIBED BY WHOM? – 7 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of
the corpus was transcribed by three individuals out of the original eight– Supervised by Steven Greenberg and John Ohala
• TRANSCRIPTION SYSTEM– A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd
• HOW LONG DOES TRANSCRIPTION TAKE? (Don’t Ask!)– 388 times real time for labeling and segmentation at the phonetic-segment level– 150 times real time for labeling phonetic segments and segmenting syllables
• HOW WAS LABELING AND SEGMENTATION PERFORMED?– Using a display of the signal waveform, spectrogram, word transcription and
“forced alignments” (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations
• DATA AVAILABLE AT - http://www.icsi/berkeley.edu/real/stp
A Brief Tour of Pronunciation Variation
inSpontaneous American English
The 10 most common words account for 27% of the corpus
The 100 most common words account for 67% of the corpus
The 1000 most common words account for 92% of the corpus
Thus, most informal dialogues are composed of a relatively small number of common words.
However, it is the infrequent words that typically provide the precision and detail required for complex information transfer
Cumulative Word Frequency in English
67%
27%
92%
Computed from the Switchboard corpus (American English telephone dialogues)
Focus on 100 most common words
How Many Pronunciations of “And”?
82 ae n63 eh n45 ix n35 ax n34 en30 n20 ae n dcl d17 ih n17 q ae n11 ae n d
7 q eh n7 ae nx6 ae ae n6 ah n5 eh nx4 uh n4 ix nx4 q ae n dcl d3 eh n d3 q ae nx
3 eh2 ae n dcl2 ae2 ax m2 ax n d2 ae eh n dcl d2 eh n dcl d2 ax nx2 q ae ae n2 q ix n2 ix n dcl d2 ih 2 eh eh n2 q eh nx2 ix d n1 eh m1 ax n dcl d1 aw n1 ae q1 eh dcl
N Pronunciation N Pronunciation
How Many Pronunciations of “And”?
1 ah nx1 ae n t1 eh d1 ah n dcl d1 ey ih n dcl1 ae ix n1 ae nx ax1 ax ng1 ay n1 ih ah n d1 ae hh1 ih ng1 ix1 ae n d dcl1 ix dcl d1 ae eh n1 hh n1 ix n t1 ae ax n dcl d1 iy eh n
1 m1 ae ae n d1 nx1 q ae ae n1 q ae ae n dcl d1 q ae eh n dcl d1 q ae ih n1 aa n1 q ae n d1 ? nx1 q ae n q1 eh n m1 q eh en dcl1 eh ng1 q eh n q1 em1 q eh ow m1 q ih n1 q ix en1 er
N Pronunciation N Pronunciation
1 I 6 4 9 5 3 5 3 a y
2 a n d 5 2 1 8 7 1 6 a e n
3 th e 4 7 5 7 6 2 7 d h a x
4 y o u 4 0 6 6 8 2 0 y ix
5 th a t 3 2 8 1 1 7 1 1 d h a e
6 a 3 1 9 2 8 6 4 a x
7 to 2 8 8 6 6 1 4 tc l t u w
8 k n o w 2 4 9 3 4 5 6 n o w
9 o f 2 4 2 4 4 2 1 a x v
1 0 it 2 4 0 4 9 2 2 ih
1 1 y e a h 2 0 3 4 8 4 3 y a e
1 2 in 1 7 8 2 2 4 5 ih n
1 3 th e y 1 5 2 2 8 6 0 d h e y
1 4 d o 1 3 1 3 0 5 4 d c l d u w
1 5 s o 1 3 0 1 4 7 4 s o w
1 6 b u t 1 2 3 4 5 1 2 b c l b a h tc l t
1 7 is 1 2 0 2 4 5 0 ih z
1 8 lik e 1 1 9 1 9 4 6 l a y k c l k
1 9 h a v e 1 1 6 2 2 5 4 h h a e v
2 0 w a s 1 1 1 2 4 2 3 w a h z
2 1 w e 1 0 8 1 3 8 3 w iy
2 2 it's 1 0 1 1 4 2 0 ih tc l s
2 3 ju s t 1 0 1 3 4 1 7 jh ix s
2 4 o n 9 8 1 8 4 9 a a n
2 5 o r 9 4 2 3 3 6 e r
2 6 n o t 9 2 2 4 2 4 m a a q
2 7 th in k 9 2 2 3 3 2 th ih n g k c l k
2 8 fo r 8 7 1 9 4 6 f e r
2 9 w e ll 8 4 4 9 2 3 w e h l
3 0 w h a t 8 2 4 0 1 4 w a h d x
3 1 a b o u t 7 7 4 6 1 2 a x b c l b a w
3 2 a ll 7 4 2 7 2 4 a o l
3 3 th a t's 7 4 1 9 1 6 d h e h s
3 4 o h 7 4 1 7 6 1 o w
3 5 re a lly 7 1 2 5 4 5 r ih l iy
3 6 o n e 6 9 8 7 8 w a h n
3 7 a re 6 8 1 9 4 2 e r
3 8 I'm 6 7 9 2 6 q a a m
3 9 rig h t 6 1 2 1 2 8 r a y
4 0 u h 6 0 1 6 4 1 a h
4 1 th e m 6 0 1 8 2 3 a x m
4 2 a t 5 9 3 6 8 a e d x
4 3 th e re 5 8 2 8 2 2 d h e h r
4 4 my 5 8 9 6 6 m a y
4 5 me a n 5 6 1 0 5 8 m iy n
4 6 d o n 't 5 6 2 1 1 4 d x o w
4 7 n o 5 5 8 7 7 n o w
4 8 w ith 5 5 2 0 3 5 w ih th
4 9 if 5 5 1 8 4 1 ih f
5 0 w h e n 5 4 1 8 3 1 w e h n
5 1 c a n 5 4 2 8 1 5 k c l k a e n
5 2 th e n 5 1 1 9 3 8 d h e h n
5 3 b e 5 0 1 1 7 6 b c l b iy
5 4 a s 4 9 1 6 1 8 a e z
5 5 o u t 4 7 1 9 2 2 a e d x
5 6 k in d 4 7 1 7 2 1 k c l k a x n x
5 7 b e c a u e 4 6 3 1 1 5 k c l k a x z
5 8 p e o p le 4 5 2 1 4 4 p c l p iy p c l l e l
5 9 g o 4 5 5 8 3 g c l g o w
6 0 g o t 4 5 3 2 1 5 g c l g a a
6 1 th is 4 4 1 1 4 7 d h ih s
6 2 s o me 4 3 4 4 8 s a h m
6 3 w o u ld 4 1 1 6 2 9 w ih d c l
6 4 th in g s 4 1 1 5 5 2 th ih n g z
6 5 n o w 3 9 1 1 6 9 n a w
6 6 lo t 3 9 9 4 7 l a a d x
6 7 h a d 3 9 1 9 2 4 h h a e d c l
6 8 h o w 3 9 1 1 5 3 h h a w
6 9 g o o d 3 8 1 3 2 7 g c l g u h d c l
7 0 g e t 3 8 2 0 1 3 g c l g e h d x
7 1 s e e 3 7 6 8 0 s iy
7 2 fro m 3 6 1 0 2 8 f r a h m
7 3 h e 3 6 7 3 9 iy
7 4 me 3 5 5 8 7 m iy
7 5 d o n 't 3 5 2 1 1 4 d x o w
7 6 th e ir 3 3 1 9 2 5 d h e h r
7 7 mo re 3 2 1 1 5 6 m a o r
7 8 it's 3 1 1 4 2 0 ih tc l s
7 9 th a t's 3 1 2 0 1 6 d h e h s
8 0 to o 3 1 6 6 0 tc l t u w
8 1 o k a y 3 1 1 7 4 5 o w k c l k e y
8 2 v e ry 3 0 1 1 3 6 v e h r iy
8 3 u p 3 0 1 1 3 4 a h p c l p
8 4 b e e n 3 0 1 1 5 1 b c l b ih n
8 5 g u e s s 2 9 8 4 2 g c l g e h s
8 6 time 2 9 8 6 2 tc l t a y m
8 7 g o in g 2 9 2 1 1 3 g c l g o w ih n g
8 8 in to 2 8 2 0 1 4 ih n tc l t u w
8 9 th o s e 2 7 1 2 4 2 d h o w z
9 0 h e re 2 7 1 1 2 5 h h iy e r
9 1 d id 2 7 1 3 2 3 d c l d ih d x
9 2 w o rk 2 5 8 6 6 w e r k c l k
9 3 o th e r 2 5 1 4 2 6 a h d h e r
9 4 a n 2 5 1 2 2 8 a x n
9 5 I'v e 2 5 7 4 6 a y v
9 6 th in g 2 4 9 5 2 th ih n g
9 7 e v e n 2 4 7 4 0 iy v ix n
9 8 o u r 2 3 9 3 3 a a r
9 9 a n y 2 3 1 1 2 3 ix n iy
1 0 0 w e 're 2 3 8 2 5 w e y r
How Many Different Pronunciations?
1 I 649 53 53 ay2 and 521 87 16 ae n3 the 475 76 27 dh ax4 you 406 68 20 y ix5 that 328 117 11 dh ae6 a 319 28 64 ax7 to 288 66 14 tcl t uw8 know 249 34 56 n ow9 of 242 44 21 ax v
10 it 240 49 22 ih11 yeah 203 48 43 y ae12 in 178 22 45 ih n13 they 152 28 60 dh ey14 do 131 30 54 dcl d uw15 so 130 14 74 s ow16 but 123 45 12 bcl b ah tcl t17 is 120 24 50 ih z18 like 119 19 46 l ay kcl k19 have 116 22 54 hh ae v20 was 111 24 23 w ah z
Rank Word N #PronMost CommonPronunciation
MCP%Total
How Many Different Pronunciations?
21 we 108 13 83 w iy22 it's 101 14 20 ih tcl s23 just 101 34 17 jh ix s24 on 98 18 49 aa n25 or 94 23 36 er26 not 92 24 24 m aa q27 think 92 23 32 th ih ng kcl k28 for 87 19 46 f er29 well 84 49 23 w eh l30 what 82 40 14 w ah dx31 about 77 46 12 ax bcl b aw32 all 74 27 24 ao l 33 that's 74 19 16 dh eh s34 oh 74 17 61 ow35 really 71 25 45 r ih l iy36 one 69 8 78 w ah n37 are 68 19 42 er38 I'm 67 9 26 q aa m39 right 61 21 28 r ay40 uh 60 16 41 ah
Rank Word N #PronMost CommonPronunciation
MCP%Total
Rank Word N #PronMost CommonPronunciation
MCP%Total
How Many Different Pronunciations?
41 them 60 18 23 ax m42 at 59 36 8 ae dx43 there 58 28 22 dh eh r44 my 58 9 66 m ay45 mean 56 10 58 m iy n46 don't 56 21 14 dx ow47 no 55 8 77 n ow48 with 55 20 35 w ih th49 if 55 18 41 ih f50 when 54 18 31 w eh n51 can 54 28 15 kcl k ae n52 then 51 19 38 dh eh n53 be 50 11 76 bcl b iy54 as 49 16 18 ae z55 out 47 19 22 ae dx56 kind 47 17 21 kcl k ax nx57 becaue 46 31 15 kcl k ax z58 people 45 21 44 pcl p iy pcl l el59 go 45 5 83 gcl g ow60 got 45 32 15 gcl g aa
How Many Different Pronunciations?
61 this 44 11 47 dh ih s62 some 43 4 48 s ah m63 would 41 16 29 w ih dcl64 things 41 15 52 th ih ng z65 now 39 11 69 n aw66 lot 39 9 47 l aa dx67 had 39 19 24 hh ae dcl68 how 39 11 53 hh aw69 good 38 13 27 gcl g uh dcl70 get 38 20 13 gcl g eh dx71 see 37 6 80 s iy72 from 36 10 28 f r ah m73 he 36 7 39 iy74 me 35 5 87 m iy75 don't 35 21 14 dx ow76 their 33 19 25 dh eh r77 more 32 11 56 m ao r78 it's 31 14 20 ih tcl s79 that's 31 20 16 dh eh s80 too 31 6 60 tcl t uw
Rank Word N #PronMost CommonPronunciation
MCP%Total
How Many Different Pronunciations?
81 okay 31 17 45 ow kcl k ey82 very 30 11 36 v eh r iy83 up 30 11 34 ah pcl p84 been 30 11 51 bcl b ih n85 guess 29 8 42 gcl g eh s86 time 29 8 62 tcl t ay m87 going 29 21 13 gcl g ow ih ng88 into 28 20 14 ih n tcl t uw89 those 27 12 42 dh ow z90 here 27 11 25 hh iy er91 did 27 13 23 dcl d ih dx92 work 25 8 66 w er kcl k93 other 25 14 26 ah dh er94 an 25 12 28 ax n95 I've 25 7 46 ay v96 thing 24 9 52 th ih ng97 even 24 7 40 iy v ix n98 our 23 9 33 aa r99 any 23 11 23 ix n iy
100 we're 23 8 25 w ey r
Rank Word N #PronMost CommonPronunciation
MCP%Total
English is (sort of) like Chinese ….
81% of the word tokens are monosyllabic
Of the 100 most common words, 90 are one syllablein length
Only 22% of the words in the lexicon are one syllable long
Hence, there is a decided preference for monosyllablic words in informal discourse
95% of the words contain just ONE or TWO syllables ….
Syllable and. Word Frequencies are SimilarWords and syllables exhibit similar distributions over the 300 most common elements, accounting for 80% of the corpus
The similarity of their distributions is a consequence of most words consisting of just a single syllable
Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10
Word Frequency in Spontaneous English
Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc.
Computed from the Switchboard corpus (American English telephone dialogues)
Information Affects PronunciationThe faster the speaking rate the more likely that the pronunciation deviates from canonical
However, the effect is much more pronounced for the 100 most common words than for more infrequent words
From Fosler, Greenberg and Morgan (1999); Greenberg and Fosler (2000)
0
5
10
15
20
25
30
35
40
45
50
CV CVC VC V
Syllable Type
PronunciationCorpus
English Syllable Structure is (sort of) Like Japanese
87% of the pronunciations are simple syllabic forms
84% of the canonical corpus is composed of simple syllabic forms
n= 103, 054
Most syllables are simple in form (no consonant clusters)
There are many “complex” syllable forms (consonant clusters, but all occur relatively infrequently
Complex Syllables are Important, ThoughThus, despite English’s reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex
Complex codas are not as frequently realized in actual pronunciation as their canonical representation
Complex onsets tend to preserve the canonical pronunciation in realize their canonical representation
n= 17,760
Syllable-Centric Pronunciation
(Spontaneous speech)
(Read Sentences)“Cat” [k ae t][k] = onset[ae] = nucleus[t] = coda
Onsets are pronouncedcanonically far more often than nuclei or codas
Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues
Percent Canonically Pronounced
Syllable Position
n= 120,814
70
75
80
85
90
95
100
Simple (C) Complex (CC(C))
STP
TIMIT
Complex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation
(Spontaneous speech)
(Read Sentences)
Percent Canonically Pronounced
Syllable Onset Type
Complex Onsets are Highly Canonical
Speaking Style Affects Codas
Percent Canonically Pronounced
Codas are much more likely to be realized canonically in formal than in spontaneous speech
Syllable Coda Type
50
55
60
65
70
AllNuclei
WithOnset
WithoutOnset
WithCoda
WithoutCoda
STP
TIMIT
Onsets (but not Codas) Affect Nuclei
Percent Canonically Pronounced
The presence of a syllable onset has a substantial impact on the realization of the nucleus
Syllable-Centric Feature Analysis• Place of articulation deviates most in nucleus position• Manner of articulation deviates most in onset and coda position• Voicing deviates most in coda position
Phonetic deviation along a SINGLE feature
Place deviates very little from canonical form in the onset and coda. It
is a STABLE AF in these positions
Place is VERY unstable in nucleus position
Articulatory PLACE Feature Analysis• Place of articulation is a “dominant” feature in nucleus position only• Drives the feature deviation in the nucleus for manner and rounding
Phonetic deviation across SEVERAL features
Place “carries” manner and rounding in the nucleus
• Manner of articulation is a “dominant” feature in onset and coda position• Drives the feature deviation in onsets and codas for place and voicing
Articulatory MANNER Feature Analysis
Manner is less stable in the coda than in the onset
Manner drives place and
voicing deviations in the onset and
coda
Phonetic deviation across SEVERAL features
• Voicing is a subordinate feature in all syllable positions• Its deviation pattern is controlled by manner in onset and coda positions
Articulatory VOICING Feature Analysis
Voicing is unstable in coda position and is dominated by manner
Phonetic deviation across SEVERAL features
• Lip-rounding is a subordinate feature• Its deviation pattern is driven by the place feature in nucleus position
LIP-ROUNDING Feature Analysis
Rounding is stable everywhere except in the nucleus where
its deviation pattern is driven by place
Phonetic deviation across SEVERAL features
Perceptual Evidence for the
Importance of Place (and Manner) of Articulation Features
Spectral Slit Paradigm
Consonant Recognition - Single Slits
Consonant Recognition - 1 Slit
Consonant Recognition - 2 Slits
Consonant Recognition - 3 Slits
Consonant Recognition - 4 Slits
Consonant Recognition - 5 Slits
Consonant Recognition - 2 Slits
Consonant Recognition - 2 Slits
Consonant Recognition - 2 Slits
Consonant Recognition - 2 Slits
Consonant Recognition - 2 Slits
Consonant Recognition - 2 Slits
Consonant Recognition - 3 Slits
Consonant Recognition - 3 Slits
Consonant Recognition - 3 Slits
Consonant Recognition - 4 Slits
Consonant Recognition - 5 Slits
Correlation - AFs/Consonant RecognitionConsonant recognition is almost perfectly correlated with place of articulation performance
This correlation suggests that the place feature is based on cues distributed across the entire speech bandwidth, in contrast to other features
Manner is also highly correlated with consonant recognition, voicing and rounding less so
Automatic Phonetic Transcription of Spontaneous Speech
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic
segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic
segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries
• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic
segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries
• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to
perform
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic
segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries
• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to
perform
• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic
segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries
• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to
perform
• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL– Such material will be extremely useful for developing pronunciation
models and new algorithms for ASR
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic
segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries
• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to
perform
• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL– Such material will be extremely useful for developing pronunciation
models and new algorithms for ASR
• THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS (OGI Numbers Corpus) WITH ca. 83% ACCURACY
Automatic Phonetic Transcription• MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
ALIGNMENT DATA TO TRAIN NEW SYSTEMS– These materials are highly inaccurate (35-50% incorrect labeling of phonetic
segments and an average of 32 ms (40% of the mean phone duration) off in terms of segment boundaries
• IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO MANUALLY LABEL AND SEGMENT PHONETIC MATERIAL FOR SPONTANEOUS CORPORA– Manual labeling and segmentation typically requires 150-400 times real time to
perform
• WE HAVE DEVELOPED AN AUTOMATIC LABELING OF PHONETIC SEGMENTS (ALPS) SYSTEM TO PROVIDE TRAINING MATERIALS FOR ASR AND TO ENABLE RAPID DEPLOYMENT TO NEW CORPORA AND FOREIGN LANGUAGE MATERIAL– Such material will be extremely useful for developing pronunciation
models and new algorithms for ASR
• THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS MATERIALS (OGI Numbers Corpus) WITH ca. 83% ACCURACY– The algorithms used are capable of achieving ca. 93% accuracy with
only minor changes to the models
Phonetic Feature Classification System
Spectro-Temporal Profile (STeP)• STePs provide a simple, accurate means of delineating the acoustic
properties associated with phonetic features and segments
Vocalic
Spectro-temporal Profile (STeP)• STePs incorporate information about the instantaneous modulation spectrum distributed
across the (tonotopic) frequency axis and can be used for training neural networks.
Fricative
Label Accuracy per Frame• Frames away from the boundary are labeled very accurately
Sample Transcription Output• The automatic system performs very similarly to manual transcription in terms of both labels and segmentation
– 11 ms average concordance in segmentation– 83% concordance with respect to phonetic labels
In Conclusion ….
Grand Summary• SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC WAYS
FROM CANONICAL DESCRIPTIONS OF LANGUAGE AND MORE FORMAL SPEAKING STYLES – Such insights can only be obtained at present with large amounts of
phonetically labeled material (in this instance, four hours of telephone dialogues recorded in the United States - the Switchboard corpus)
– Automatic methods will eventually supply badly needed data for more complete analyses and evaluation
• THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR TO BE ORGANIZED AT THE SYLLABIC RATHER THAN AT THE PHONE LEVEL– Onsets are pronounced in canonical (i. e., dictionary) fashion 85-90% of the time– Nuclei and codas are expressed canonically only 60% of the time– Nuclei tend to be realized as vowels different from the canonical form– Codas are often deleted entirely– Articulatory-acoustic features are also organized in systematic fashion with
respect to syllabic position– Therefore, it is important to model spoken language at the syllabic level
• THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC ORTHOGRAPHY– It may be unrealistic to assume that any phonetic transcription based
exclusively on segments (such as the IPA) is truly capably of capturing the important phonetic detail of spontaneous material
That’s All, Folks
Many Thanks for Your Time and Attention
Temporal View of Language
Linguistic Automatic Speech Recognition• CHARACTERIZE SPOKEN LANGUAGE WITH GREAT PRECISION
– Currently, manual transcription is the only means by which to collect detailed data pertaining to spoken language. Computational methods are currently being developed to perform transcription automatically in order to provide an abundance of data for statistical characterization of spontaneous discourse.
• USE THIS KNOWLEDGE TO DEVELOP COMPUTATIONAL TECHNIQUESTAILORED TO THE PROPERTIES OF THE SPEECH DOMAIN
– A detailed knowledge of spoken language is essential for deriving a computational framework for ASR. The phonetic properties of speech are structured in different ways depending on the location within the syllable, word and phrase. Such knowledge is currently under-utilized by mainstream ASR.
• FOCUS ON LOWER TIERS OF SPOKEN LANGUAGE FOR THE PRESENT– It is fashionable to emphasize the importance of “language” models (i.e., word
co-occurrence properties) in ASR. However, most of the problems lie in the acoustic-phonetic front end and therefore this domain should be attacked first.
• USE KNOWLEDGE OF HOW HUMAN LISTENERS UNDERSTAND SPOKEN LANGUAGE TO GUIDE DEVELOPMENT OF ASR
ALGORITHMS– Current ASR acoustic models are not based on perceptual capabilities of human
listeners, but on a distorted representation of what is important in hearing. It is important to perform intelligibility experiments to ascertain the identity of the truly important components of the speech signal and use this knowledge to develop robust, acoustic-front-end models for ASR.
Linguistic ASR Research @ ICSI• PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY
– Human listening experiments identifying the specific properties crucial for understanding spoken language
• MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH RECOGNITION– Using auditory-based algorithms (linked to the syllable) for reliable
ASR in background noise and reverberation• SYLLABLE-BASED AUTOMATIC SPEECH RECOGNITION
– Development of a syllable-based decoder for ASR• STATISTICAL PROPERTIES OF SPONTANEOUS SPEECH
– Detailed and comprehensive statistical analyses of the Switchboard corpus pertaining to phonetic, prosodic and lexical properties, used for developing pronunciation models (among other things)
• AUTOMATIC PHONETIC LABELING AND SEGMENTATION– Development of (the first) automatic phonetic transcription system
using articulatory-acoustic features (e.g, voicing, manner, place etc.)• AUTOMATIC LABELING OF PROSODIC STRESS
– Development of (the first) automatic system for labeling prosodic stress in English
• AUTOMATIC SPEECH RECOGNITION DIAGNOSTIC EVALUATION– Detailed and comprehensive analyses of Switchboard-corpus ASR
systems in order to identify factors associated with word error
Linguistic ASR at ICSI• SENIOR PERSONNEL
– Steven Greenberg - Linguistic ASR, Spoken Language Statistics, Speech Perception– Lokendra Shastri - Neural Network Design, Higher-level Language & Neural Processing
• GRADUATE STUDENTS– Shawn Chang - ANN-based ASR, Automatic Phonetic Transcription & Segmentation– Michael Shire - Temporal & Multi-Stream Approaches to Automatic Speech Recognition– Mirjam Wester - Pronunciation Modeling in Automatic Speech Recognition
• UNDERGRADUATE STUDENTS– Micah Farrer - Database Development for ASR Analysis– Leah Hitchcock - Statistics of Pronunciation and Prosody of Spoken Language
• TECHNICAL STAFF– Joy Hollenback - Statistical Analyses, Data Collection and Maintenance
• ASSOCIATES AT ICSI – Hynek Hermansky, Nelson Morgan, Liz Shriberg and Andreas Stolcke
• ASSOCIATES AT LOCATIONS OTHER THAN ICSI– Takayuki Arai (Sophia University, Tokyo) - Speech Perception, Signal Processing– Les Atlas (University of Washington, Seattle) - Acoustic Signal Processing– Ken Grant (Walter Reed Army Medical Center) - Audio-visual Speech Processing– David Poeppel (University of Maryland) - Brain Mechanisms of Language– Tim Roberts (UC-San Francisco Medical Center) - Brain Imaging of Language Processes– Christoph Schreiner (UCSF) - Auditory Cortex and Its Relation to Speech Processing– Lloyd Watts (Applied Neurosystems) - Auditory Modeling from Cochlea to Cortex
• CURRENT FUNDING– National Security Agency - Automatic Transcription of Phonetic and Prosodic Elements– National Science Foundation - Syllable-based ASR, Speech Perception, Statistics of Speech
Linguistic ASR at ICSI (continued)
• FORMER ICSI POST-DOCTORAL FELLOWS– Takayuki Arai - Sophia University, Tokyo– Dan Ellis - Columbia University (as of September 1, 2000)– Eric Fosler - Bell Laboratories, Lucent Technologies– Rosaria Silipo - Nuance Communications
• FORMER ICSI GRADUATE STUDENTS– Jeff Bilmes - University of Washington, Seattle– Eric Fosler - Bell Laboratories, Lucent Technologies– Brian Kingsbury - IBM, Yorktown Heights– Katrin Kirchhoff - University of Washington, Seattle– Nikki Mirghafori - Nuance Communications– Su-Lin Wu - Nuance Communications
• FORMER ICSI UNDERGRADUATE STUDENTS– Candace Cardinal - Nuance Communications– Rachel Coulston - University of California, San Diego– Collen Richey - Stanford University
Publications - Linguistic ASRAUTOMATIC SPEECH RECOGNITION DIAGNOSTIC EVALUATION
Greenberg, S., Chang, S. and Hollenback, J. (2000) An introduction to the diagnostic evaluation of the Switchboard-corpus automatic speech recognition systems. Proceedings of the NIST Speech Transcription Workshop, College Park
Greenberg, S. and Chang, S. (2000) Linguistic dissection of switchboard-corpus automatic recognition systems. Proceedings of the ICSI Workshop on Automatic Speech Recognition: Challenges for the New Millennium, Paris.
AUTOMATIC PHONETIC TRANSCRIPTION AND SEGMENTATIONChang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech
(American English). Proc. Int. Conf. Spoken Lang. Proc., Beijing.Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable detection and segmentation using temporal flow neural
networks. Proceedings of the International Congress of Phonetic Sciences, San Francisco, pp. 1721-1724.
STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELINGFosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic
speech recognition. Proceedings of the International Congress of Phonetic Sciences, San Francisco.Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production
of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .
Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176,
Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.
Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27.
AUTOMATIC LABELING OF PROSODIC STRESS IN SPONTANEOUS SPEECHSilipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous english
discourse. Proceedings of the International Congress of Phonetic Sciences, San Francisco.Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency.
Proceedings of the NIST Speech Transcription Workshop, College Park.
Publications - Linguistic ASR (continued)
MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH RECOGNITIONGreenberg, S. and Kingsbury, B. (1997) The modulation spectrogram: In pursuit of an invariant
representation of speech, in ICASSP-97, IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, pp. 1647-1650.
Kingsbury, B., Morgan, N. and Greenberg, S. (1999) The modulation-filtered spectrogram: A noise-robust speech representation, in Proceedings of the Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland.
Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Robust speech recognition using the modulation spectrogram, Speech Communication, 25, 117-132.
SYLLABLE-BASED AUTOMATIC SPEECH RECOGNITIONWu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Incorporating information from syllable-length
time scales into automatic speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 721-724.
Wu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Performance improvements through combining phone- and syllable-length information in automatic speech recognition, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 854-857.
PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY GERMANE TO ASRArai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony,
IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936.Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral
asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678.
Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.
Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8.
Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from
Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest.
Syllable Frequency - Spontaneous English
The distribution of syllable frequency in spontaneous speech differs markedly from that in dictionaries
Word frequency as a function of word rank approximates a 1/f distribution, particularly after rank-order 10
Word Frequency in Spontaneous English
Word frequency is logarithmically related to rank order in the corpus (I.e., the 10th most common word occurs ca. 10 times more frequently than the 100th most common word, etc.
Computed from the Switchboard corpus (American English telephone dialogues)
The Intricate Web of Research