Phonetic Dissection of
Switchboard-CorpusAutomatic Speech Recognition Systems
Steven Greenberg and Shuangyu Chang
International Computer Science Institute1947 Center Street, Berkeley, CA 94704
{steveng, shawnc}@icsi.berkeley.eduhttp://www.icsi.berkeley.edu/~steveng
Large Vocabulary Continuous Speech Recognition Workshop Maritime Institute of Technology, Linthicum Heights, MD, May 4, 2001
• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS– Many different analyses (to follow) support this conclusion– Consonants appear to be more important than vowels
• SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR FOR ACCURATE RECOGNITION– The pattern of errors differs across the syllable (onset, nucleus, coda) and
exhibit consistent patterns difficult to discern with other units of analysis
• STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE, PARTICULARLY FOR UNDERSTANDING THE NATURE OF WORD-DELETION ERRORS– Relation among stress-accent, syllable structure, vocalic identity and length
• THE NATURE OF PRONUNCIATION MODELS and THEIR RELATION TO LEXICAL REPRESENTATIONS IS A POTENTIALLY KEY FACTOR– The unit of lexical representation (phones, articulatory features, etc.) is
probably of the utmost importance for optimizing ASR performance
• FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS LIKELY TO DEPEND ON DEEP INSIGHT INTO THE NATURE OF SPOKEN LANGUAGE
Take Home Messages
• DESCRIPTION OF THE CORPUS MATERIALS FOR THE 2000 AND 2001 EVALUATIONS
– 2000 – Brief (2-17 s) utterances spoken by hundreds of different speakers. No relation to competitive evaluation
– 2001 – A subset of the competitive evaluation
• BRIEF OVERVIEW OF THE ANALYSIS REGIME COMMON TO THE 2000 AND 2001 PHONETIC EVALUATIONS
– File formats, time-mediated alignment, statistical analysis of the corpora, etc.– Details are contained in “Linguistic Dissection …..” (in workshop notebook)
and in “An Introduction ….” (NIST Speech Transcription Workshop, 2000)
• ANALYSES AND PATTERNS COMMON TO BOTH 2000 and 2001 EVALUATIONS
– Syllable structure, phonetic segments, articulatory-acoustic features. Details pertaining to the 2000 evaluation are in the papers cited above
• PHONETIC CONFUSION MATRICES FOR THE 2001 EVALUATION
• FUTURE ANALYSIS PLANNED FOR THIS SPRING WHEN REMAINING 2001 SUBMISSIONS ARRIVE
– Relationship between phonetic classification, pronunciation and language models
Structure of the Presentation
• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Switchboard contains informal telephone dialogues
– 54 minutes of material that previously phonetically transcribed (by highly trained phonetics students from UC-Berkeley)
– All of this material was hand-segmented at either the phonetic-segment or syllabic level by the transcribers
– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified.
• THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE CURRENT PROJECT ARE AVAILABLE ON THE PHONEVAL WEB SITE:
http://www.icsi.berkeley.edu/real/phoneval
• THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL ARE AVAILABLE AT:
http://www.icsi.berkeley.edu/real/stp
Evaluation Material - 2000
Evaluation Material Details - 2000
0
50
100
150
200
250
300
V_Easy Easy Medium Hard V_Hard
Subjective Difficulty
By Subjective Difficulty
0
20
40
60
80
100
120
140
160
180
S_Mid N_Mid N_East West South NYC (Other)
Dialect Region
Nu
mb
er o
f U
tter
ance
s
By Dialect Region
• 581 DIFFERENT SPEAKERS
• AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS
• BROAD DISTRIBUTION OF UTTERANCE DURATIONS– 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10%
• COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD
• A WIDE RANGE OF DISCUSSION TOPICS
• VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)
• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Seventy-four minutes of material phonetically labeled by five highly
trained phonetics students from UC-Berkeley plus S. Greenberg
– The material was hand-segmented at the syllabic level by the transcribers
– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained originally on 72-minutes of hand-segmented Switchboard material (similar to the process performed the previous year)
• THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED ARE AVAILABLE ON THE PHONEVAL WEB SITE:
http://www.icsi.berkeley.edu/real/phoneval
Evaluation Material - 2001
Evaluation Material Details - 2001• A SUBSET OF THE HUB-5 COMPETITIVE EVALUATION CORPUS
– A representative selection from the evaluation set, including an even distribution of data from the three main recording conditions (cellular and 2 land-line conditions)
• 21 SEPARATE CONVERSATIONS (2 speakers per conversation)
• 42 DIFFERENT SPEAKERS
• A TOTAL OF 74 MINUTES OF SPOKEN LANGUAGE MATERIAL – (including FILLED PAUSES, JUNCTURES, etc.)
• AVERAGE LENGTH OF SPEECH PER SPEAKER – 106 seconds
• RANGE OF LENGTH PER SPEAKER – 48 s (least) to 226 s (most)
• STANDARD DEVIATION – 38 s
• APPROXIMATELY ONE-THIRD OF THE MATERIAL FROM CELL PHONES
• EIGHT SITES PARTICIPATED IN THE EVALUATION– All eight provided material for the unconstrained-recognition phase– Six sites also provided sufficient forced-alignment-recognition
material (i.e., phone/word labels and segmentation given the word transcript for each utterance) for a detailed analysis
• AT&T (forced-alignment recognition incomplete, not analyzed )
• Bolt, Beranek and Newman
• Cambridge University
• Dragon (forced-alignment recognition incomplete, not analyzed )
• Johns Hopkins University
• Mississippi State University
• SRI International
• University of Washington
Evaluation Sites - 2000
• SEVEN SITES ARE PARTICIPATING IN THE EVALUATION– Unconstrained-recognition phase – 6 Sites– Forced-alignment – 7 Sites– Phone classification confidence scores – 5 Sites– Variable condition recognition – 2 Sites– Phone strings to words - 1 Site
• AT&T
• Bolt, Beranek and Newman
• IBM
• Johns Hopkins University
• Mississippi State University
• Philips
• SRI International
Evaluation Sites - 2001
• However … NOT ALL OF THE MATERIAL REQUIRED TO PERFORM THE ANALYSES HAVE MATERIALIZED
– The tables below summarize the commitments and currently usable data (certain data arrived in not-quite-ready-for-prime-time
form)
Evaluation Data Status - 2001
Commitments
Current(usable data)
SITE RECOGNITION FORCED-ALIGNMENT PHONE CONFIDENCE VARIABLE RECOGNITION PHONES-TO-WORDS
AT&T
BBN X X
IBM X
JHU X X X
MSU X X
Philips X X X
SRI X X
Parameter Key
START - Begin time (in seconds) of phone
DUR - Duration (in sec) of phone
PHN - Hypothesized phone ID
WORD - Hypothesized Word ID
Format is for all 674 files in the evaluation set
(Example courtesy of MSU)
Initial Recognition File - Example
UTT-ID CH Start DUR PHN WORD
2001_0016 B 0 0.1 sil !SENT_START
2001_0016 B 0.1 0.06 l
2001_0016 B 0.16 0.05 ay
2001_0016 B 0.21 0.07 k LIKE
2001_0016 B 0.28 0.04 ih
2001_0016 B 0.32 0.05 n IN
2001_0016 B 0.37 0.21 ao
2001_0016 B 0.58 0.08 g
2001_0016 B 0.66 0.08 ax
2001_0016 B 0.74 0.03 s
2001_0016 B 0.77 0.04 t
2001_0016 B 0.81 0.01 sp AUGUST
2001_0016 B 0.82 0.03 w
2001_0016 B 0.85 0.03 eh
2001_0016 B 0.88 0.04 n WHEN
2001_0016 B 0.92 0.05 eh
2001_0016 B 0.97 0.03 v
2001_0016 B 1 0.03 r
2001_0016 B 1.03 0.05 iy
2001_0016 B 1.08 0.06 b
2001_0016 B 1.14 0.05 aa
2001_0016 B 1.19 0.03 d
2001_0016 B 1.22 0.03 iy EVERYBODY
• EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET– Most of the phone sets are available on the PHONEVAL web site
• THE SITES’ PHONE SETS WERE MAPPED TO A COMMON “REFERENCE” PHONE SET – The reference phone set is based on the ICSI Switchboard
transcription material (STP), but is adapted to match the less granular symbol sets used by the submission sites
– The set of mapping conventions relating to the STP (and reference) sets are also available on the PHONEVAL web site
• THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SUBMISSION SITE PHONE SETS
– This reverse mapping was done in order to insure that variants of a phone were given due “credit” in the scoring procedure
– For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel [ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set
Phone Mapping Procedure
• TWO METHODS WERE USED FOR THE 2001 EVALUATION– The “UNCOMPENSATED” form is the same as last year’s scoring
method. Only common phone ambiguities (such as [ix], [ih], [ah]. [ax], etc. are allowed
– The “TRANSCRIPTION-COMPENSATED” form allows for certain phones commonly confused among human transcribers to be scored as “correct,” even though they would otherwise be scored as “wrong”
– The compensated form of transcription lowers the phone “error” by ca. 10-20%
• TIME-MEDIATED SCORING WAS OF TWO VARIETIES
– A “STRICT” form is identical to that used in last year’s evaluation. There is a severe penalty for deviations from time boundaries for words and phones
– A “LENIENT” form allows for a much looser fit between time markers associated with words and phones. A weighting of 0.15 (relative to the STRICT form) was used (by modifying the penalty algorithm in SC-Lite). The 0.15 weight reduced the number of phone “errors” by ca. 20% without a significant decline in false-positive responses
Phone Scoring Procedures - 2001
00.20.40.60.8
1
B D G P T K DX JH CH S SH Z ZH F TH V DH M N NX NG L R W Y HH OTH
Visualization of a 3-D Confusion Matrix• When the matrix is sparsely coded, as below, it is more efficient to
view the pattern as if squashed against a brick wall (see below)
The diagonal is plotted in a linear plane
Phonetic Segment
Pro
po
rtio
n C
on
cord
ance
Consonants
Interlabeler Agreement (74%) - 3 Transcribers• Highest for consonants (especically the stops)
• Lowest for vowels (particularly the lax monophthongs)
Numbers refer to the concordance diagonal in the confusion matrices
Vowels
Interlabeler Disagreement Patterns - 2001• INTERLABELER DISAGREEMENT PATTERNS WERE DERIVED FROM THE
2000 EVALUATION MATERIAL– Several minutes of 3 transcribers material transcribed in common were
analyzed (2 from 1996-1997 STP, 1 from 2001 STP)
• THE FOLLOWING PATTERNS WERE OBSERVED IN THE INTERLABELER DISAGREEMENT ANALYSIS
• Consonants– Stop and nasal consonants exhibit a small amount of disagreement– Fricatives exhibit slightly higher amounts of disagreement– Liquids show a moderate amount of disagreement
• Vowels– Lax monophthongs exhibit a high amount of disagreement– Diphthongs show a relatively small amount of disagreement– Tense, low monophthongs show relatively little disagreement (except
for [ao] (probably a dialect issue)
• Overall Transcriber Agreement was 70%
Interlabeler Disagreement Patterns - 2001• FROM SUCH PATTERNS THE FOLLOWING FORMS OF TOLERANCES WERE
ALLOWED IN “TRANSCRIPTION COMPENSATED” SCORING:
Segment
[d]
[k]
[s]
[n]
[r]
[iy]
[ao]
[ax]
[ix]
UNcompensated
[d]
[k]
[s]
[n]
[r]
[iy]
[ao]
[ax]
[ix] [ih] [ax]
Compensated
[d] [dx]
[k]
[s] [z]
[n] [nx] [ng] [en]
[r] [axr] [er]
[iy] [ix] [ih]
[ao] [aa] [ow]
[ax] [ah] [aa] [ix]
[ix] [ih] [iy] [ax]
Transcription Compensation Affects Phone Error• COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS
LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES
0
0.1
0.2
0.3
0.4
0.5
SRIJHUBBNIBMMSU
TranscriptionUncompensated
TranscriptionCompensated
TranscriptionUncompensatedTranscriptionCompensated
Error Rate
STRICTTime Mediation
Transcription Compensation Affects Phone Error• COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS
LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES
0
0.1
0.2
0.3
0.4
0.5
SRIJHUBBNIBMMSU
TranscriptionUncompensated
TranscriptionCompensated
TranscriptionUncompensatedTranscriptionCompensated
Error Rate
LENIENTTime Mediation
Generation of Evaluation Data - 1
• EACH SITE’S MATERIAL WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE)
CTM File Format for Word Scoring
SOURCE UTID SIDE START DUR WORD ERTYP
REFERENCE 2001-B-0016 B 0 0.11 ? NHYPOTHESIS 2001-B-0016 B *** *** *** N
R 2001-B-0016 B 0.11 0.18 LIKE CH 2001-B-0016 B 0.1 0.18 LIKE C
R 2001-B-0016 B 0.29 0.08 IN CH 2001-B-0016 B 0.28 0.09 IN C
R 2001-B-0016 B 0.37 0.48 AUGUST CH 2001-B-0016 B 0.37 0.45 AUGUST C
R 2001-B-0016 B 0.85 0.07 WHEN CH 2001-B-0016 B 0.82 0.1 WHEN C
R 2001-B-0016 B 0.92 0.44 EVERYBODY_IS SH 2001-B-0016 B 0.92 0.33 EVERYBODY S
R 2001-B-0016 B *** *** *** IH 2001-B-0016 B 1.25 0.1 IS I
R 2001-B-0016 B 1.36 0.15 ON CH 2001-B-0016 B 1.35 0.15 ON C
… … … … … … …
ERROR KEY
C = CORRECTI = INSERTION N = NULL ERRORS = SUBSTITUTION
Generation of Evaluation Data - 2
• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance
• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context
Summary of Corpus Acoustic Properties• PHONE PROPERTIES
– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices
• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration
Word- and Phone-Centric “Big Lists”
ERR REFWORD HYPWORD UTID WORDPOS WORDFREQ WRDENG MRATE SYLRATE ETC.
N ? *** 2001-B-0016 0 -6.02 0.92 5.05 6.56 …C LIKE LIKE 2001-B-0016 0.06 -2.1522 1.04 5.05 6.56 …C IN IN 2001-B-0016 0.11 -1.9295 0.97 5.05 6.56 …C AUGUST AUGUST 2001-B-0016 0.17 -4.6678 1.1 5.05 6.56 …C WHEN WHEN 2001-B-0016 0.22 -2.5432 0.97 5.05 6.56 …C EVERYBODY'S EVERYBODY'S 2001-B-0016 0.28 -4.3253 1.02 5.05 6.56 …C ON ON 2001-B-0016 0.33 -2.3138 0.97 5.05 6.56 …C VACATION VACATION 2001-B-0016 0.39 -3.9967 0.95 5.05 6.56 …C OR OR 2001-B-0016 0.44 -2.3202 0.84 5.05 6.56 …C SOMETHING SOMETHING 2001-B-0016 0.5 -2.7438 0.81 5.05 6.56 …C WE WE 2001-B-0016 0.56 -2.1082 0.88 5.05 6.56 …C CAN CAN 2001-B-0016 0.61 -2.611 0.75 5.05 6.56 …C DRESS DRESS 2001-B-0016 0.67 -4.0399 0.9 5.05 6.56 …C A A 2001-B-0016 0.72 -1.6723 0.85 5.05 6.56 …C LITTLE LITTLE 2001-B-0016 0.78 -2.7814 0.91 5.05 6.56 …C MORE MORE 2001-B-0016 0.83 -2.7027 0.85 5.05 6.56 …C CASUAL CASUAL 2001-B-0016 0.89 -4.6678 0.94 5.05 6.56 …I *** !SILENCE 2001-B-0016 0.94 -6.02 0.6 5.05 6.56 …N H# *** 2005-B-0077 0 -6.02 0.6 4.44 7 …N ? *** 2005-B-0077 0.06 -6.02 0.92 4.44 7 …C YEAH YEAH 2005-B-0077 0.12 -1.9361 0.99 4.44 7 …C JUST JUST 2005-B-0077 0.18 -2.1809 0.94 4.44 7 …C BECAUSE BECAUSE 2005-B-0077 0.24 -2.4782 1.09 4.44 7 …… … … … … … … … … …
• THE “BIG LISTS” CONTAIN SUMMARY INFORMATION ON 55-65 SEPARATE PARAMETERS ASSOCIATED WITH PHONES, SYLLABLES, WORD, UTTERANCES AND SPEAKERS SYNCHRONIZED TO EITHER THE WORD (THIS SLIDE) OR THE PHONE
Generation of Evaluation Data - 3
Phoneval-2000 Web SiteRECOGNITION FILES•Converted Submissions
ATT, BBN , JHU, MSU, SRI, WASH
•Word Level Recognition ErrorsATT, CU, BBN , JHU, MSU, SRI, WASH
•Phone Error (Free Recognition)ATT, BBN, JHU, MSU, WASH •Word Recognition Phone Mapping
ATT, BBN, JHU, MSU, WASH
BIG LISTS•Word-Centric
ATT, CU, BBN, JHU, MSU, SRI, WASH
•Phone-CentricATT, BBN, JHU, MSU, WASH
•Phonetic Confusion MatricesATT, BBN, JHU, MSU, WASH
FORCED ALIGNMENT FILES•Forced Alignment Files
BBN , JHU, MSU, WASH
•Word-Level Alignment ErrorsBBN , CU, JHU, MSU, SRI, WASH
•Phone Error (Forced Alignment)CU, BBN, JHU, MSU, SRI, WASH •Alignment Word-Phone Mapping
BBN , JHU, MSU, WASH
BIG LISTS•Word-Centric
BBN, CU, JHU, MSU, SRI, WASH
•Phone-CentricBBN, JHU, MSU, WASH
•Phonetic Confusion MatricesBBN, JHU, MSU, WASH
•Description of the STP Phone Set•STP Transcription Material
Phone-Word Reference
Syllable-Word Reference
•Phone Mapping for Each SiteATT, BBN , JHU, MSU, WASH
STP-to-Reference Map
STP Phone-to-Articulatory-Feature Map
http://www.icsi.berkeley.edu/real/phoneval
A Syllable-Centric PerspectiveIn this presentation we will “drill down” from the lexical to the phonetic tiers by way of
the syllable, the phone and articulatory-acoustic features
Words
Articulatory-Acoustic Features
Phonetic segment
Stress-accent
• THE FOLLOWING SLIDES PROVIDE DETAILS ABOUT THE COARSE WORD AND PHONE SCORES FOR THE 2000 AND 2001 EVALUATIONS
• ALTHOUGH THE WORD AND PHONE SCORES ARE ROUGHLY COMPARABLE ACROSS YEARS (FOR ANALOGOUS
CONDITIONS) THE 2001 EVALUATION HAS FOUR TIMES THE NUMBER OF SCORING CONDITIONS (FOR PHONES) BASED ON THE “LENIENT” vs. STRICT TIME-MEDIATION AND THE COMPENSATED vs. UNCOMPENSATED TRANSCRIPTION SCORING
Coarse Word and Phone Recognition
Word Recognition Error (2000)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
ATT
BBN
CU
DRAGON
JHU
MSU
SRI
WASH
Site
• WORD ERROR RATES VARY BETWEEN 27% AND 43%–Substitutions are the major source of word errors
• The effect of stress is most concentrated among word-deletion errors
Prosodic Stress & Word Error Rate (2000)
Data represent averages across all eight ASR systems
Unstressed Fully Stressed Intermediate Stress
Syllable Structure & Word Error Rate (2000) • Vowel-initial forms show the greatest error• Polysyllabic forms exhibit the lowest error
Data are averaged across all eight sitesC = ConsonantV = Vowel
• VOWEL-INITIAL forms exhibit the HIGHEST error• POLYSYLLABLES have the LOWEST error rate
Syllable Structure & Word Error Rate (2000)
Word Recognition Error (2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
IBM
JHU
MSU
SRI
Site
• WORD ERROR RATES VARY BETWEEN 33% AND 49%–Substitutions are the major source of phone errors
STRICT Time Mediation
Word Recognition Error (2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
IBM
JHU
MSU
SRI
Site
• WORD ERROR RATES VARY BETWEEN 31% AND 44%–Substitutions are the major source of phone errors
LENIENT Time Mediation
• NOT YET
• PROSODIC LABELING OF THIS MATERIAL REQUIRED FIRST
• ANALYSIS SCHEDULED FOR JUNE, 2001
Prosodic Stress & Word Error Rate (2001)
Syllable Structure & Word Error Rate (2001) • Vowel-initial forms show the greatest error
Data are averaged across all five sites
• Polysyllabic forms exhibit the lowest error, except fpr CVCV forms (probably due to forms such as “gonna,” etc.)
• VOWEL-INITIAL forms exhibit the HIGHEST error• POLYSYLLABLES have the LOWEST error rate
Syllable Structure & Word Error Rate (2001)
Are Word and Phone Errors Related? (2000)• COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE
–The correlation between the two parameters is 0.78
0
0.1
0.2
0.3
0.4
0.5
0.6
JHUATTCUSRIDRAGMSUUWBBN
Phone Error
Word Error
Phone ErrorWord Error
Submission Site
Error Rate
Pronunciation Models?
The differential error rate is
probably related to the use of
either pronunciation or
language models (or both)
Are Word and Phone Errors Related? (2001)• COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE
0
0.1
0.2
0.3
0.4
0.5
SRIJHUBBNIBMMSU
Phone Error
Word Error
Phone ErrorWord ErrorTranscription
UnCompensated
Error Rate
Pronunciation Model?Strict
Time Mediation
Are Word and Phone Errors Related? (2001)• COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE
0
0.1
0.2
0.3
0.4
0.5
SRIJHUBBNIBMMSU
Phone Error
Word Error
Phone ErrorWord Error
TranscriptionUnCompensated
Error Rate
Pronunciation Model?Lenient
Time Mediation
Phonetic - Pronunciation Mismatch• THERE ARE A FAR GREATER NUMBER OF PRONUNCIATIONS IN THE
TRANSCRIPTION MATERIALS THAN IN THE ASR LEXICONS
• GIVEN THAT MOST WORDS ARE CORRECTLY RECOGNIZED, THIS RESULT IMPLIES THAT PHONETIC CLASSIFICATION IN ASR SYSTEMS IS, BY NECESSITY, HIGHLY AGRANULAR
• THUS, UNUSUAL PRONUNCIATIONS ARE UNLIKELY TO BE DECODED CORRECTLY
• THE COARSE NATURE OF THE PRONUNCIATION MODELS ALSO MAKE IT DIFFICULT TO FINE-TUNE THE RELATION BETWEEN THE PHONETIC CLASSIFIER AND PRONUNCIATION MODEL COMPONENTS
Pronunciation Variation in ASR Lexicons• MOST WORDS IN THE ASR LEXICONS HAVE A SINGLE PRONUNCIATION
• EXCEPTIONS ARE HIGHLY FREQUENT WORDS (SUCH AS “THE” AND “AND” WHICH HAVE 2 OR 3 PRONUNCIATION VARIATIONS. NO
WORD HAS MORE THAN 5 PRONUNCIATION VARIANTS (AT LEAST NOT IN THE PHONETIC OUTPUT PROVIDED TO ICSI FOR THE EVALUATION)
Pronunciation Variation in Switchboard (2001)• THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100
MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL
WORD INSTANCES #PRON WORD INSTANCES #PRON I 588 79AND 430 76THE 408 59YOU 317 54A 285 54THAT 229 66TO 223 47KNOW 211 23IN 209 41IT 208 54OF 198 56LIKE 170 25YEAH 165 22HAVE 135 38THEY 128 23IT'S 122 39BUT 113 22DON'T 112 42SO 107 16UH 107 16IS 97 21WAS 95 34FOR 91 20DO 90 26JUST 88 26
THAT'S 84 39IF 82 23ON 82 23THINK 82 19WE 82 10OR 77 24BE 73 12NOT 70 15WHAT 70 18MY 69 10I'M 67 18WELL 61 21WITH 57 27ARE 55 20THERE 54 15MEAN 52 9AT 51 23PEOPLE 49 20THEY'RE 49 19THIS 49 15UP 49 15AS 48 21GET 48 12REALLY 48 18LOT 47 17
Pronunciation Variation in Switchboard (2001)
WORD INSTANCES #PRON WORD INSTANCES #PRON WOULD 47 10ALL 46 11ONE 46 8TIME 44 11OUT 41 16HE 40 12NO 39 8ABOUT 38 26RIGHT 38 10THEN 38 10WORK 38 7BECAUSE 37 32KIND 37 13WHEN 37 13NOW 36 9YOU'RE 36 18ACTUALLY 35 20FROM 34 12HAD 34 7GOOD 33 7HE'S 33 9WHERE 33 13BEEN 31 11DID 31 11HERE 31 9
• THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100 MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL
GUESS 30 5THEM 30 10TOO 30 6GOT 29 11I'VE 29 15ME 29 4OKAY 29 12SOME 29 5WHO 29 11ANY 28 12THERE'S 28 21WERE 28 9HAS 27 13MORE 27 9CAN 26 11GONNA 26 19SOMETHING 26 18PRETTY 25 11YOUR 25 12COULD 24 7GO 24 5SHE 24 6EVEN 23 18OUR 23 10THINGS 23 8
Phone Error and Word Length (2000)
Data are averaged across all eight sites
• For CORRECT words, only one phone (on average) is misclassified– Implication – short words are highly tolerant of phone “errors”
• For INCORRECT words, phone errors increase linearly with word length
Data are averaged across all five sites
Phone Error and Word Length (2001)• For CORRECT words, only one phone (on average) is misclassified
– Implication – short words are highly tolerant of phone “errors”
• For INCORRECT words, phone errors increase linearly with word length
Phone Error - Forced Alignment (2000)
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
CU
JHU
MSU
SRI
WASH
Err
or
Ra
te
Error Type
AT&T, Dragon did not provide a complete set of forced alignments
Site
• PHONE ERROR RATES VARY BETWEEN 35% AND 49%–This, despite having the word transcript!!!
Phone Error - Forced Alignment (2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 40% AND 50%–Same picture for 2001. Suggests a potential mismatch between
lexical and phonetic representations
STRICT Time Mediation Transcription UNcompensated
Phone Error - Forced Alignment (2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 30% AND 44%–Still a poor match between phonetic transcripts and lexical reps
LENIENT Time Mediation Transcription UNcompensated
Phone Error - Forced Alignment (2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 32% AND 38%–Still a lack of concordance with a tolerant scoring method
STRICT Time Mediation Transcription Compensated
Phone Error - Forced Alignment (2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 23% AND 29%–With the most tolerant scoring there is still some lack of concordance
LENIENT Time Mediation Transcription Compensated
00.20.40.60.8
1
B D G P T K DX JH CH S SH Z ZH F TH V DH M N NX NG L R W Y HH OTH
Visualization of a 3-D Confusion Matrix• When the matrix is sparsely coded, as below, it is more efficient to
view the pattern as if squashed against a brick wall (see below)
The diagonal is plotted in a linear plane
Phonetic Segment
CVC
Pro
po
rtio
n C
on
cord
ance
CVC
Phonetic Confusion Matrix - CVC Syllables• Onset consonants tend to be highly concordant with transcription• Coda consonants are slightly less concordant, particularly some fricatives
Forced AlignmentNumbers refer to the concordance diagonal in the confusion matrices
STOPS NASALSFRICATIVES APPROXIMANTS
Phonetic Segment
CCVC
Pro
po
rtio
n C
on
cord
ance
CVCC
Phonetic Confusions - CCVC, CVCC Syllables• Certain fricatives are problematic in CVCC coda position• Redo this figure and others - no wrong words, compare CVC, CVC etc,
Forced AlignmentNumbers refer to the concordance diagonal in the confusion matrices
STOPS NASALSFRICATIVES APPROXIMANTS
Phonetic Segment
CVC
Pro
po
rtio
n C
on
cord
ance
CV
Phonetic Confusions - CV and CVC Nuclei• Diphthongs and tense, low monophthongs tend to be concordant• Lax monophthongs tend to be less concordant (cf. Stress-accent-paper)
Forced AlignmentNumbers refer to the concordance diagonal in the confusion matrices
Phone Error - Unconstrained Recognition (2000)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
0.6
TOTAL SUB DEL INS
ATT
BBN
CU
DRAGON
JHU
MSU
SRI
WASH
Site
• PHONE ERROR RATES VARY BETWEEN 39% AND 55%–Phone error is only slightly greater than for forced alignments
Phone Error - Unconstrained Recognition(2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
0.6
TOTAL SUB DEL INS
BBN
IBM
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 44% AND 55%–Results similar to 2000 evaluation
Transcription Uncompensated
Condition most analogous to 2000
evaluation
STRICT Time Mediation
Phone Error - Unconstrained Recognition (2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
0.6
TOTAL SUB DEL INS
BBN
IBM
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 38% AND 48%–Relaxing time-mediation brings down the error slightly
LENIENT Time Mediation Transcription Uncompensated
Phone Error - Unconstrained Recognition(2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
0.6
TOTAL SUB DEL INS
BBN
IBM
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 25% AND 39%–Transcription compensation also brings down the error
STRICT Time Mediation Transcription Compensated
Phone Error - Unconstrained Recognition(2001)E
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
0.6
TOTAL SUB DEL INS
BBN
IBM
JHU
MSU
SRI
Site
• PHONE ERROR RATES VARY BETWEEN 27% AND 38%–Phone errors decline somewhat more with lax scoring
LENIENT Time Mediation Transcription Compensated
Phonetic Segment
CorrectWords
Pro
po
rtio
n C
on
cord
ance
WrongWords
Phonetic Confusion Matrix - CV Onsets• ARROWS pinpoint problem segments• AFFRICATES and FRICATIVES are problematic in CV onset position• [d] is also problematic
Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices
STOPS NASALSFRICATIVES APPROXIMANTS
Phonetic Segment
CorrectWords
Pro
po
rtio
n C
on
cord
ance
WrongWords
Phonetic Confusion Matrix - CVC Onsets• Fricatives and affricates are problematic in CVC onset position
Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices
STOPS NASALSFRICATIVES APPROXIMANTS
Phonetic Segment
CorrectWords
Pro
po
rtio
n C
on
cord
ance
WrongWords
Phonetic Confusion Matrix - CCVC Onsets• Certain fricatives are particularly problematic in CCVC onset position
Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices
STOPS NASALSFRICATIVES APPROXIMANTS
Phonetic Segment
CorrectWords
Pro
po
rtio
n C
on
cord
ance
WrongWords
Phonetic Confusion Matrix - CVC Codas• Fricatives are particularly problematic in CVC coda position• Certain Stops are also problematic in CVC coda position
Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices
STOPS NASALSFRICATIVES APPROXIMANTS
Phonetic Segment
CorrectWords
Pro
po
rtio
n C
on
cord
ance
WrongWords
Phonetic Confusion Matrix - CVCC Codas• Certain fricatives are problematic in CVCC coda position• [d] is also problematic in CVCC coda position
Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices
STOPS NASALSFRICATIVES APPROXIMANTS
Phonetic Segment
CorrectWords
Pro
po
rtio
n C
on
cord
ance
WrongWords
Phonetic Confusion Matrix - CVC Nuclei• Certain vowels are a problem in CVC nucleus position• Note that the level of concordance is much lower for vowels than for consonants (in onset or coda position), even
for correct words
Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices
Phonetic Segment
CorrectWords
Pro
po
rtio
n C
on
cord
ance
WrongWords
Phonetic Confusion Matrix - CV Nuclei• Diphthongs and low, tense vowels are more concordant with the
transcription than the lax monophthongs – cf. Stress-accent paper
Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices
Consonantal Onsets and AF Errors (2000)• Syllable onsets are intolerant of AF errors in CORRECT words• Place and manner AF errors are particularly high in INCORRECT onsets
Data are averaged across all eight sites
Consonantal Onsets and AF Errors (2001)• Syllable onsets are intolerant of AF errors, particularly place, in CORRECT words• Place and manner AF errors are particularly high in INCORRECT onsets• Syllable structure does not have the same effect as in the 2000 analysis
Data are averaged across all five sites
Consonantal Codas and AF Errors (2000)• Syllable codas exhibit a slightly higher tolerance for error than onsets
• There is a high degree of AF error for wrong words
Data are averaged across all eight sites
Consonantal Codas and AF Errors (2001)• Syllable codas exhibit a slightly higher tolerance for error than onsets
• There is a high degree of AF error for wrong words
Data are averaged across all five sites
Vocalic Nuclei and AF Errors (2000) • Nuclei exhibit a much higher tolerance for error than onsets & codas• There are many more errors than among syllabic onsets & codas
Data are averaged across all eight sites
Vocalic Nuclei and AF Errors (2001) • Nuclei exhibit a much higher tolerance for error than onsets & codas,
particularly for height and front/back• There are many more errors than among syllabic onsets & codas
Data are averaged across all five sites
• WITH THE ARRIVAL OF THE REMAINING FORCED-ALIGNMENT AND UNCONSTRAINED RECOGNITION DATA – IT will be possible to investigate in the relative contribution of the phonetic
classification, pronunciation and language models to recognition performance
– In order to do this, it is necessary to obtain unconstrained recognition, forced alignment and phone-confidence material from each site (to the extent
possible) [the phone confidence metric is problematic]
• CUSTOMIZED ANALYSES FOR INDIVIDUAL SITES– SRI has different versions of their system (with & w/o adaptation, etc.)– AT&T will use phone strings from ICSI transcription material– Individual diagnostics for each site (are there significant differences for specific
parameters?)
• MOST OF THE DATA FOR THE 2001 EVALUATION WILL BE POSTED ON THE PHONEVAL WEB SITE SHORTLY
• WEB-BASED ORACLE DATABASE APPLICATION IS NEAR COMPLETION– Will enable searches over the web of the Phoneval corpus and be able to graph the
results (this is the tricky part, given the ugly nature of Oracle Web DB…)
• A PAPER DESCRIBING THE FULL SET OF ANALYSES WILL BE AVAILABLE AT THE END OF JUNE (2001)
Into the (Near) Future …
• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS– Many different analyses (to follow) support this conclusion– Consonants appear to be more important than vowels
• SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR FOR ACCURATE RECOGNITION– The pattern of errors differs across the syllable (onset, nucleus, coda) and
exhibit consistent patterns difficult to discern with other units of analysis
• STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE, PARTICULARLY FOR UNDERSTANDING THE NATURE OF WORD-DELETION ERRORS– Relation among stress-accent, syllable structure, vocalic identity and length
• THE NATURE OF PRONUNCIATION MODELS and THEIR RELATION TO LEXICAL REPRESENTATIONS IS A POTENTIALLY KEY FACTOR– The unit of lexical representation (phones, articulatory features, etc.) is
probably of the utmost importance for optimizing ASR performance
• FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS LIKELY TO DEPEND ON DEEP INSIGHT INTO THE NATURE OF SPOKEN LANGUAGE
Summary and Conclusions
Top Related