Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal...

Mandarin ChineseMandarin ChineseSpeech RecognitionSpeech Recognition

Mandarin ChineseMandarin Chinese

Tonal language (inflection matters!)Tonal language (inflection matters!) 11stst tone – High, constant pitch (Like saying “aaah”) tone – High, constant pitch (Like saying “aaah”) 22ndnd tone – Rising pitch (“Huh?”) tone – Rising pitch (“Huh?”) 33rdrd tone – Low pitch (“ugh”) tone – Low pitch (“ugh”) 44thth tone – High pitch with a rapid descent (“No!”) tone – High pitch with a rapid descent (“No!”) ““55thth tone” – Neutral used for de-emphasized syllables tone” – Neutral used for de-emphasized syllables

Monosyllabic languageMonosyllabic language Each character represents a single base syllable and Each character represents a single base syllable and

tonetone Most words consist of 1, 2, or 4 charactersMost words consist of 1, 2, or 4 characters

Heavily contextual languageHeavily contextual language

Mandarin Chinese and Speech Mandarin Chinese and Speech ProcessingProcessing

Accoustic representations of Chinese Accoustic representations of Chinese syllablessyllables Structural FormStructural Form

(consonant) + vowel + (consonant)(consonant) + vowel + (consonant)


Phone SetsPhone Sets Initial/final phones [1]Initial/final phones [1]

e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if)e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) Initial phones: unvoiced Initial phones: unvoiced

1 phone1 phone Final phones: voiced (tone 1-5) Final phones: voiced (tone 1-5)

Can consist of multiple phonesCan consist of multiple phones


Strong tonal recognition is crucial to Strong tonal recognition is crucial to distinguish between homonyms [3] distinguish between homonyms [3] (especially w/o context)(especially w/o context)

Creating tone models is difficultCreating tone models is difficult Discontinuities exist in the F0 contour Discontinuities exist in the F0 contour

between voiced and unvoiced regionsbetween voiced and unvoiced regions

ProsodyProsody

Prosody: “the rhythmic and Prosody: “the rhythmic and intonational aspect of language” [2]intonational aspect of language” [2] Embedded Tone Modeling[4]Embedded Tone Modeling[4] Explicit Tone Modeling[4]Explicit Tone Modeling[4]

Tone ModelingTone Modeling

Embedded Tone ModelingEmbedded Tone Modeling Tonal acoustic units are joined with Tonal acoustic units are joined with

spectral features at each frame [4]spectral features at each frame [4] Explicit Tone ModelingExplicit Tone Modeling

Tone recognition is completed Tone recognition is completed independently and combined after post-independently and combined after post-processing [4]processing [4]

Pitch, energy, and duration (Prosody) Pitch, energy, and duration (Prosody) combined with lexical and syntactic combined with lexical and syntactic features improves tonal labelingfeatures improves tonal labeling

CoarticulationCoarticulation Variations in syllables can cause variations in Variations in syllables can cause variations in

tone: Bu4 + Dui4 = Bu2 Dui4 (wrong)tone: Bu4 + Dui4 = Bu2 Dui4 (wrong)

Ni3 + Hao3 = Ni2 Hao3 (hello)Ni3 + Hao3 = Ni2 Hao3 (hello)

Tone ModelingTone Modeling

Emebedded Tone Modeling:Emebedded Tone Modeling:Two Stream ModelingTwo Stream Modeling

Ni, Liu, XuNi, Liu, Xu

Spectral Stream –MFCC’s Spectral Stream –MFCC’s (Mel frequency cepstral (Mel frequency cepstral coefficients)coefficients) Describe vocal tract informationDescribe vocal tract information Distinctive for phones (short time duration)Distinctive for phones (short time duration)

Pitch/Tone Stream – requires smoothingPitch/Tone Stream – requires smoothing Describe vibrations of the vocal chordsDescribe vibrations of the vocal chords Independent of Spectral featuresIndependent of Spectral features d/dt(pitch) aka tone and d2/dt2(pitch) are addedd/dt(pitch) aka tone and d2/dt2(pitch) are added Embedded in an entire syllable Embedded in an entire syllable Affected by coarticulation (requires a longer time Affected by coarticulation (requires a longer time

window) – i.e. Sandhi Tone – context dependencywindow) – i.e. Sandhi Tone – context dependency

Embedded Tone Modeling:Embedded Tone Modeling:Two Stream Modeling [4]Two Stream Modeling [4]

Tonal Identification FeaturesTonal Identification Features F0F0 EnergyEnergy DurationDuration Coarticulation (cont. speech)Coarticulation (cont. speech)

Initially use 2 stream embedded model Initially use 2 stream embedded model followed by explicit modeling during lattice followed by explicit modeling during lattice rescoring (alignment?)rescoring (alignment?) Explicit tone modeling uses max. entropy Explicit tone modeling uses max. entropy

framework [4] (discriminative model)framework [4] (discriminative model)

Explicit Tone Modeling [4]Explicit Tone Modeling [4]NoNo..

Feature DescriptionFeature Description # of # of FeaturesFeatures

11 Duration of current, previous, and following Duration of current, previous, and following syllablessyllables 33

22 Previous syllable is or is not spPrevious syllable is or is not sp 11

33 Slope and intercept of F0 contour of current Slope and intercept of F0 contour of current syllable, its delta, and delta-deltasyllable, its delta, and delta-delta 66

44 Statistical Parameters of pitch and log-Statistical Parameters of pitch and log-energy of current syllable (i.e. max, min, energy of current syllable (i.e. max, min, mean, etc.)mean, etc.)

1010

55 Normalized max and mean of pitch and Normalized max and mean of pitch and energy in each syllable in the context energy in each syllable in the context windowwindow

1212

66 Location of current syllable within wordLocation of current syllable within word 11

77 Tones of preceding and proceding syllablesTones of preceding and proceding syllables 22

Other Work Other Work Chang, Zhou, Di, Huang, & Lee [1]Chang, Zhou, Di, Huang, & Lee [1]

3 Methods3 Methods Powerful Language Model (no tone modeling)Powerful Language Model (no tone modeling)

CER = 7.32%CER = 7.32% Embedded 2 StreamEmbedded 2 Stream

Tone Stream + Feature StreamTone Stream + Feature Stream CER = 6.43%CER = 6.43%

Embedded 1 Stream Embedded 1 Stream Developed Pitch extractorDeveloped Pitch extractor

pitch track added to feature vectorpitch track added to feature vector CER = 6.03%CER = 6.03%

Other WorkOther WorkQian, Soong [3]Qian, Soong [3]

F0 contour smoothingF0 contour smoothing Multi-Space Distribution (MSD)Multi-Space Distribution (MSD)

Models 2 prob. SpacesModels 2 prob. Spaces Unvoiced: DiscreteUnvoiced: Discrete Voiced (F0 Contour): ContinuousVoiced (F0 Contour): Continuous

Other WorkOther WorkLamel, Gauvain, Le, Oparin, Meng [6]Lamel, Gauvain, Le, Oparin, Meng [6]

Multi-Layer Perceptron FeaturesMulti-Layer Perceptron Features Combined with MFCC’s and Pitch featuresCombined with MFCC’s and Pitch features

Compare Language ModelsCompare Language Models N-Gram: Back-off Language ModelN-Gram: Back-off Language Model Neural Network Language ModelNeural Network Language Model

Language Model AdaptationLanguage Model Adaptation

Other WorkOther WorkO. Kalinli [7]O. Kalinli [7]

Replace prosodic features with biologically Replace prosodic features with biologically inspired auditory attention cuesinspired auditory attention cues

Cochlear filtering, inner hair cell, etc.Cochlear filtering, inner hair cell, etc. Other features are extracted from the auditory Other features are extracted from the auditory

spectrumspectrum IntensityIntensity Frequency contrastFrequency contrast Temporal contrastTemporal contrast Orientation (phase)Orientation (phase)

Other WorkOther WorkQian, Xu, Soong [8]Qian, Xu, Soong [8]

Cross-Lingual Voice TransformationCross-Lingual Voice Transformation Phonetic mapping between languagesPhonetic mapping between languages Difficult for Mandarin and English Difficult for Mandarin and English

Very different prosodic featuresVery different prosodic features

ReferencesReferences

[1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones”Modeling Tones”

[2] Meriam-Webster Dictionary, [2] Meriam-Webster Dictionary, http://www.merriam-webster.comhttp://www.merriam-webster.com//

[3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 20092009

[4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework”Entropy Framework”

[5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004Speech Recognition”, International Journal of Speech Technology, 2004

[6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For [6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011Mandarin Speech to Text Transcription, ICASSP, 2011

[7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011Cues”, ICASSP, 2011

http://www.merriam-webster.com/

http://www.merriam-webster.com/

Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal...

Documents

Transcript of Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal...