Underspecified feature models for pronunciation variation in ASR

Underspecified feature models for pronunciation variation in ASR

Eric Fosler-LussierThe Ohio State University

Speech & Language Technologies Lab

ITRW - Speech Recognition & Intrinsic Variation

20 May 2006

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Fill in the blanks

• 3, 6, __, 12, 15, __, 21, 24• A B C __ E F __ H• You’re going to Toulouse? Drink a

bottle of _____ for me!• What’s the red object? We’re very good at

filling in the blankswhen we have

context!

Introduction Why features? Role of transcription Approaches Vision



Filling in the blanks: missing data• Missing data

approaches have been used to integrate over noisy acoustics

(a) Clean utterance

Frequency (Hz)

0.5 1 1.5 2 2.5 3 3.550

363

1246

3255

8000

(c) Segregated voiced utterance

Frequency (Hz)

0.5 1 1.5 2 2.5 3 3.550

363

1246

3255

8000

(b) Mixture (SNR 0 dB)

0.5 1 1.5 2 2.5 3 3.5

(d) Segregated whole utterance

0.5 1 1.5 2 2.5 3 3.5

(e) Utterance segregated from IBM

Frequency (Hz)

Time (S)0.5 1 1.5 2 2.5 3 3.5

50

363

1246

3255

8000

Wang & Hu 06Wang & Hu 06




Decode this!

(brackets indicate options)

s iy n y {ah,ax,axr,er}{l,r} {eh,ih,iy} s er ch{ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey}

{t,d}




Decode this!


s iy n y {ah,ax,axr,er} senior

{l,r} {eh,ih,iy} s er ch research

{ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}

associate




Decode this!





associate

dictionary pronunciation




Decode this!





associate

dictionary pronunciationas marked by transcribers (Buckeye Corpus of Speech)




What do these tasks have in common?• Recovering from erroneous information?

– Context plays a big role in helping “clean up”




What do these tasks have in common?• Recovering from erroneous information?

– Context plays a big role in helping “clean up”

• Recovering from incomplete information!– We should be treating pronunciation variation

as a missing data problem• Integrate over “missing” phonological features

– How much information do you need to decode words?

• Particularly taking into account the context of the word, syllabic context of phones, etc…

• Information theory problem




Outline

• Problems with phonetic representations of variation– Potential advantages of phonological features

• Re-examining the role of phonetic transcription• Phonological feature approaches to ASR

– Feature attribute detection– Feature combination methods– Learning to (dis-)trust features

• A challenge for the future




“The Case Against The Phoneme”Homage to Ostendorf (ASRU 99)• Four major indications that phonetic

modeling of variation is not appropriate:





modeling of variation is not appropriate:– Lack of progress on spontaneous speech

WER• McAllaster et al (98): 50% improvement

possible• Finke & Waibel (97): 6% WER reduction





modeling of variation is not appropriate:– Lack of progress on spontaneous speech WER– Independence of decisions in phone-based

models• When pronunciation variation is modeled on phone-

by-phone level, unusual baseforms are often created

• Word-based learning fails to generalize across words


Riley et al 98Riley et al 98




modeling of variation is not appropriate:– Lack of progress on spontaneous speech WER– Independence of decisions in phone-based

models– Lack of granularity

• Triphone contexts mean a symbolic change in phone can affect 9 HMM states (min 90 msec)

• Much variation is already handled by triphone context


Jurafsky et al 01Jurafsky et al 01

Saraçlar et al 00Saraçlar et al 00




modeling of variation is not appropriate:– Lack of progress on spontaneous speech

WER– Independence of decisions in phone-based

models– Lack of granularity– Difficulty in transcription

• Phonetic transcription is expensive and time consuming

• Many decisions difficult to make for transcribers




Using phonological features

• Finer granularity– Some phonological changes don’t result in canonical phones

for a language• English: uw can sometimes be fronted (toot)• Common enough: TIMIT introduced a special phone (ux)• Symbol change loses all commonality between phones (uw-

>ux)

– Handling odd phonological effects• Phone deletions: many “deletions” really leave small traces of

coarticulation on neighboring segments• E.g. vowel nasalization with nasal deletion

• Features may provide basis for cross-lingual recognition

• International Phonetic Alphabet




Issues with phonological features

• Interlingua: “high vowels in English are not the same as high vowels in Japanese”– Richard Wright, lunch Wednesday, ICASSP 2006

• Concept of “independent directions” false– Correlation of feature values– Distances no longer euclidean among feature dimensions

• Dealing with feature spreading• Even more difficulty in transcription

– (but: Karen Livescu’s group, JHU workshop 2006)

• Articulatory vs. acoustic features– No two definitions are exactly the same (see Richard’s

talk)




Phonetic transcription

• There have been a number of efforts to transcribe speech phonetically– American English

• TIMIT (4 hr read speech)• Switchboard (4 hr spontaneous speech)• Buckeye Corpus (40 hr spontaneous speech)

http://buckeyecorpus.osu.edu

• ASR researchers have found it difficult to utilize phonetic transcriptions directly


Riley et al 99Riley et al 99



ASR & Phonetic Transcription

• Saraclar & Khudanpur (04) examined the means of acoustic models where canonical phone /x/ was transcribed as [y] over all pairs x:y– Compared means of x:y to x:x, y:y– Data showed that x:y means often fell between x:x and

y:y, sometimes closer to x:x

• Another view: data from Buckeye Corpus– /ae/ is sometimes transcribed as [eh]– Examined 80 vowels from one speaker

• Formant frequencies from center of vowel




higher than eh

opposite side of ae from eh

mixed ae/eh

ae territory



Can you trust transcription?

• Perceptual marking ≠ acoustic measurement– Can’t take transcription at face value

• What are the transcribers are trying to tell us?– This phone doesn’t sound like a canonical phone– Perhaps we can look at commonalities across

canonical/transcribed phone• ae:eh -> front vowel (& not high?)

• Phonological features may help us represent transcription differences.




Variation in single-phone changes• Compared canonical vs. transcribed

consonants with single-phone substitutions in Switchboard, Buckeye – Differences in manner, place, voicing

countedManner Place Voicing SWB % BCS %

42.1 41.5

7.3 13.8

39.7 27.1

8.2 12.5

1.4 1.5

0.0 1.1

0.7 2.1

single dimensioncommon

manner, voicingvariants morecommon than place




Recent approaches to feature modeling in ASR

• Since 90’s there has been increased interest in phonological feature modeling– Deng et al (92 ff), Kirchhoff (96 ff)

• Current directions of research– Approaches for detecting phonological features from

data– Methods of combining phonological features– Knowing when to ignore information




Feature detection methods

• Frame-level decisions– Most common: artificial neural network methods

• Input: various flavors of spectral/cepstral representations

• Output: estimating posterior P(feature|acoustics) on a per-frame level

– Recent competitor: support vector machines• Typically used for binary decision problems

• Segmental-level decisions: integrate over time– HMM detectors– Hybrid ANN/Dynamic Bayesian Network




Binary vs. n-ary features

• Features can either be described as binary or n-ary if they can contrast– Binary: /t/ : +stop -fricative …– N-ary: /t/ : manner=stop

• No real conclusion on whether which is better– Binary more matched to SVM learning– N-ary allows for discrimination among classes

• Should a segment be allowed to be +stop +fricative?

– Anecdotally (our lab) we find n-ary features slightly better




Hierarchical representations

• Phonological features are not truly independent– Chang et al (01): Place prediction improves if manner

is known• ANN predicts P(place=x|manner=y,X) vs P(place=x|X)• Suggests need for hierarchical detectors

– Rajamanohar & Fosler-Lussier (05): Cascading errors make chained decisions worse

• Better to jointly model P(place=x,manner=y|X), or even derive P(place=x|X) from phone probabilities

– Frankel et al (04): Hierarchy can be integrated as additional dependencies in DBN




Combining features into higher-level structures

• Once you have (frame-level) estimates of phonological features, need to combine– Temporal integration: Markov structures– Phonetic spatial integration: combining into higher-

level units (phones, syllables, words)

• Differences in methodologies:– spatial first, then temporal– joint/factored spatio-temporal integration– phone-level temporal integration with spatial

rescoring




Combining features into higher-level structures

• Tandem ANN/HMM Systems – ANN feature posterior estimates are used as

replacements for MFCCs for Mixture of Gaussians HMM system

– We find decorrelation of features (via PCA) necessary to keep models well conditioned

• Lattice rescoring with Landmarks – Maximum entropy models for local word discrimination– SVMs used as local features for MaxEnt model.

• Dynamic Bayesian Models – Model asynchrony as a hidden variable– SVM outputs used as observations of features


Launay et al 02Launay et al 02

Hasegawa-Johnson et al 05Hasegawa-Johnson et al 05

Livescu 05Livescu 05



Combining features intohigher-level structures

• Conditional random fields– CRFs jointly model spatio-temporal integration– Probability expressed in terms of indicator functions

s (state), t (transition)

• Usually binary in NLP applications

– Frame-level ANN posteriors are bounded• Probabilities can serve as observation feature functions

– sstop(/t/,x,i)=P(manner=stop|xi)


€

P(y | x)∝ exp λ js j (y i,x,i) + μ ktk (y i−1,y i,x,i)k

∑j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

i

∑

Morris & Fosler-Lussier 06Morris & Fosler-Lussier 06



Conditional Random Fields

+ CRFs make no independence assumptions about input– Posteriors can be used directly without decorrelation– Can combine features, phones, …– No assumption of temporal independence

+ Entire label sequence is modeled jointly– Monophone feature CRF phone recog. similar to triphone HMM

+ Learning parameters (,) determines importance of feature/phone relationships– Implicit model of partial phonological underspecification

– Slow to train

€

P(y | x)∝ exp λ js j (y i,x,i) + μ ktk (y i−1,y i,x,i)k

∑j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

i

∑




Underspecification

• All of these models learn what phonological information is important in higher-level processing– Ignoring “canonical” feature definitions for phone is a

form of underspecification– Traditional underspecification: some features are

undefined for a particular phone– Weighted models: partial underspecification

• When can you ignore phonetic information?– Crucially, when it doesn’t help you disambiguate

between word hypotheses




Underspecification

• Example: unstressed syllables tend to show more phonetic variation than stressed syllables – Experiment: reduce phonetic representation for

unstressed syllables to manner class – Allowing recognizer to choose best representation

(phone/manner) during training (WSJ0):• Minor degradation for clean speech (9.9 vs. 9.1 WER)• Larger improvement in 10dB car noise (15.8 vs 13.0 WER)

• Moral: we don’t need to have exact phonetic representation to decode words– But we may need to integrate more higher-level

knowledge


Fosler-Lussier et al 05Fosler-Lussier et al 05



Vision for the Future

• Acoustic-phonetic variation is difficult– Still significant cause of errors in ASR

• Underspecified models give a new way of looking at the problem– Rather than the “change x to y” model

• Challenge for the field:– Current techniques for accent modeling, intrinsic

pronunciation variation separate– Can we build a model that handles both?




Conclusions

• We have come quite a distance since 1999– New methods for phonological feature

detection– New methods for feature integration– New ways of thinking about variation:

underspecification

• Still have a long way to go– Integrating more knowledge sources

• Stress, prosody, word confusability

– Solving the pronunciation adaptation problem in a general way




Fin



An example feature grid

OBS VOW OBS VOW SON VOW OBS VOW SON OBS VOW SON

VCD VLS VCD VLS VCD VLS VCD

SP - SP - AT - FE - NL SP - NL

VR - AR - LB - PL - VR AR - AR

- MD - HH - LW - HH - MD -

- BK - BK - BK - CL - CL -

- RD - RD - ND - ND - ND -

- TE - TE - TE - LX - LX -

CLASS:

VOICED:

CMANNER:

CPLACE:

VHEIGHT:

VFRONTNESS:

VROUND:

VTENSE:

g ow t uw w aa sh ix ng t ax n

go to washington


Underspecified feature models for pronunciation variation in ASR

Documents

Transcript of Underspecified feature models for pronunciation variation in ASR