Underspecified feature models for pronunciation variation in ASR

37
Underspecified feature models for pronunciation variation in ASR Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab ITRW - Speech Recognition & Intrinsic Variation 20 May 2006

description

Underspecified feature models for pronunciation variation in ASR. Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab ITRW - Speech Recognition & Intrinsic Variation 20 May 2006. Introduction. Why features?. Role of transcription. Approaches. Vision. - PowerPoint PPT Presentation

Transcript of Underspecified feature models for pronunciation variation in ASR

Page 1: Underspecified feature models for pronunciation variation in ASR

Underspecified feature models for pronunciation variation in ASR

Eric Fosler-LussierThe Ohio State University

Speech & Language Technologies Lab

ITRW - Speech Recognition & Intrinsic Variation

20 May 2006

Page 2: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Fill in the blanks

• 3, 6, __, 12, 15, __, 21, 24• A B C __ E F __ H• You’re going to Toulouse? Drink a

bottle of _____ for me!• What’s the red object? We’re very good at

filling in the blankswhen we have

context!

Introduction Why features? Role of transcription Approaches Vision

Page 3: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Filling in the blanks: missing data• Missing data

approaches have been used to integrate over noisy acoustics

(a) Clean utterance

Frequency (Hz)

0.5 1 1.5 2 2.5 3 3.550

363

1246

3255

8000

(c) Segregated voiced utterance

Frequency (Hz)

0.5 1 1.5 2 2.5 3 3.550

363

1246

3255

8000

(b) Mixture (SNR 0 dB)

0.5 1 1.5 2 2.5 3 3.5

(d) Segregated whole utterance

0.5 1 1.5 2 2.5 3 3.5

(e) Utterance segregated from IBM

Frequency (Hz)

Time (S)0.5 1 1.5 2 2.5 3 3.5

50

363

1246

3255

8000

Wang & Hu 06Wang & Hu 06

Introduction Why features? Role of transcription Approaches Vision

Page 4: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Decode this!

(brackets indicate options)

s iy n y {ah,ax,axr,er}{l,r} {eh,ih,iy} s er ch{ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey}

{t,d}

Introduction Why features? Role of transcription Approaches Vision

Page 5: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Decode this!

(brackets indicate options)

s iy n y {ah,ax,axr,er} senior

{l,r} {eh,ih,iy} s er ch research

{ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}

associate

Introduction Why features? Role of transcription Approaches Vision

Page 6: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Decode this!

(brackets indicate options)

s iy n y {ah,ax,axr,er} senior

{l,r} {eh,ih,iy} s er ch research

{ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}

associate

dictionary pronunciation

Introduction Why features? Role of transcription Approaches Vision

Page 7: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Decode this!

(brackets indicate options)

s iy n y {ah,ax,axr,er} senior

{l,r} {eh,ih,iy} s er ch research

{ah,ax} s ow {s,sh,z,zh} {eh,ih,iy} {eh,ey} {t,d}

associate

dictionary pronunciationas marked by transcribers (Buckeye Corpus of Speech)

Introduction Why features? Role of transcription Approaches Vision

Page 8: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

What do these tasks have in common?• Recovering from erroneous information?

– Context plays a big role in helping “clean up”

Introduction Why features? Role of transcription Approaches Vision

Page 9: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

What do these tasks have in common?• Recovering from erroneous information?

– Context plays a big role in helping “clean up”

• Recovering from incomplete information!– We should be treating pronunciation variation

as a missing data problem• Integrate over “missing” phonological features

– How much information do you need to decode words?

• Particularly taking into account the context of the word, syllabic context of phones, etc…

• Information theory problem

Introduction Why features? Role of transcription Approaches Vision

Page 10: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Outline

• Problems with phonetic representations of variation– Potential advantages of phonological features

• Re-examining the role of phonetic transcription• Phonological feature approaches to ASR

– Feature attribute detection– Feature combination methods– Learning to (dis-)trust features

• A challenge for the future

Introduction Why features? Role of transcription Approaches Vision

Page 11: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

“The Case Against The Phoneme”Homage to Ostendorf (ASRU 99)• Four major indications that phonetic

modeling of variation is not appropriate:

Introduction Why features? Role of transcription Approaches Vision

Page 12: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

“The Case Against The Phoneme”Homage to Ostendorf (ASRU 99)• Four major indications that phonetic

modeling of variation is not appropriate:– Lack of progress on spontaneous speech

WER• McAllaster et al (98): 50% improvement

possible• Finke & Waibel (97): 6% WER reduction

Introduction Why features? Role of transcription Approaches Vision

Page 13: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

“The Case Against The Phoneme”Homage to Ostendorf (ASRU 99)• Four major indications that phonetic

modeling of variation is not appropriate:– Lack of progress on spontaneous speech WER– Independence of decisions in phone-based

models• When pronunciation variation is modeled on phone-

by-phone level, unusual baseforms are often created

• Word-based learning fails to generalize across words

Introduction Why features? Role of transcription Approaches Vision

Riley et al 98Riley et al 98

Page 14: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

“The Case Against The Phoneme”Homage to Ostendorf (ASRU 99)• Four major indications that phonetic

modeling of variation is not appropriate:– Lack of progress on spontaneous speech WER– Independence of decisions in phone-based

models– Lack of granularity

• Triphone contexts mean a symbolic change in phone can affect 9 HMM states (min 90 msec)

• Much variation is already handled by triphone context

Introduction Why features? Role of transcription Approaches Vision

Jurafsky et al 01Jurafsky et al 01

Saraçlar et al 00Saraçlar et al 00

Page 15: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

“The Case Against The Phoneme”Homage to Ostendorf (ASRU 99)• Four major indications that phonetic

modeling of variation is not appropriate:– Lack of progress on spontaneous speech

WER– Independence of decisions in phone-based

models– Lack of granularity– Difficulty in transcription

• Phonetic transcription is expensive and time consuming

• Many decisions difficult to make for transcribers

Introduction Why features? Role of transcription Approaches Vision

Page 16: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Using phonological features

• Finer granularity– Some phonological changes don’t result in canonical phones

for a language• English: uw can sometimes be fronted (toot)• Common enough: TIMIT introduced a special phone (ux)• Symbol change loses all commonality between phones (uw-

>ux)

– Handling odd phonological effects• Phone deletions: many “deletions” really leave small traces of

coarticulation on neighboring segments• E.g. vowel nasalization with nasal deletion

• Features may provide basis for cross-lingual recognition

• International Phonetic Alphabet

Introduction Why features? Role of transcription Approaches Vision

Page 17: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Issues with phonological features

• Interlingua: “high vowels in English are not the same as high vowels in Japanese”– Richard Wright, lunch Wednesday, ICASSP 2006

• Concept of “independent directions” false– Correlation of feature values– Distances no longer euclidean among feature dimensions

• Dealing with feature spreading• Even more difficulty in transcription

– (but: Karen Livescu’s group, JHU workshop 2006)

• Articulatory vs. acoustic features– No two definitions are exactly the same (see Richard’s

talk)

Introduction Why features? Role of transcription Approaches Vision

Page 18: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Phonetic transcription

• There have been a number of efforts to transcribe speech phonetically– American English

• TIMIT (4 hr read speech)• Switchboard (4 hr spontaneous speech)• Buckeye Corpus (40 hr spontaneous speech)

http://buckeyecorpus.osu.edu

• ASR researchers have found it difficult to utilize phonetic transcriptions directly

Introduction Why features? Role of transcription Approaches Vision

Riley et al 99Riley et al 99

Page 19: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

ASR & Phonetic Transcription

• Saraclar & Khudanpur (04) examined the means of acoustic models where canonical phone /x/ was transcribed as [y] over all pairs x:y– Compared means of x:y to x:x, y:y– Data showed that x:y means often fell between x:x and

y:y, sometimes closer to x:x

• Another view: data from Buckeye Corpus– /ae/ is sometimes transcribed as [eh]– Examined 80 vowels from one speaker

• Formant frequencies from center of vowel

Introduction Why features? Role of transcription Approaches Vision

Page 20: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Page 21: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

higher than eh

opposite side of ae from eh

mixed ae/eh

ae territory

Page 22: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Can you trust transcription?

• Perceptual marking ≠ acoustic measurement– Can’t take transcription at face value

• What are the transcribers are trying to tell us?– This phone doesn’t sound like a canonical phone– Perhaps we can look at commonalities across

canonical/transcribed phone• ae:eh -> front vowel (& not high?)

• Phonological features may help us represent transcription differences.

Introduction Why features? Role of transcription Approaches Vision

Page 23: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Variation in single-phone changes• Compared canonical vs. transcribed

consonants with single-phone substitutions in Switchboard, Buckeye – Differences in manner, place, voicing

countedManner Place Voicing SWB % BCS %

42.1 41.5

7.3 13.8

39.7 27.1

8.2 12.5

1.4 1.5

0.0 1.1

0.7 2.1

single dimensioncommon

manner, voicingvariants morecommon than place

Introduction Why features? Role of transcription Approaches Vision

Page 24: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Recent approaches to feature modeling in ASR

• Since 90’s there has been increased interest in phonological feature modeling– Deng et al (92 ff), Kirchhoff (96 ff)

• Current directions of research– Approaches for detecting phonological features from

data– Methods of combining phonological features– Knowing when to ignore information

Introduction Why features? Role of transcription Approaches Vision

Page 25: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Feature detection methods

• Frame-level decisions– Most common: artificial neural network methods

• Input: various flavors of spectral/cepstral representations

• Output: estimating posterior P(feature|acoustics) on a per-frame level

– Recent competitor: support vector machines• Typically used for binary decision problems

• Segmental-level decisions: integrate over time– HMM detectors– Hybrid ANN/Dynamic Bayesian Network

Introduction Why features? Role of transcription Approaches Vision

Page 26: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Binary vs. n-ary features

• Features can either be described as binary or n-ary if they can contrast– Binary: /t/ : +stop -fricative …– N-ary: /t/ : manner=stop

• No real conclusion on whether which is better– Binary more matched to SVM learning– N-ary allows for discrimination among classes

• Should a segment be allowed to be +stop +fricative?

– Anecdotally (our lab) we find n-ary features slightly better

Introduction Why features? Role of transcription Approaches Vision

Page 27: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Hierarchical representations

• Phonological features are not truly independent– Chang et al (01): Place prediction improves if manner

is known• ANN predicts P(place=x|manner=y,X) vs P(place=x|X)• Suggests need for hierarchical detectors

– Rajamanohar & Fosler-Lussier (05): Cascading errors make chained decisions worse

• Better to jointly model P(place=x,manner=y|X), or even derive P(place=x|X) from phone probabilities

– Frankel et al (04): Hierarchy can be integrated as additional dependencies in DBN

Introduction Why features? Role of transcription Approaches Vision

Page 28: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Combining features into higher-level structures

• Once you have (frame-level) estimates of phonological features, need to combine– Temporal integration: Markov structures– Phonetic spatial integration: combining into higher-

level units (phones, syllables, words)

• Differences in methodologies:– spatial first, then temporal– joint/factored spatio-temporal integration– phone-level temporal integration with spatial

rescoring

Introduction Why features? Role of transcription Approaches Vision

Page 29: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Combining features into higher-level structures

• Tandem ANN/HMM Systems – ANN feature posterior estimates are used as

replacements for MFCCs for Mixture of Gaussians HMM system

– We find decorrelation of features (via PCA) necessary to keep models well conditioned

• Lattice rescoring with Landmarks – Maximum entropy models for local word discrimination– SVMs used as local features for MaxEnt model.

• Dynamic Bayesian Models – Model asynchrony as a hidden variable– SVM outputs used as observations of features

Introduction Why features? Role of transcription Approaches Vision

Launay et al 02Launay et al 02

Hasegawa-Johnson et al 05Hasegawa-Johnson et al 05

Livescu 05Livescu 05

Page 30: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Combining features intohigher-level structures

• Conditional random fields– CRFs jointly model spatio-temporal integration– Probability expressed in terms of indicator functions

s (state), t (transition)

• Usually binary in NLP applications

– Frame-level ANN posteriors are bounded• Probabilities can serve as observation feature functions

– sstop(/t/,x,i)=P(manner=stop|xi)

Introduction Why features? Role of transcription Approaches Vision

P(y | x)∝ exp λ js j (y i,x,i) + μ ktk (y i−1,y i,x,i)k

∑j

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

i

Morris & Fosler-Lussier 06Morris & Fosler-Lussier 06

Page 31: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Conditional Random Fields

+ CRFs make no independence assumptions about input– Posteriors can be used directly without decorrelation– Can combine features, phones, …– No assumption of temporal independence

+ Entire label sequence is modeled jointly– Monophone feature CRF phone recog. similar to triphone HMM

+ Learning parameters (,) determines importance of feature/phone relationships– Implicit model of partial phonological underspecification

– Slow to train

P(y | x)∝ exp λ js j (y i,x,i) + μ ktk (y i−1,y i,x,i)k

∑j

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

i

Introduction Why features? Role of transcription Approaches Vision

Page 32: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Underspecification

• All of these models learn what phonological information is important in higher-level processing– Ignoring “canonical” feature definitions for phone is a

form of underspecification– Traditional underspecification: some features are

undefined for a particular phone– Weighted models: partial underspecification

• When can you ignore phonetic information?– Crucially, when it doesn’t help you disambiguate

between word hypotheses

Introduction Why features? Role of transcription Approaches Vision

Page 33: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Underspecification

• Example: unstressed syllables tend to show more phonetic variation than stressed syllables – Experiment: reduce phonetic representation for

unstressed syllables to manner class – Allowing recognizer to choose best representation

(phone/manner) during training (WSJ0):• Minor degradation for clean speech (9.9 vs. 9.1 WER)• Larger improvement in 10dB car noise (15.8 vs 13.0 WER)

• Moral: we don’t need to have exact phonetic representation to decode words– But we may need to integrate more higher-level

knowledge

Introduction Why features? Role of transcription Approaches Vision

Fosler-Lussier et al 05Fosler-Lussier et al 05

Page 34: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Vision for the Future

• Acoustic-phonetic variation is difficult– Still significant cause of errors in ASR

• Underspecified models give a new way of looking at the problem– Rather than the “change x to y” model

• Challenge for the field:– Current techniques for accent modeling, intrinsic

pronunciation variation separate– Can we build a model that handles both?

Introduction Why features? Role of transcription Approaches Vision

Page 35: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Conclusions

• We have come quite a distance since 1999– New methods for phonological feature

detection– New methods for feature integration– New ways of thinking about variation:

underspecification

• Still have a long way to go– Integrating more knowledge sources

• Stress, prosody, word confusability

– Solving the pronunciation adaptation problem in a general way

Introduction Why features? Role of transcription Approaches Vision

Page 36: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

Fin

Page 37: Underspecified feature models for pronunciation variation in ASR

Fosler-Lussier / Underspecified Feature Models

ITRW Speech Recognition and Intrinsic Variation

An example feature grid

OBS VOW OBS VOW SON VOW OBS VOW SON OBS VOW SON

VCD VLS VCD VLS VCD VLS VCD

SP - SP - AT - FE - NL SP - NL

VR - AR - LB - PL - VR AR - AR

- MD - HH - LW - HH - MD -

- BK - BK - BK - CL - CL -

- RD - RD - ND - ND - ND -

- TE - TE - TE - LX - LX -

CLASS:

VOICED:

CMANNER:

CPLACE:

VHEIGHT:

VFRONTNESS:

VROUND:

VTENSE:

g ow t uw w aa sh ix ng t ax n

go to washington

Introduction Why features? Role of transcription Approaches Vision