Articulatory Feature-Based Speech Recognition S1S1 S1S1 word ind 1 ind 2 ind 3 U1U1 U1U1 S2S2 S2S2...

Articulatory Feature-Based Speech Recognition

S1 S1

word word

ind1 ind1

ind2 ind2

ind3 ind3

U1 U1

S2 S2

U2 U2

U3

S3 S3

U3

sync1,2

sync2,3

sync1,2

sync2,3

JHU WS06 Final team presentation

August 17, 2006

Team members:

Karen Livescu (MIT) Arthur Kantor (UIUC) Özgür Çetin (ICSI) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore)

Advisors/satellite members:

Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Daryush Mehta (MIT), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)

Project Participants

Why are we here?

Why articulatory feature-based ASR? Improved modeling of co-articulation Potential savings in training data Compatibility with more recent theories of phonology

(autosegmental phonology, articulatory phonology) Application to audio-visual and multilingual ASR Improved ASR performance with feature-based observation models

in some conditions [e.g. Kirchhoff ‘02, Soltau et al. ‘02] Improved lexical access in experiments with oracle feature

transcriptions [Livescu & Glass ’04, Livescu ‘05]

Why now? A number of sites working on complementary aspects of this idea:

U. Edinburgh (King et al.), UIUC (Hasegawa-Johnson et al.), (Livescu et al.)

Recently developed tools (e.g. GMTK) for systematic exploration of the model space

Definitions: Pronunciation and observation modeling

w = “makes sense...”

q = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ]

P(w)language model

P(q|w)pronunciation model

P(o|q)observation model

o =

Feature set for observation modeling

pl 1 LAB, LAB-DEN, DEN, ALV, POST-ALV, VEL, GLO, RHO, LAT, NONE, SIL

dg 1 VOW, APP, FLAP, FRIC, CLO, SIL

nas +, -

glo ST, IRR, VOI, VL, ASP, A+VO

rd +, -vow aa, ae, ah, ao, aw1, aw2, ax, axr, ay1, ay2, eh,

el, em, en, er, ey1, ey2, ih, ix, iy, ow1, ow2, oy1, oy2, uh, uw, ux, N/A

SVitchboard

Data: SVitchboard - Small Vocabulary Switchboard

SVitchboard [King, Bartels & Bilmes, 2005] is a collection of small-vocabulary tasks extracted from Switchboard 1

Closed vocabulary: no OOV issues Various tasks of increasing vocabulary sizes: 10, … 500

words Pre-defined train/validation/test sets

and 5-fold cross-validation scheme Utterance fragments extracted from SWB 1

always surrounded by silence Word alignments available (msstate) Whole word HMM baselines already built

SVitchboard = SVB

SVitchboard: amount of data

Vocabulary size Utterances Word tokens Duration

(total, hours)

Duration (speech, hours)

10 6775 7792 3.2 0.9

25 9778 13324 4.7 1.4

50 12442 20914 6.2 1.9

100 14602 28611 7.5 2.5

250 18933 51950 10.5 4.0

500 23670 89420 14.6 6.4

SVitchboard: example utterances

10 word task oh right oh really so well the

500 word task oh how funny oh no i feel like they need a big home a nice place where someone

can have the time to play with them and things but i can't give them up

oh oh i know it's like the end of the world i know i love mine too

SVitchboard: isn’t it too easy (or too hard)?

No (no).

Results on the 500 word task test set using a recent SRI system:

SVitchboard data included in the training set for this system SRI system has 50k vocab System not tuned to SVB in any way

First pass 42.4% WER

After adaptation 26.8% WER

SVitchboard: what is the point of a 10 word task?

Originally designed for debugging purposes However, results on the 10 and 500 word tasks obtained in

this workshop show good correlation between WERs on the two tasks:

WER on 500 word task vs 10 word task

50

55

60

65

70

75

80

85

15 17 19 21 23 25 27 29

WER (%) 10 word task

WE

R (

%)

500

wo

rd t

ask

SVitchboard: pre-existing baseline word error rates

Whole word HMMs trained on SVitchboard these results are from [King, Bartels & Bilmes, 2005] Built with HTK Use MFCC observations

Vocabulary Full validation set Test set

10 word 20.2 20.8

500 word 69.8 70.8

SVitchboard: experimental technique

We only perfomed task 1 of SVitchboard (the first of 5 cross-fold sets) Training set is known as “ABC” Validation set is known as “D” Test set is known as “E”

SVitchboard defines cross-validation sets But these were too big for the very large number of

experiments we ran We mainly used a fixed 500 utterance randomly-chosen

subset of “D” which we call the small validation setAll validation set results reported today are on this set, unless stated otherwise

SVitchboard: experimental technique

SVitchboard includes word alignments. We found that using these made training significantly

faster, and gave improved results in most cases Word alignments are only ever used during training

Results above is for a monophone HMM with PLP observations

Word alignments? Validation set Test set

without 65.1 67.7

with 62.1 65.0

SVitchboard: workshop baseline word error rates

Monophone HMMs trained on SVitchboard PLP observations

Vocabulary Small validation set

Full validation set Test set

10 word 16.7 18.7 19.6

500 word 62.1 - 65.0

SVitchboard: workshop baseline word error rates

Triphone HMMs trained on SVitchboard PLP observations 500 word task only

(GMTK system was trained without word alignments)

System Small validation set Validation set Test set

HTK - 56.4 61.2

GMTK / gmtkTie 56.1 - 59.2

SVitchboard: baseline word error rates summary

Test set word error rates

Model 10 word 500 word

Whole word 20.8 70.8

Monophone 19.6 65.0

HTK triphone 61.2

GMTK triphone 59.2

gmtkTie

gmtkTie

General parameter clustering and tying tool for GMTK Written for this workshop

Currently most developed parts: Decision-tree clustering of Gaussians, using same

technique as HTK Bottom-up agglomerative clustering

Decision-tree tying was tested in this workshop on various observation models using Gaussians Conventional triphone models Tandem models, including with factored observation

streams Feature based models

Can tie based on values of any variables in the graph, not just the phone state (e.g. feature values)

gmtkTie

gmtkTie is more general than HTK HHEd HTK asks questions about previous/next phone identity HTK clusters states only within the same phone gmtkTie can ask user-supplied questions about user-

supplied features: no assumptions about states, triphones, or anything else

gmtkTie clusters user-defined groups of parameters, not just states

gmtkTie can compute cluster sizes and centroids in lots of different ways

GMTK/gmtkTie triphone system built in this workshop is at least as good as HTK system

gmtkTie: conclusions

It works!

Triphone performance at least as good as HTK

Can cluster arbitrary groups of parameters, asking questions about any feature the user can supply Later in this presentation, we will see an example of

separately clustering the Gaussians for two observation streams

Opens up new possibilities for clustering Much to explore:

Building different decision trees for various factorings of the acoustic observation vector

Asking questions about other contextual factors

Hybrid models

Hybrid models: introduction

Motivation Want to use feature-based representation In previous work, we have successfully recovered feature

values from continuous speech using neural networks (MLPs)

MLPs alone are just frame-by-frame classifiers Need some “back end” model to decode their output into

words

Ways to use such classifiers Hybrid models Tandem observations


Conventional HMMs generate observations via a likelihood p(O|state) or p(O|class) using a mixture of Gaussians

Hybrid models use another classifier (typically an MLP) to obtain the posterior P(class|O)

Dividing by the prior gives the likelihood, which can be used directly in the HMM: no Gaussians required


Advantages of hybrid models include:

Can easily train the classifier discriminatively

Once trained, MLPs will compute P(class|O) relatively fast

MLPs can use a long window of acoustic input frames

MLPs don’t require input feature distribution to have diagonal covariance (e.g. can use filterbank outputs from computational auditory scene analysis front-ends)

Hybrid models: standard method Standard phone-based hybrid

Train an MLP to classify phonemes, frame by frame Decode the MLP output using simple HMMs for smoothing

(transition probabilities easily derived from phone duration statistics – don’t even need to train them)

Feature-based hybrid Use ANNs to classify articulatory features instead of phones 8 MLPs, classifying pl1, dg1, etc frame-by-frame

One of the motivations for using features is that it should be easier to build a multi-lingual / cross-language system this way

Hybrid models: our method

Hybrid models: using feature-classifying MLPs

phoneState

p(dg1 | phoneState) = Non-deterministic CPT (learned)

dg1 pl1 rd

MLPs provide “virtual evidence” here

. . .

dummy variable

Hybrid models: training the MLPs

We use MLPs to classify speech into AFs, frame-by-frame Must obtain targets for training These are derived from phone labels

obtained by forced alignment using the SRI recogniser

this is less than ideal, but embedded training might help (results later)

MLPs were trained by Joe Frankel (Edinburgh/ICSI) & Mathew Magimai (ICSI)

Standard feedforward MLPsTrained using Quicknet Input to nets is a 9-frame window of PLPs (with VTLN and per-speaker mean and variance normalisation)

Hybrid models: training the MLPs

Two versions of MLPs were initially trained Fisher

Trained on all of Fisher but not on any data from Switchboard 1

SVitchboardTrained only on the training set of SVB

The Fisher nets performed better, so were used in all hybrid experiments

Hybrid models: MLP details

MLP architecture is:

input units x hidden units x output units

Feature MLP architecture

glo 351 x 1400 x 4

dg1 351 x 1600 x 6

nas 351 x 1200 x 3

pl1 351 x 1900 x 10

rou 351 x 1200 x 3

vow 351 x 2400 x 23

fro 351 x 1700 x 7

ht 351 x 1800 x 8

Hybrid models: MLP overall accuracies

Frame-level accuracies

MLPs trained on Fisher

Accuracy computed with respect to SVB test set Silence frames

excluded from this calculation

More detailed analysis coming up later…

Feature Accuracy (%)

glo 85.3

dg1 73.6

nas 92.7

pl1 72.6

rou 84.7

vow 65.6

fro 69.2

ht 68.0

Hybrid models: experiments

Using MLPs trained on Fisher using original phone-derived targets

vs. Using MLPs retrained on SVB data, which has been aligned

using one of our models

Hybrid model

vs Hybrid model plus PLP observation

Hybrid models: experiments – basic model

Basic model is trained on activations from original MLPs (Fisher-trained) The only parameters in

this DBN are the conditional probability tables (CPTS) describing how each feature depends on phone state

Embedded training Use the model to

realign the SVB data (500 word task)

Starting from the Fisher-trained nets, retrain on these new targets

Retrain the DBN on the new net activations

phoneState

dg1 pl1 rd. . .

ModelSmall validation set

Test set

Hybrid 26.0 30.1

Hybrid, embedded training

23.1 24.3

Hybrid models: 500 word results

Small validation set

hybrid 66.6

hybrid, embedded training 62.6

Hybrid models: adding in PLPs

To improve accuracy, we combined the “pure” hybrid model with a standard monophone model

Can/must weight contribution of PLPs Used a global weight on each of the 8 virtual evidences,

and a fixed weight on PLPs of 1.0 Weight tuning worked best if done both during training

and decoding Computationally expensive: must train and cross-validate

many different systems

Hybrid models: adding PLPs

phoneState

p(dg1 | phoneState) = Non-deterministic CPT (learned)

dg1 pl1 rd

MLP likelihoods (implemented via virtual evidence in GMTK)

. . .

dummy variable

PLPs

Hybrid models: weighting virtual evidence vs PLP

Word error rate (%)

16.5

17

17.5

18

18.5

19

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Weight on virtual evidence

WE

R (

%)

Hybrid models: experiments – basic model + PLP Basic model is augmented

with PLP observations Generated from

mixtures of Gaussians, initialised from a conventional monophone model

A big improvement over hybrid-only model

A small improvement over the PLP-only monophone model

phoneState

dg1 pl1 rd. . .

ModelSmall validation set

Test set

Hybrid 26.0 30.1

PLP only 16.9 20.0

Hybrid + PLP 16.2 19.6

PLPs

Hybrid experiments: conclusions

Hybrid models perform reasonably well, but not yet as well as conventional models But they have fewer parameters to be trained So may be a viable approach for small databases:

Train MLPs on large database (e.g. Fisher)Train hybrid model on small databaseCross-language??

Embedded training gives good improvements for the “pure” hybrid model

Hybrid models augmented with PLPs perform better than baseline PLP-only models But improvement is only small

The best way to use the MLPs trained on Fisher might be to construct tandem observation vectors…

Using MLPs to transfer knowledge from larger databases

Scenario we need to build a system for a

domain/accent/language for which we have only a small amount of data

We have lots of data from other domains/accents/languages

Method Train MLP on large database Use it in either a hybrid or a tandem system in target

domain

Using MLPs to transfer knowledge from larger databases

Articulatory features It is plausible that training MLPs to be AF classifiers

could be more accent/language independent than phones

Tandem results coming up shortly will show that, across very similar domains (Fisher & SVB), AF nets perform as well or better than phone nets

Hybrid models vs Tandem observations

Standard hybrid Train an MLP to classify phonemes, frame by frame Decode the MLP output using simple HMMs (transition

probabilities easily derived from phone duration statistics – don’t even need to train them)

Standard tandem Instead of using MLP output to directly obtain the

likelihood, just use it as a feature vector, after some transformations (e.g. taking logs) and dimensionality reduction

Append the resulting features to standard features, e.g. PLPs or MFCCs

Use this vector as the observation for a standard HMM with a mixture-of-Gaussians observation model

Currently used in state-of-art systems such as from SRI

but first a look at structural modifications . . .

Adding dependencies

Consider the set of random variables constant Take an existing model and augment it with edges to

improve performance on a particular task Choose edges greedily, or using a discriminative metric

like the Explaining Away Residue [EAR, Blimes 1998] One goal is to compare these approaches on our models

EAR(X,Y)=I(X,Y|Q) – I(X,Y)

For instance, let X = word random variable

Y = degree 1 random variable

Q = {“right”, “another word”, “silence”}

Provides an indication of how valuable it is to model X and Y jointly

Models

phoneState

dg1 pl1 rou

MLP likelihoods (implemented via virtual evidence in GMTK)

. . .

word

Monophone hybrid Monophone hybrid + PLP

PLPs

phoneState

pl1 rou. . .

word

Which edges?

phoneState

dg1

pl1 rou. . .

word

Learn connections from classifier outputs to word

Intuition that the word will be able to use the classifier output to correct other mistakes being made

Results (10 word monophone hybrid)

Baseline monophone hybrid: 26.0% WER on CV, 30.0% Test Choose edge with best CV score: ROU (25.1%)

Test result: 29.7% WER Choose two single best edges on CV: VOW + ROU

Test result: 29.9% Choose edge with highest EAR: GLO

Test result: 30.1% Choose highest EAR between MLP features: DG1 ↦ PL1

Test result: 31.6% (CV: 26.0%)

In monophone + PLP model , the best result is obtained with original model.

In Conclusion

The EAR measure would not have chosen the best possible edges

These models may already be optimized Once PLPs are added to the model, changing the structure

has little effect

Tandem observation models

Introduction

Tandem is a method to use the predictions of a MLP as observation vectors in generative models, e..g. HMMs

Extensively used in the ICSI/SRI systems: 10-20 % improvement for English, Arabic, and Mandarin

Most previous work used phone MLPs for deriving tandem (e.g., Hermansky et al. ’00, and Morgan et al. ‘05 )

We explore tandem based on articulatory MLPs Similar to the approach in Kirchhoff ’99 Questions

Are articulatory tandems better than the phonetic ones?

Are factored observation models for tandem and acoustic (e.g. PLP) observations better than the observation concatenation approaches?

Tandem Processing Steps

MLP posteriors are processed to make them Gaussian like

There are 8 articulatory MLPs; their outputs are joined together at the input (64 dims)

PCA reduces dimensionality to 26 (95% of the total variance)

Use this 26-dimensional vector as acoustic observations in an HMM or some other model

The tandem features are usually used in combination w/ a standard feature, e.g. PLP

PRINCIPALCOMPONENT ANALYSIS

LOGARITHM

SPEAKER MEAN/VARNORMALIZATION

MLP OUTPUTS

TANDEM FEATURE

Tandem Observation Models Feature concatenation: Simply append tandems to

PLPs- All of the standard modeling methods applicable to this

meta observation vector (e.g., MLLR, MMIE, and HLDA) Factored models: Tandem and PLP distributions are

factored at the HMM state output distributions

- Potentially more efficient use of free parameters, especially if streams are conditionally independent

- Can use e.g., separate triphone clusters for each observation

PLP

Tandem

State

Concatenated Observations

PLP

State

Tandem

p(X, Y|Q) = p(X|Q) p(Y|Q)

Factored Observations

Articulatory vs. Phone Tandems

Model Test WER (%)

PLP 67.7PLP/Phone Tandem (SVBD) 63.0PLP/Articulatory Tandem (SVBD) 62.3PLP/Articulatory Tandem (Fisher) 59.7

Monophones on 500 vocabulary task w/o alignments; feature concatenated PLP/tandem models

All tandem systems are significantly better than PLP alone

Articulatory tandems are as good as phone tandems Articulatory tandems from Fisher (1776 hrs) trained

MLPs outperform those from SVB (3 hrs) trained MLPs

Concatenation vs. Factoring

Model Task Test WER (%)

PLP 10 24.5PLP / Tandem Concatenation 21.1PLP x Tandem Factoring 19.7PLP 500 67.7PLP / Tandem Concatenation 59.7PLP x Tandem Factoring 59.1

Monophone models w/o alignments All tandem results are significant over PLP baseline Consistent improvements from factoring; statistically

significant on the 500 task

Triphone Experiments

Model # of Clusters Test WER %

PLP 477 59.2PLP / Tandem Concatenation

880 55.0

PLP x Tandem Factoring 467x641 53.8

500 vocabulary task w/o alignments PLP x Tandem factoring uses separate decision trees

for PLP and Tandem, as well as factored pdf’s A significant improvement from factoring over the

feature concatenation approach All pairs of results are statistically significant

Summary

Tandem features w/ PLPs outperform PLPs alone for both monophones and triphones

8-13 % relative improvements (statistically significant) Articulatory tandems are as good as phone tandems

- Further comparisons w/ phone MLPs trained on Fisher Factored models look promising (significant results on

the 500 vocabulary task)

- Further experiments w/ tying, initialization

- Judiciously selected dependencies between the factored vectors, instead of complete independence

Manual feature transcriptions

Main transcription guideline: The output should correspond to what we would like our AF classifiers to detect

Details 2 transcribers: phonetician (Lisa Lavoie), PhD student in speech

group (Xuemin Chi) 78 SVitchboard utterances 9 utterances from Switchboard Transcription Project for comparison

Multipass transcription using WaveSurfer (KTH) 1st pass: Phone-feature hybrid 2nd pass: All-feature 3rd pass: Discussion, error-correction

Some basic statistics Overall speed ~1000 x real-time High inter-transcriber agreement (93% avg. agreement, 85% avg.

string accuracy)

First use to date of human-labeled articulatory feature data for classifier/recognizer testing

GMTKtoWavesurfer Debugging/Visualization Tool

Input Per-utterance files

containing Viterbi-decoded variables

List of variables Optional map between

integer values and labels

Optional reference transcriptions for comparison

Output Per-utterance, per-

feature Wavesurfer(KTH) transcription files

Wavesurfer configuration for viewing the decoded variables, and optionally comparing to a reference transcription

General debugging/visualization for any GMTK model

Summary

Analysis Improved forced AF alignments obtained using MSState word

alignments combined with new AF-based models MLP performance analysis shows that retrained classifiers move

closer to human alignments, farther from forced phonetic alignments

Data Manual transcriptions PLPs and MLP outputs for all of SVitchboard New, improved SVitchboard baselines (monophone & triphone)

Tools gmtkTie Viterbi path analysis tool Site-independent parallel GMTK training and decoding scripts

Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)

Acknowledgments

NSFDARPADoD

Fred JelinekLaura GrahamSanjeev KhudanpurJason EisnerCLSP

Support staff at SSLI Lab, U. Washington IFP, U. Illinois, Urbana-Champaign CSTR, U. Edinburgh

ICSISRI

Articulatory Feature-Based Speech Recognition S1S1 S1S1 word ind 1 ind 2 ind 3 U1U1 U1U1 S2S2 S2S2...

Documents

Transcript of Articulatory Feature-Based Speech Recognition S1S1 S1S1 word ind 1 ind 2 ind 3 U1U1 U1U1 S2S2 S2S2...