By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis,...

By the Novel Approaches team,With site leaders:

Nelson Morgan, ICSIHynek Hermansky, OGI

Dan Ellis, ColumbiaKemal Sönmez, SRIMari Ostendorf, UW

Hervé Bourlard, IDIAP/EPFLGeorge Doddington, NA-sayer

“Pushing the Envelope”

A six month report

OverviewOverview

Nelson Morgan, ICSINelson Morgan, ICSI

The Current Cast of The Current Cast of CharactersCharacters

• ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington

• UW: M. Ostendorf, Ö. Çetin

• OGI: H. Hermansky, S. Sivadas, P. Jain

• Columbia: D. Ellis, M. Athineos

• SRI: K. Sönmez

• IDIAP: H. Bourlard, J. Ajmera, V. Tyagi

Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR

• Escape dependence on spectral envelope

• Use multiple front-ends across time/freq

• Modify statistical models to accommodate new front-ends

• Design optimal combination schemes for multiple models

time

Task 1: Pushing the Task 1: Pushing the Envelope (aside)Envelope (aside)

• Problem: Spectral envelope is a fragile information carrier

estimate of sound identity

info

rmati

on

fusio

n

10 msOLD

PROPOSED

• Solution: Probabilities from multiple time-frequency patches

ith estimate

up to 1s

kth estimate

nth estimate

estimate of sound identity

Task 2: Beyond Task 2: Beyond Frames…Frames…

• Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm

OLD

PROPOSED

conventional HMMshort-term features

• Problem: Features & models interact; new features may require different models

advanced features multi-rate, dynamic-scale classifier

Today’s presentationToday’s presentation

• Infrastructure: training, testing, software

• Initial Experiments: pilot studies• Directions: where we’re headed

Infrastructure Infrastructure

Kemal Sönmez, SRIKemal Sönmez, SRI(SRI/UW/ICSI effort)(SRI/UW/ICSI effort)

Initial Experimental Initial Experimental ParadigmParadigm

• Focus on a small task to facilitate exploratory work (later move to CTS)

• Choose a task where LM is fixed & plays a minor role (to focus on acoustics)

• Use mismatched train/test data:To avoid tuning to the taskTo facilitate later move to CTS

• Task: OGI numbers/ Train: swbd+macrophone

• Composition

(total ~ 60 hours)

* subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations

• WER 2-4% higher vs. full 250+ hour training

Hub5 “Short” Training Hub5 “Short” Training SetSet

hoursCorpus Male Female

callhome 2.8 13.8

switchboard* 5.9 4.3credit-card 6.7 7.1macrophone 12.4 5.8

Reduced UW Training Reduced UW Training SetSet

• A reduced training set to shorten expt. turn-around time

• Choose training utterances with per-frame likelihood scores close to the training set average

• 1/4th of the original training set• Statistics (gender, data set constituencies) are similar

to that of the full training set.

• For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5).

data set constituencies

male/femalemacrophon

ecallhome

credit-card

otherswitchboard

“short” 32% 32% 12% 24% 45/55%

Reduced (UW)

38% 28% 12% 22% 48/52%

Development Test SetsDevelopment Test Sets• A “Core-Subset” of OGI’s Numbers 95 corpora – telephone

speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items

• “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers

• Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.)

Data Set Name Total Utterance

Total Words Duration (hours)

Numbers95-CS Cross

Validation

357 1353 ~0.2

Numbers95-CSDevelopment

1206 4673 ~0.6

Numbers95-CSTest

1227 4757 ~0.6

Statistical Modeling Statistical Modeling Tools Tools

• HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging

• GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streamsAllows direct dependencies across streams Not limited by single-rate, single-stream paradigmRapid model specification/training/testing

• SRI Decipher system for providing lattices to rescore (later in CTS expts)

• Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP

Baseline SRI Baseline SRI RecognizerRecognizer

for the numbers taskfor the numbers task• Bottom-up state-clustered Gaussian mixture

HMMs for acoustic modeling• Acoustic adaptation to speakers using affine mean

and variance transforms[Not used for numbers]• Vocal-tract length normalization using maximum

likelihood estimation [Not helpful for numbers]• Progressive search with lattice recognition and N-

best rescoring [To be used in later work]• Bigram LM

Initial ExperimentsInitial Experiments

Barry Chen, ICSIBarry Chen, ICSIHynek Hermansky, OHSU (OGI)Hynek Hermansky, OHSU (OGI)

Özgür Çetin, UWÖzgür Çetin, UW

Goals of Initial Goals of Initial ExperimentsExperiments

• Establish performance baselinesHMM + standard features (MFCC, PLP)HMM + current best from ICSI/OGI

• Develop infrastructure for new modelsGMTK for multi-stream & multi-rate featuresNovel features based on large timespansNovel features based on temporal fine

structure

• Provide fodder for future error analysis

ICSI Baseline ICSI Baseline experimentsexperiments

• PLP based - SRI system

• “Tandem” PLP-based ANN + SRI system

• Initial combination approach

Development Baseline: Development Baseline: Gender Independent Gender Independent

PLP SystemPLP System

Training SetWord,SentenceError Rate on

Numbers95-CS Test Set

Full “Short” Hub5 (85k utterances, ~64.9 hrs)

3.4%,10.2%

UW Reduced Hub5 (20k utterances, ~18.8 hrs)

3.8%,11.4%

Phonetically Trained Neural Phonetically Trained Neural NetNet

• Multi-Layer Perceptron (input, hidden, and output layer)• Trained Using Error-Backpropagation Technique – outputs

interpreted as posterior probabilities of target classes• Training Targets: 47 mono-phone targets from forced

alignment using SRI Eval 2002 system• Training Utterances: UW Reduced Hub5 Set• Training Features: PLP12+e+d+dd, mean & variance

normalized on per-conversation side basis• MLP Topology:

9 Frame Context Window (4 frames in past + current frame + 4 frames in future)

351 Input Units, 1500 Hidden Units, and 47 Output Units Total Number of Parameters: ~600k

Baseline ICSI TandemBaseline ICSI Tandem

• Outputs of Neural Net before final softmax non-linearity used as inputs to PCA

• PCA without dimensionality reduction

• 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set

Baseline ICSI Tandem+PLPBaseline ICSI Tandem+PLP

• PLP Stream concatenated with neural net posteriors stream• PCA reduces dimensionality of posteriors stream to 16

(keeping 95% of overall variance)• 3.3% Word and 9.5% Sentence Error Rate on Numbers95-

CS test set

Word and String Error Rates on Word and String Error Rates on Numbers95-CS Test SetNumbers95-CS Test Set

OGI Experiments:OGI Experiments:New Features in EARSNew Features in EARS

• Develop on home-grown ASR system (phoneme-based HTK)

• Pass the most promising to ICSI for running in SRI LVCSR system

• So far new features match the performance of the

baseline PLP features but do not exceed itadvantage seen in combination with the

baseline

Looking to the human Looking to the human auditory system for design auditory system for design

inspirationinspiration

• Psychophysics Components within

certain frequency range (several critical bands) interact [e.g. frequency masking]

Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking]

• Physiology 2-D (time-frequency)

matched filters for activity in auditory cortex [cortical receptive fields]

TRAP-based HMM-NN hybrid ASR

Posterior probabilitiesof phonemes

Multilayer Perceptron

(MLP)

Mean &variancenormalized,hamming windowedcritical bandtrajectory

101 pointinput


(MLP)


(MLP)

Searchfor the best

match

Feature estimation from linearly transformed temporal

patterns

MLP

MLPtransform

transform

TANDEMHMMASR

? ? ?

Preliminary Preliminary TANDEM/TRAP results TANDEM/TRAP results

(OGI-HTK)(OGI-HTK)

WER% on OGI numbers, training on UW reduced training set,monophone models

BASELINE 4.5

TANDEM 4.1

TANDEM with TRAP 3.9

Features from more than one Features from more than one critical-band temporal critical-band temporal

trajectorytrajectory

+

averagefrequencyderivative

cosinetransform

Studying KLT-derived basis functions, we observe:

UW Baseline UW Baseline ExperimentExperimentss

• Constructed an HTK-based HMM system that is competitive with the SRI system

• Replicated the HMM system in GMTK• Move on to models which integrate

information from multiple sources in a principled manner:

Multiple feature streams (multi-stream models)

Different time scales (multi-rate models)

• Focus on statistical models not on feature extraction

HTK HMM BaselineHTK HMM Baseline• An HTK-based standard HMM system:

• 3 state triphones with decision-tree clustering,

• Mixture of diagonal Gaussians as state output dists.,

• No adaptation, fixed LM.

• Dimensions explored:• Front-end: PLP vs. MFCC, VTLN

• Gender dependent vs. independent modeling

• Conclusions: • No significant performance differences

• Decided on PLPs, no VTLN, gender-independent models for simplicity

HMM Baselines (cont.)HMM Baselines (cont.)• Replicated HTK baseline with equivalent results in GMTK

• To reduce experiment turn-around time, wanted to reduce the training set

• For HMMs and Numbers95, 3/4th of the training data can be safely ignored:

WER %

tool dev test

HTK 3.7 3.2

GMTK 3.7 3.0

Training set

WER %

dev test

Full “short” 3.7 3.2

1/4th (“reduced”)

3.4 3.4

Multi-stream ModelsMulti-stream Models• Information fusion from multiple streams of features • Partially asynchronous state sequences

states of stream X

state

s of stre

am

Y

state seq. of stream Y

STATE TOPOLOGY

state seq. of stream X

feature stream X

feature stream Y

GRAPHICAL MODEL

modelWER %

dev test

HMM (PLP) 3.9 4.2

multi-stream(PLP+MFCC)

Temporal envelope Temporal envelope featuresfeatures

(Columbia)(Columbia)• Temporal fine structure is lost (deliberately)

in STFT features:

• Need a compact, parametric description...time / sec

0.65 0.7 0.75 0.8 0.85 0.90

2000

4000

6000

8000

-6dB

0

-40

-20

0

0.65 0.7 0.75 0.8 0.85 0.9-0.05

0

0.05

0.1

0.15mpgr1-sx419

10 mswindows

Frequency-DomainFrequency-DomainLinear Prediction Linear Prediction

(FDLP)(FDLP)

• Extend LPC with LP model of spectrum

• ‘Poles’ represent temporal peaks:

• Features ~ pole bandwidth, ‘frequency’

TD-LPy[n] = iaiy[n-i]

DFTFD-LP

Y[k] = ibiY[k-i]

0.65 0.7 0.75 0.8 0.85 0.9-0.05

0

0.05

0.1

mpgr1-sx419: TDLPC env (60 poles / 300 ms)

Preliminary FDLP Preliminary FDLP ResultsResults

• Distribution of pole magnitudes for different phone classes (in 4 bands):

• NN Classifier Frame Accuracies:

plp12N 57.0%

plp12N+FDLP4 58.2%

-2 0 2 4 60

0.02

0.04

0.06

0.08

0.10-500 Hz band

-2 0 2 4 6

500-1000 Hz band

-2 0 2 4 6

1-2 kHz band

-2 0 2 4 6

2-4 kHz band

-log(1-||)

/ah//p/

DirectionsDirections

Dan Ellis, ColumbiaDan Ellis, Columbia(SRI/UW/Columbia work)(SRI/UW/Columbia work)

Nelson Morgan, ICSINelson Morgan, ICSI(OGI/IDIAP/ICSI work + summary)(OGI/IDIAP/ICSI work + summary)

Multi-rate Models (UW)Multi-rate Models (UW)

long-term features

short-term features

Cro

ss-s

cale

d

epe

nde

nci

es

(exa

mpl

e)

coarse state chain

fine state chain

• Integrate acoustic information from different time scales

• Account for dependencies across scales

• Better robustness against time- and/or frequency localized interferences

•Reduced redundancy gives better confidence estimates

SRI DirectionsSRI Directions• Task 1: Signal-adaptive weighting of time-frequency patches

Basis-entropy based representation

Matching pursuit search for optimal weighting of patches

Optimality based on minimum entropy criterion

• Task 2: Graphical models of patch combinations

Tiling-driven dependency modeling

GM combines across patch selections

Optimality based on information in representation

Data-derived phonetic Data-derived phonetic features (Columbia)features (Columbia)

• Find a set of independent attributes to account for phonetic (lexical) distinctionsphones replaced by feature streams

• Will require new pronunciation modelsasynchronous feature transitions (no phones)mapping from phonetics (for unseen words)

Joint work with Eric Fosler-Lussier

ICA for feature basesICA for feature bases• PCA finds decorrelated bases;

ICA finds independent bases

• Lexically-sufficient ICA basis set?

test/dr1/faks0/sa2

Basis vectors

5

10

15

0

2

4

6

8

time / labels d ow n ae s m iy t ix k eh r iy ix n oy l iy r ae g l ay k dh ae tcl

0

2

4

6

8

frequency / Bark

-1

0

1

0 5 10 15 20-1

0

1

2

01234

OGI Directions:OGI Directions:Targets in sub-bandsTargets in sub-bands• Initially context-independent and band-

specific phonemes• Gradually shifted to band-specific 6 broad

phonetic classes (stops, fricatives, nasals, vowels, silence, flaps)

• Moving towards band-independent speech classes (vocalic-like, fricative-like, plosive-like, ???)

More than one temporal pattern?

Mean &Variance normalized,Hamming windowedcritical bandtrajectory

MLP

MLPKLT1

101 dim

KLTn

Pre-processing by 2-D operatorsPre-processing by 2-D operatorswith subsequent TRAP-TANDEMwith subsequent TRAP-TANDEM

frequ

ency

time

1 2 10 0 0-1 -2 -1

-1 0 1-2 0 2-1 0 1

0 1 2-1 0 1-2 -1 0

-2 -1 0-1 0 10 1 2

differentiate faverage t

differentiate taverage f

diff upwardsav downwards

diff downwardsav upwards

IDIAP Directions:IDIAP Directions:Phase AutoCorrelation Phase AutoCorrelation

FeaturesFeaturesTraditional Features: Autocorrelation based.Very sensitive to additive noise, other variations.Phase AutoCorrelation (PAC):

if represents autocorrelation

coeffs derived from a frame of length PACs:

.1,...,1,0 , NkkR1N

energy. Frame 0 , 0

cos1-

R

R

kRkP

Entropy Based Multi-Entropy Based Multi-Stream CombinationStream Combination

• Combination of evidences from more than one expert to improve performance

• Entropy as a measure of confidence• Experts having low entropy are more

reliable as compared to experts having high entropy

• Inverse entropy weighting criterion• Relationship between entropy of the

resulting (recombined) classifier and recognition rate

ICSI Directions:ICSI Directions:Posterior Combination Posterior Combination

FrameworkFramework

• Combination of Several Discriminative Probability Streams

Improvement of the Combo Infrastructure

• Improve basic features:

Add prosodic features: voicing level, energy continuity,

Improve PLP by further removing the pitch difference among speakers.

• Tandem

Different targets, different training features. E.g.: word boundary.

• Improve TRAP (OGI)

• Combination

Entropy based, accuracy based stream weighting or stream selection.

New types of tandem features: Possible

word/syllable boundary

NNProcessing

Inputfeature

Target posterior

Input feature:• Traditional or improved

PLP• Spectral continuity• Voicing, voicing continuity• Formant continuity feature• …more

• Phonemes• Word/syllable

boundary• Broad phoneme

classes• Manner/ place /

articulation… etc

Data Driven Subword Unit Data Driven Subword Unit Generation (IDIAP/ICSI)Generation (IDIAP/ICSI)

Initial segmentation:large number of clusters

Is thresholdless BIC-likemerging criterion met?

Merge, re-segment, and re-estimate

Yes

StopNo

• Motivation: Phoneme-based units may not be optimal for ASR.

• Approach (based on speaker segmentation

method):

SummarySummary

• Staff and tools in place to proceed with core experiments

• Pilot experiments provided coherent substrate for cooperation between 6 sites

• Future directions for individual sites are all over the map, which is what we want

• Possible exploration of collaborations w/MS in this meeting

By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis,...

Documents

Transcript of By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis,...