This landmark article by Frantisek Hermansky and Paulus Pudlak, clini
By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis,...
-
date post
22-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis,...
By the Novel Approaches team,With site leaders:
Nelson Morgan, ICSIHynek Hermansky, OGI
Dan Ellis, ColumbiaKemal Sönmez, SRIMari Ostendorf, UW
Hervé Bourlard, IDIAP/EPFLGeorge Doddington, NA-sayer
“Pushing the Envelope”
A six month report
OverviewOverview
Nelson Morgan, ICSINelson Morgan, ICSI
The Current Cast of The Current Cast of CharactersCharacters
• ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington
• UW: M. Ostendorf, Ö. Çetin
• OGI: H. Hermansky, S. Sivadas, P. Jain
• Columbia: D. Ellis, M. Athineos
• SRI: K. Sönmez
• IDIAP: H. Bourlard, J. Ajmera, V. Tyagi
Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR
• Escape dependence on spectral envelope
• Use multiple front-ends across time/freq
• Modify statistical models to accommodate new front-ends
• Design optimal combination schemes for multiple models
time
Task 1: Pushing the Task 1: Pushing the Envelope (aside)Envelope (aside)
• Problem: Spectral envelope is a fragile information carrier
estimate of sound identity
info
rmati
on
fusio
n
10 msOLD
PROPOSED
• Solution: Probabilities from multiple time-frequency patches
ith estimate
up to 1s
kth estimate
nth estimate
estimate of sound identity
Task 2: Beyond Task 2: Beyond Frames…Frames…
• Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm
OLD
PROPOSED
conventional HMMshort-term features
• Problem: Features & models interact; new features may require different models
advanced features multi-rate, dynamic-scale classifier
Today’s presentationToday’s presentation
• Infrastructure: training, testing, software
• Initial Experiments: pilot studies• Directions: where we’re headed
Infrastructure Infrastructure
Kemal Sönmez, SRIKemal Sönmez, SRI(SRI/UW/ICSI effort)(SRI/UW/ICSI effort)
Initial Experimental Initial Experimental ParadigmParadigm
• Focus on a small task to facilitate exploratory work (later move to CTS)
• Choose a task where LM is fixed & plays a minor role (to focus on acoustics)
• Use mismatched train/test data:To avoid tuning to the taskTo facilitate later move to CTS
• Task: OGI numbers/ Train: swbd+macrophone
• Composition
(total ~ 60 hours)
* subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations
• WER 2-4% higher vs. full 250+ hour training
Hub5 “Short” Training Hub5 “Short” Training SetSet
hoursCorpus Male Female
callhome 2.8 13.8
switchboard* 5.9 4.3credit-card 6.7 7.1macrophone 12.4 5.8
Reduced UW Training Reduced UW Training SetSet
• A reduced training set to shorten expt. turn-around time
• Choose training utterances with per-frame likelihood scores close to the training set average
• 1/4th of the original training set• Statistics (gender, data set constituencies) are similar
to that of the full training set.
• For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5).
data set constituencies
male/femalemacrophon
ecallhome
credit-card
otherswitchboard
“short” 32% 32% 12% 24% 45/55%
Reduced (UW)
38% 28% 12% 22% 48/52%
Development Test SetsDevelopment Test Sets• A “Core-Subset” of OGI’s Numbers 95 corpora – telephone
speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items
• “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers
• Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.)
Data Set Name Total Utterance
Total Words Duration (hours)
Numbers95-CS Cross
Validation
357 1353 ~0.2
Numbers95-CSDevelopment
1206 4673 ~0.6
Numbers95-CSTest
1227 4757 ~0.6
Statistical Modeling Statistical Modeling Tools Tools
• HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging
• GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streamsAllows direct dependencies across streams Not limited by single-rate, single-stream paradigmRapid model specification/training/testing
• SRI Decipher system for providing lattices to rescore (later in CTS expts)
• Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP
Baseline SRI Baseline SRI RecognizerRecognizer
for the numbers taskfor the numbers task• Bottom-up state-clustered Gaussian mixture
HMMs for acoustic modeling• Acoustic adaptation to speakers using affine mean
and variance transforms[Not used for numbers]• Vocal-tract length normalization using maximum
likelihood estimation [Not helpful for numbers]• Progressive search with lattice recognition and N-
best rescoring [To be used in later work]• Bigram LM
Initial ExperimentsInitial Experiments
Barry Chen, ICSIBarry Chen, ICSIHynek Hermansky, OHSU (OGI)Hynek Hermansky, OHSU (OGI)
Özgür Çetin, UWÖzgür Çetin, UW
Goals of Initial Goals of Initial ExperimentsExperiments
• Establish performance baselinesHMM + standard features (MFCC, PLP)HMM + current best from ICSI/OGI
• Develop infrastructure for new modelsGMTK for multi-stream & multi-rate featuresNovel features based on large timespansNovel features based on temporal fine
structure
• Provide fodder for future error analysis
ICSI Baseline ICSI Baseline experimentsexperiments
• PLP based - SRI system
• “Tandem” PLP-based ANN + SRI system
• Initial combination approach
Development Baseline: Development Baseline: Gender Independent Gender Independent
PLP SystemPLP System
Training SetWord,SentenceError Rate on
Numbers95-CS Test Set
Full “Short” Hub5 (85k utterances, ~64.9 hrs)
3.4%,10.2%
UW Reduced Hub5 (20k utterances, ~18.8 hrs)
3.8%,11.4%
Phonetically Trained Neural Phonetically Trained Neural NetNet
• Multi-Layer Perceptron (input, hidden, and output layer)• Trained Using Error-Backpropagation Technique – outputs
interpreted as posterior probabilities of target classes• Training Targets: 47 mono-phone targets from forced
alignment using SRI Eval 2002 system• Training Utterances: UW Reduced Hub5 Set• Training Features: PLP12+e+d+dd, mean & variance
normalized on per-conversation side basis• MLP Topology:
9 Frame Context Window (4 frames in past + current frame + 4 frames in future)
351 Input Units, 1500 Hidden Units, and 47 Output Units Total Number of Parameters: ~600k
Baseline ICSI TandemBaseline ICSI Tandem
• Outputs of Neural Net before final softmax non-linearity used as inputs to PCA
• PCA without dimensionality reduction
• 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set
Baseline ICSI Tandem+PLPBaseline ICSI Tandem+PLP
• PLP Stream concatenated with neural net posteriors stream• PCA reduces dimensionality of posteriors stream to 16
(keeping 95% of overall variance)• 3.3% Word and 9.5% Sentence Error Rate on Numbers95-
CS test set
Word and String Error Rates on Word and String Error Rates on Numbers95-CS Test SetNumbers95-CS Test Set
OGI Experiments:OGI Experiments:New Features in EARSNew Features in EARS
• Develop on home-grown ASR system (phoneme-based HTK)
• Pass the most promising to ICSI for running in SRI LVCSR system
• So far new features match the performance of the
baseline PLP features but do not exceed itadvantage seen in combination with the
baseline
Looking to the human Looking to the human auditory system for design auditory system for design
inspirationinspiration
• Psychophysics Components within
certain frequency range (several critical bands) interact [e.g. frequency masking]
Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking]
• Physiology 2-D (time-frequency)
matched filters for activity in auditory cortex [cortical receptive fields]
TRAP-based HMM-NN hybrid ASR
Posterior probabilitiesof phonemes
Multilayer Perceptron
(MLP)
Mean &variancenormalized,hamming windowedcritical bandtrajectory
101 pointinput
Multilayer Perceptron
(MLP)
Multilayer Perceptron
(MLP)
Searchfor the best
match
Feature estimation from linearly transformed temporal
patterns
MLP
MLPtransform
transform
TANDEMHMMASR
? ? ?
Preliminary Preliminary TANDEM/TRAP results TANDEM/TRAP results
(OGI-HTK)(OGI-HTK)
WER% on OGI numbers, training on UW reduced training set,monophone models
BASELINE 4.5
TANDEM 4.1
TANDEM with TRAP 3.9
Features from more than one Features from more than one critical-band temporal critical-band temporal
trajectorytrajectory
+
averagefrequencyderivative
cosinetransform
Studying KLT-derived basis functions, we observe:
UW Baseline UW Baseline ExperimentExperimentss
• Constructed an HTK-based HMM system that is competitive with the SRI system
• Replicated the HMM system in GMTK• Move on to models which integrate
information from multiple sources in a principled manner:
Multiple feature streams (multi-stream models)
Different time scales (multi-rate models)
• Focus on statistical models not on feature extraction
HTK HMM BaselineHTK HMM Baseline• An HTK-based standard HMM system:
• 3 state triphones with decision-tree clustering,
• Mixture of diagonal Gaussians as state output dists.,
• No adaptation, fixed LM.
• Dimensions explored:• Front-end: PLP vs. MFCC, VTLN
• Gender dependent vs. independent modeling
• Conclusions: • No significant performance differences
• Decided on PLPs, no VTLN, gender-independent models for simplicity
HMM Baselines (cont.)HMM Baselines (cont.)• Replicated HTK baseline with equivalent results in GMTK
• To reduce experiment turn-around time, wanted to reduce the training set
• For HMMs and Numbers95, 3/4th of the training data can be safely ignored:
WER %
tool dev test
HTK 3.7 3.2
GMTK 3.7 3.0
Training set
WER %
dev test
Full “short” 3.7 3.2
1/4th (“reduced”)
3.4 3.4
Multi-stream ModelsMulti-stream Models• Information fusion from multiple streams of features • Partially asynchronous state sequences
states of stream X
state
s of stre
am
Y
state seq. of stream Y
STATE TOPOLOGY
state seq. of stream X
feature stream X
feature stream Y
GRAPHICAL MODEL
modelWER %
dev test
HMM (PLP) 3.9 4.2
multi-stream(PLP+MFCC)
Temporal envelope Temporal envelope featuresfeatures
(Columbia)(Columbia)• Temporal fine structure is lost (deliberately)
in STFT features:
• Need a compact, parametric description...time / sec
0.65 0.7 0.75 0.8 0.85 0.90
2000
4000
6000
8000
-6dB
0
-40
-20
0
0.65 0.7 0.75 0.8 0.85 0.9-0.05
0
0.05
0.1
0.15mpgr1-sx419
10 mswindows
Frequency-DomainFrequency-DomainLinear Prediction Linear Prediction
(FDLP)(FDLP)
• Extend LPC with LP model of spectrum
• ‘Poles’ represent temporal peaks:
• Features ~ pole bandwidth, ‘frequency’
TD-LPy[n] = iaiy[n-i]
DFTFD-LP
Y[k] = ibiY[k-i]
0.65 0.7 0.75 0.8 0.85 0.9-0.05
0
0.05
0.1
mpgr1-sx419: TDLPC env (60 poles / 300 ms)
Preliminary FDLP Preliminary FDLP ResultsResults
• Distribution of pole magnitudes for different phone classes (in 4 bands):
• NN Classifier Frame Accuracies:
plp12N 57.0%
plp12N+FDLP4 58.2%
-2 0 2 4 60
0.02
0.04
0.06
0.08
0.10-500 Hz band
-2 0 2 4 6
500-1000 Hz band
-2 0 2 4 6
1-2 kHz band
-2 0 2 4 6
2-4 kHz band
-log(1-||)
/ah//p/
DirectionsDirections
Dan Ellis, ColumbiaDan Ellis, Columbia(SRI/UW/Columbia work)(SRI/UW/Columbia work)
Nelson Morgan, ICSINelson Morgan, ICSI(OGI/IDIAP/ICSI work + summary)(OGI/IDIAP/ICSI work + summary)
Multi-rate Models (UW)Multi-rate Models (UW)
long-term features
short-term features
Cro
ss-s
cale
d
epe
nde
nci
es
(exa
mpl
e)
coarse state chain
fine state chain
• Integrate acoustic information from different time scales
• Account for dependencies across scales
• Better robustness against time- and/or frequency localized interferences
•Reduced redundancy gives better confidence estimates
SRI DirectionsSRI Directions• Task 1: Signal-adaptive weighting of time-frequency patches
Basis-entropy based representation
Matching pursuit search for optimal weighting of patches
Optimality based on minimum entropy criterion
• Task 2: Graphical models of patch combinations
Tiling-driven dependency modeling
GM combines across patch selections
Optimality based on information in representation
Data-derived phonetic Data-derived phonetic features (Columbia)features (Columbia)
• Find a set of independent attributes to account for phonetic (lexical) distinctionsphones replaced by feature streams
• Will require new pronunciation modelsasynchronous feature transitions (no phones)mapping from phonetics (for unseen words)
Joint work with Eric Fosler-Lussier
ICA for feature basesICA for feature bases• PCA finds decorrelated bases;
ICA finds independent bases
• Lexically-sufficient ICA basis set?
test/dr1/faks0/sa2
Basis vectors
5
10
15
0
2
4
6
8
time / labels d ow n ae s m iy t ix k eh r iy ix n oy l iy r ae g l ay k dh ae tcl
0
2
4
6
8
frequency / Bark
-1
0
1
0 5 10 15 20-1
0
1
2
01234
OGI Directions:OGI Directions:Targets in sub-bandsTargets in sub-bands• Initially context-independent and band-
specific phonemes• Gradually shifted to band-specific 6 broad
phonetic classes (stops, fricatives, nasals, vowels, silence, flaps)
• Moving towards band-independent speech classes (vocalic-like, fricative-like, plosive-like, ???)
More than one temporal pattern?
Mean &Variance normalized,Hamming windowedcritical bandtrajectory
MLP
MLPKLT1
101 dim
KLTn
Pre-processing by 2-D operatorsPre-processing by 2-D operatorswith subsequent TRAP-TANDEMwith subsequent TRAP-TANDEM
frequ
ency
time
1 2 10 0 0-1 -2 -1
-1 0 1-2 0 2-1 0 1
0 1 2-1 0 1-2 -1 0
-2 -1 0-1 0 10 1 2
differentiate faverage t
differentiate taverage f
diff upwardsav downwards
diff downwardsav upwards
IDIAP Directions:IDIAP Directions:Phase AutoCorrelation Phase AutoCorrelation
FeaturesFeaturesTraditional Features: Autocorrelation based.Very sensitive to additive noise, other variations.Phase AutoCorrelation (PAC):
if represents autocorrelation
coeffs derived from a frame of length PACs:
.1,...,1,0 , NkkR1N
energy. Frame 0 , 0
cos1-
R
R
kRkP
Entropy Based Multi-Entropy Based Multi-Stream CombinationStream Combination
• Combination of evidences from more than one expert to improve performance
• Entropy as a measure of confidence• Experts having low entropy are more
reliable as compared to experts having high entropy
• Inverse entropy weighting criterion• Relationship between entropy of the
resulting (recombined) classifier and recognition rate
ICSI Directions:ICSI Directions:Posterior Combination Posterior Combination
FrameworkFramework
• Combination of Several Discriminative Probability Streams
Improvement of the Combo Infrastructure
• Improve basic features:
Add prosodic features: voicing level, energy continuity,
Improve PLP by further removing the pitch difference among speakers.
• Tandem
Different targets, different training features. E.g.: word boundary.
• Improve TRAP (OGI)
• Combination
Entropy based, accuracy based stream weighting or stream selection.
New types of tandem features: Possible
word/syllable boundary
NNProcessing
Inputfeature
Target posterior
Input feature:• Traditional or improved
PLP• Spectral continuity• Voicing, voicing continuity• Formant continuity feature• …more
• Phonemes• Word/syllable
boundary• Broad phoneme
classes• Manner/ place /
articulation… etc
Data Driven Subword Unit Data Driven Subword Unit Generation (IDIAP/ICSI)Generation (IDIAP/ICSI)
Initial segmentation:large number of clusters
Is thresholdless BIC-likemerging criterion met?
Merge, re-segment, and re-estimate
Yes
StopNo
• Motivation: Phoneme-based units may not be optimal for ASR.
• Approach (based on speaker segmentation
method):
SummarySummary
• Staff and tools in place to proceed with core experiments
• Pilot experiments provided coherent substrate for cooperation between 6 sites
• Future directions for individual sites are all over the map, which is what we want
• Possible exploration of collaborations w/MS in this meeting