T.J. Watson Research Center, Human Language Technologies 12/1/2003 Presentation subtitle: 20pt Arial...
-
Upload
noreen-miller -
Category
Documents
-
view
213 -
download
0
Transcript of T.J. Watson Research Center, Human Language Technologies 12/1/2003 Presentation subtitle: 20pt Arial...
T.J. Watson Research Center, Human Language Technologies
12/1/2003
EARS Progress Update:Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments
Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig
2
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Part 1: Improved MPE
3
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Previous discriminative training setup –Implicit Lattice MMI
•Used unigram decoding graph and fast decoding to generate state-level “posteriors” (actually relative likelihoods: delta between best path using the state and best path overall)•Posteriors used directly (without forward-backward) to accumulate “denominator” statistics.•Numerator statistics accumulated as for ML training, with full forward-backward•Fairly effective but not “MMI/MPE standard”
4
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Current discriminative training setup (for standard MMI)
Creating lattices with unigram scores on links Forward-backward on lattices (using fixed state
sequence) to get occupation probabilities, use same lattices on multiple iterations
Creating num + den stats in a consistent way Use slower training speed (E=2, not 1) and more
iterations Also implemented MPE
5
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Experimental conditions
Same as for RT’03 evaluation 274 hours of Switchboard training data Training + test data adapted using FMLLR transform [from ML
system] 60dim PLPs, VTLN, no MLLR
6
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Basic MMI results (eval’00)
With word-internal phone context, 142K Gaussians
ML Iter-1 Iter-2 Iter-3 Iter-4
Old MMI, E=1 23.5% 22.7% 22.2%
New MMI, E=2 23.5% 22.5% 21.7% 20.9% 20.8%
1.4% more improvement (2.7% total) with this setup
7
T.J. Watson Research Center, Human Language Technologies
EARS progress update
MPE results (eval’00)
ML Iter-1 Iter-2 Iter-3 Iter-4 Iter-5
MMI 23.5% 22.5% 21.7% 20.9% 20.8%
MPE 23.5% 22.2% 21.5%* 21.3%*
MPE+MMI
23.5% 21.8% 21.3% 20.9% 20.5% 20.3%
Standard MPE is not as good as MMI with this setup “MPE+MMI”, which is MPE with I-smoothing to MMI update (not ML),
gives 0.5% absolute over MMI
* Conditions differ, treat with caution.
8
T.J. Watson Research Center, Human Language Technologies
EARS progress update
MPE+MMI continued
“MPE+MMI” involves storing 4 sets of statistics rather than 3: num, den, ml and now also mmi-den. 33% more storage, no extra computation
Do standard MMI update using ml and mmi-den stats, use resulting mean & var in place of ML mean & var in I-smoothing.
(Note- I-smoothing is a kind of gradual backoff to a more robust estimate of mean & variance).
MPE training leads to an excess of deletions. Based on previous experience, this can be due to a probability scale
that is too extreme. Changing the probability scale from 1/18 to 1/10 gave a ~0.3% win. 1/10 used as scale on all MPE experiments with left-context (see later)
Probability scaling in MPE
9
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Fast MMI
Work presented by Bill Byrne at Eurospeech’03 showed improved results from MMI where the correctly recognized data was excluded*
Achieve a similar effect without hard decisions, by canceling num & den stats
I.e., if a state has nonzero occupation probabilities for both numerator and denominator at time t, cancel the shared part so only one is positive.
Gives as good or better results as baseline, with half the iterations. Use E=2 as before.
* “Lattice segmentation and Minimum Bayes Risk Discriminative Training”, Vlasios Doumpiotis et. al, Eurospeech 2003
ML Iter-1 Iter-2 Iter-3 Iter-4
MMI 23.5% 22.5% 21.7% 20.9% 20.8%
Fast MMI 23.5% 21.2% 20.7% 21.2%
10
T.J. Watson Research Center, Human Language Technologies
EARS progress update
MMI+MPE with cross-word (left) phone context
Similar size system (about 160K vs 142K), with cross-word context Results shown here connect word-traces into lattices indiscriminately
(ignoring constraints of context) There is an additional win possible from using context constraints (~0.2%)
RT’00 ML Iter-1 Iter-2 Iter-3 Iter-4
Old MMI* 22.0% 20.8%
Fast MMI 22.0% 20.0% 19.9%
MPE 22.0% 20.5% 20.2% 20.0%
MPE+MMI 22.0% 20.5% 19.8% 19.4% 19.5%
*I.e. last year, different setup
11
T.J. Watson Research Center, Human Language Technologies
EARS progress update
MMI and MPE with cross-word context.. on RT’03
RT’03 ML Iter-1 Iter-2 Iter-3 Iter-4
Old MMI* 22.0% 29.8%
Fast MMI 30.9% 29.9%
MPE 30.9% 29.7% 29.6% 29.5%
MPE+MMI 30.9% 29.1%
The new MMI setup (including ‘fast MMI’) is no better than old MMI About 1.8% improvement on RT’03 from MPE+MMI; MPE alone gives
1.4% improvement. Those numbers are 2.5% and 2.0% on RT’00 Comparison with MPE results in Cambridge’s 28-mix system (~170K
Gaussians) from 2002: Most comparable number is 2.2% improvement (30.4% to 28.2%) on
dev01sub using FMLLR (“constrained MLLR”) and F-SAT training (*)
“Automatic transcription of conversational telephone speech”, T. Hain et. al, submitted to IEEE transactions on Speech & Audio processing
12
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Part 2: Inline Lattice Rescoring
13
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Language model rescoring – some preliminary work
Very large LMs help, e.g. moving from a typical to huge (unpruned) LM can help by 0.8% (*)
Very hard to build static decoding graphs for huge LMs Good to be able to efficiently rescore lattices with a different LM Also useful for adaptive language modeling … adaptive language modeling gives us ~1% on “superhuman” test
set, and 0.2% on RT’03 (+)
* “Large LM”, Nikolai Duta & Richard Schwartz (BBN), presentation 2003 EARS meeting, IDIAP, Martigny+ “Experiments on adaptive LM”, Lidia Mangu & Geoff Zweig (IBM), ibid.
14
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Lattice rescoring algorithm
Taking a lattice and applying a 3 or 4-gram LM involves expanding lattice nodes
This algorithm can take very large amounts of time for some lattices Can be solved by heavy pruning- but this is undesirable if LMs are
quite different. Developed lattice LM-rescoring algorithm. Finds the best path through a lattice given a different LM (*)
*(We are working on a modified algorithm that will generate rescored lattices)
15
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Lattice rescoring algorithm (cont’d) Each word-instance in lattice has k tokens (e.g. k=3) Each token has a partial word history ending in the current word, and a traceback to the best predecessor token
THE
CAT
CAP
WHEN
WHY
WHY, -101
WHEN, -101
WHEN THE, -205WHY THE, -210
CAP, -310
WHY THE CAT, -345 THE CAT, -310
16
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Lattice rescoring algorithm (cont’d)
For each word-instance in lattice from left to right… …for each token in each predecessor word-instance... …...Add current word to that token’s word-history and work out LM &
acoustic costs; …... delete word left-context until the word-history exists in the LM as an LM
context …... Form a new token pointing back to predecessor token. ……and add token to the current word-instance’s list of tokens.
Always ensure that no two tokens with the same word-history exist (delete the least likely one)
… and always keep only the k most likely tokens.
Finally, trace-back from most likely token at end of utterance. All done within decoder Highly efficient
17
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Lattice rescoring algorithm – experiments
To verify that it works… Took the 4-gram LM used for the RT’03 evaluation and pruned it 13-fold Built a decoding graph, and rescored with original LM Testing on RT’03, MPE-trained system with Gaussianization
WER (RT’03)
Big LM (132 MB) 28.5%
Tiny LM (10MB) 31.7%
Tiny LM + rescoring, k=3 28.5%
Tiny LM + rescoring, k=2 28.6%
Tiny LM + rescoring, k=3,Backwards traces only (*)
30.1%
* See next slide Note, all experiments actually include an n-1 word history in each token, even when not necessary. This should decrease the accuracy of the algorithm for a given k.
18
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Lattice rescoring algorithm – forward vs backward
Lattice generation algorithm: Both alpha and beta likelihoods are available to the algorithm Whenever a word-end state likelihood is within delta of the best path… Trace back until a word beginning state whose best predecessor is
word-end, is reached ...and create a “word trace.” Join all these word traces to form a lattice (using graph connectivity
constraints) Equivalent to Julian Odell’s algorithm (with n=infinity) BUT we also add “forwards” traces, based on tracing forward from word
beginning to word end. Time-symmetric with backtraces. There are fewer forwards traces (due to graph topology) Adding forwards traces is important (0.6% hit from removing them) I don’t believe there is much effect on lattice oracle WER. … it is the alignments of word-sequences that are affected.
19
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Part 3: Progress in Fast Decoding
20
T.J. Watson Research Center, Human Language Technologies
EARS progress update
RT’03 Sub-realtime Architecture
21
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Improvements in Fast Decoding
Switched from rank pruning to running beam pruning Hypotheses are pruned early on based on running max estimate
during successor expansion then pruned again after final max
t t+1
time
states
max update
max update; pruned at the end
max update
prune based on current max-beam
prune based on current max-beam
22
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Runtime vs. WER: Beam and rank pruning
Resulted in a 10% decoding speed-up without loss in accuracy
23
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Reducing the memory requirements
Run-time memory reduction by storing minimum traceback information for Viterbi word sequence recovery
Previously we stored information for full state-level alignment Now we store only information for word-level alignment
– Alpha entry has accumulated cost and pointer to originating word token
– Two alpha vectors for “flip-flop”
– Permanent word-level tokens created only at active word-ends No penalty in speed and dynamic memory reduction by two
orders of magnitude
24
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Part 4: Feature-Space Gaussianization
25
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Feature space Gaussianization [Saon et al. 04]
Idea: transform each dimension non-linearly such that it becomes Gaussian distributed
Motivations:Perform speaker adaptation with non-linear transforms
Natural form of non-linear speaker adaptive training (SAT)
Effort of modeling output distribution with GMMs is reduced
Transform is given by the inverse gaussian CDF applied to the empirical CDF
N
xranky ii
)(1
26
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Feature Space Gaussianization, Pictorially
Inverse Gaussian CDF (mean 0, variance 1)
Old data values, percentile
New
dat
a va
lues
, ab
solu
te
0
50 8416
-1
1+- 1 std dev
68%
27
T.J. Watson Research Center, Human Language Technologies
EARS progress update
An actual transform
28
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Feature space Gaussianization: WER
Results on RT’03 at the SAT level (no MLLR):
ML MPE
Baseline(FMLLR-SAT)
30.9% 29.1%
Gaussianized 30.5% 28.5%
29
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Part 5: Experiments with Fisher Data
30
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Acoustic Training Data
Training set size based on aligned frames only.
Total is 829 hours of speech; 486 hours excluding Fisher.
Training vocabulary includes 61K tokens.
First experiments with Fisher 1-4. Iteration likely to improve results.
corpus # frames # hours
Fisher 1-4 130 M 361SWB-1 98.6M 274IBM Voicemail 37.9M 105
BBNCTRANS 20.5M 57
SWB Cellular 6.4M 18
CallhomeEnglish 4.9M 14
31
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Effect of new Fisher data on WER
Systems are PLP VTLN SAT, 60-dim. LDA+MLLT features One-shot decoding on IBM 2003 RT-03 LM (interpolated
4gm) for RT-03; generic interpolated 3gm for Superhuman. Fisher data in AM only – not LM
RT-03 Switchboard
RT-03Fisher
RT-03overall
IBM Superhuman
2002 System 34.1 26.0 30.2 36.8
All data 32.7 25.1 29.1 36.7
All less Fisher 33.2 25.7 29.6 36.2
All less VM 32.2 25.4 28.9 36.8
32
T.J. Watson Research Center, Human Language Technologies
EARS progress update
Summary Discriminative training New MPE 0.7% better than old MMI on RT03 Used MMI estimate rather than ML estimate for I-smoothing with MPE
(consistently gives about 0.4% improvement over standard MPE)LM rescoring 10x Reduction in static graph size – 132M 10M Useful for rescoring with adaptive LMsFast Decoding 10% speedup - incremental application of absolute pruning thresholdGaussianization 0.6% improvement on top of MPE Useful on a variety of tasks (e.g. C&C in cars)Fisher Data 1.3% improvement over last year without it (AM only) Not useful in a broader context