T.J. Watson Research Center, Human Language Technologies 12/1/2003 Presentation subtitle: 20pt Arial...

T.J. Watson Research Center, Human Language Technologies

12/1/2003

EARS Progress Update:Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments

Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig

2


EARS progress update

Part 1: Improved MPE

3



Previous discriminative training setup –Implicit Lattice MMI

•Used unigram decoding graph and fast decoding to generate state-level “posteriors” (actually relative likelihoods: delta between best path using the state and best path overall)•Posteriors used directly (without forward-backward) to accumulate “denominator” statistics.•Numerator statistics accumulated as for ML training, with full forward-backward•Fairly effective but not “MMI/MPE standard”

4



Current discriminative training setup (for standard MMI)

Creating lattices with unigram scores on links Forward-backward on lattices (using fixed state

sequence) to get occupation probabilities, use same lattices on multiple iterations

Creating num + den stats in a consistent way Use slower training speed (E=2, not 1) and more

iterations Also implemented MPE

5



Experimental conditions

Same as for RT’03 evaluation 274 hours of Switchboard training data Training + test data adapted using FMLLR transform [from ML

system] 60dim PLPs, VTLN, no MLLR

6



Basic MMI results (eval’00)

With word-internal phone context, 142K Gaussians

ML Iter-1 Iter-2 Iter-3 Iter-4

Old MMI, E=1 23.5% 22.7% 22.2%

New MMI, E=2 23.5% 22.5% 21.7% 20.9% 20.8%

1.4% more improvement (2.7% total) with this setup

7



MPE results (eval’00)

ML Iter-1 Iter-2 Iter-3 Iter-4 Iter-5

MMI 23.5% 22.5% 21.7% 20.9% 20.8%

MPE 23.5% 22.2% 21.5%* 21.3%*

MPE+MMI

23.5% 21.8% 21.3% 20.9% 20.5% 20.3%

Standard MPE is not as good as MMI with this setup “MPE+MMI”, which is MPE with I-smoothing to MMI update (not ML),

gives 0.5% absolute over MMI

* Conditions differ, treat with caution.

8



MPE+MMI continued

“MPE+MMI” involves storing 4 sets of statistics rather than 3: num, den, ml and now also mmi-den. 33% more storage, no extra computation

Do standard MMI update using ml and mmi-den stats, use resulting mean & var in place of ML mean & var in I-smoothing.

(Note- I-smoothing is a kind of gradual backoff to a more robust estimate of mean & variance).

MPE training leads to an excess of deletions. Based on previous experience, this can be due to a probability scale

that is too extreme. Changing the probability scale from 1/18 to 1/10 gave a ~0.3% win. 1/10 used as scale on all MPE experiments with left-context (see later)

Probability scaling in MPE

9



Fast MMI

Work presented by Bill Byrne at Eurospeech’03 showed improved results from MMI where the correctly recognized data was excluded*

Achieve a similar effect without hard decisions, by canceling num & den stats

I.e., if a state has nonzero occupation probabilities for both numerator and denominator at time t, cancel the shared part so only one is positive.

Gives as good or better results as baseline, with half the iterations. Use E=2 as before.

* “Lattice segmentation and Minimum Bayes Risk Discriminative Training”, Vlasios Doumpiotis et. al, Eurospeech 2003

ML Iter-1 Iter-2 Iter-3 Iter-4

MMI 23.5% 22.5% 21.7% 20.9% 20.8%

Fast MMI 23.5% 21.2% 20.7% 21.2%

10



MMI+MPE with cross-word (left) phone context

Similar size system (about 160K vs 142K), with cross-word context Results shown here connect word-traces into lattices indiscriminately

(ignoring constraints of context) There is an additional win possible from using context constraints (~0.2%)

RT’00 ML Iter-1 Iter-2 Iter-3 Iter-4

Old MMI* 22.0% 20.8%

Fast MMI 22.0% 20.0% 19.9%

MPE 22.0% 20.5% 20.2% 20.0%

MPE+MMI 22.0% 20.5% 19.8% 19.4% 19.5%

*I.e. last year, different setup

11



MMI and MPE with cross-word context.. on RT’03

RT’03 ML Iter-1 Iter-2 Iter-3 Iter-4

Old MMI* 22.0% 29.8%

Fast MMI 30.9% 29.9%

MPE 30.9% 29.7% 29.6% 29.5%

MPE+MMI 30.9% 29.1%

The new MMI setup (including ‘fast MMI’) is no better than old MMI About 1.8% improvement on RT’03 from MPE+MMI; MPE alone gives

1.4% improvement. Those numbers are 2.5% and 2.0% on RT’00 Comparison with MPE results in Cambridge’s 28-mix system (~170K

Gaussians) from 2002: Most comparable number is 2.2% improvement (30.4% to 28.2%) on

dev01sub using FMLLR (“constrained MLLR”) and F-SAT training (*)

“Automatic transcription of conversational telephone speech”, T. Hain et. al, submitted to IEEE transactions on Speech & Audio processing

12



Part 2: Inline Lattice Rescoring

13



Language model rescoring – some preliminary work

Very large LMs help, e.g. moving from a typical to huge (unpruned) LM can help by 0.8% (*)

Very hard to build static decoding graphs for huge LMs Good to be able to efficiently rescore lattices with a different LM Also useful for adaptive language modeling … adaptive language modeling gives us ~1% on “superhuman” test

set, and 0.2% on RT’03 (+)

* “Large LM”, Nikolai Duta & Richard Schwartz (BBN), presentation 2003 EARS meeting, IDIAP, Martigny+ “Experiments on adaptive LM”, Lidia Mangu & Geoff Zweig (IBM), ibid.

14



Lattice rescoring algorithm

Taking a lattice and applying a 3 or 4-gram LM involves expanding lattice nodes

This algorithm can take very large amounts of time for some lattices Can be solved by heavy pruning- but this is undesirable if LMs are

quite different. Developed lattice LM-rescoring algorithm. Finds the best path through a lattice given a different LM (*)

*(We are working on a modified algorithm that will generate rescored lattices)

15



Lattice rescoring algorithm (cont’d) Each word-instance in lattice has k tokens (e.g. k=3) Each token has a partial word history ending in the current word, and a traceback to the best predecessor token

THE

CAT

CAP

WHEN

WHY

WHY, -101

WHEN, -101

WHEN THE, -205WHY THE, -210

CAP, -310

WHY THE CAT, -345 THE CAT, -310

16



Lattice rescoring algorithm (cont’d)

For each word-instance in lattice from left to right… …for each token in each predecessor word-instance... …...Add current word to that token’s word-history and work out LM &

acoustic costs; …... delete word left-context until the word-history exists in the LM as an LM

context …... Form a new token pointing back to predecessor token. ……and add token to the current word-instance’s list of tokens.

Always ensure that no two tokens with the same word-history exist (delete the least likely one)

… and always keep only the k most likely tokens.

Finally, trace-back from most likely token at end of utterance. All done within decoder Highly efficient

17



Lattice rescoring algorithm – experiments

To verify that it works… Took the 4-gram LM used for the RT’03 evaluation and pruned it 13-fold Built a decoding graph, and rescored with original LM Testing on RT’03, MPE-trained system with Gaussianization

WER (RT’03)

Big LM (132 MB) 28.5%

Tiny LM (10MB) 31.7%

Tiny LM + rescoring, k=3 28.5%

Tiny LM + rescoring, k=2 28.6%

Tiny LM + rescoring, k=3,Backwards traces only (*)

30.1%

* See next slide Note, all experiments actually include an n-1 word history in each token, even when not necessary. This should decrease the accuracy of the algorithm for a given k.

18



Lattice rescoring algorithm – forward vs backward

Lattice generation algorithm: Both alpha and beta likelihoods are available to the algorithm Whenever a word-end state likelihood is within delta of the best path… Trace back until a word beginning state whose best predecessor is

word-end, is reached ...and create a “word trace.” Join all these word traces to form a lattice (using graph connectivity

constraints) Equivalent to Julian Odell’s algorithm (with n=infinity) BUT we also add “forwards” traces, based on tracing forward from word

beginning to word end. Time-symmetric with backtraces. There are fewer forwards traces (due to graph topology) Adding forwards traces is important (0.6% hit from removing them) I don’t believe there is much effect on lattice oracle WER. … it is the alignments of word-sequences that are affected.

19



Part 3: Progress in Fast Decoding

20



RT’03 Sub-realtime Architecture

21



Improvements in Fast Decoding

Switched from rank pruning to running beam pruning Hypotheses are pruned early on based on running max estimate

during successor expansion then pruned again after final max

t t+1

time

states

max update

max update; pruned at the end

max update

prune based on current max-beam

prune based on current max-beam

22



Runtime vs. WER: Beam and rank pruning

Resulted in a 10% decoding speed-up without loss in accuracy

23



Reducing the memory requirements

Run-time memory reduction by storing minimum traceback information for Viterbi word sequence recovery

Previously we stored information for full state-level alignment Now we store only information for word-level alignment

– Alpha entry has accumulated cost and pointer to originating word token

– Two alpha vectors for “flip-flop”

– Permanent word-level tokens created only at active word-ends No penalty in speed and dynamic memory reduction by two

orders of magnitude

24



Part 4: Feature-Space Gaussianization

25



Feature space Gaussianization [Saon et al. 04]

Idea: transform each dimension non-linearly such that it becomes Gaussian distributed

Motivations:Perform speaker adaptation with non-linear transforms

Natural form of non-linear speaker adaptive training (SAT)

Effort of modeling output distribution with GMMs is reduced

Transform is given by the inverse gaussian CDF applied to the empirical CDF

N

xranky ii

)(1

26



Feature Space Gaussianization, Pictorially

Inverse Gaussian CDF (mean 0, variance 1)

Old data values, percentile

New

dat

a va

lues

, ab

solu

te

0

50 8416

-1

1+- 1 std dev

68%

27



An actual transform

28



Feature space Gaussianization: WER

Results on RT’03 at the SAT level (no MLLR):

ML MPE

Baseline(FMLLR-SAT)

30.9% 29.1%

Gaussianized 30.5% 28.5%

29



Part 5: Experiments with Fisher Data

30



Acoustic Training Data

Training set size based on aligned frames only.

Total is 829 hours of speech; 486 hours excluding Fisher.

Training vocabulary includes 61K tokens.

First experiments with Fisher 1-4. Iteration likely to improve results.

corpus # frames # hours

Fisher 1-4 130 M 361SWB-1 98.6M 274IBM Voicemail 37.9M 105

BBNCTRANS 20.5M 57

SWB Cellular 6.4M 18

CallhomeEnglish 4.9M 14

31



Effect of new Fisher data on WER

Systems are PLP VTLN SAT, 60-dim. LDA+MLLT features One-shot decoding on IBM 2003 RT-03 LM (interpolated

4gm) for RT-03; generic interpolated 3gm for Superhuman. Fisher data in AM only – not LM

RT-03 Switchboard

RT-03Fisher

RT-03overall

IBM Superhuman

2002 System 34.1 26.0 30.2 36.8

All data 32.7 25.1 29.1 36.7

All less Fisher 33.2 25.7 29.6 36.2

All less VM 32.2 25.4 28.9 36.8

32



Summary Discriminative training New MPE 0.7% better than old MMI on RT03 Used MMI estimate rather than ML estimate for I-smoothing with MPE

(consistently gives about 0.4% improvement over standard MPE)LM rescoring 10x Reduction in static graph size – 132M 10M Useful for rescoring with adaptive LMsFast Decoding 10% speedup - incremental application of absolute pruning thresholdGaussianization 0.6% improvement on top of MPE Useful on a variety of tasks (e.g. C&C in cars)Fisher Data 1.3% improvement over last year without it (AM only) Not useful in a broader context

T.J. Watson Research Center, Human Language Technologies 12/1/2003 Presentation subtitle: 20pt Arial...

Documents

Transcript of T.J. Watson Research Center, Human Language Technologies 12/1/2003 Presentation subtitle: 20pt Arial...