1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

1M4 speech recognition

University of SheffieldUniversity of SheffieldUniversity of SheffieldUniversity of Sheffield

M4 speech recognition

Vincent Wan, Martin Karafiát


The RecogniserThe RecogniserThe RecogniserThe Recogniser

Frontend

n-best lattice generationBest first decoding

(Ducoder)

Trigram language model(SRILM)

Word internaltriphone models

MLLR adaptation

(HTK)

Cross wordtriphone models

Recognitionoutput

Lattice rescoringTime synchronous decoding

(HTK)

MLLR adaptation

(HTK)

Recognitionoutput


System limitationsSystem limitationsSystem limitationsSystem limitations

• N-best list rescoring not optimal

• Adaptation must be performed on two sets of acoustic models

• Many more hyper-parameters to tune manually

• SRILM is not efficient on very large language models (greater than 10e+9 words)


Advances since last meetingAdvances since last meetingAdvances since last meetingAdvances since last meeting

• Models trained on two databases– SWITCHBOARD recogniser

• Acoustic & language models trained on 200 hours of speech

– ICSI meetings recogniser• Acoustic models trained on 40 hours of speech

• Language model is a combination of SWB and ICSI

• Improvements mainly affect the Switchboard models

• 16kHz sampling rate used throughout


Advances since last meetingAdvances since last meetingAdvances since last meetingAdvances since last meeting

• Adaptation of word internal context dependent models

• Unified the phone sets and pronunciation dictionaries– Improved the pronunciation dictionary for Switchboard– Now using the ICSI dictionary with missing pronunciations

imported from the ISIP dictionary

• Better handling of multiple pronunciations during acoustic model training

• General bug fixes


Results overviewResults overviewResults overviewResults overview

SWB trn ICSI trnSWB trn

ICSI adpt

ICSI trn

ICSI adpt

ICSI trn

M4 adpt

SWB trn

M4 adpt

SWB trn

ICSI adpt

M4 adpt

SWB55.05

45.41

ICSI 52.36 53.99 49.27

M473.47 *

79.17 †

84.67 *

81.27 †

% word error rates

* Results from lapel mics† Results from beam former


Results: adaptation vs. direct training on ICSIResults: adaptation vs. direct training on ICSIResults: adaptation vs. direct training on ICSIResults: adaptation vs. direct training on ICSI

ICSI trainedSWB trained

ICSI adapted

Monophone models * 73.37 78.89

Context dependent word internal models *

66.08 70.59

Lattice rescoring (none or spkr independent adaptation)

52.34 53.99

Lattice rescoring

(speaker adaptation)49.27 51.18

% word error rates

* Results from Ducoder using all pruning


Acoustic model adaptation issueAcoustic model adaptation issueAcoustic model adaptation issueAcoustic model adaptation issue

• Acoustic models are presently not very adaptive– Better MLLR code required (next slide)– More training data required

• Need to make better use of the combined ICSI/SWB training data for M4.


Other newsOther newsOther newsOther news

• The next version of HTK’s adaptation code will be made available to M4 before the official public release.

• Sheffield to acquire HTK LVCSR decoder– Licensing issues to be resolved– May be able to make binaries available to M4 partners

1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

Documents

Transcript of 1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.