Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech...
Transcript of Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech...
![Page 1: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/1.jpg)
SpeechRecognitionHUNG-YI LEE ๆๅฎๆฏ
![Page 2: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/2.jpg)
LAS: ๅฐฑๆฏ seq2seq
CTC: decoder ๆฏ linear classifier ็ seq2seq
RNA: ่ผธๅ ฅไธๅๆฑ่ฅฟๅฐฑ่ฆ่ผธๅบไธๅๆฑ่ฅฟ็ seq2seq
RNN-T: ่ผธๅ ฅไธๅๆฑ่ฅฟๅฏไปฅ่ผธๅบๅคๅๆฑ่ฅฟ็ seq2seq
Neural Transducer: ๆฏๆฌก่ผธๅ ฅไธๅ window ็ RNN-T
MoCha: window ็งปๅไผธ็ธฎ่ชๅฆ็ Neural Transducer
Last Time
![Page 3: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/3.jpg)
Two Points of Views
Source of image: ๆ็ณๅฑฑ่ๅธซใๆธไฝ่ช้ณ่็ๆฆ่ซใ
Seq-to-seq HMM
![Page 4: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/4.jpg)
Hidden Markov Model (HMM)
SpeechRecognition
speech text
X Y
Yโ = ๐๐๐maxY
๐ Y|๐
= ๐๐๐maxY
๐ ๐|Y ๐ Y
๐ ๐
= ๐๐๐maxY
๐ ๐|Y ๐ Y
๐ ๐|Y : Acoustic Model
๐ Y : Language Model
HMM
Decode
![Page 5: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/5.jpg)
HMM
hh w aa t d uw y uw th ih ng k
what do you think
t-d+uw1 t-d+uw2 t-d+uw3
โฆโฆ t-d+uw d-uw+y uw-y+uw y-uw+th โฆโฆ
d-uw+y1 d-uw+y2 d-uw+y3
Phoneme:
Tri-phone:
State:
๐ ๐|Y ๐ ๐|๐
A token sequence Y corresponds to a sequence of states S
![Page 6: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/6.jpg)
HMM๐ ๐|Y ๐ ๐|๐
A sentence Y corresponds to a sequence of states S
๐ ๐ ๐
Start End
๐ ๐
![Page 7: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/7.jpg)
HMM๐ ๐|Y ๐ ๐|๐
A sentence Y corresponds to a sequence of states S
Transition Probability
๐ ๐|๐๐ ๐
Emission Probability
t-d+uw1
d-uw+y3
P(x|โt-d+uw1โ)
P(x|โd-uw+y3โ)
Gaussian Mixture Model (GMM)
Probability from one state to another
![Page 8: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/8.jpg)
HMM โ Emission Probability
โข Too many states โฆโฆ
P(x|โt-d+uw3โ)P(x|โd-uw+y3โ)
Tied-state
Same Addresspointer pointer
็ตๆฅตๅๆ : Subspace GMM [Povey, et al., ICASSPโ10]
(Geoffrey Hinton also published deep learning for ASR in the same conference)
[Mohamed , et al., ICASSPโ10]
![Page 9: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/9.jpg)
P๐ ๐|๐ =?
transition
๐ ๐|๐
๐ ๐emission(GMM)
โโ๐๐๐๐๐(๐)
๐ ๐|โ
alignment
which state generates which vector
โ1 = ๐๐๐๐๐๐
๐ ๐ ๐
โ2 = ๐๐๐๐๐๐
โ = ๐๐๐๐๐๐
๐ ๐|โ1
๐ ๐|โ2
โ = ๐๐๐๐๐
๐ ๐ ๐ ๐ ๐ ๐๐ ๐|๐ ๐ ๐|๐ ๐ ๐|๐ ๐ ๐|๐ ๐ ๐|๐
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐ ๐ฅ1|๐ ๐ ๐ฅ2|๐ ๐ ๐ฅ3|๐ ๐ ๐ฅ4|๐ ๐ ๐ฅ5|๐ ๐ ๐ฅ6|๐
![Page 10: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/10.jpg)
How to use Deep Learning?
![Page 11: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/11.jpg)
Method 1: Tandem
โฆโฆ โฆโฆ
๐ฅ๐
Size of output layer= No. of states DNN
โฆโฆ
โฆโฆ
Last hidden layer or bottleneck layer are also possible.
New acoustic feature for HMM
๐(๐|๐ฅ๐) ๐(๐|๐ฅ๐) ๐(๐|๐ฅ๐)
State classifier
![Page 12: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/12.jpg)
Method 2: DNN-HMM Hybrid
DNN
๐ ๐|๐ฅโฆโฆ
๐ฅ
๐ ๐ฅ|๐
๐ ๐ฅ|๐ =๐ ๐ฅ, ๐
๐ ๐=๐ ๐|๐ฅ ๐ ๐ฅ
๐ ๐Count from training data
CNN, LSTM โฆ
DNN output
![Page 13: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/13.jpg)
How to train a state classifier?
Train HMM-GMM model
state sequence: { a , b , c }
Acoustic features:
Utterance + Label(without alignment)
Utterance + Label(aligned)
state sequence:
Acoustic features:
a a a b b c c
![Page 14: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/14.jpg)
How to train a state classifier?
Train HMM-GMM model
Utterance + Label(without alignment)
Utterance + Label(aligned)
DNN1
![Page 15: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/15.jpg)
How to train a state classifier?
Utterance + Label(without alignment)
Utterance + Label(aligned)
DNN2DNN1realignment
![Page 16: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/16.jpg)
Human Parity!
โข ๅพฎ่ป่ช้ณ่พจ่ญๆ่ก็ช็ ด้ๅคง้็จ็ข๏ผๅฐ่ฉฑ่พจ่ญ่ฝๅ้ไบบ้กๆฐดๆบ๏ผ(2016.10)
โข https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216
โข IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)
โข http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/
Machine 5.9% v.s. Human 5.9%
Machine 5.5% v.s. Human 5.1%
[Yu, et al., INTERSPEECHโ16]
[Saon, et al., INTERSPEECHโ17]
![Page 17: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/17.jpg)
Very Deep
[Yu, et al., INTERSPEECHโ16]
![Page 18: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/18.jpg)
Back to End-to-end
![Page 19: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/19.jpg)
LAS
๐ ๐|X =?
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
๐ฅ4๐ฅ3๐ฅ2๐ฅ1
โข LAS directly computes ๐ ๐|X
๐ ๐
๐ ๐|X = ๐ ๐|๐ ๐ ๐|๐, ๐ โฆ๐ง0
๐0
๐ง1
Size V
๐1
๐ง1 ๐ง2
a
๐ ๐ ๐ ๐
๐2
๐ง3
b
๐ ๐ธ๐๐
Beam Search
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Decoding:
Training:
![Page 20: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/20.jpg)
CTC, RNN-T
๐ ๐|X =?
๐ฅ4๐ฅ3๐ฅ2๐ฅ1
โข LAS directly computes ๐ ๐|X
๐ ๐
๐ ๐|X = ๐ ๐|๐ ๐ ๐|๐, ๐ โฆ
โข CTC and RNN-T need alignment
Encoder
โ4โ3โ2โ1
โ = ๐ ๐ ๐ ๐๐ โ|๐
๐ ๐ ๐ ๐๐ ๐ ๐ ๐
P Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
โ ๐ ๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐Beam Search
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Decoding:
Training:
![Page 21: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/21.jpg)
HMM, CTC, RNN-T
P๐ ๐|๐ =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
P๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
1. Enumerate all the possible alignments
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
![Page 22: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/22.jpg)
HMM, CTC, RNN-T
๐ ๐|๐ =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
1. Enumerate all the possible alignments
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
![Page 23: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/23.jpg)
All the alignments
HMM
CTC
RNN-T
LAS
ไฝ ๅๅจๅฟไป้บผโบ
c a t c c a a a t c a a a a t
c ๐ a a t t
add ๐
duplicate to length ๐
๐ c a ๐ t ๐
add ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐
๐ = 6
SpeechRecognition
speech text
c a t
๐ = 3
โฆ
duplicatec a t
to length ๐
โฆ
c a t โฆโฆ
๐ โ โ ๐๐๐๐๐(๐)
![Page 24: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/24.jpg)
HMM c a t c c a a a t c a a a a tduplicate to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
nexttoken
duplicate
For n = 1 to ๐
output the n-th token ๐ก๐ times
constraint: ๐ก1 + ๐ก2 +โฏ๐ก๐ = ๐, ๐ก๐ > 0Trellis Graph
c c c
a
a
t
aa
t
t t
![Page 25: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/25.jpg)
HMM c a t c c a a a t c a a a a tduplicate to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
nexttoken
duplicate
For n = 1 to ๐
output the n-th token ๐ก๐ times
constraint: ๐ก1 + ๐ก2 +โฏ๐ก๐ = ๐, ๐ก๐ > 0Trellis Graph
c c c c cc
![Page 26: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/26.jpg)
For n = 1 to ๐
output the n-th token ๐ก๐ times
constraint:
output โ๐โ ๐๐ times
๐ก๐ > 0 ๐๐ โฅ 0
output โ๐โ ๐0 times
๐ก1 + ๐ก2 +โฏ๐ก๐ +
๐0 + ๐1 +โฏ๐๐ = ๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
![Page 27: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/27.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
next token
duplicate
insert ๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
(๐ can be skipped)
c
![Page 28: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/28.jpg)
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
duplicate ๐
next token
cannot skip any token
๐
![Page 29: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/29.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
next token
duplicate
insert ๐
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
duplicate
next token
![Page 30: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/30.jpg)
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
cc
a
t
๐
a
๐
t
๐๐
c
๐
![Page 31: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/31.jpg)
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
c
๐
a
๐
t
๐
c
a
t
๐๐
c c ca
t
c
๐
![Page 32: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/32.jpg)
CTC c ๐ a a t t
add ๐
๐ c a ๐ t ๐duplicate
c a t
to length ๐
โฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐
s
๐
e
๐
e
๐
next token
duplicate
insert ๐
โฆ ee โฆ โ e
Exception: when the next token is the same token
c
๐
![Page 33: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/33.jpg)
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
For n = 1 to ๐
output the n-th token 1 times
constraint:
output โ๐โ ๐๐ times
output โ๐โ ๐0 times
๐0 + ๐1 +โฏ๐๐ = ๐
c a t
Put some ๐ (option)
Put some ๐ (option)
Put some ๐ (option)
Put some ๐
at least once
๐๐ > 0
๐๐ โฅ 0 for n = 1 to ๐ โ 1
![Page 34: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/34.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
output token
Insert ๐
๐
๐
c
![Page 35: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/35.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
output token
Insert ๐
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
๐c
๐ ๐๐
๐ก๐
๐ก๐ ๐
c
a
๐ ๐ ๐ ๐ ๐ ๐
![Page 36: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/36.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
output token
Insert ๐
RNN-Tadd ๐ x ๐
c ๐ ๐ ๐ a ๐ ๐ t ๐
c ๐ ๐ a ๐ ๐ t ๐ ๐c a t โฆโฆ
๐
๐ ๐ ๐ ๐ ๐
๐ก
c
a
๐
![Page 37: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/37.jpg)
c a tStart End
c a tStart End
๐ ๐ ๐ ๐
c a tStart End
๐ ๐ ๐ ๐
HMM
CTC
RNN-T
not gen not gen not gen
![Page 38: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/38.jpg)
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
๐ ๐|Y =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
![Page 39: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/39.jpg)
This part is challenging.
![Page 40: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/40.jpg)
Score Computation
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
output token
Insert ๐
๐c
๐ ๐๐
๐๐ก๐ ๐
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐
๐ โ|๐
= ๐ ๐|๐
ร ๐ ๐|๐, ๐
ร ๐ ๐|๐, ๐๐ โฆโฆ
![Page 41: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/41.jpg)
๐
โ1 โ2 โ2 โ3 โ4 โ4 โ5 โ5 โ6
๐1,0
๐
๐2,0
๐
๐2,1
๐
๐3,1
๐
๐4,1
๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐0 ๐0 ๐1 ๐1 ๐1 ๐2 ๐2 ๐3๐3
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐
![Page 42: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/42.jpg)
Score Computation
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐1,0 ๐
๐2,0 ๐
๐2,1 ๐ ๐3,1 ๐๐4,1 ๐
๐4,2 ๐๐5,2 ๐ก
๐5,3 ๐ ๐6,3 ๐
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐
๐ โ|๐
![Page 43: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/43.jpg)
Score Computation
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
c<B
OS>
a
๐ 0๐ 1
๐ 2
Because ๐ is not considered!
๐4,2 ๐
๐4,2 ๐๐4,2
![Page 44: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/44.jpg)
๐
โ1 โ2 โ2 โ3 โ4 โ4 โ5 โ5 โ6
๐1,0
๐
๐2,0
๐
๐2,1
๐
๐3,1
๐
๐4,1
๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐0 ๐0 ๐1 ๐1 ๐1 ๐2 ๐2 ๐3๐3
![Page 45: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/45.jpg)
๐
โ4 โ5 โ5 โ6
๐ ๐ ๐ ๐ ๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐2 ๐2 ๐3๐3๐ ๐๐ ๐ ๐
๐๐ ๐ ๐๐
![Page 46: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/46.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ผ4,2
๐ผ4,2 = ๐ผ4,1๐4,1 ๐ + ๐ผ3,2๐3,2 ๐
๐ผ4,1
๐ผ3,2
generate โaโ
read ๐ฅ4
generate โ๐ โ,
๐ผ๐,๐:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens
๐ผ๐,๐
โโ๐๐๐๐๐ ๐
๐ โ|๐
![Page 47: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/47.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ผ4,2 = ๐ผ4,1๐4,1 ๐ + ๐ผ3,2๐3,2 ๐
๐ผ๐,๐:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens
You can compute summation of the scores of all the alignments.
![Page 48: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/48.jpg)
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
P๐ ๐|Y =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
P๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
![Page 49: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/49.jpg)
Training
๐ c ๐ ๐ a ๐ t ๐ ๐
๐1,0 ๐ ๐2,0 ๐ ๐2,1 ๐ ๐3,1 ๐ ๐4,1 ๐ ๐4,2 ๐ ๐5,2 ๐ก ๐5,3 ๐ ๐6,3 ๐
๐โ = ๐๐๐max๐
๐๐๐๐ ๐|๐
๐ ๐|๐ =
โ
๐ โ|๐
๐๐ ๐|๐
๐๐=?
![Page 50: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/50.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐4,1 ๐
๐3,2 ๐
๐1,0 ๐ ๐2,0 ๐ ๐2,1 ๐ ๐3,1 ๐ ๐4,1 ๐ ๐4,2 ๐ ๐5,2 ๐ก ๐5,3 ๐ ๐6,3 ๐
๐ ๐|๐ =
โ
๐ โ|๐
Each arrow is a component in ๐ ๐|๐ =
โ
๐ โ|๐
![Page 51: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/51.jpg)
Training
๐ c ๐ ๐ a ๐ t ๐ ๐
๐1,0 ๐ ๐2,0 ๐ ๐2,1 ๐ ๐3,1 ๐ ๐4,1 ๐ ๐4,2 ๐ ๐5,2 ๐ก ๐5,3 ๐ ๐6,3 ๐
๐๐ ๐|๐
๐๐=?
๐โ = ๐๐๐max๐
๐๐๐๐ ๐|๐
๐ ๐|๐ =
โ
๐ โ|๐
๐
๐4,1 ๐
๐3,2 ๐
โฆโฆ๐ ๐|๐
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ
![Page 52: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/52.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐
๐4,1 ๐
๐3,2 ๐
โฆโฆ๐ ๐|๐
๐๐ ๐|๐
๐๐=?
Each arrow is a component
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ
๐4,1 ๐
๐3,2 ๐
![Page 53: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/53.jpg)
๐1,0 ๐
โ1 โ2 โ2 โ3 โ4
๐1,0
๐2,0 ๐
๐2,0
๐2,1 ๐
๐2,1
๐3,1 ๐
๐3,1
๐4,1 ๐
๐4,1
c<BOS>
๐0 ๐1
๐0 ๐0 ๐1 ๐1 ๐1
๐๐4,1 ๐
๐๐=?
Backpropagation (through time)
To encoder
![Page 54: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/54.jpg)
๐ ๐|๐ =
โ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐ +
โ ๐ค๐๐กโ๐๐ข๐ก ๐4,1 ๐
๐ โ|๐
=
โ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐
๐4,1 ๐
๐๐ ๐|๐
๐๐=?
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ
๐4,1 ๐ ร ๐๐กโ๐๐
๐๐ ๐|๐
๐๐4,1 ๐
=1
๐4,1 ๐
โ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐
=
โ ๐ค๐๐กโ ๐4,1 ๐
๐๐กโ๐๐
![Page 55: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/55.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ฝ4,2
๐ฝ4,2 = ๐ฝ4,3๐4,2 ๐ก + ๐ฝ5,2๐4,2 ๐
generate โtโ
read ๐ฅ5
generate โ๐โ
๐ฝ๐,๐:the summation of the score of all the alignmentsstaring from i-th acoustic features and j-th tokens
๐ฝ5,2
๐ฝ4,3
![Page 56: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/56.jpg)
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
c
a
t
๐ฝ4,2
๐ผ4,1
๐4,1 ๐
๐๐ ๐|๐
๐๐4,1 ๐=
1
๐4,1 ๐
๐ ๐ค๐๐กโ ๐4,1 ๐
๐ โ|๐ ๐ผ4,1 ๐4,1 ๐ ๐ฝ4,2
๐๐ ๐|๐
๐๐=?
๐๐4,1 ๐
๐๐
๐๐ ๐|๐
๐๐4,1 ๐+๐๐3,2 ๐
๐๐
๐๐ ๐|๐
๐๐3,2 ๐+โฏ๐ผ4,1๐ฝ4,2
![Page 57: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/57.jpg)
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
P๐ ๐|Y =
โโ๐๐๐๐๐ ๐
๐ ๐|โ
๐โ = ๐๐๐max๐
๐๐๐P๐ ๐|๐
Yโ = ๐๐๐maxY
๐๐๐๐ Y|๐
P๐ Y|๐ =
โโ๐๐๐๐๐ ๐
๐ โ|๐
HMM CTC, RNN-T
๐P๐ ๐|๐
๐๐=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
![Page 58: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/58.jpg)
Yโ = ๐๐๐maxY
๐๐๐P Y|๐
= ๐๐๐maxY
๐๐๐
โโ๐๐๐๐๐ ๐
๐ โ|๐
โ ๐๐๐maxY
maxโโ๐๐๐๐๐ ๐
๐๐๐๐ โ|๐
maxโโ๐๐๐๐๐ ๐
๐ โ|๐็ๆณ
โโ = ๐๐๐ maxโ
๐๐๐๐ โ|๐
Yโ = ๐๐๐๐๐โ1 โโ
Decoding
โ = ๐ c ๐ ๐ a ๐ t ๐ ๐ โฆ
๐ โ|๐ = ๐ โ1|๐ ๐ โ2|๐, โ1 ๐ โ3|๐, โ1, โ2 โฆ
โ1 โ2 โ3
็พๅฏฆ
![Page 59: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/59.jpg)
โโ = ๐๐๐ maxโ
๐๐๐๐ โ|๐
๐
โ1 โ2 โ2 โ3 โ4 โ4 โ5 โ5 โ6
๐1,0
๐
๐2,0
๐
๐2,1
๐
๐3,1
๐
๐4,1
๐
๐4,2
๐ก
๐5,2
๐
๐5,3
๐
๐6,3
c a t<BOS>
๐0 ๐1 ๐2 ๐3
๐0 ๐0 ๐1 ๐1 ๐1 ๐2 ๐2 ๐3๐3
max
โโ
Beam Search for better
approximation
![Page 60: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/60.jpg)
Summary
LAS CTC RNN-T
Decoder dependent independent dependent
Alignment not explicit (soft alignment)
Yes Yes
Training just train it sum over alignment
sum over alignment
On-line No Yes Yes
![Page 61: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdfย ยท Recognition speech text X Y Y โ= ๐๐max Y Y| = ๐๐max Y |Y Y = ๐๐max Y |Y Y |Y: Acoustic](https://reader035.fdocuments.net/reader035/viewer/2022070912/5fb3fa9a1dca4d353a387767/html5/thumbnails/61.jpg)
Reference
โข [Yu, et al., INTERSPEECHโ16] Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention, INTERSPEECH, 2016
โข [Saon, et al., INTERSPEECHโ17] George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, BhuvanaRamabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, English Conversational Telephone Speech Recognition by Humans and Machines, INTERSPEECH, 2017
โข [Povey, et al., ICASSPโ10] Daniel Povey, Lukas Burget, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra Kumar Goel, Martin Karafiat, Ariya Rastrow, Richard C. Rose, Petr Schwarz, Samuel Thomas, Subspace Gaussian Mixture Models for speech recognition, ICASSP, 2010
โข [Mohamed , et al., ICASSPโ10] Abdel-rahman Mohamed and Geoffrey Hinton, Phone recognition using Restricted Boltzmann Machines, ICASSP, 2010