Fast Temporal State-Splitting for HMM Model Selection and Learning
Deep learning for (more than) speech recognitionSpeech recognition was performed by combining...
Transcript of Deep learning for (more than) speech recognitionSpeech recognition was performed by combining...
![Page 1: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/1.jpg)
Deep learning for (more than)speech recognition
IndabaX Western Cape, UCT, Apr. 2018
Herman Kamper
E&E Engineering, Stellenbosch Universityhttp://www.kamperh.com/
![Page 2: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/2.jpg)
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 40
![Page 3: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/3.jpg)
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 40
![Page 4: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/4.jpg)
Success in automatic speech recognition (ASR)
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]1 / 40
![Page 5: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/5.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 6: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/6.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 7: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/7.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing
(the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 8: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/8.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 9: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/9.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work
(a second rant)
2 / 40
![Page 10: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/10.jpg)
Talk outline
1. State-of-the-art automatic speech recognition (ASR)
2. Examples of non-ASR speech processing (the first rant)
3. Examples of local work (a second rant)
2 / 40
![Page 11: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/11.jpg)
State-of-the-art speech recognition
![Page 12: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/12.jpg)
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
4 / 40
![Page 13: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/13.jpg)
Feature extraction for speech processing
25 ms
10 ms
X
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
5 / 40
![Page 14: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/14.jpg)
Feature extraction for speech processing
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
6 / 40
![Page 15: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/15.jpg)
Name these networks
Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/7 / 40
![Page 16: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/16.jpg)
Name these networks
x(i)
y(i)
8 / 40
![Page 17: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/17.jpg)
Name these networks
Image: http://deeplearning.net/tutorial/lenet.html9 / 40
![Page 18: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/18.jpg)
Name these networks
p(x(1),x(2), . . . ,x(N))|[ih])
10 / 40
![Page 19: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/19.jpg)
![Page 20: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/20.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 21: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/21.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 22: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/22.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 23: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/23.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 24: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/24.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 25: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/25.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 26: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/26.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 27: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/27.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 28: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/28.jpg)
Hidden Markov models (HMMs)
W ∗ = arg maxW
P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))
= arg maxW
P (W |X)
= arg maxW
∑U
P (W, U |X) [ “without” = /w ih th aw t/ ]
= arg maxW
∑U
p(W, U, X)p(X)
= arg maxW
∑U
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|W, U)P (U |W )P (W )
≈ arg maxW
maxU
p(X|U)P (U |W )P (W )
p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model
12 / 40
![Page 29: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/29.jpg)
Hidden Markov models (HMMs)p(X|[ih])
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
X
x1
x2
...xD
x(5)
x1
x2
...xD
x(6)
Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).
13 / 40
![Page 30: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/30.jpg)
Hidden Markov models (HMMs)p(X|[ih])
x1
x2
...xD
x(1)
x1
x2
...xD
x(2)
x1
x2
...xD
x(3)
x1
x2
...xD
x(4)
X
x1
x2
...xD
x(5)
x1
x2
...xD
x(6)
Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).
13 / 40
![Page 31: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/31.jpg)
Back to today: End-to-end speech recognition
[Chan et al., arXiv’15]
14 / 40
![Page 32: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/32.jpg)
Back to today: End-to-end speech recognition
[Chan et al., arXiv’15]14 / 40
![Page 33: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/33.jpg)
End-to-end speech recognition
[Chan et al., arXiv’15]15 / 40
![Page 34: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/34.jpg)
End-to-end speech recognition
[Chan et al., arXiv’15]16 / 40
![Page 35: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/35.jpg)
End-to-end speech recognition
[Chan et al., arXiv’15]17 / 40
![Page 36: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/36.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 37: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/37.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 38: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/38.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 39: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/39.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier
x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 40: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/40.jpg)
Why did we talk about HMMs?
• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?
• Idea: Use HMM to obtainframe alignments for DNN!
• Hybrid model: DNN-HMM
• Can be seen asrepresentation learningtrained jointly with classifier x(i)
y(i)s1 s2 . . . s9000
18 / 40
![Page 41: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/41.jpg)
What about convolutional neural networks?
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
19 / 40
![Page 42: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/42.jpg)
What about convolutional neural networks?
0.0 0.2 0.4 0.6 0.8 1.00
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
0.0 0.2 0.4 0.6 0.8Time (s)
0
1000
2000
3000
4000
5000
6000
7000
8000
Fre
qu
ency
(Hz)
19 / 40
![Page 43: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/43.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]
20 / 40
![Page 44: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/44.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]
20 / 40
![Page 45: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/45.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
![Page 46: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/46.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
![Page 47: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/47.jpg)
Is end-to-end the best?
• End-to-end models are easier to implement1
• But, do they give state-of-the-art performance?
• What do you think CLDNN-HMM2 stands for?
1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40
![Page 48: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/48.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 49: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/49.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 50: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/50.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 51: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/51.jpg)
Summary: Speech recognition is important, but. . .
• Very important engineering endeavour:information access, illiteracy, assistance for the disabled
• But it is more: speech and language makes us human
• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models
• And studies about how we perceive the world can tell us somethingabout better engineering decisions
21 / 40
![Page 52: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/52.jpg)
Rant 1: Do we always need/have ASR?
Examples of non-ASR speech processing
![Page 53: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/53.jpg)
What if we do not have supervision?
• Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
• Data: 2000 hours transcribed speech audio; ∼350M/560M words text
• Can we do this for all 7000 languages spoken in the world?
• Many of these languages are endangered and unwritten
23 / 40
![Page 54: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/54.jpg)
Example 1: Query-by-example search
Spoken query:Spoken query:Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 55: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/55.jpg)
Example 1: Query-by-example search
Spoken query:
Spoken query:Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 56: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/56.jpg)
Example 1: Query-by-example search
Spoken query:
Spoken query:
Spoken query:Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 57: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/57.jpg)
Example 1: Query-by-example search
Spoken query:Spoken query:
Spoken query:
Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 58: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/58.jpg)
Example 1: Query-by-example search
Spoken query:Spoken query:Spoken query:
Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, Interspeech’12]24 / 40
![Page 59: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/59.jpg)
Example 2: Linguistic and cultural documentation
http://www.stevenbird.net/25 / 40
![Page 60: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/60.jpg)
Example 2: Linguistic and cultural documentation
http://www.stevenbird.net/26 / 40
![Page 61: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/61.jpg)
Example 3: Learning robots to understand speech
[Janssens and Renkens, 2014]; [Renkens et al., SLT’14]27 / 40
![Page 62: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/62.jpg)
Rant 2: Taking inspiration from humans
Examples of local work
![Page 63: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/63.jpg)
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
Can we acquire language from audio alone?
29 / 40
![Page 64: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/64.jpg)
Supervised speech recognition
i had to think of some example speech
since speech recognition is really cool
Can we acquire language from audio alone?
29 / 40
![Page 65: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/65.jpg)
![Page 66: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/66.jpg)
Full-coverage segmentation and clustering
31 / 40
![Page 67: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/67.jpg)
Full-coverage segmentation and clustering
31 / 40
![Page 68: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/68.jpg)
Full-coverage segmentation and clustering
31 / 40
![Page 69: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/69.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 70: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/70.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 71: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/71.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 72: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/72.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 73: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/73.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 74: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/74.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 75: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/75.jpg)
Unsupervised segmental Bayesian model
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
Acoustic frames y1:M
fa(·)fa(·) fa(·)
Speech waveform
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
Bayesian Gaussian mixture model
fe(·)
Acoustic
modelling
Wordsegm
entation
Embeddings xi = fe(yt1:t2)
Acoustic frames y1:M
p(xi|h−)
fe(·) fe(·)
fa(·)fa(·) fa(·)
Speech waveform
32 / 40
![Page 76: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/76.jpg)
Listen to discovered clusters
• Small-vocabulary cluster 45: Play
• Large-vocabulary English cluster 1214: Play
• Large-vocabulary Xitsonga cluster 629: Play
33 / 40
![Page 77: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/77.jpg)
Arrival
![Page 78: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/78.jpg)
Using images for grounding language
Consider images paired with unlabellel spoken captions:
Play
35 / 40
![Page 79: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/79.jpg)
Using images for grounding language
Consider images paired with unlabellel spoken captions:
Play
35 / 40
![Page 80: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/80.jpg)
Map images and speech into common space
X
VGG
max
conv
max
feedfwd
d(yvis,yspch) distance
yvis yspch
[Harwath et al., NIPS’16]
36 / 40
![Page 81: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/81.jpg)
Map images and speech into common space
X
VGG
max
conv
max
feedfwd
d(yvis,yspch) distance
yvis yspch
[Harwath et al., NIPS’16]36 / 40
![Page 82: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/82.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach Play (one of top 10)behindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 83: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/83.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . .behindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 84: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/84.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehindbikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 85: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/85.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wavebikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 86: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/86.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebikeboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 87: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/87.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the airboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 88: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/88.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboyslargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 89: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/89.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys Play
largeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 90: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/90.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the parklargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 91: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/91.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlargeplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 92: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/92.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge Play
playsittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 93: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/93.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of waterplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 94: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/94.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplaysittingyellowyoung
[Kamper et al., Interspeech’17]37 / 40
![Page 95: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/95.jpg)
Visually grounded keyword spotting
Keyword Example of matched utterance Type
beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplay children playing in a ball pit variantsitting two people are seated at a table with drinks semanticyellow a tan dog jumping over a red and blue toy mistakeyoung a little girl on a kid swing semantic
[Kamper et al., Interspeech’17]37 / 40
![Page 96: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/96.jpg)
Summary and conclusion
![Page 97: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/97.jpg)
What did we chat about today?
• Supervised speech recognition: From HMMs all the way to CLDNNs
• Structure is still important in speech recognition
• Saw three examples of models that do not require ASR
• Looked at local work taking inspiration from humans
39 / 40
![Page 98: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/98.jpg)
What’s next (specifically for us)?
• Still many many unsolved core machine learning problems inunsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 99: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/99.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 100: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/100.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 101: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/101.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 102: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/102.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 103: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/103.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 104: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/104.jpg)
What’s next (specifically for us)?• Still many many unsolved core machine learning problems in
unsupervised and low-resource speech processing
• Building speech search systems for (South) African languages
• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?
• What can we learn about language acquisition in humans?
• Language acquisition in robots
• Main take-away: Look at machine learningproblems from different perspectives and angles
40 / 40
![Page 105: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/105.jpg)
http://www.kamperh.com/
https://github.com/kamperh
![Page 106: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/106.jpg)
Backup slides
![Page 107: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/107.jpg)
Acoustic word embeddings (AWe)
Acoustic word embeddings x ∈ RD
fe(Y1)
fe(Y2)
Y2
Y1
x1
x2
[Levin et al., ASRU’13]43 / 40
![Page 108: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/108.jpg)
Word similarity Siamese CNN
Use idea of Siamese networks [Bromley et al., PatRec’93]
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
distancel(x1,x2)
[Kamper et al., ICASSP’15]44 / 40
![Page 109: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/109.jpg)
Word similarity Siamese CNN
Use idea of Siamese networks [Bromley et al., PatRec’93]
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
Y1
x1 = f(Y1)
Y2
x2 = f(Y2)
distancel(x1,x2)
[Kamper et al., ICASSP’15]44 / 40
![Page 110: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/110.jpg)
Retrieval in common (semantic) space
y ∈ RD in D-dimensional space
yvis
yspch
[Harwath et al., NIPS’16]45 / 40
![Page 111: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/111.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]
46 / 40
![Page 112: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/112.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 113: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/113.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 114: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/114.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 115: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/115.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 116: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/116.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 117: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/117.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 118: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/118.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40
![Page 119: Deep learning for (more than) speech recognitionSpeech recognition was performed by combining acoustic model (thousands of HMM states) with pronunciation dictionary and language model](https://reader030.fdocuments.net/reader030/viewer/2022040308/5f09e7797e708231d4290f6b/html5/thumbnails/119.jpg)
Word prediction from images and speech
VGG
hat man
shirtyvis
VGG
hat man
shirtyvis
0.85 0.8 0.9
X
VGG
hat man
shirtyvis f(X)
max
conv
max
feedfwd
X
VGG
hat man
shirtyvis f(X)Loss
max
conv
max
feedfwd
L
X
f(X)
max
conv
max
feedfwd
X
f(X)
max
conv
max
feedfwdma
nhat
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
X
f(X)
max
conv
max
feedfwdma
nhat
f(X) ∈ RW is vector ofword probabilities
I.e., a spoken bag-of-words(BoW) classifier
[Kamper et al., Interspeech’17]46 / 40