Deep learning for (more than) speech recognitionSpeech recognition was performed by combining...

Deep learning for (more than)speech recognition

IndabaX Western Cape, UCT, Apr. 2018

Herman Kamper

E&E Engineering, Stellenbosch Universityhttp://www.kamperh.com/

Success in automatic speech recognition (ASR)

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]1 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work (a second rant)

2 / 40

Talk outline

2 / 40

Talk outline

2. Examples of non-ASR speech processing

(the first rant)

2 / 40

Talk outline

2 / 40

Talk outline

3. Examples of local work

(a second rant)

2 / 40

Talk outline

2 / 40

State-of-the-art speech recognition

Supervised speech recognition

i had to think of some example speech

since speech recognition is really cool

4 / 40

Feature extraction for speech processing

5 / 40

Feature extraction for speech processing

0.0 0.2 0.4 0.6 0.8 1.00

0.0 0.2 0.4 0.6 0.8Time (s)

6 / 40

Name these networks

Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/7 / 40

Name these networks

8 / 40

Name these networks

Image: http://deeplearning.net/tutorial/lenet.html9 / 40

Name these networks

p(x(1),x(2), . . . ,x(N))|[ih])

10 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

p(W, U, X)p(X)

= arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|W, U)P (U |W )P (W )

≈ arg maxW

p(X|U)P (U |W )P (W )

12 / 40

Hidden Markov models (HMMs)p(X|[ih])

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Hidden Markov models (HMMs)p(X|[ih])

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]

14 / 40

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]14 / 40

End-to-end speech recognition

Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier x(i)

y(i)s1 s2 . . . s9000

18 / 40

y(i)s1 s2 . . . s9000

18 / 40

y(i)s1 s2 . . . s9000

18 / 40

• Can be seen asrepresentation learningtrained jointly with classifier

y(i)s1 s2 . . . s9000

18 / 40

y(i)s1 s2 . . . s9000

18 / 40

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

0.0 0.2 0.4 0.6 0.8Time (s)

19 / 40

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

0.0 0.2 0.4 0.6 0.8Time (s)

19 / 40

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40

Summary: Speech recognition is important, but. . .

• Very important engineering endeavour:information access, illiteracy, assistance for the disabled

• But it is more: speech and language makes us human

• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models

• And studies about how we perceive the world can tell us somethingabout better engineering decisions

21 / 40

Rant 1: Do we always need/have ASR?

Examples of non-ASR speech processing

What if we do not have supervision?

• Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)

• Data: 2000 hours transcribed speech audio; ∼350M/560M words text

• Can we do this for all 7000 languages spoken in the world?

• Many of these languages are endangered and unwritten

23 / 40

Example 1: Query-by-example search

Spoken query:Spoken query:Spoken query:Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40

Spoken query:

Spoken query:Spoken query:Spoken query:

Spoken query:

Spoken query:Spoken query:

Spoken query:

Spoken query:Spoken query:Spoken query:

Spoken query:

Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/25 / 40

Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/26 / 40

Example 3: Learning robots to understand speech

[Janssens and Renkens, 2014]; [Renkens et al., SLT’14]27 / 40

Rant 2: Taking inspiration from humans

Examples of local work

Can we acquire language from audio alone?

29 / 40

Can we acquire language from audio alone?

29 / 40

Full-coverage segmentation and clustering

31 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Acoustic

modelling

Wordsegm

entation

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Listen to discovered clusters

• Small-vocabulary cluster 45: Play

• Large-vocabulary English cluster 1214: Play

• Large-vocabulary Xitsonga cluster 629: Play

33 / 40

Arrival

Using images for grounding language

Consider images paired with unlabellel spoken captions:

35 / 40

Using images for grounding language

Consider images paired with unlabellel spoken captions:

35 / 40

Map images and speech into common space

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]

36 / 40

Map images and speech into common space

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]36 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach Play (one of top 10)behindbikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

beach a boy in a yellow shirt is walking on a beach . . .behindbikeboyslargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehindbikeboyslargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wavebikeboyslargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebikeboyslargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the airboyslargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboyslargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys Play

largeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the parklargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlargeplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge Play

playsittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of waterplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplaysittingyellowyoung

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplay children playing in a ball pit variantsitting two people are seated at a table with drinks semanticyellow a tan dog jumping over a red and blue toy mistakeyoung a little girl on a kid swing semantic

Summary and conclusion

What did we chat about today?

• Supervised speech recognition: From HMMs all the way to CLDNNs

• Structure is still important in speech recognition

• Saw three examples of models that do not require ASR

• Looked at local work taking inspiration from humans

39 / 40

What’s next (specifically for us)?

• Still many many unsolved core machine learning problems inunsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

40 / 40

http://www.kamperh.com/

https://github.com/kamperh

Backup slides

Acoustic word embeddings (AWe)

Acoustic word embeddings x ∈ RD

fe(Y1)

fe(Y2)

[Levin et al., ASRU’13]43 / 40

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

x1 = f(Y1)

x2 = f(Y2)

x1 = f(Y1)

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

x1 = f(Y1)

x2 = f(Y2)

x1 = f(Y1)

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Retrieval in common (semantic) space

y ∈ RD in D-dimensional space

[Harwath et al., NIPS’16]45 / 40

Word prediction from images and speech

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

f(X) ∈ RW is vector ofword probabilities

feedfwdma

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]

46 / 40

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

hat man

shirtyvis

hat man

shirtyvis

0.85 0.8 0.9

hat man

shirtyvis f(X)

feedfwd

hat man

shirtyvis f(X)Loss

feedfwd

feedfwdma

Deep learning for (more than) speech recognitionSpeech recognition was performed by combining...

Documents

Transcript of Deep learning for (more than) speech recognitionSpeech recognition was performed by combining...