Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty...
-
date post
22-Dec-2015 -
Category
Documents
-
view
234 -
download
3
Transcript of Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty...
![Page 1: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/1.jpg)
Using Recurrent Neural NetworksAutomatic Speech Recognition
Delft University of TechnologyFaculty of Information Technology and SystemsKnowledge-Based Systems
Daan Nollen
8 januari 1998
![Page 2: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/2.jpg)
2
“Baby’s van 7 maanden onderscheiden al wetten in taalgebruik”
• Abstract rules versus statistical probabilities• Example “meter”
• Context very important– sentence (grammer)– word (syntax)– phoneme“Er staat 30 graden op de meter”
“De breedte van de weg is hier 2 meter”
![Page 3: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/3.jpg)
3
• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations
Contents
• Problem definition
![Page 4: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/4.jpg)
4
Problem definition:
Create an ASR using only ANNs
• Study Recnet ASR and train Recogniser• Design and implement an ANN workbench• Design and implement an ANN phoneme
postprocessor• Design and implement an ANN word
postprocessor
![Page 5: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/5.jpg)
5
• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations
Contents
![Page 6: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/6.jpg)
6
Automatic Speech Recognition (ASR)
• ASR contains 4 phases
A/D conversion Pre-
processingRecognition
Post- processing
![Page 7: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/7.jpg)
7
• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations
Contents
![Page 8: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/8.jpg)
8
Recnet ASR
• TIMIT Speech Corpus US-American samples
• Hamming windows 32ms with 16ms overlap
• Preprocessor based on Fast Fourier Transformation
![Page 9: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/9.jpg)
9
Recnet Recogniser• u(t) external input (output
preprocessor)• y(t+1) external output
(phoneme probabilities)• x(t) internal input• x(t+1) internal output
x(0) = 0
x(1) = f( u(0), x(0))
x(2) = f( u(1), x(1)) = f( u(1), f( u(0), x(0)))
![Page 10: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/10.jpg)
10
Performance Recnet Recogniser
• Output Recnet probability vector
• 65 % phonemes correctly labelled
• Network trained on parallel nCUBE2
/eh/ = 0.1/ih/ = 0.5/ao/ = 0.3/ae/ = 0.1
![Page 11: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/11.jpg)
11
Training Recnet on the nCUBE2
• Per trainingcycle over 78 billion floating point calculations
• Training time SUN 144 mhz 700 hours (1 month)
• nCUBE2 (32 processors) processing time 70 hours
![Page 12: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/12.jpg)
12
• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations
Contents
![Page 13: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/13.jpg)
13
Recnet Postprocessor• Perfect recogniser
• Recnet recogniser
/h/ /h/ /h/ /eh/ /eh/ /eh /eh/ /l/ /l/ /ow/ /ow /ow//h/ /eh/ /l/ /ow/
/h/ /k/ /h/ /eh/ /ah/ /eh /eh/ /l/ /l/ /ow/ /ow /h/
![Page 14: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/14.jpg)
14
Hidden Markov Models (HMMs)
• Model stochastic processes• Maximum a posteriori state
sequence determined by
• Most likely path by Viterbi algorithm
S1 S2
a11 a22
a12
a21
p(a)=pa1
p(b)=pb1
p(n)=pn1
p(a)=pa2
p(b)=pb2
p(n)=pn2
T
ttttt
QqupqqQ
11
* )|()|Pr(maxarg
![Page 15: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/15.jpg)
15
Recnet Postprocessing HMM
• Phonemes represented by 62 states
• Transitions trained on correct TIMIT labels
• Output Recnet phoneme recogniser determines output probabilities
• Very suitable for smoothing task
/eh/ /ih/
a11 a22
a12
a21
p(/eh/)=y/eh/
p(/ih/)=y/ih/
p(/ao/)=y/ao/
p(/eh/)=y/eh/
p(/ih/)=y/ih/
p(/ao/)=y/ao/
![Page 16: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/16.jpg)
16
Scoring a Postprocessor• Number of frames fixed, but number of phonemes
not
• Three types of errors - insertion /a/ /c/ /b/ /b/ /a/ /c/ /b/- deletion /a/ /a/ /a/ /a/ /a/ ...- substitution /a/ /a/ /c/ /c/ /a/ /c/
• Optimal allignment
/a/ /a/ /b/ /b/ /a/ /b//a/ /c/ /b/ /b/ /a/ /c/ /b/
![Page 17: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/17.jpg)
17
Scoring Recnet Postprocessing HMM
• No postprocessing vs Recnet HMM
• Removing repetitions
• Applying scoring algorithm
Nothing Recnet HMMCorrect 79.9% 72.8%Insertion 46.1% 3.7%Deletion 17.6% 21.2%Substitution 2.5% 6.0%Total errors 66.2% 30.9%
![Page 18: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/18.jpg)
18
Elman RNN Phoneme Postprocessor(1)
• Context helps to smooth the signal
• Probability vectors input
• Most likely phoneme output
![Page 19: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/19.jpg)
19
RNN Phoneme Postprocessor(2)• Calculates conditional probability
• Probability changes through time
context)) input,|(p(phonargmaxphon i61i0
(i)likely most
![Page 20: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/20.jpg)
20
Training RNN Phoneme Postprocessor(1)
• Highest probability determines region
• 35 % in incorrect region• Training dilemma:
– Perfect data, context not used
– Output Recnet, 35 % errors in trainingdata
• Mixing trainingset needed
![Page 21: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/21.jpg)
21
• Two mixing methods applied– I : Trainingset = norm( Recnet + (1- ) Perfect prob)
– II: Trainingset = norm(Recnet with p(phncorrect)=1)
Method I Method II Recnet HMMCorrect 74.8% 75.1% 72.8%Insertion 23.5% 21.9% 3.7%Deletion 21.4% 21.1% 21.2%Substitution 3.8% 3.8% 6.0%Total errors 48.7% 46.8% 30.9%
Training RNN Phoneme Postprocessor(2)
![Page 22: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/22.jpg)
22
• Reduces 50% insertion errors
• Unfair competition with HMM– Elman RNN uses previous frames (context)– HMM uses preceding and successive frames
Conclusion RNN Phoneme Postprocessor
![Page 23: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/23.jpg)
23
Example HMM Postprocessing
/a/ /b/
0.8 0.8
0.2
0.2
a=1b=0 Frame 1: p(/a/)=0.1
p(/b/)=0.9
p(/a/, /a/) = 1·0.8·0.1 = 0.08p(/a/, /b/) = 1·0.2·0.9 = 0.18
Frame 2: p(/a/)=0.9 p(/b/)=0.1
p(/a/, /a/, /a/) = 0.08·0.8·0.9 = 0.0576 p(/a/, /a/, /b/) = 0.08·0.2·0.1 = 0.0016p(/a/, /b/, /a/) = 0.18·0.2·0.9 = 0.0324p(/a/, /b/, /b/) = 0.18·0.8·0.1 = 0.0144
![Page 24: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/24.jpg)
24
• Reduces 50% insertion errors
• Unfair competition with HMM– Elman RNN uses previous frames (context)– HMM uses preceding and succeeding frames
• PPRNN works real-time
Conclusion RNN Phoneme Postprocessor
![Page 25: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/25.jpg)
25
• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations
Contents
![Page 26: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/26.jpg)
26
Word postprocessing
• Output phoneme postprocessor continuous stream of phonemes
• Segmentation needed to convert phonemes into words
• Elman Phoneme Prediction RNN (PPRNN) can segment stream and correct errors
![Page 27: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/27.jpg)
27
Elman Phoneme Prediction RNN (PPRNN)
• Network trained to predict next phoneme in coninuous stream
• Squared error Se determined by
261
0
)(
i
iie zyS
![Page 28: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/28.jpg)
28
Testing PPRNN“hh iy - hv ae dcl - p l ah n jh dcl d - ih n tcl t ix - dh ix - dcl d aa r kcl k - w uh dcl d z - bcl b iy aa n dcl d”
hh
iy
hv
ae dcl
p
l
ah n jh
dcl d
ih
n
tcl
t
ix
dh
ix
dcl
d
aa
r kcl k
w uh
dcl d
z
bcl
b
iy
aa
n dcl d 0
0.5
1
1.5
2
2.5
3
hh iy hv ae dcl p l ah n jh dcl d ih n tcl t ix dh ix dcl d aa r kcl k w uh dcl d z bcl b iy aa n dcl d
Distance
Phoneme
![Page 29: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/29.jpg)
29
Performance PPRNN parser
• Two error types– insertion error helloworld he-llo-world– deletion error helloworld
helloworld
• Performance parser complete testset– parsing errors : 22.9 %
![Page 30: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/30.jpg)
30
Error detection using PPRNN
• Prediction forms vector in phoneme space
• Squared error to closest cornerpoint squared error to most likely Se mostlikely
• Error indicated by
mostlikely e
eindicationE
SS
![Page 31: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/31.jpg)
31
Phn(x+1)Phn(x+1)
Error correction using PPRNN
Phn(x) Phn(x+2)
Phnml
(x+1)• Insertion errorPhnml
(x+2)
• Deletion error
• Substitution error
![Page 32: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/32.jpg)
32
Performance Correction PPRNN
• Error rate reduced 8.6 %
Recnet HMM HMM + PPRNNCorrect 72.9% 80.7%Insertion 4.1% 3.3%Deletion 20.9% 14.8%Substitution 6.2% 4.5%Total errors 31.2% 22.6%
![Page 33: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/33.jpg)
33
• Problem definition• Automatic Speech Recognition (ASR)• Recnet ASR• Phoneme postprocessing• Word postprocessing• Conclusions and recommendations
Contents
![Page 34: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/34.jpg)
34
ConclusionsPossible to create an ANN ASR
• Recnet– documented and trained
• Implementation RecBench• ANN phoneme postprocessor
– Promising performance
• ANN word postprocessor– Parsing 80 % correct– Reducing error rate 9 %
![Page 35: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/35.jpg)
35
Recommendations• ANN phoneme postprocessor
– different mixing techniques– Increase framerate– More advanced scoring algorithm
• ANN word postprocessor– Results of increase in vocabulary
• Phoneme to Word conversion– Autoassociative ANN
• Hybrid Time Delay ANN/ RNN
![Page 36: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/36.jpg)
36
A/D Conversion
Hello world
• Sampling the signal
1000101001001100010010101100010010111011110110101101
![Page 37: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/37.jpg)
37
Preprocessing
101101
010110
111101
111100
• Spliting signal into frames
• Extracting features
1000101001001100010010101100010010111011110110101101
![Page 38: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/38.jpg)
38
Recognition
101101
010110
111101
111100
• Classification of frames
101101
010110
111101
111100
H E L L
![Page 39: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/39.jpg)
39
Postprocessing
• Reconstruction of the signal
101101
010110
111101
111100
H E L L
“Hello world”
![Page 40: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/40.jpg)
40
Training Recnet on the nCUBE2
• Training RNN computational intensive
• Trained using Back- propagation Through Time
• nCUBE2 hypercube architecture
• 32 processors used during training
0
1
2
3
![Page 41: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/41.jpg)
41
Training results nCUBE2
• Training time on nCUBE2 50 hours
• Performance trained Recnet 65.0 % correct classified phonemes
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60
cycles
alpha
energy
senergy
ncorrect
sncorrect
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60
cycles
alpha
energy
senergy
ncorrect
sncorrect
![Page 42: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/42.jpg)
42
Viterbi algorithm
• A posteriori most likely path
• Recursive algorithm with O(t) is the observation
jj )0(
))(())1((max)(1
tObatt jijiNi
j
SN
...S7
S6
S5
S4
S3
S2
S1
0 1 2 3 4 ... T
![Page 43: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/43.jpg)
43
Elman Character prediction RNN (2)• Trained on words
“Many years a boy and girl lived by the sea. They played happily”
• During training words picked randomly from vocabulary
• During parsing squared error Se is determined by
with yi is correct output of node i and zi is the real output of node i
• Se is expected to decrease within a word
261
0
)(
i
iie zyS
![Page 44: Using Recurrent Neural Networks Automatic Speech Recognition Delft University of Technology Faculty of Information Technology and Systems Knowledge-Based.](https://reader030.fdocuments.net/reader030/viewer/2022032523/56649d795503460f94a5d671/html5/thumbnails/44.jpg)
44
Elman Character prediction RNN (3)