What does speech “look” like
to an Automatic Speech Recognition System?
Jiang [email protected] Engineering Department
Say if you are only allowed to use 39 values to represent a speech seg of 1 sec long…
What are “speech features?” What are good features?
◦ Discriminative
◦ “Curse of Dimensionality”
How do we extract features from speech?
From both time and frequency domain…
For each frame of pre-recorded speech, we try to extract the feature as to compress its spectrum.
Frequency Domain Features : “Static” Features
0 2 4 6 8-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Frequency [kHz]
Am
plitu
de
BV0
BV1BV2
-60 -40 -20 0 20 40 60-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Time [ms]
Am
plitu
de
BV0
BV1BV2
Time Domain Features : Trajectories of Static Spectral Features over Time Most recent research has
shown that spectral trajectories, over time, also play an important role in ASR.
Thus, we also want to let computers see what happens over time, about the center of each static feature.
0
10
20
30010
2030
40
0
0.2
0.4
0.6
0.8
1
0
510
1520
25
3005
1015
20
0
0.2
0.4
0.6
0.8
1
So finally to the ASR system, the speech features look like…
Time (Sec)
Freq
uenc
y (H
z)
Original Spectrogram
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50
1000
2000
3000
4000
5000
6000
7000
8000
Time (Sec)
Freq
uenc
y (H
z)
Rebuilt Spectrogram
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50
1000
2000
3000
4000
5000
6000
7000
8000
Other Features for Future Study… Pitch Contour :
◦ Strongly related to “tones”◦ Very popular feature type
for tonal languages: Mandarin, Cantonese, Some of Korean dialects, etc.
“Perceptual Features:◦ Analyze speech signal as to
how human’s auditory system “perceptually” process sound
◦ Frequency resolution and time resolution both depend on frequency and time..
0 0.5 1 1.5 2 2.5 3-4000
-2000
0
2000
4000Speech waveform
Time (seconds)
Am
plitu
de
0 0.5 1 1.5 2 2.50
100
200
300Pitch
Time (seconds)
Freq
uenc
y (H
z)
0.5 1 1.5 2 2.5 30
2000
4000
6000
8000Spectrogram
Time
Freq
uenc
y (H
z)
Our Speech Lab Dr. Montri Karnjanadecha (Force Alignment) Chandra Vootkuri (Ph.D.) (Landmark Theory) Brian Wong (M.S.) (Freq. Non-linearity) Andrew Hwang(M.S.) (Feature Transform)
Our Current Projects Project 1: To create an open source multi- language
audio database for spoken language processing applications.
Project 2: To understand tonal languages.
◦Signal processing (A/D)◦Probability theory, pattern recognition and
machine learning◦Understanding of human auditory system/
linguistic/musicality will be a bonus!
Background you need..
Speech Research is interesting!
◦ Pronunciation therapy
◦ Singing voice processing
+
◦ Hearing aids
Tons of interesting applications!!
Talk to a public computer in your real life…
◦ Not just the Microsoft speech-text software on your PC..
Thank you!Questions?
Top Related