Download - What does speech “look” like

What does speech “look” like

to an Automatic Speech Recognition System?

Jiang [email protected] Engineering Department

Say if you are only allowed to use 39 values to represent a speech seg of 1 sec long…

What are “speech features?” What are good features?

◦ Discriminative

◦ “Curse of Dimensionality”

How do we extract features from speech?

From both time and frequency domain…

For each frame of pre-recorded speech, we try to extract the feature as to compress its spectrum.

Frequency Domain Features : “Static” Features

0 2 4 6 8-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Frequency [kHz]

Am

plitu

de

BV0

BV1BV2

-60 -40 -20 0 20 40 60-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time [ms]

Am

plitu

de

BV0

BV1BV2

Time Domain Features : Trajectories of Static Spectral Features over Time Most recent research has

shown that spectral trajectories, over time, also play an important role in ASR.

Thus, we also want to let computers see what happens over time, about the center of each static feature.

0

10

20

30010

2030

40

0

0.2

0.4

0.6

0.8

1

0

510

1520

25

3005

1015

20

0

0.2

0.4

0.6

0.8

1

So finally to the ASR system, the speech features look like…

Time (Sec)

Freq

uenc

y (H

z)

Original Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

Time (Sec)

Freq

uenc

y (H

z)

Rebuilt Spectrogram

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

1000

2000

3000

4000

5000

6000

7000

8000

Other Features for Future Study… Pitch Contour :

◦ Strongly related to “tones”◦ Very popular feature type

for tonal languages: Mandarin, Cantonese, Some of Korean dialects, etc.

“Perceptual Features:◦ Analyze speech signal as to

how human’s auditory system “perceptually” process sound

◦ Frequency resolution and time resolution both depend on frequency and time..

0 0.5 1 1.5 2 2.5 3-4000

-2000

0

2000

4000Speech waveform

Time (seconds)

Am

plitu

de

0 0.5 1 1.5 2 2.50

100

200

300Pitch

Time (seconds)

Freq

uenc

y (H

z)

0.5 1 1.5 2 2.5 30

2000

4000

6000

8000Spectrogram

Time

Freq

uenc

y (H

z)

Our Speech Lab Dr. Montri Karnjanadecha (Force Alignment) Chandra Vootkuri (Ph.D.) (Landmark Theory) Brian Wong (M.S.) (Freq. Non-linearity) Andrew Hwang(M.S.) (Feature Transform)

Our Current Projects Project 1: To create an open source multi- language

audio database for spoken language processing applications.

Project 2: To understand tonal languages.

◦Signal processing (A/D)◦Probability theory, pattern recognition and

machine learning◦Understanding of human auditory system/

linguistic/musicality will be a bonus!

Background you need..

Speech Research is interesting!

◦ Pronunciation therapy

◦ Singing voice processing

+

◦ Hearing aids

Tons of interesting applications!!

Talk to a public computer in your real life…

◦ Not just the Microsoft speech-text software on your PC..

Thank you!Questions?