Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

53
Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Transcript of Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Page 1: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Speech and Language Processing

Chapter 9 of SLPAutomatic Speech Recognition (I)

Page 2: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Outline for ASR

ASR Architecture The Noisy Channel Model

Five easy pieces of an ASR system1)Language Model2)Lexicon/Pronunciation Model (HMM)3)Feature Extraction4)Acoustic Model5)Decoder

Training Evaluation

04/21/23 2Speech and Language Processing Jurafsky and Martin

Page 3: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Speech Recognition

Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions,

air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching

04/21/23 3Speech and Language Processing Jurafsky and Martin

Page 4: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

LVCSR

Large Vocabulary Continuous Speech Recognition

~20,000-64,000 words Speaker independent (vs. speaker-

dependent) Continuous speech (vs isolated-word)

04/21/23 4Speech and Language Processing Jurafsky and Martin

Page 5: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Current error rates

Task Vocabulary Error Rate%

Digits 11 0.5

WSJ read speech 5K 3

WSJ read speech 20K 3

Broadcast news 64,000+ 10

Conversational Telephone

64,000+ 20

Ballpark numbers; exact numbers depend very much on the specific corpus

04/21/23 5Speech and Language Processing Jurafsky and Martin

Page 6: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

HSR versus ASR

Conclusions:Machines about 5 times worse than humansGap increases with noisy speech These numbers are rough, take with grain of

salt

Task Vocab

ASR Hum SR

Continuous digits

11 .5 .009

WSJ 1995 clean 5K 3 0.9

WSJ 1995 w/noise

5K 9 1.1

SWBD 2004 65K 20 4

04/21/23 6Speech and Language Processing Jurafsky and Martin

Page 7: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Why is conversational speech harder?

A piece of an utterance without context

The same utterance with more context

04/21/23 7Speech and Language Processing Jurafsky and Martin

Page 8: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

LVCSR Design Intuition

• Build a statistical model of the speech-to-words process

• Collect lots and lots of speech, and transcribe all the words.

• Train the model on the labeled speech

• Paradigm: Supervised Machine Learning + Search

04/21/23 8Speech and Language Processing Jurafsky and Martin

Page 9: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Speech Recognition Architecture

04/21/23 9Speech and Language Processing Jurafsky and Martin

Page 10: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

The Noisy Channel Model

Search through space of all possible sentences.

Pick the one that is most probable given the waveform.

04/21/23 10Speech and Language Processing Jurafsky and Martin

Page 11: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

The Noisy Channel Model (II)

What is the most likely sentence out of all sentences in the language L given some acoustic input O?

Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot

Define a sentence as a sequence of words: W = w1,w2,w3,…,wn

04/21/23 11Speech and Language Processing Jurafsky and Martin

Page 12: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Noisy Channel Model (III)

Probabilistic implication: Pick the highest prob S = W:

We can use Bayes rule to rewrite this:

Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

ˆ W = argmaxW ∈L

P(W | O)

ˆ W = argmaxW ∈L

P(O |W )P(W )€

ˆ W = argmaxW ∈L

P(O |W )P(W )

P(O)

04/21/23 12Speech and Language Processing Jurafsky and Martin

Page 13: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Noisy channel model

ˆ W = argmaxW ∈L

P(O |W )P(W )

likelihood prior

04/21/23 13Speech and Language Processing Jurafsky and Martin

Page 14: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

The noisy channel model

Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)

04/21/23 14Speech and Language Processing Jurafsky and Martin

Page 15: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Speech Architecture meets Noisy Channel

04/21/23 15Speech and Language Processing Jurafsky and Martin

Page 16: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Architecture: Five easy pieces (only 3-4 for today)

HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this

already)

04/21/23 16Speech and Language Processing Jurafsky and Martin

Page 17: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Lexicon

A list of words Each one with a pronunciation in

terms of phones We get these from on-line

pronucniation dictionary CMU dictionary: 127K words

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

We’ll represent the lexicon as an HMM

04/21/23 17Speech and Language Processing Jurafsky and Martin

Page 18: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

HMMs for speech: the word “six”

04/21/23 18Speech and Language Processing Jurafsky and Martin

Page 19: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Phones are not homogeneous!

Time (s)0.48152 0.937203

0

5000

ay k

04/21/23 19Speech and Language Processing Jurafsky and Martin

Page 20: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Each phone has 3 subphones

04/21/23 20Speech and Language Processing Jurafsky and Martin

Page 21: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Resulting HMM word model for “six” with their subphones

04/21/23 21Speech and Language Processing Jurafsky and Martin

Page 22: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

HMM for the digit recognition task

04/21/23 22Speech and Language Processing Jurafsky and Martin

Page 23: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Detecting Phones

Two stages Feature extraction

Basically a slice of a spectrogram

Phone classification Using GMM classifier

04/21/23 23Speech and Language Processing Jurafsky and Martin

Page 24: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Discrete Representation of Signal

Represent continuous signal into discrete form.

Thanks to Bryan Pellom for this slide04/21/23 24Speech and Language Processing Jurafsky and Martin

Page 25: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Digitizing the signal (A-D)

Sampling: measuring amplitude of signal at time t16,000 Hz (samples/sec) Microphone (“Wideband”):8,000 Hz (samples/sec) TelephoneWhy?

– Need at least 2 samples per cycle– max measurable frequency is half sampling rate– Human speech < 10,000 Hz, so need max 20K– Telephone filtered at 4K, so 8K is enough

04/21/23 25Speech and Language Processing Jurafsky and Martin

Page 26: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Quantization Representing real value of each amplitude as integer

8-bit (-128 to 127) or 16-bit (-32768 to 32767)Formats:

16 bit PCM8 bit mu-law; log compression

LSB (Intel) vs. MSB (Sun, Apple)

Headers:Raw (no header)Microsoft wavSun .au

40 byteheader

Digitizing Speech (II)

04/21/23 26Speech and Language Processing Jurafsky and Martin

Page 27: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Discrete Representation of Signals

Byte swapping Little-endian vs. Big-endian

Some audio formats have headers Headers contain meta-information such

as sampling rates, recording condition Raw file refers to 'no header' Example: Microsoft wav, Nist sphere

Nice sound manipulation tool: sox. change sampling rate convert speech formats

04/21/23 27Speech and Language Processing Jurafsky and Martin

Page 28: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

MFCC: Mel-Frequency Cepstral Coefficients

04/21/23 28Speech and Language Processing Jurafsky and Martin

Page 29: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Pre-Emphasis

Pre-emphasis: boosting the energy in the high frequencies

Q: Why do this? A: The spectrum for voiced segments has

more energy at lower frequencies than higher frequencies. This is called spectral tilt Spectral tilt is caused by the nature of the

glottal pulse Boosting high-frequency energy gives more

info to Acoustic Model Improves phone recognition performance

04/21/23 29Speech and Language Processing Jurafsky and Martin

Page 30: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Example of pre-emphasis

Before and after pre-emphasis Spectral slice from the vowel [aa]

04/21/23 30Speech and Language Processing Jurafsky and Martin

Page 31: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

MFCC process: windowing

04/21/23 31Speech and Language Processing Jurafsky and Martin

Page 32: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Windowing

Why divide speech signal into successive overlapping frames? Speech is not a stationary signal; we want

information about a small enough region that the spectral information is a useful cue.

Frames Frame size: typically, 10-25ms Frame shift: the length of time between

successive frames, typically, 5-10ms

04/21/23 32Speech and Language Processing Jurafsky and Martin

Page 33: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

MFCC process: windowing

04/21/23 33Speech and Language Processing Jurafsky and Martin

Page 34: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Common window shapes

Rectangular window:

Hamming window

04/21/23 34Speech and Language Processing Jurafsky and Martin

Page 35: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Discrete Fourier Transform

Input: Windowed signal x[n]…x[m]

Output: For each of N discrete frequency bands A complex number X[k] representing magnidue and phase

of that frequency component in the original signal Discrete Fourier Transform (DFT)

Standard algorithm for computing DFT: Fast Fourier Transform (FFT) with complexity N*log(N) In general, choose N=512 or 1024

04/21/23 35Speech and Language Processing Jurafsky and Martin

Page 36: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Discrete Fourier Transform computing a spectrum

A 25 ms Hamming-windowed signal from [iy] And its spectrum as computed by DFT

(plus other smoothing)

04/21/23 36Speech and Language Processing Jurafsky and Martin

Page 37: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Mel-scale

Human hearing is not equally sensitive to all frequency bands

Less sensitive at higher frequencies, roughly > 1000 Hz

I.e. human perception of frequency is non-linear:

04/21/23 37Speech and Language Processing Jurafsky and Martin

Page 38: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Mel-scale

A mel is a unit of pitchPairs of sounds perceptually equidistant in

pitchAre separated by an equal number of mels

Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz

Definition:

04/21/23 38Speech and Language Processing Jurafsky and Martin

Page 39: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Mel Filter Bank Processing

Mel Filter bank Uniformly spaced before 1 kHz logarithmic scale after 1 kHz

04/21/23 39Speech and Language Processing Jurafsky and Martin

Page 40: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Log energy computation

Log of the square magnitude of the output of the mel filterbank

Why log? Logarithm compresses dynamic range of values

Human response to signal level is logarithmic– humans less sensitive to slight differences in amplitude at

high amplitudes than low amplitudes Makes frequency estimates less sensitive to

slight variations in input (power variation due to speaker’s mouth moving closer to mike)

Why square? Phase information not helpful in speech

04/21/23 40Speech and Language Processing Jurafsky and Martin

Page 41: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

The Cepstrum

One way to think about this Separating the source and filter Speech waveform is created by

A glottal source waveform Passes through a vocal tract which because of its

shape has a particular filtering characteristic

Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of oral cavity, some

harmonics are amplified more than others

04/21/23 41Speech and Language Processing Jurafsky and Martin

Page 42: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Vocal Fold Vibration

UCLA Phonetics Lab Demo04/21/23 42Speech and Language Processing Jurafsky and Martin

Page 43: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

George Miller figure

04/21/23 43Speech and Language Processing Jurafsky and Martin

Page 44: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

We care about the filter not the source

Most characteristics of the source F0 Details of glottal pulse

Don’t matter for phone detection What we care about is the filter

The exact position of the articulators in the oral tract

So we want a way to separate these And use only the filter function

04/21/23 44Speech and Language Processing Jurafsky and Martin

Page 45: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

The Cepstrum

The spectrum of the log of the spectrum

Spectrum Log spectrum

Spectrum of log spectrum

04/21/23 45Speech and Language Processing Jurafsky and Martin

Page 46: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Thinking about the Cepstrum

Pictures from John Coleman (2005)04/21/23 46Speech and Language Processing Jurafsky and Martin

Page 47: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Mel Frequency cepstrum

The cepstrum requires Fourier analysis But we’re going from frequency space back to

time So we actually apply inverse DFT

Details for signal processing gurus: Since the log power spectrum is real and symmetric, inverse DFT reduces to a Discrete Cosine Transform (DCT)

04/21/23 47Speech and Language Processing Jurafsky and Martin

Page 48: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Another advantage of the Cepstrum

DCT produces highly uncorrelated features

We’ll see when we get to acoustic modeling that these will be much easier to model than the spectrum Simply modelled by linear combinations of

Gaussian density functions with diagonal covariance matrices

In general we’ll just use the first 12 cepstral coefficients (we don’t want the later ones which have e.g. the F0 spike)

04/21/23 48Speech and Language Processing Jurafsky and Martin

Page 49: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Dynamic Cepstral Coefficient

The cepstral coefficients do not capture energy

So we add an energy feature

Also, we know that speech signal is not constant (slope of formants, change from stop burst to release).

So we want to add the changes in features (the slopes).

We call these delta features

We also add double-delta acceleration features

04/21/23 49Speech and Language Processing Jurafsky and Martin

Page 50: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Typical MFCC features

Window size: 25ms Window shift: 10ms Pre-emphasis coefficient: 0.97 MFCC:

12 MFCC (mel frequency cepstral coefficients)

1 energy feature 12 delta MFCC features 12 double-delta MFCC features 1 delta energy feature 1 double-delta energy feature

Total 39-dimensional features04/21/23 50Speech and Language Processing Jurafsky and Martin

Page 51: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Why is MFCC so popular?

Efficient to compute

Incorporates a perceptual Mel frequency scale

Separates the source and filter

IDFT(DCT) decorrelates the features Improves diagonal assumption in HMM

modeling

Alternative PLP

04/21/23 51Speech and Language Processing Jurafsky and Martin

Page 52: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Next Time: Acoustic Modeling (= Phone detection)

Given a 39-dimensional vector corresponding to the observation of one frame oi

And given a phone q we want to detect

Compute p(oi|q) Most popular method:

GMM (Gaussian mixture models) Other methods

Neural nets, CRFs, SVM, etc04/21/23 52Speech and Language Processing Jurafsky and Martin

Page 53: Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Summary

ASR Architecture The Noisy Channel Model

Five easy pieces of an ASR system1)Language Model2)Lexicon/Pronunciation Model (HMM)3)Feature Extraction4)Acoustic Model5)Decoder

Training Evaluation

04/21/23 53Speech and Language Processing Jurafsky and Martin