Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the...
Transcript of Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the...
![Page 1: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/1.jpg)
Features Extraction
![Page 2: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/2.jpg)
![Page 3: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/3.jpg)
9/26/2017 3
Why do we need feature extraction?
• Acoustic speech signal varies over time. Can’t compare two waveforms
example: two instances of /a:/ vowel spoken in isolation, with time interval between repetitions < 1 second:
![Page 4: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/4.jpg)
9/26/2017 4
What is Features?
• Feature = a measure of a property of the speech waveform
• Reasons for feature extraction:– Redundancy and harmful information is removed– Reduced computation time– Easier modeling of the feature distribution
• Speech has many “natural” (Acoustic-phonetic) features:– Fundamental frequency (F0), formant frequencies, formant
bandwidths, spectral tilt, intensity, phone durations, articulation,etc
• Not-so-natural features:– Cepstrum, linear predictive coefficients, line spectral
frequencies, vocal tract area function, delta and double-deltacoefficients, etc
![Page 5: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/5.jpg)
9/26/2017 5
ɐd ʃ m o ʃ ɐ j tʰ ɐ m l e n
Segmental
Duration profile #1 Duration profile #2
Amplitude profile #1 Amplitude profile #2
Pitch profile #2 Pitch profile #1
Supra-Segmental
![Page 6: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/6.jpg)
9/26/2017 6
Speech Events
Segmental Supra-segmental
![Page 7: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/7.jpg)
9/26/2017 7
Supra-segmental features and Prosody
Intonation, pause, duration, stress together are called
prosodic or supra-segmental features and may be
considered as the melody, rhythm, and emphasis of the
speech at the perceptual level.
The prosody of a sentence is important for naturalness
and for conveying the correct meaning of a sentence.
![Page 8: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/8.jpg)
Peaks denote dominant frequency componentsin the speech signal Peaks are referred to as formants Formants carry the identity of the sound
![Page 9: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/9.jpg)
9/26/2017 9
Parameter / Feature Classification
Frequency Domain Parameters• Filter Bank Analysis• Short-term spectral analysis• Cepstral Transfer Coefficient (CC)• Formant Parameters• MFCC, Delta MFCC, Delta-Delta MFCC
Time Domain Parameters• LPC• Shape Parameters
Time- Frequency Domain Parameters• Perceptual Linear Prediction (PLP):• Wavelet Analysis
![Page 10: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/10.jpg)
Filter Bank Analysis
![Page 11: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/11.jpg)
Complete Filter Bank Analysis Model
![Page 12: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/12.jpg)
12
How to determine filter band ranges
Uniform filter banksLog frequency banksMel filter bands
![Page 13: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/13.jpg)
13
Uniform Filter Banks
• Uniform filter banks– bandwidth B= Sampling Freq... (Fs)/no. of banks (N) – For example Fs=10Kz, N=20 then B=500Hz– Simple to implement but not too useful
...freq..
1 2 3 4 5 .... Q
500 1K 1.5K 2K 2.5K 3K ... (Hz)
V Filter output
v1 v2 v3
![Page 14: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/14.jpg)
14
Non-uniform filter banks: Log frequency
• Log. Freq... scale : close to human ear
filter 1 filter 2 filter 3 filter 4Center freq. 300 600 1200 2400bankwidth 200 400 800 1600
200 400 800 1600 3200freq.. (Hz)
v1 v2 v3
VFilter output
![Page 15: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/15.jpg)
15
Mel filter bands
• Freq. lower than 1 KHz has narrower bands (and in linear scale)
• Higher frequencies have larger bands (and in log scale)
• More filter below 1KHz• Less filters above 1KHz
Filter output
![Page 16: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/16.jpg)
][*][][ nenhns =
DFT of s[n] ][][][ kEkHkS =
)][log()][log(])][log( kEkHkS +=
![Page 17: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/17.jpg)
• Formants and a smooth curve connecting them• This Smooth curve is referred to as spectral envelope
What we want to Extract? Spectral Envelope
![Page 18: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/18.jpg)
Cepstral analysis
• Homomorphic speech processing– Speech is modelled as the output of a linear, time varying system
(linear time-invariant (LTI) in short seg.) excited by either quasi-periodic pulses or random noise.
– The problem of speech analysis is to estimate the parameters ofthe speech model and to measure their variations with time.
– Since the excitation and impulse response of a LTI system arecombined in a convolutional manner, the problem of speechanalysis can also been viewed as a problem in separating thecomponents of a convolution, called ”deconvolution”.
![Page 19: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/19.jpg)
The principle of superposition for conventional linear systems:
If signals fall in non-overlapping frequency bands then they are separable
x[n]=x1[n]+x2[n]X1(ω)=ℱ{x1[n]} & X1(ω) [0, π/2],
X2(ω)=ℱ{x2[n]} & X2(ω) [π/2, π],
![Page 20: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/20.jpg)
26 September 2017 20
Principles of Homomorphic Processing
Importance of homomorphic systems for speechprocessing lies in their capability of transformingnonlinearly combined signals to additively combinedsignals so that linear filtering can be performed on them.
Homomorphic systems can be expressed as a cascade ofthree homomorphic sub-systems referred to as thecanonic representation:
H
D*x[n]
* +y[n]L+ +
D*
*+ -1
I II III
[ ]nx [ ]ny
Homomorphic Systems for Convolution
![Page 21: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/21.jpg)
26 September 2017 21
Canonic Representation of a Homomorphic System
D*x[n]
* +I
[ ]nx
L+ +[ ]nx [ ]nyII
y[n]D*
*+ -1
III
[ ]ny
I. System takes inputs combinedby convolution and transformsthem into additive outputs
II. System is a conventional linearsystem
III. Inverse of first system--takesadditive inputs and transformsthem into convolution outputs
![Page 22: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/22.jpg)
The characteristic system for homomorphic deconvolution
![Page 23: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/23.jpg)
Cepstral analysis
Observation:
taking logarithm of X(z), then
in the cepstral domain
• So, the two convolved signals are additive in thecepstral domain
![Page 24: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/24.jpg)
![Page 25: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/25.jpg)
Cepstral analysis
][ˆ nxReal cepstrum c[n] is the even part of
![Page 26: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/26.jpg)
26 September 2017 26
• Relationship of complex cepstrum to real cepstrum c[n]:– If x[n] real then:
• |X(ω)| is real and even and thus log[|X(ω)|] is real and even• ∠X(ω) is odd, and hence
is referred to as the complex cepstrum.• Even component of the complex cepstrum, c[n] is referred to
as the real cepstrum.
[ ] [ ] [ ]2ˆˆ nxnxnc −+
=
[ ]nx
[ ] ( )[ ]∫−
=π
π
ω ωωπ
deXnx njlog21ˆ
[ ]nx
![Page 27: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/27.jpg)
26 September 2017 27
Homomorphic Filtering
• In the cepstral domain:– Pseudo-time Quefrency– Low Quefrency Slowly varying components.– High Quefrency Fast varying components.
• Removal of unwanted components (i.e., filtering) can be attempted in the cepstral domain (on the signal , in which case filtering is referred to as liftering):
• When the complex cestrum of h[n] resides in a quefrency intervalless than a pitch period, then the two components can be separatedform each other.
[ ]nx
![Page 28: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/28.jpg)
26 September 2017 28
Homomorphic Filtering
• If log[X(ω)] – Is viewed as a “time signal”– Consisting of low-frequency and high-frequency contributions.– Separation of this signal with a high-pass/low-pass filter.
• One implementation of low pass filter:
D*
* +y[n]l[n]+ +
D*
*+ -1[ ]nx [ ]ny
x[n]=h[n]*p[n]
![Page 29: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/29.jpg)
26 September 2017 29
Homomorphic Filtering
• Alternate view of “liftering” operation: Filtering operation L(ω) applied in the log-spectral domain
• Interchange of time and frequency domain by viewing the frequency-domain signal log[X(ω)] as a time signal to be filtered. ⇒ – “Cepstrum” can be thought of as spectrum of log[X (ω)]– Time axes of is referred to as “quefrency”– Filter l[n] as the “lifter”.
F-1 y[n]l[n] F-1
[ ]nx [ ]nyx[n]=
h[n]*p[n] F log F exp
X(ω)^ Y(ω)^L(ω)
[ ]nx
![Page 30: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/30.jpg)
![Page 31: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/31.jpg)
9/26/2017 31
Basic Speech processing steps for Frequency Parameter
Pre-emphasis
Windowing
DFT
Framing
Signal: S(t) = x[n]
y[n] = x[n] – α * x[n-1] (where α = 0.95)
x(n) = x(n)*W(n)
Power Spectrum
Hamming : 0.54 – 0.46 cos(2*pi*n/(N-1))
Hanning : 0.5(1- cos(2*pi*n/(N-1)))
Cosine : sin(pi*n/(N-1))
BasicSignal
ProcessingBlock
![Page 32: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/32.jpg)
9/26/2017 32
Cepstral Transform Coefficients (CC)
( )( )( )( )nSDFTlogIDFTCepstrum =
SpeechBasic Signal
ProcessingBlock
Log IDFT Cepstrum
![Page 33: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/33.jpg)
26 September 2017 33
• Relationship of complex cepstrum to real cepstrum c[n]:– If x[n] real then:
• |X(ω)| is real and even and thus log[|X(ω)|] is real and even• ∠X(ω) is odd, and hence
is referred to as the complex cepstrum.• Even component of the complex cepstrum, c[n] is referred to
as the real cepstrum.
[ ] [ ] [ ]2ˆˆ nxnxnc −+
=
[ ]nx
[ ] ( )[ ]∫−
=π
π
ω ωωπ
deXnx njlog21ˆ
[ ]nx
![Page 34: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/34.jpg)
![Page 35: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/35.jpg)
![Page 36: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/36.jpg)
LPC Cepstrum
The LPC vector is defined by [a0,a1,a2,...ap] and the CC vector is defined by [c0c1c2...cp...cn−1
![Page 37: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/37.jpg)
MFCC is the most used parameters in Speech Technology development.
MFCC computed from the speech signal using the following three steps:
1. Compute the FFT power spectrum of the speech signal2. Apply a Mel-space filter-bank to the power spectrum to get energies3. Compute discrete cosine transform (DCT) of log filter-bank energies to get
uncorrelated MFCC’s
Mel Frequency Cepstral Coefficients (MFCC)
![Page 38: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/38.jpg)
Basic Signal Processing Block
Block diagram of Extracting a sequence of 39-dimensional MFCC feature vectors
![Page 39: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/39.jpg)
![Page 40: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/40.jpg)
Mel Filter bank
![Page 41: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/41.jpg)
![Page 42: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/42.jpg)
)()()(ˆ kMkSlSl
l
U
Lkl∑
=
=
Ml(k) the filter weighting function can be normalized
)()(1)(ˆ kMkSM
lSl
l
U
Lkl
l∑=
=
)(kMMl
l
U
Lkll ∑
=
=
![Page 43: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/43.jpg)
![Page 44: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/44.jpg)
• The signal is real with mirror symmetry• The IFFT requires complex arithmetic• The DCT does NOT• The DCT implements the same function as the FFTmore efficiently by taking advantage of the redundancyin a real signal.• The DCT is more efficient computationally
Why the DCT?
![Page 45: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/45.jpg)
![Page 46: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/46.jpg)
Perceptual Linear Prediction
• PLP parameters are the coefficients that result from standardall-pole modeling or linear predictive analysis, of a speciallymodified, short-term speech spectrum.
• In PLP the speech spectrum is modified by a set oftransformations that are based on models of the humanauditory system
• The spectral resolution of human hearing is roughly linear upto 800 or 1000Hz, but it decreases with increasing frequencyabove this linear range
![Page 47: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/47.jpg)
Perceptually motivated analyses Critical-band spectral resolution: PLP incorporates critical-band
spectral-resolution into its spectrum estimate by remapping thefrequency axis to the Bark scale and integrating the energy in thecritical bands to produce a critical-band spectrum approximation.
Equal-loudness pre-emphasis: At conversational speech levels,human hearing is more sensitive to the middle frequency range ofthe audible spectrum. PLP incorporates the effect of thisphenomenon by multiplying the critical-band spectrum by an equalloudness curve that suppresses both the low- and high-frequencyregions relative to the midrange from 400 to 1200 Hz.
Intensity-loudness power law: There is a nonlinear relationshipbetween the intensity of sound and the perceived loudness. PLPapproximates the power-law of hearing by using a cube-rootamplitude compression of the loudness-equalized critical bandspectrum estimate.
![Page 48: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/48.jpg)
Perceptual LPC(Hermansky, J. Acoust. Soc. Am., 1990)
• First, warp the spectrum to a Bark scale:
• The filters, Hb(k), are uniformly spaced in Bark frequency. Their amplitudes are scaled by the equal-loudness contour (an estimate of how loud each frequency sounds):
![Page 49: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/49.jpg)
Perceptual LPC• Second, compute the cube-root of the power spectrum
– Cube root replaces the logarithm that would be used in MFCC– Loudness of a tone is proportional to cube root of its power
Y(b) = S(b)0.33
• Third, inverse Fourier transform to find the “Perceptual Autocorrelation:”
![Page 50: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/50.jpg)
Perceptual LPC• Fourth, use Normal Equations to find the Perceptual LPC (PLP)
coefficients:
• Fifth, use the LPC Cepstral recursion to find Perceptual LPC Cepstrum (PLPCC):
![Page 51: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/51.jpg)
![Page 52: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/52.jpg)
RASTA(RelAtive SpecTrA )• The rate of change of nonlinguistic components of
speech and background noise environments often liesoutside the typical rate-of-change of vocal-tractshapes in conversational speech
• Hearing is relatively insensitive to slowly varyingstimuli
• The basic idea of RASTA filtering is to exploit thesephenomena by suppressing constant and slowlyvarying elements in each spectral component of theshort term auditory-like spectrum prior tocomputation of the linear prediction coefficients
![Page 53: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/53.jpg)
RASTA (RelAtive SpecTral Amplitude)(Hermansky, IEEE Trans. Speech and Audio Proc., 1994)
• Modulation-filtering of the cepstrum is equivalent to modulation-filtering of the log spectrum:
ct*[m] = Σk hk ct-k[m]
• RASTA is a particular kind of modulation filter:
![Page 54: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/54.jpg)
Time Domain Methods in Speech Processing
![Page 55: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/55.jpg)
Fundemental Assumptions• Properties of Speech Signal change relatively
slowly with time (5-10 sounds per second)• Uncertainty in short/Long time measurements
and estimates– Over very short (5-20ms) intervals
• Uncertainty due to small amount of data, varying pitch and amplitude
– Over medium Length intervals (20-100ms)• Uncertainty due to changes in sound quality, transition
between sounds, rapid transients in speech– Overlong Intervals (100-500ms)
• Uncertainty due to large amount of sound changes
![Page 56: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/56.jpg)
![Page 57: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/57.jpg)
![Page 58: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/58.jpg)
Time-domain processing
• Time-domain parameters
– Short-time energy
– Short-time average magnitude
– Short-time zero crossing rate
– Short-time autocorrelation
– Short-time average magnitude difference
![Page 59: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/59.jpg)
![Page 60: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/60.jpg)
![Page 61: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/61.jpg)
![Page 62: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/62.jpg)
![Page 63: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/63.jpg)
![Page 64: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/64.jpg)
Zero Crossing
• Number of times unvoiced speech crosses thezero line is significantly higher than that ofvoiced speech.
• Gender of speaker can also have an effect onzero crossing.
• Small pitch weighting can be used to weightthe decision threshold.
![Page 65: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/65.jpg)
![Page 66: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/66.jpg)
![Page 67: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/67.jpg)
![Page 68: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/68.jpg)
![Page 69: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/69.jpg)
![Page 70: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/70.jpg)
![Page 71: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/71.jpg)
![Page 72: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/72.jpg)
Autocorrelation is a cross-correlation of a signal with itself.
The maximum of similarity occurs for timeshifting of zero. An other maximum should occur in theorywhen the time-shifting of the signal correspondsto the fundamental period.
Autocorrelation Technique
![Page 73: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/73.jpg)
∑
∑
−−
=
−=∞→
≤≤+⋅=
==
≤≤+⋅+
=
kN
n
N
NnN
KkknxnxN
kR
ofoparties
KkknxnxN
kR
1
00
0
0 ],[][1][
0kat maximum R[k]is 2.R[-k]R[k] 1.
is ationsAutocorrel Pr
0 ],[][12
1lim][
isn correlatio-auto ,definitionBy
Ch4. pitch, v3.a 20
Autocorrelation function
![Page 74: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/74.jpg)
When a segment of a signal is correlated with itself, thedistance (Lag_time_in_samples) between the positions ofthe maximum and the second maximum is defined as thefundamental period (pitch) of the signal.
T0 T0 T0
![Page 75: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/75.jpg)
It is an alternate to Autocorrelation function. It compute the difference between the signal and a
time-shifted version of itself.
While autocorrelation have peaks at maximumsimilarity, there will be valleys in the averagemagnitude difference function.
Average Magnitude Difference Function(AMDF)
0 ,)()(1][ 0
1
0Kkknxkx
NkD
kN
nx ≤≤+−= ∑
−−
=
![Page 76: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/76.jpg)
![Page 77: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/77.jpg)
Speech/Non-speech Detection
![Page 78: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/78.jpg)
![Page 79: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/79.jpg)
![Page 80: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/80.jpg)
![Page 81: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/81.jpg)
![Page 82: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/82.jpg)
![Page 83: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/83.jpg)
![Page 84: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/84.jpg)
![Page 85: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/85.jpg)
![Page 86: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/86.jpg)
![Page 87: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/87.jpg)
Frequency-domain Processing• Spectrogram – short-time Fourier analysis
– two-dimensional waveform (amplitude/time) is converted into a three-dimensional pattern (amplitude/frequency/time)
• Wideband spectrogram: – analyzed on 15ms sections of waveform with a step of 1ms
– voiced regions with vertical striations due to the periodicity of the time waveform (each vertical line represents a pulse of vocal folds) while unvoiced regions are solid/random, or ‘snowy’
• Narrowband spectrogram: – analyzed on 50ms sections of waveform with a step of 1ms
– pitch for voiced intervals in horizontal lines
![Page 88: Features Extraction · 2020. 6. 9. · – Reduced computation time – Easier modeling of the feature distribution • Speech has many “natural” (Acoustic-phonetic) features:](https://reader036.fdocuments.net/reader036/viewer/2022070223/6145ba8f07bb162e665fdf7e/html5/thumbnails/88.jpg)
F3F2F1
waveform
Wideband spectrogram
narrowband spectrogram
Frequency-domain Processing