Speech Enhancement EE 516 Spring 2009
description
Transcript of Speech Enhancement EE 516 Spring 2009
![Page 1: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/1.jpg)
Speech EnhancementEE 516 Spring 2009
Alex Acero
![Page 2: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/2.jpg)
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
![Page 3: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/3.jpg)
Additive noise
• Stationary noise: properties don’t change over time:– White noise x[n]
• flat power spectrum• Samples are uncorrelated
– White Gaussian Noise
• Pdf is Gaussian (see chapter 10)– Typical noise is colored
• Pink noise: low-pass in nature• Non-stationary: properties changes over time
– Babble noise– Cocktail party effect
( )xxS f q[ ] [ ]xxR n q n
![Page 4: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/4.jpg)
Reverberation
• Impulse response of an average office
0 200 400 600 800 1000 1200 1400 1600 1800 2000-3000
-2000
-1000
0
1000
2000
3000
4000
5000
6000
7000
Time (samples)
Roo
m Im
puls
e R
espo
nse 0 0
1[ ] [ ] [ ]k k
k kk kk k
h n n T n Tr c T
![Page 5: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/5.jpg)
Model of the Environment
n[m]
x[m] y[m]h[m] +
[ ] [ ] [ ] [ ]y m x m h m n m
![Page 6: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/6.jpg)
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
![Page 7: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/7.jpg)
Cepstral Mean NormalizationCompute mean of cepstrum
And subtract it from input
CMN robust to channel
distortion
Normalizes average
vocal tract or short filters
Average must include
> 2 sec of speech
1
0
1 T
ttT
x x
ˆ t t x x x
0
2
4
6
8
10
12
14
16
10 15 20 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%) No CMN
CMN-2
![Page 8: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/8.jpg)
RASTA
• CMN is a low-pass filter with rectangular window
• Can use other low-pass filters too• RASTA filter is band-pass
1 3 44
1
2 2( ) 0.1 *
1 0.98
z z zH z z
z
1
0
1ˆ
T
t t ttT
x x x
![Page 9: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/9.jpg)
Retrain with noisy data
• Mismatches between training and testing are bad for pattern recognition systems
• Retrain with noisy data• Approximation: add noise to clean data and retrain
0
20
40
60
80
100
0 5 10 15 20 25 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%)
Mismatched
Matched (Noisy)
![Page 10: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/10.jpg)
Multi-condition training
• Very hard to predict exactly the type of noise we’ll encounter at test time
• Too expensive to retrain the system for each noise condition• Train system offline with several noise types and levels
0
5
10
15
20
25
30
5 10 15 20 25 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%)
Matched Noise
Multistyle
![Page 11: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/11.jpg)
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
![Page 12: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/12.jpg)
Condenser Microphone
b
b
h
~
ZM RL
v(t) G+
-
PreamplifierMicrophone
![Page 13: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/13.jpg)
Ommidirectional microphones
• Polar response
0.5
1
30
210
60
240
90
270
120
300
150
330
180 0
Diaphragm
Mic opening
![Page 14: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/14.jpg)
Bidirectional microphones
Speech sound wave from the front
Noise sound wave from the side
r
source
(d, 0)(–d, 0)
r1r2
5
10
15
20
25
30
210
60
240
90
270
120
300
150
330
180 0
![Page 15: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/15.jpg)
Bidirectional microphones
• bidirectional microphone with d=1 cm at 0• Solid line corresponds to far field conditions ( ) and the
dotted line to near field conditions ( )
102
103
104
-30
-25
-20
-15
-10
-5
0
Frequency (Hz)
Diff
eren
ce in
air
pres
sure
(dB
)
0.02 0.5 /d r
![Page 16: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/16.jpg)
Unidirectional microphones
5
10
15
20
25
30
210
60
240
90
270
120
300
150
330
180 0
Speech sound wave from the front
Noise sound wave from the side
![Page 17: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/17.jpg)
Dynamic microphones
Output voltage
Magnet
Coil
Diaphragm
![Page 18: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/18.jpg)
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
![Page 19: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/19.jpg)
Acoustic Echo cancellation
2
10 2
{ [ ]}( ) 10log
ˆ{( [ ] [ ]) }
E d nERLE dB
E d n d n
Adaptive filter
Acoustic path H
-
x[n]
s[n]
r[n]
Loudspeaker
e[n]
Speech signal
Microphone
+ +v[n] Local
noise
d[n]ˆ[ ]d n
![Page 20: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/20.jpg)
Line echo cancellation
Adaptive filter
Hybrid circuit H
-
x[n]
s[n]r[n]
Speaker A
e[n]
Speaker B
+ +v[n]
d[n]
Noise
ˆ[ ]d n
2
10 2
{ [ ]}( ) 10log
ˆ{( [ ] [ ]) }
E d nERLE dB
E d n d n
![Page 21: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/21.jpg)
Least Mean Squares (LMS)
• Given input
• Estimate output
• Compute error
• Update filter
• Need to tune step size
[ 1] [ ] [ ] [ ]n n e n n W W X
[ ] [ ] [ ]e n d n y n
[ ] { [ ], [ 1], [ 1]}n x n x n x n L X
1
0
[ ] [ ] [ ] [ ] [ ]L
Tk
k
y n w n x n k n n
W X
![Page 22: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/22.jpg)
Normalized LMS
• Make step size adaptive to ensure convergence
• Where we track the input energy
2[ ]
ˆ [ ]x
nL n
2 2 2ˆ ˆ[ ] (1 ) [ 1] [ ]x xn n x n
![Page 23: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/23.jpg)
Recursive Least Squares (RLS)
• Newton Raphson
• New weights
• Faster convergence, but more CPU intensive
x0x1
f(x)
1
( )
( )i
i ii
f xx x
f x
121 [ ] ( ) ( )i i i in e e
w w w w2 ( ) [ ] { [ ] [ ]}T
ie n E n n w R x x
[ ] [ 1] [ ] [ ]Tn n n n R R x x
![Page 24: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/24.jpg)
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
![Page 25: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/25.jpg)
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
Microphone arrays: delay & sum
5 microphones spaced 5 cm apart. Source located at 5 m
Angle 0
400Hz 880Hz 4400Hz 8000 Hz
21
0
1arg max [ sin( )]
N
in i
y n iaN
M0
M1
M2
S
M-2
M-1
a1
0
1[ ] [ sin ]
N
ii
y n y n iaN
![Page 26: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/26.jpg)
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
12.5
25
30
210
60
240
90
270
120
300
150
330
180 0
Microphone arrays: delay & sum
5 microphones spaced 5 cm apart. Source located at 5 m.
Angle 30
400Hz 880Hz 4400Hz 8000 Hz
21
0
1arg max [ sin( )]
N
in i
y n iaN
M0
M1
M2
S
M-2
M-1
a1
0
1[ ] [ sin ]
N
ii
y n y n iaN
![Page 27: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/27.jpg)
WITTY: Who Is Talking To You?
( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
Y f X f V f
B f H f X f G f V f W f
![Page 28: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/28.jpg)
Bone microphone for noise robust ASR
• Conventional microphones are sensitive to noise• Bone microphones are more noise resistant, but distort the signal
• Not enough data to retrain recognizer with bone microphone
• Fusion between acoustic microphone and bone microphone
![Page 29: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/29.jpg)
Acoustic Microphone
![Page 30: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/30.jpg)
Bone Microphone
![Page 31: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/31.jpg)
Microphone fusion
![Page 32: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/32.jpg)
Relationship between acoustic mic and bone mic
Acoustic
Contact
![Page 33: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/33.jpg)
Relationship between acoustic mic and bone mic
![Page 34: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/34.jpg)
WITTY: Who is talking to you?
![Page 35: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/35.jpg)
Blind source separation
• Linear mixing• Estimate filter • Separate signals• Using assumption signals are independent
• Do gradient descent:
[ ] [ ]n ny Gx1H G
[ ] [ ]n nx Hy
( [ ]) | | ( [ ])p n p ny xy H Hy
1 1
0 0
( [0], [1], , [ 1]) ( [ ]) | | ( [ ])N N
N
n n
p N p n p n
y y xy y y y H Hy
1
1 ( [ ])( [ ])T Tn n n n n n
H H H H y y
![Page 36: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/36.jpg)
Blind source separation
Idea: Estimate filters h11[n] and h12[n] that maximize p(z1[n]|) where is a HMM.
Approximate HMM by a Gaussian Mixture Model with LPC parameters => EM algorithm with a linear set of equations
+
+
h11[n]
h22[n]
h12[n]
h21[n]
z1[n]
z2[n]
y1[n]
y2[n]
+
+
h11[n]
h22[n]
h12[n]
h21[n]
z1[n]
z2[n]
y1[n]
y2[n]
![Page 37: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/37.jpg)
Outline
• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression
![Page 38: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/38.jpg)
Spectral subtraction
Corrupted signal
Power spectrum
but
So
Estimate noise power spectrum from noisy frames
Estimate clean power spectrum as
[ ] [ ] [ ]y m x m n m
2 2 2( ) ( ) ( )Y f X f N f
12 2
0
1ˆ ( ) ( )M
ii
N f Y fM
2 22 2 1ˆ ˆ( ) ( ) ( ) ( ) 1( )
X f Y f N f Y fSNR f
2
2
( )( )
ˆ ( )
Y fSNR f
N f
2 2 2( ) ( ) ( ) 2 ( ) ( ) cosY f X f N f X f N f
cos 0E
![Page 39: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/39.jpg)
Spectral subtraction
Keep original phase
Ensure it’s positive
ˆ ( ) ( ) ( )ssX f Y f H f1
( ) max 1 ,( )ssH f a
SNR f
-5 0 5 10 15 20-12
-10
-8
-6
-4
-2
0
Instantaneous SNR (dB)
Ga
in(d
B)
spectral subtractionmagnitude subtractionOversubtraction
![Page 40: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/40.jpg)
Aurora2
• ETSI STQ group• TIDigits• Added noise at SNRs: -5dB, 0dB, 5dB, 10dB, 15dB, 20dB• Set A: subway, babble, car, exhibition• Set B: restaurant, airport, street, station• Set C: one noise from set A and one noise from set C• Aurora 3 recorded in car (no digital mixing!)• Aurora4 for large vocabulary• Advanced Front-End (AFE) standard (2001) uses a variant of
spectral subtraction
![Page 41: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/41.jpg)
Aurora 2 (Clean training)
Using SPLICE algorithm
AA BB CCSubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 98.1698.16 98.5298.52 98.7298.72 98.2798.27 98.4298.42 98.6598.65 97.5897.58 98.8198.81 98.798.7 98.4498.44 98.3498.34 98.0498.04 98.1998.19 98.3898.3815 dB15 dB 96.6596.65 97.6497.64 98.0998.09 96.6196.61 97.2597.25 97.8897.88 96.8996.89 97.9797.97 97.8497.84 97.6597.65 96.8196.81 96.496.4 96.6196.61 97.2897.2810 dB10 dB 93.7793.77 94.6894.68 95.7195.71 93.0993.09 94.3194.31 94.7594.75 93.4493.44 95.8595.85 94.694.6 94.6694.66 93.1893.18 91.2391.23 92.2192.21 94.0394.035 dB5 dB 87.4787.47 84.4684.46 88.4688.46 85.5385.53 86.4886.48 85.0885.08 83.7183.71 87.0387.03 84.9484.94 85.1985.19 84.3184.31 80.3580.35 82.3382.33 85.1385.130 dB0 dB 65.9265.92 57.1357.13 63.6763.67 63.7863.78 62.6362.63 59.7259.72 57.8357.83 63.1163.11 57.4257.42 59.5259.52 59.2359.23 52.952.9 56.0756.07 60.0760.07-5dB-5dB AveragAveragee
88.3988.39 86.4986.49 88.9388.93 87.4687.46 87.8287.82 87.2287.22 85.8985.89 88.5588.55 86.7086.70 87.0987.09 86.3786.37 83.7883.78 85.0885.08 86.9886.98
AA BB CCSubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 37.63%37.63% 84.97%84.97% 50.58%50.58% 52.08%52.08% 56.31%56.31% 86.51%86.51% 43.19%43.19% 87.29%87.29% 75.38%75.38% 73.09%73.09% 74.62%74.62% 59.75%59.75% 67.19%67.19% 65.20%65.20%15 dB15 dB 48.54%48.54% 91.01%91.01% 80.82%80.82% 57.41%57.41% 69.45%69.45% 91.08%91.08% 73.07%73.07% 91.17%91.17% 86.79%86.79% 85.53%85.53% 75.89%75.89% 67.54%67.54% 71.71%71.71% 76.33%76.33%10 dB10 dB 70.72%70.72% 89.48%89.48% 87.00%87.00% 71.61%71.61% 79.70%79.70% 88.39%88.39% 80.05%80.05% 91.01%91.01% 86.40%86.40% 86.46%86.46% 73.87%73.87% 65.70%65.70% 69.79%69.79% 80.42%80.42%5 dB5 dB 73.81%73.81% 78.77%78.77% 82.49%82.49% 73.77%73.77% 77.21%77.21% 78.37%78.37% 73.53%73.53% 81.38%81.38% 79.11%79.11% 78.10%78.10% 67.80%67.80% 61.31%61.31% 64.56%64.56% 75.04%75.04%0 dB0 dB 53.94%53.94% 52.74%52.74% 57.53%57.53% 55.80%55.80% 55.00%55.00% 54.76%54.76% 48.67%48.67% 56.90%56.90% 51.85%51.85% 53.05%53.05% 45.33%45.33% 38.90%38.90% 42.12%42.12% 51.64%51.64%-5dB-5dB AveragAveragee
61.96%61.96% 73.03%73.03% 71.90%71.90% 63.75%63.75% 68.48%68.48% 73.03%73.03% 63.33%63.33% 75.52%75.52% 70.02%70.02% 70.83%70.83% 59.73%59.73% 52.14%52.14% 55.93%55.93% 67.39%67.39%
![Page 42: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/42.jpg)
Aurora 2 (multi-condition training)
Using SPLICE algorithm
AA BB CC
SubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 98.5398.53 98.6498.64 98.5198.51 98.6498.64 98.5898.58 98.4698.46 97.9197.91 98.698.6 98.5898.58 98.3998.39 98.498.4 98.2598.25 98.3398.33 98.4598.4515 dB15 dB 97.6497.64 98.0798.07 98.3398.33 97.6997.69 97.9397.93 97.7997.79 97.4997.49 97.4497.44 97.4797.47 97.5597.55 97.8897.88 97.1697.16 97.5297.52 97.7097.7010 dB10 dB 95.9895.98 96.3796.37 96.8496.84 95.6595.65 96.2196.21 95.2795.27 94.4194.41 95.1195.11 95.1295.12 94.9894.98 95.7995.79 93.893.8 94.8094.80 95.4395.435 dB5 dB 92.0892.08 88.9488.94 92.7892.78 90.2590.25 91.0191.01 87.6387.63 88.0688.06 88.1688.16 87.0487.04 87.7287.72 90.9790.97 85.8585.85 88.4188.41 89.1889.180 dB0 dB 78.0278.02 65.5765.57 76.8376.83 74.4274.42 73.7173.71 65.3765.37 68.2368.23 69.4969.49 65.5765.57 67.1767.17 72.6772.67 65.4265.42 69.0569.05 70.1670.16-5dB-5dB AverageAverage 92.4592.45 89.5289.52 92.6692.66 91.3391.33 91.4991.49 88.9088.90 89.2289.22 89.7689.76 88.7688.76 89.1689.16 91.1491.14 88.1088.10 89.6289.62 90.1890.18
AA BB CCSubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage
CleanClean 20 dB20 dB 38.49%38.49% 40.09%40.09% 24.37%24.37% 47.49%47.49% 37.61%37.61% 50.80%50.80% 13.64%13.64% 45.31%45.31% 52.51%52.51% 40.56%40.56% 40.74%40.74% 49.28%49.28% 45.01%45.01% 40.27%40.27%15 dB15 dB 33.14%33.14% 34.80%34.80% 30.13%30.13% 30.63%30.63% 32.17%32.17% 52.98%52.98% 31.98%31.98% 34.02%34.02% 43.40%43.40% 40.59%40.59% 41.92%41.92% 36.47%36.47% 39.19%39.19% 36.95%36.95%10 dB10 dB 27.70%27.70% 23.09%23.09% 25.82%25.82% 26.15%26.15% 25.69%25.69% 41.17%41.17% 1.06%1.06% 27.12%27.12% 31.56%31.56% 25.23%25.23% 36.79%36.79% 17.33%17.33% 27.06%27.06% 25.78%25.78%5 dB5 dB 31.96%31.96% 11.16%11.16% 40.82%40.82% 21.37%21.37% 26.33%26.33% 24.85%24.85% 17.03%17.03% 13.89%13.89% 21.36%21.36% 19.28%19.28% 48.66%48.66% 19.00%19.00% 33.83%33.83% 25.01%25.01%0 dB0 dB 33.60%33.60% 9.04%9.04% 50.24%50.24% 28.23%28.23% 30.27%30.27% 14.93%14.93% 17.82%17.82% 12.55%12.55% 21.54%21.54% 16.71%16.71% 48.61%48.61% 24.10%24.10% 36.35%36.35% 26.06%26.06%-5dB-5dB AverageAverage 32.85%32.85% 13.01%13.01% 45.52%45.52% 27.57%27.57% 30.15%30.15% 24.04%24.04% 16.83%16.83% 17.14%17.14% 24.99%24.99% 21.05%21.05% 47.14%47.14% 24.13%24.13% 36.01%36.01% 27.87%27.87%
![Page 43: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/43.jpg)
Wiener Filtering
• Find linear estimate of clean signal• MMSE (Minimum Mean Squared Error)
• Wiener-Hopf equation
• In Freq domain
• If noise and signal are uncorrelated
[ ] [ ] [ ]n n n y x v
ˆ[ ] [ ] [ ]m
n h m n m
x y
2
[ ] [ ] [ ]m
E n h m n m
x y
[ ] [ ] [ ]xy yym
R l h m R l m
( )( )
( )xy
yy
S fH f
S f
[ ] [ ] [ ]xym
R l x m y m l
[ ] [ ] [ ]yym
R l y m y m l
( )( )
( ) ( )xx
xx vv
S fH f
S f S f
![Page 44: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/44.jpg)
Wiener Filtering
• Find linear estimate of clean signal• If noise and signal are uncorrelated
• With
• Compare with Spectral Subtraction
[ ] [ ] [ ]n n n y x v
ˆ[ ] [ ] [ ]m
n h m n m
x y
( ) ( )( ) 1( ) 1
( ) ( ) ( )yy vvxx
yy yy
S f S fS fH f
S f S f SNR f
( )( )
( )yy
vv
S fSNR f
S f
1( ) max 1 ,
( )ssH f aSNR f
![Page 45: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/45.jpg)
Spectral Subtraction
0
20
40
60
80
100
0 5 10 15 20 25 30
SNR (dB)
Wo
rd E
rro
r R
ate
(%)
Clean Speech Training
Spectral Subtraction
Matched Noisy Training
![Page 46: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/46.jpg)
Vector Taylor Series (VTS)
• Acero, Moreno
• The power spectrum, on the average
• Taking logs
• Cepstrum is DCT (matrix C) of log power spectrum
( ) y x h g n x h 1
( ) ln 1 e
C zg z C
[ ] [ ] [ ] [ ]y m x m h m n m
2 2 2 2( ) ( ) ( ) ( )i i i iY f X f H f N f
2 2 2
2 2 2
ln ( ) ln ( ) ln ( )
ln 1 exp ln ( ) ln ( ) ln ( )
i i i
i i i
Y f X f H f
N f X f H f
![Page 47: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/47.jpg)
Vector Taylor Series (VTS)
• x, h, and n are Gaussian random vectors with means , , and and covariance matrices , , and
• Expand y in first-order Taylor series
xμ hμnμ xΣ hΣ nΣ
( )
( ) ( ) ( )( )x h n x h
x h n
y μ μ g μ μ μ
A x μ A h μ I A n μ
1A CFC1
1( )
1 e
C μf μ
( )y x h n x h μ μ μ g μ μ μ
( ) ( )T T T y x h nΣ AΣ A AΣ A I A Σ I A
![Page 48: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/48.jpg)
Vector Taylor Series
• Distribution of corrupted log-spectra• Noise with mean of 0dB and std dev of 2dB• Speech with mean of 25dB• Montecarlo simulation• Std dev: 25dB 10dB 5dB
0 50 1000
0.01
0.02
0.03
0 20 40 600
0.01
0.02
0.03
0.04
0 20 40 600
0.02
0.04
0.06
0.08
![Page 49: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/49.jpg)
Phase matters
Corrupted signal
Spectrum
But is only an approximation
[ ] [ ] [ ]y m x m n m
2 2 2( ) ( ) ( )Y f X f N f
2 2 2( ) ( ) ( ) 2 ( ) ( ) cosY f X f N f X f N f
cos 0E
2 ( ) ( ) cos 0t t tX f N f
-6 -4 -2 0 2-6
-5
-4
-3
-2
-1
0
1
2
-6 -5 -4 -3 -2 -1 0 1 2-6
-5
-4
-3
-2
-1
0
1
2
![Page 50: Speech Enhancement EE 516 Spring 2009](https://reader036.fdocuments.net/reader036/viewer/2022062323/5681580f550346895dc57cfb/html5/thumbnails/50.jpg)
Non-stationary noise
• Speech/Noise decomposition (Varga et al.)
Observations
Speech HMM
Noise HMM