Blind Separation Algorithm for Audio Signal Based on Genetic Algorithm and Neural Network
Blind Audio Source Separation based on Independent ... · ... Blind Audio Source Separation based...
Transcript of Blind Audio Source Separation based on Independent ... · ... Blind Audio Source Separation based...
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 1
Shoji Makino, Hiroshi Sawada, Shoji Makino, Hiroshi Sawada, and Shoko Arakiand Shoko Araki
NTT Communication Science Laboratories, Kyoto, Japan
Blind Audio Source Separation Blind Audio Source Separation based on based on
Independent Component AnalysisIndependent Component Analysis
Participants: 247Oversea: 11921 Countries
April 1-4, 2003 Nara, JapanGeneral Chair: Shun-ichi Amari(RIKEN) Organizing Chair: Shoji Makino(NTT)
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 2
April 15April 15--20, 2007 Hawaii, USA20, 2007 Hawaii, USA
Tutorial Tutorial onon
Audio Source Separation Audio Source Separation based on based on
Independent Component AnalysisIndependent Component Analysis
Organizers: Shoji Makino
Hiroshi Sawada
ShouShou--TokuToku--TaishiTaishi
Could separate ten speeches.
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 3
At a Cocktail PartyAt a Cocktail Party
?
Reverberation
moreclearly!
Morning!
Hello!!Mixing
When two people talk to a computerWhen two people talk to a computer
Hello!
Morning!
Hello!Hello!
Hello!
Morning!
Morning!
Morning!Morning!Morning!Hello!
Hello!
I cannot understand!
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 4
?
Reverberation
moreclearly!
Morning!
Hello!!
Target signal must be extracted from mixed signals (source separation)
Mixing
Task of Blind Source SeparationTask of Blind Source Separation
Hello!
Morning!
Hello!Hello!
Hello!
Morning!
Morning!
Morning!
Speech recognition(Robots, Car navi. …) People(TV conferences, Hearing aids, …)
Morning!Morning!Hello!Hello!
Observed signal XSource signal S Separated signal Y
+
+
Hello!
Morning!
Separation system WEstimate S1 and S2using only X1 and X2
Y1
Y2
Hello!
+
+
Morning!
S1
S2
H11
H21
H12
H22
Model of Blind Source SeparationModel of Blind Source Separation
Mixing system HDelayAttenuationReverberation
Hello!Morning!
Hello!Morning!
X1
X2
W21
W12
W22
W11
Source SIndependent
BlindBlind
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 5
Blind Source Separation using Blind Source Separation using Independent Component AnalysisIndependent Component Analysis
H W
Hello!
Morning!
Observed signals X1and X2 are correlated
Extract Y1 and Y2 so that they are mutually independent
Source signals S1 and S2 are statistically independent
No need for information on the source signals or mixing system(location or room acoustics) ⇒ Blind Source Separation
No need for information on the source signals or mixing system(location or room acoustics) ⇒ Blind Source SeparationNo need for information on the source signals or mixing system(location or room acoustics) ⇒ Blind Source Separation
BlindBlind Source Separation using Source Separation using Independent Component AnalysisIndependent Component Analysis
Unsupervised Learning by ICAUnsupervised Learning by ICA
H11
H21
H22
S1
S2
Mixing system H
Y1W11
W12
W21
W22
H12
X1
X2 Y2
Separation system W
Estimate W so that Y1and Y2 become mutually independent
Source signals S1 and S2 are statistically independent
Observed signal XSource signal S Separated signal Y
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 6
Background TheoryBackground Theory
- Minimization of Mutual Information(Minimization of Kullback-Leibler Divergence)
All solutions areidentical
- Maximization of Non-Gaussianity
- Maximization of Likelihood
Background TheoryBackground Theory
All solutions are identical
),()(),( 21
2
121 YYHYHYYI
ii −=∑
=
Minimization of Mutual Information
Maximization of Non-Gaussianity
Maximization of Likelihood
MarginalEntropy
JointEntropy
MutualInformation
H(・):Entropy
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 7
Background TheoryBackground Theory
- Maximization of Non-Gaussianity
• Make the output pdf away from Gaussian
Central Limit TheoremCentral Limit Theorem
Mix independent components ⇒ Gaussian
Find independent component ⇒ Non-Gaussian
maximization of negentropymaximization of |kurtosis|
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 8
Wave forms
Am
plitu
des
Wave forms
Histograms
Amplitude
Gaussian
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 9
Central Limit TheoremCentral Limit Theorem
Mix independent components ⇒ Gaussian
Find independent component ⇒ Non-Gaussian
Non-Gaussianity measuresNegentropy
Kurtosis
Maximization of NegeMaximization of Negentropiesntropies
00.0120.0250.0870.225Negentropy N
1.411.401.391.331.19Entropy H
Gaussian16821# sources N
)()()( gauss yHxHyN −=
∑=
=n
i ii p
pyH1
1log)(
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 10
Maximization of KurtosisMaximization of Kurtosis
00.390.701.82.1Kurtosis
Gaussian16821N
224 })|{|(3}|{|)( yEyEykurt −=
Background TheoryBackground Theory
- Minimization of Mutual Information(Minimization of Kullback-Leibler Divergence)
• Make the output “decorrelated”
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 11
H(Y2)H(Y1)
H(Y1,Y2)
),()(),( 21
2
121 YYHYHYYI
ii −=∑
=
MarginalEntropy
JointEntropy
MutualInformation
Mutual InformationMutual Information
H(・):Entropy
H(Y2)H(Y1)
H(Y1,Y2)
I(Y1,Y2)Mutual
Information
),()(),( 21
2
121 YYHYHYYI
ii −=∑
=
Mutual InformationMutual Information
H(・):EntropyMarginalEntropy
JointEntropy
MutualInformation
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 12
H(Y2)H(Y1)I(Y1,Y2)=0
Mutual Information
),()(),( 21
2
121 YYHYHYYI
ii −=∑
=
H(Y1,Y2)
Minimization of Mutual InformationMinimization of Mutual Information
H(・):EntropyMarginalEntropy
JointEntropy
MutualInformation
),()(),( 21
2
121 YYHYHYYI
ii −=∑
=
MarginalEntropy
JointEntropy
MutualInformation
∫∫ += 22
211
1 )(1log)(
)(1log)( dY
YpYpdY
YpYp
∫− 2121
21 ),(1log),( dYdY
YYpYYp
∫= 2121
2121 )()(
),(log),( dYdYYpYp
YYpYYp
Kullbuck-LeiblerDivergence
Minimization of Mutual InformationMinimization of Mutual Information
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 13
∫= 2121
212121 )()(
),(log),(),( dYdYYpYp
YYpYYpYYI
MutualInformation
Kullbuck-LeiblerDivergence
- Search for W which minimizes Mutual Information I- Gradient of W can be derived by differenciating I
with W, using y = Wx;
WWW
W T2,1 )( −
∂∂
−∝ΔYYI ( )W(Y)YI ][E Tφ−=
YY(Y)
d)(logd p−=φwhere
Minimization of Mutual InformationMinimization of Mutual Information
How can we separate speech?How can we separate speech?
Diagonalize RY
⎥⎦
⎤⎢⎣
⎡⟩⟨⟩⟨⟩⟨⟩⟨
=2212
2111
)()()()(YYYYYYYY
RY φφφφ
S H X WY1
Y2⟨⋅⟩ averaging operator
)(⋅φ activation function
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 14
0)( 12 =⟩⟨ YYφ0)( 21 =⟩⟨ YYφ
Mutual independence
At the Convergence PointAt the Convergence Point
⎥⎦
⎤⎢⎣
⎡*00*
Average amplitude of Y
111)( cYY =⟩⟨φ 222 )( cYY =⟩⟨φ⎥⎦
⎤⎢⎣
⎡
2
1
**c
c
4 equations for 4 unknowns Wij
1
11 d
)(logd)(Y
YpY −=φ
iii WWW Δ+=+1
ii WYYcYY
YYYYcW ⎥
⎦
⎤⎢⎣
⎡⟩⟨−⟩⟨
⟩⟨⟩⟨−=Δ
22212
21111
)()()()(
φφφφ
μ 0
⎥⎦
⎤⎢⎣
⎡⟩⟨⟩⟨⟩⟨⟩⟨
=2212
2111
)()()()(YYYYYYYY
RY φφφφ
+
+
Y1
Y2
X1
X2
W
ICA Learning RuleICA Learning Rule
Update W so that Y1 and Y2 become mutually independent
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 15
)tanh()( 11 YY =φ
Higher Order Statistics (HOS)
0)( 21 =⟩⟨ YYφ
0)( 21 =⟩⟨ YYφ
Second Second Order StatisticsOrder Statistics vs.vs. Higher Higher Order StatisticsOrder Statistics
Second Order Statistics (SOS)
11)( YY =φ 021 =⟩⟨ YY
0)152
3 2
51
31
1 =⟩+−⟨ YYYY L
for multipletime blocks
nonstationarysources
Tailorexpansion
0)tanh( 21 =⟩⟨ YY
Convolutive mixtureHij are FIR filters > 1000 taps- sounds in a room
Instantaneous vs. ConvolutiveInstantaneous vs. Convolutive
Instantaneous mixtureHij are scalars- sounds with mixer- images- wireless communication signals- fMRI and EEG signals
+
+
Hello!
Morning!
S1
S2
H11
H21
H12
H22
Well studied, good results
Difficult problem, relatively new
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 16
Mixing Filters and Separation FiltersMixing Filters and Separation Filters
Y1
Y2
S1
S2
X1
X2
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H11
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H21
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H12
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H22
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W11
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W22
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W21
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W12
Time Domain
Y1
Y2
S1
S2
X1
X2100 200 300 400 500 600 700 800 900 1000
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W21
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W22
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W11
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
W12
Mixing Filters and Separation FiltersMixing Filters and Separation Filters
Time Domain in a Matrix
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H12
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H11
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H21
100 200 300 400 500 600 700 800 900 1000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
H22
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 17
Time
Time-domain signal
TimeFr
eque
ncy
Frequency-domain time-series signal
Short Time DFT (Spectrogram)Short Time DFT (Spectrogram)
Convolutive mixture in time domain
Multiple instantaneous mixtures in frequency domain
Frequency-domain time-series signal
Freq
uenc
y
Time
4000
0
Observed signal X2
Apply instantaneous ICA approach to each frequency bin
4000
0
Freq
uenc
y
Time
Observed signal X1
Frequency
Frequency Domain BSSFrequency Domain BSS
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 18
Y1
Y2
S1
S2
X1
X2
W11(f)W11(f)W11(f)W11(f)W11(f)
W11(f)W11(f)W11(f)W11(f)W21(f)
W11(f)W11(f)W11(f)W11(f)W12(f)
W11(f)W11(f)W11(f)W11(f)W22(f)
Frequency
Frequency
Frequency
Frequency
Mixing Filters and Separation FiltersMixing Filters and Separation Filters
FrequencyDomainW11(f)W11(f)W11(f)W11(f)H11(f)
W11(f)W11(f)W11(f)W11(f)H21(f)
W11(f)W11(f)W11(f)W11(f)H12(f)
W11(f)W11(f)W11(f)W11(f)H22(f)
Frequency
Frequency
Frequency
Frequency
Y1
Y2
S1
S2
X1
W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)
Frequency
X2
Mixing Filters and Separation FiltersMixing Filters and Separation Filters
FrequencyDomain in many Matrices
W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)W11(f) W12(f)
W21(f) W22(f)H11(f) H12(f)
H21(f) H22(f)
Frequency
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 19
)(ty)(tx
Mixedsignals
Separatedsignals
Separationfilters
Flow of Frequency Domain BSSFlow of Frequency Domain BSSTime domain
ScalingPermutation Spectralsmoothing
Frequency domain
),( tfXSTFT
)( fWICA
separation matrix
)( fWIDFT
)(lw )(lw
Permutation Spectralsmoothing
Physical Interpretation of BSSPhysical Interpretation of BSS
BSS = Two sets of ABF
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 20
Adaptive Beamformer (ABF)Adaptive Beamformer (ABF)
•• StrategyStrategyMinimize the output when only a jammer is active but a target is not active
(b)
•• AssumptionsAssumptionsDirection and absence period of a target is known
Target S1
Jammer S2
C1S1
0Target S1C1S1
⎥⎦
⎤⎢⎣
⎡=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡c
cHHHH
WWWW
2
1
2221
1211
2221
1211
00
Two Sets of ABFsTwo Sets of ABFs
0
C1S1S1
S2
(a) ABF for target S1and jammer S2
C2S2
0
S1
S2
(b) ABF for target S2and jammer S1
⎥⎦
⎤⎢⎣
⎡=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡c
cHHHH
WWWW
2
1
2221
1211
2221
1211
00⎥⎦
⎤⎢⎣
⎡=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡c
cHHHH
WWWW
2
1
2221
1211
2221
1211
00
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 21
Blind Source Separation (BSS)Blind Source Separation (BSS)
•• StrategyStrategyMinimize the SOSSOS or HOS of the outputs
•• AssumptionsAssumptionsTwo sources are mutually independent
Hello!
Morning!
0
C1S1
0
C2S2
C1S1
C2S2
)()(),()()(
)(),()(),(
*22
*12
*2111⎥⎦
⎤⎢⎣
⎡=
=
=
∗
∗∗
∗
YYYYYYYY
E
kkk
S ωωωωωωωωω
WHΛHWWRWR XY
[ ][ ] [ ] [ ] [ ]( ){ }
0
)(22
22
11221
221
=+++= ∗∗∗∗∗∗
∗
SEbdSEacSSEbcSSEadYYE
( )
• After convergence, the off-diagonal component is
Diagonalization of in BSSDiagonalization of in BSS
= =
0 0(If S1 and S2 are ideally independent)
),( kY ωR
),( kY ωR• The BSS strategy works to diagonalize
00
dc b 0= = a
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 22
=
where
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡=
⎥⎥⎦
⎤
⎢⎢⎣
⎡
HHHH
WWWW
dac
b
2221
1211
2221
1211a bc d
BSS SolutionsBSS Solutions
⎥⎦
⎤⎢⎣
⎡=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡c
cHHHH
WWWW
2
1
2221
1211
2221
1211
00
CASE 1: a=c1, c=0, b=0, d=c2
⎥⎦
⎤⎢⎣
⎡=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡0
02
1
2221
1211
2221
1211
cc
HHHH
WWWW
CASE 2: a=0, c=c1, b=c2, d=0
(Permutation solution)
⎥⎦
⎤⎢⎣
⎡=⎥⎦⎤
⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡ccHH
HHWWWW
212221
1211
2221
1211 00CASE 3:
⎥⎦
⎤⎢⎣
⎡=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡00
21
2221
1211
2221
1211 ccHHHH
WWWWCASE 4: ( ) 0|| ≠fH
( ) 0≠fH ji
do not appearwe assumeQ
S1
S2
Y1
Y2
Same as ABF
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 23
Adaptive Beamformers0
C1S1
S2
S1
0
C2S2
BSS
Equivalence between BSS and ABFEquivalence between BSS and ABF
0
if the independent assumption ideally holdsBSS and ABF are equivalentif the independent assumption ideally holds
ST
ST
Physical Understanding of BSSPhysical Understanding of BSS
Directivity Patterns
TR = 300 ms TR = 0 ms
TR = 300 msTR = 0 ms
BSS
ABF
BSS, ABF ⇒ Spatial notch to jammer direction
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 24
BSS of Three SpeechesBSS of Three Speeches
3 sources3 sources××3 sensors BSS3 sensors BSS
Directivity Patterns
TR = 130 ms
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 25
3 sources3 sources××3 sensors BSS3 sensors BSS
Directivity Patterns
TR = 130 ms
3 sources3 sources××3 sensors BSS3 sensors BSS
Directivity Patterns
TR = 130 ms
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 26
4 cm
Spatial AliasingSpatial Aliasing
2λ<d
λ : wave length of the highest frequency
When microphone spacing d is too wide…
dd
Spatial aliasing does not occer when
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 27
4 cm30 cm
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 28
Rec
ogni
tion
rate
0%
20%
40%
60%
80%
100%
妨害音:新聞読み上げ 妨害音:DVD
ブラインド源分離前
ブラインド源分離後
After
Before
Before
After
Jammer : Speech Music + Speech
Effect on Speech RecognitionEffect on Speech Recognition
Permutation AlignmentVarious approaches and methodsMethods based on clustering
Bin-wise separated signalsaccording to their activities
Time difference of arrival (TDOA)estimated from basis vectors
Permutation ≈ ClusteringMembership assignment is restricted
to a permutation in each frequency bin
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 29
Dependence Across FrequenciesMeaningful audio source has some structure
Common silence period Common onset and offsetHarmonics
Mutual dependence of separated signals across frequencies
Permutation Alignment (Spatial Information)
Beamforming approach Directivity patterns calculated with Direction of arrival (DOA) estimated & clusteredArray geometry information needed
DOA
Time difference of arrival (TDOA)
Estimated from basis vectorsNo need for array information
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 30
• Mixtures
• ICA
• The inverse of separation matrix
• Decomposition of mixture
Basis VectorsBasis Vectors
basis vector
• Basis vector• represents the frequency responses
from source to all sensors• Implies information on the source location
Basis VectorsBasis Vectors
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 31
3-source 3-microphone Case
MicrophonesOn edges of 4cm triangle
Loudspeakers
45°
120°
210°Distance:
0.5, 1.1, 1.7 m
• Reverberation time: RT60 = 130, 200 ms
DOA: Definition and Example3-dim unit-norm vector
6 speakers and 8 microphones
ICA2007 Keynote – Blind Audio Source Separation based on Independent Component AnalysisShoji Makino, Hiroshi Sawada, and Shoko Araki (NTT Communication Science Laboratories) 32
ReferencesICA and BSS books
[Lee, 1998], [Haykin, 2000], [Hyvarinen et al., 2001], [Cichocki and Amari, 2002]
ICA algorithms
• Information-maximization approach [Bell and Sejnowski, 1995]• Maximum likelihood (ML) estimation [Cardoso, 1997]• Natural gradient [Amari et al., 1996], [Cichocki and Amari, 2002]• Equivariance property [Cardoso and Laheld, 1996]• FastICA [Hyvarinen et al., 2001]
Time-domain approach to convolutive BSS
[Amari et al., 1997], [Kawamoto et al., 1998], [Matsuoka and Nakashima, 2001],[Douglas and Sun, 2003], [Buchner et al., 2004], [Takatani et al., 2004], [Douglas et al., 2005],[Aichner et al., 2006]
Frequency-domain approach to convolutive BSS
[Smaragdis, 1998], [Parra and Spence, 2000], [Schobben and Sommen, 2002], [Murata et al., 2001],[Anemuller and Kollmeier, 2000], [Mitianoudis and Davies, 2003], [Asano et al., 2003],[Saruwatari et al., 2003], [Ikram and Morgan, 2005], [Sawada et al., 2004], [Mukai et al., 2006],[Sawada et al., 2006], [Hiroe, 2006], [Kim et al., 2007], [Lee et al., 2006], [Sawada et al., 2007]
Approaches to permutation alignment
• Making separation matrices smooth in the frequency domain [Smaragdis, 1998],[Parra and Spence, 2000], [Schobben and Sommen, 2002], [Buchner et al., 2004]
• Beamforming approach and estimating direction-of-arrival (DOA) [Saruwatari et al., 2003],[Ikram and Morgan, 2005], [Sawada et al., 2004], [Mukai et al., 2006], [Sawada et al., 2006]
• Correlation of envelopes [Murata et al., 2001], [Anemuller and Kollmeier, 2000],[Sawada et al., 2004]
• Nonstationary time-varying scale parameter [Mitianoudis and Davies, 2003]• Multivariate density function [Hiroe, 2006], [Kim et al., 2007], [Lee et al., 2006]• Dominance measure [Sawada et al., 2007]
Time-frequency masking approach to BSS
[Aoki et al., 2001], [Rickard et al., 2001], [Bofill, 2003], [Yilmaz and Rickard, 2004],[Roman et al., 2003], [Araki et al., 2004], [Araki et al., 2005], [Kolossa and Orglmeister, 2004],[Sawada et al., 2006]
Blind dereverberation for speech signals
[Nakatani et al., 2007], [Delcroix et al., 2007], [Kinoshita et al., 2006]
Scaling adjustment to microphone observations
[Cardoso, 1998], [Murata et al., 2001], [Matsuoka and Nakashima, 2001], [Takatani et al., 2004]
Linear estimation
[Kailath et al., 2000]
Time difference of arrival (TDOA)
[Knapp and Carter, 1976], [Omologo and Svaizer, 1997], [DiBiase et al., 2001], [Chen et al., 2004]
K-means clustering
[Duda et al., 2000]
REFERENCES
[Aichner et al., 2006] Aichner, R., Buchner, H., Yan, F., and Kellermann, W. (2006). A real-time blind source separationscheme and its application to reverberant and noisy acoustic environments. Signal Process., 86(6):1260–1277.
[Amari et al., 1996] Amari, S., Cichocki, A., and Yang, H. H. (1996). A new learning algorithm for blind signalseparation. In Advances in Neural Information Processing Systems, volume 8, pages 757–763. The MIT Press.
[Amari et al., 1997] Amari, S., Douglas, S., Cichocki, A., and Yang, H. (1997). Multichannel blind deconvolution andequalization using the natural gradient. In Proc. IEEE Workshop on Signal Processing Advances in WirelessCommunications, pages 101–104.
[Anemuller and Kollmeier, 2000] Anemuller, J. and Kollmeier, B. (2000). Amplitude modulation decorrelation forconvolutive blind source separation. In Proc. ICA 2000, pages 215–220.
[Aoki et al., 2001] Aoki, M., Okamoto, M., Aoki, S., Matsui, H., Sakurai, T., and Kaneda, Y. (2001). Sound sourcesegregation based on estimating incident angle of each frequency component of input signals acquired by multiplemicrophones. Acoustical Science and Technology, 22(2):149–157.
[Araki et al., 2004] Araki, S., Makino, S., Blin, A., Mukai, R., and Sawada, H. (2004). Underdetermined blind separationfor speech in real environments with sparseness and ICA. In Proc. ICASSP 2004, volume III, pages 881–884.
[Araki et al., 2005] Araki, S., Makino, S., Sawada, H., and Mukai, R. (2005). Reducing musical noise by a fine-shiftoverlap-add method applied to source separation using a time-frequency mask. In Proc. ICASSP 2005, volume III,pages 81–84.
[Asano et al., 2003] Asano, F., Ikeda, S., Ogawa, M., Asoh, H., and Kitawaki, N. (2003). Combined approach of arrayprocessing and independent component analysis for blind separation of acoustic signals. IEEE Trans. Speech AudioProcessing, 11(3):204–215.
[Bell and Sejnowski, 1995] Bell, A. and Sejnowski, T. (1995). An information-maximization approach to blind separationand blind deconvolution. Neural Computation, 7(6):1129–1159.
[Bofill, 2003] Bofill, P. (2003). Underdetermined blind separation of delayed sound sources in the frequency domain.Neurocomputing, 55:627–641.
[Buchner et al., 2004] Buchner, H., Aichner, R., and Kellermann, W. (2004). Blind source separation for convolutivemixtures: A unified treatment. In Huang, Y. and Benesty, J., editors, Audio Signal Processing for Next-GenerationMultimedia Communication Systems, pages 255–293. Kluwer Academic Publishers.
[Cardoso, 1997] Cardoso, J.-F. (1997). Infomax and maximum likelihood for blind source separation. IEEE SignalProcessing Letters, 4(4):112–114.
[Cardoso, 1998] Cardoso, J.-F. (1998). Multidimensional independent component analysis. In Proc. ICASSP 1998,volume 4, pages 1941–1944.
[Cardoso and Laheld, 1996] Cardoso, J.-F. and Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Trans.Signal Processing, 44(12):3017–3030.
[Chen et al., 2004] Chen, J., Huang, Y., and Benesty, J. (2004). Time delay estimation. In Huang, Y. and Benesty, J.,editors, Audio Signal Processing, pages 197–227. Kluwer Academic Publishers.
[Cichocki and Amari, 2002] Cichocki, A. and Amari, S. (2002). Adaptive Blind Signal and Image Processing. JohnWiley & Sons.
[Delcroix et al., 2007] Delcroix, M., Hikichi, T., and Miyoshi, M. (2007). Precise dereverberation using multi-channellinear prediction. IEEE Trans. Audio, Speech and Language Processing, 15(2):430–440.
[DiBiase et al., 2001] DiBiase, J. H., Silverman, H. F., and Brandstein, M. S. (2001). Robust localization in reverberantrooms. In Brandstein, M. and Ward, D., editors, Microphone Arrays, pages 157–180. Springer.
[Douglas et al., 2005] Douglas, S. C., Sawada, H., and Makino, S. (2005). A spatio-temporal FastICA algorithm forseparating convolutive mixtures. In Proc. ICASSP 2005, volume V, pages 165–168.
[Douglas and Sun, 2003] Douglas, S. C. and Sun, X. (2003). Convolutive blind separation of speech mixtures using thenatural gradient. Speech Communication, 39:65–78.
[Duda et al., 2000] Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification. Wiley Interscience, 2ndedition.
[Haykin, 2000] Haykin, S., editor (2000). Unsupervised Adaptive Filtering (Volume I: Blind Source Separation). JohnWiley & Sons.
[Hiroe, 2006] Hiroe, A. (2006). Solution of permutation problem in frequency domain ICA using multivariate probabilitydensity functions. In Proc. ICA 2006 (LNCS 3889), pages 601–608. Springer.
[Hyvarinen et al., 2001] Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley& Sons.
[Ikram and Morgan, 2005] Ikram, M. Z. and Morgan, D. R. (2005). Permutation inconsistency in blind speech separation:Investigation and solutions. IEEE Trans. Speech Audio Processing, 13(1):1–13.
[Kailath et al., 2000] Kailath, T., Sayed, A. H., and Hassibi, B. (2000). Linear Estimation. Prentice Hall.[Kawamoto et al., 1998] Kawamoto, M., Matsuoka, K., and Ohnishi, N. (1998). A method of blind separation for
convolved non-stationary signals. Neurocomputing, 22:157–171.[Kim et al., 2007] Kim, T., Attias, H. T., Lee, S.-Y., and Lee, T.-W. (2007). Blind source separation exploiting
higher-order frequency dependencies. IEEE Trans. Audio, Speech and Language Processing, pages 70–79.[Kinoshita et al., 2006] Kinoshita, K., Nakatani, T., and Miyoshi, M. (2006). Spectral subtraction steered by multi-step
forward linear prediction for single channel speech dereverberation. In Proc. ICASSP 2006, volume I, pages 817–820.[Knapp and Carter, 1976] Knapp, C. H. and Carter, G. C. (1976). The generalized correlation method for estimation of
time delay. IEEE Trans. Acoustic, Speech and Signal Processing, 24(4):320–327.[Kolossa and Orglmeister, 2004] Kolossa, D. and Orglmeister, R. (2004). Nonlinear postprocessing for blind speech
separation. In Proc. ICA 2004 (LNCS 3195), pages 832–839.[Lee et al., 2006] Lee, I., Kim, T., and Lee, T.-W. (2006). Complex FastIVA: A robust maximum likelihood approach of
MICA for convolutive BSS. In Proc. ICA 2006 (LNCS 3889), pages 625–632. Springer.[Lee, 1998] Lee, T. W. (1998). Independent Component Analysis - Theory and Applications. Kluwer Academic Publishers.[Matsuoka and Nakashima, 2001] Matsuoka, K. and Nakashima, S. (2001). Minimal distortion principle for blind source
separation. In Proc. ICA 2001, pages 722–727.[Mitianoudis and Davies, 2003] Mitianoudis, N. and Davies, M. (2003). Audio source separation of convolutive mixtures.
IEEE Trans. Speech and Audio Processing, 11(5):489–497.[Mukai et al., 2006] Mukai, R., Sawada, H., Araki, S., and Makino, S. (2006). Frequency-domain blind source separation
of many speech signals using near-field and far-field models. EURASIP Journal on Applied Signal Processing,2006:Article ID 83683, 13 pages.
[Murata et al., 2001] Murata, N., Ikeda, S., and Ziehe, A. (2001). An approach to blind source separation based ontemporal structure of speech signals. Neurocomputing, 41(1-4):1–24.
[Nakatani et al., 2007] Nakatani, T., Kinoshita, K., and Miyoshi, M. (2007). Harmonicity-based blind dereverberation forsingle-channel speech signals. IEEE Trans. Audio, Speech and Language Processing, 15(1):80–95.
[Omologo and Svaizer, 1997] Omologo, M. and Svaizer, P. (1997). Use of the crosspower-spectrum phase in acousticevent location. IEEE Trans. Speech Audio Processing, 5(3):288–292.
[Parra and Spence, 2000] Parra, L. and Spence, C. (2000). Convolutive blind separation of non-stationary sources. IEEETrans. Speech Audio Processing, 8(3):320–327.
[Rickard et al., 2001] Rickard, S., Balan, R., and Rosca, J. (2001). Real-time time-frequency based blind sourceseparation. In Proc. ICA2001, pages 651–656.
[Roman et al., 2003] Roman, N., Wang, D., and Brown, G. J. (2003). Speech segregation based on sound localization.Journal of Acousitical Society of America, 114(4):2236–2252.
[Saruwatari et al., 2003] Saruwatari, H., Kurita, S., Takeda, K., Itakura, F., Nishikawa, T., and Shikano, K. (2003). Blindsource separation combining independent component analysis and beamforming. EURASIP Journal on Applied SignalProcessing, 2003(11):1135–1146.
[Sawada et al., 2007] Sawada, H., Araki, S., and Makino, S. (2007). Measuring dependence of bin-wise separated signalsfor permutation alignment in frequency-domain BSS. In Proc. ISCAS 2007. (in press).
[Sawada et al., 2006] Sawada, H., Araki, S., Mukai, R., and Makino, S. (2006). Blind extraction of dominant targetsources using ICA and time-frequency masking. IEEE Trans. Audio, Speech and Language Processing, pages2165–2173.
[Sawada et al., 2004] Sawada, H., Mukai, R., Araki, S., and Makino, S. (2004). A robust and precise method for solvingthe permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Processing,12(5):530–538.
[Schobben and Sommen, 2002] Schobben, L. and Sommen, W. (2002). A frequency domain blind signal separationmethod based on decorrelation. IEEE Trans. Signal Processing, 50(8):1855–1865.
[Smaragdis, 1998] Smaragdis, P. (1998). Blind separation of convolved mixtures in the frequency domain.Neurocomputing, 22:21–34.
[Takatani et al., 2004] Takatani, T., Nishikawa, T., Saruwatari, H., and Shikano, K. (2004). High-fidelity blind separationof acoustic signals using SIMO-model-based independent component analysis. IEICE Trans. Fundamentals,E87-A(8):2063–2072.
[Yilmaz and Rickard, 2004] Yilmaz, O. and Rickard, S. (2004). Blind separation of speech mixtures via time-frequencymasking. IEEE Trans. Signal Processing, 52(7):1830–1847.