Blind Audio Source Separation Pipeline and Algorithm ...cs229.stanford.edu/proj2015/124_report.pdf1...

1

Blind Audio Source Separation Pipeline andAlgorithm Evaluation

Wisam Reid, Kai-Chieh Huang & Doron Roberts-Kedes

Abstract—This report outlines the various methods and experiments employed by its authors in their search for algorithmicsolutions for Blind Audio Source Separation. In the context of music, advancing BSS would lead to improvements in musicinformation retrieval, computer music composition, spatial audio, and audio engineering. An understanding and evaluationon the advantage and disadvantage of different BSS algorithms will be beneficial for further usage of these BSS algorithm indifferent context. This report discusses three Blind Audio Source Separation Algorithms: GMM, NMF, and ICA, and evaluatestheir performance based on human perception of audio signals.

Index Terms— Audio Signal Processing, Blind Source Separation, Bark Coefficient Analysis, Non-negative Matrix Factorization,Gaussian Mixture Model, Critical Band Smoothing

F

1 PROJECT BACKGROUND

Blind Source Separation (BSS) is the separation of a setof source signals from a set of mixed signals, withoutthe aid of information (or with very little information)about the source signals or the mixing process. One wayof categorizing these algorithms is dividing them intothe approach in time domain and frequency domain. Anexample of time domain approach is the IndependentComponent Analysis which works with input data thatcontains both positive and negative values. On the otherhand, algorithms such as Non-negative Matrix Factoriza-tion works in the frequency domain where the input datais the magnitude of the spectrogram which can only bepositive. Although BSS is an active research area wherenew techniques are continuously being developed, thereis little literature studying the characteristics and thedifferences between different BSS algorithms and hav-ing an objective way of measuring the performance ofthe BSS algorithm in the context of human perception.Thus, it would be interesting to study the differences ofBSS algorithms and compare them by evaluating theirseparation results. In this report, we studied three majorBSS algorithms: Gaussian Mixture Model (GMM), Non-negative Matrix Factorization (NMF), and IndependentComponent Analysis (ICA) and examine their perfor-mance along with a perceptually relevant criteria formeasuring the error, thus enabling regression on theparameters of our model.

2 PIPELINE

With the goal of comparing the performance of eachalgorithm used in this paper in a convenient fashion, wedesigned a pipeline structure as shown in Fig 1. Perfor-mance of BSS supervised and unsupervised algorithmsfor both under-determined and determined systems, wascompared using this process. Using this structure, we’ve

Fig. 1. Pipeline for Performing Algorithm Evaluation

implemented a system that automatically generates mix-ings and feeds the mixings into different algorithms toevaluate and compare their performance.

3 NMFSource separation can be viewed as a matrix factor-ization problem, where the source mixture is modeledas a matrix containing it’s spectrogram representation.Spectrograms are commonly used to visualize the timevaryingspectral density of audio, and other time domainsignals [1].Audio signals can therefore be fully repre-sented by a matrix with rows, columns, and elementvalues corresponding to the horizontal axis t (represent-ing time), the vertical axis f (representing frequency),and the intensity or color of each point in the image(indicating the amplitude of a particular frequency ata particular time) of a spectrogram respectively. Thespectrogram of a signal x(t) can be estimated by com-puting the squared magnitude of the short-time fourier

2

transform (STFT) of the signal x(t), and likewise, x(t) canbe recovered from the spectrogram through the inverseshort-time fourier transform (ISTFT) after processing thesignal in spectral domain [2].

3.1 Modeling Source Separation as an NMFAdopting this view, it follows that source separationcould be achieved by factorizing spectrogram dataas a mixture of prototypical spectra. While there aremany commonly practiced matrix factorization tech-niques such as Singular Value Decomposition (SVD),Eigenvalue Decomposition, QR Decomposition (QR),Lower Upper Decomposition (LU). NMF is a matrixfactorization that assumes everything is non-negative,giving this technique an advantage when processingmagnitude spectrograms. As an added advantage non-negativity avoids destructive interference, guaranteeingthat estimated sources must cumulatively add duringresynthesis. In general NMF decomposes a matrix as aproduct of two or more matrices as follows [3]

1) V 2 RF⇥T+ original non-negative data

2) W 2 RF⇥K+ matrix of basis vectors, dictionary

elements3) H 2 RK⇥T

+ matrix of activations, weights, or gainsIn the form:

⇥V

⇤⇡

⇥W

⇤ ⇥H

⇤

Typically K < F < T and K is chosen such thatFK + KT ⌧ FT , hence reducing dimensionality [4].In the context of source separation spectrogram data ismodeled as V . The columns of V are approximated asa weighted sum (or mixture) of basis vectors W rep-resenting prototypical spectra and H representing timeonsets or envelopes. NMF is underlaid by a well-definedstatistical model of superimposed gaussian componentsand is equivalent to maximum likelihood estimation ofvariance parameters. NMF can accommodate regulariza-tion constraints on the factors through Bayesian priors.In particular, inverse-gamma and gamma Markov chainpriors. Estimation can be carried out using a generalizedexpectation-maximization. This can also be solved as aminimization of D where D is a measure of divergence.In the literature, factorization is usually framed as anoptimization problem min

W,H�0D(V |WH) [4] Commonly

solved using Euclidean or the generalized Kullback-Leibler (KL) divergence. Defined as

dKL(x|y) = x logx

y� x+ y giving

minW,H�0

X

f,t

Vft logVft

(WH)ft� Vft + (WH)ft

The former is convex in W and H separately, butis not convex in both simultaneously. NMF does notalways give an intuitive decomposition, howeverexplicitly controlling the sparseness and smoothnessof the representation leads to representations that are

Fig. 2. A closer look at Non-Negative Matrix Factorization

parts-based and match the intuitive features of the data[5]. A deeper intuition is needed for how regularizationtechniques relate to the performance of these algorithmson audio data.

Euclidean and KL divergence are both derived from agreater class of �-divergence algorithms, while it shouldbe noted that the derivative of d�(x|y) with regard toy is continuous in �, KL divergence and the Euclideandistance are defined as (� = 1) and (� = 2) respec-tively. This is noteworthy since factorizations obtainedwith � > 0 will rely more heavily on the largestdata values and less precision is to be expected in theestimation of the low-power components. This makesKL-NMF especially suitable for the decomposition ofaudio spectra, which typically exhibit exponential powerdecrease along frequency f and also usually compriselow-power transient components such as note attackstogether with higher power components such as tonalparts of sustained notes [6]. Majorization-minimization(MM) can be performed using block coordinate descent,where H is optimize for a fixed W , then W is optimizefor a fixed H , this is then repeated until convergence.Since solving a closed form solution is intractable, this issolved using Jensen’s inequality, introducing the weightsP

k �ijk = 1 which gives D(V |WH)

X

f,t

(�Vft

X

k

�ftklogVft

�ftk+

X

k

(WfkHkt))

Choosing �ftk to be WfkH`ktP

k Wfk(H)`ktas suggested in [7] MM

updates can be derived, where majorization is achievedby calculating �ftk and minimization is achieved by

minimizeW,H�0

�X

f,t

Vft

X

k

�ftklogWfkHkt +X

k

WfkHkt

As mentioned earlier, the MM estimation is equivalentto a generalized EM estimation with the added bene-

3

fit of accommodating regularization constraints on thefactors through Markov chain priors. This is intuitivesince the EM algorithm is a special case of MM, wherethe minorizing function is the expected conditional loglikelihood. This approach stems from the fact that onlyVft is observed, but the full model involves unobservedvariables k. EM is used to fit parameters that maximizethe likelihood of the data, maximizing over p(k|t) andp(f |k) gives EM updates where the E-step involvescalculating

P (k|f, t) = P (k|t)P (f |k)Pk P (k|t)P (f |k))

and an M-Step maximizingX

f,t

Vft

X

k

P (k|f, t)logP (k|t)P (f |k)

Unfortunately the number of parameters that need tobe estimated is FK + KT , in such a high dimensionalsetting it is useful to impose additional structure. Thiscan be done using priors and regularization. Priors canencode structural assumptions, like sparsity. Commonlythe posterior distribution is calculated using the poste-rior mode (MAP). Another way is to impose structureis through regularization, by adding another term to theobjective function

minimizeW,H�0

D(V ||WH) + �⌦(H)

where ⌦ encodes the desired structure, and � controls thestrength [7]. As discussed earlier, sparsity and smooth-ness are good choices for ⌦(H), these structures areuseful when encoding the transient features of commonaudio sources. In addition, an interesting area of futurework could be the inclusion of GMM derived structurethrough clustering. The beginnings of this research arediscussed further in this report.

3.2 Seperating Sources

In the current research NMF is used to estimate W andH which are in turn used to derive masking filters Msasseen in the signal flow in Fig 2. Here audio sourcemixtures are first represented as spectrograms by a STFTalgorithm, then factorized into W and H in order toderive masking filters used to extract estimated sourcesX where X = Ms � V and � is the Hadamard product(an element-wise multiplication of the matrices). Theseestimates are then synthesized into time domain signalsvia a ISTFT algorithm, the original phase components\V are added back into the estimates, and passed downto the next stage of the pipeline for evaluation. Un-supervised, partially supervised, and fully supervisedalgorithms were evaluated, using this method NMF’sseparation and source estimation performance wherecompared to ICA and GMM algorithms against theperformance criteria outline later in this report.

Fig. 3. Decomposing Spectrograms

4 GMM

N sources are separated from a mixed signal by fittinga Gaussian Mixture Model (GMM) with N componentson the signal’s magnitude spectrogram. Each bin in theoriginal spectrogram is assigned to one of N GMMcomponents. Spectrograms containing the bins assignedto each GMM component are inverted to produce esti-mations of the source signals. The spectrogram of themixed signal was generated using a short-time Fouriertransform (STFT). The STFT was computed with a 2048sample kaiser window with a beta value of 18, a hopsize of 64 samples, and zero-padding by a factor of2. These parameters were chosen to minimize spectralleakage while providing extremely high resolution inboth the time and frequency domain. The top mostgraph in Fig 4 shows the spectrogram of a mixed signal,where the magnitude is in decibel scale. All data pointswith a magnitude less than -40db were discarded. Thethreshold value of -40db was chosen experimentally asthe value that most effectively disambiguated betweensilence and acoustic events.

Fig. 4. GMM Source Separation Process

4

4.1 Clustering

After thresholding data based on magnitude, the mag-nitude feature was discarded from the data used to fitthe GMM. Fitting a GMM on data without magnitudeconsistently outperformed the GMM fit on data withmagnitude included. This is not surprising since the timeand frequency of a bin are much more indicative ofsource signal membership than magnitude. The GMMwas fit to the data with an equal number of componentsas source signals to be estimated. Each component wasfit with a full covariance (non-diagonal) matrix inde-pendently of the other components. A full covariancematrix reflects the highly unpredictable nature of thecovariance of time and frequency in audio sources. Theindependence of each components covariance matrixreflects tendency for some audio events to be highlyconcentrated in time or frequency, while other audioevents are more dispersed in time and frequency. Initialvalues for the components were selected using the k-means algorithm.

4.2 Source Estimation

After fitting the GMM to the data, each bin in thepositive frequency portion of the original mixed signalspectrogram underwent hard assignment to the com-ponent that maximized the posterior probability. Themiddle graph in Fig 4 shows the result of this assignmentafter fitting a GMM using both frequency and timeinformation. A new spectrogram was generated for eachcomponent consisting only of the bins in the originalspectrogram that were assigned to the component. Fi-nally, in the bottom of Fig 4 shows the spectrograms oftwo estimated sources - the first having predominatelylow, narrow frequency content, and the second havingpredominately high, disperse frequency content. Each ofthese spectrograms was inverted using an inverse short-time Fourier transform to obtain estimates of the originalsource signals. The evaluation result of this algorithm isdiscussed in the end of this paper.

5 PERFORMANCE EVALUATION

In order to fully understand and compare the per-formance of each algorithm used in this paper, it isimportant to have an objective way of measuring theestimated source result against its true source. Severalmethods commonly used to evaluate audio quality suchas PEAQ [8] are particular tailored to measure audiocodec performance and are in consequence not idealfor evaluating audio source separation algorithms. Oneset of metrics used to measure the performance in theliterature of BSS that particular suited for studying thecharacteristics of different BSS algorithms are Sourceto Distortion Ratio (SDR), Source to Interference Ratio(SIR), and Sources to Artifact Ratio (SAR) [9]. These ratioare derived based on decomposing the estimated sourceinto true source part plus error term corresponding to

the interference of other sources and the error term ofartifact introduced by the algorithm, and then calculatethe relative energy of these component in time domain.This evaluation technique hence has the advantage ofproviding information on how well a particular BSSalgorithm suppresses the interference of other sources orthe artifact introduced while performing the separation.However, the relationship between human perception onthe quality of separated results to these metrics is notwell established. Accordingly, we proposed an improvedperformance evaluation approach which relates humanperception on audio signals to the metrics based onmodifying the method proposed in [9].

5.1 Critical Band Smoothing

Since human hearing is only sensitive to spectral featureswider in frequency than a critical bandwidth, we canmodel how human perceive audio signal by blurringspectral features smaller than a critical bandwidth inthe spectrum using critical band smoothing procedure[10]. In this paper, the equivalent rectangular bandwidth(ERB) scale [11] is used for determining the criticalband. The ERB and frequency in kHz are related bythe equations bE = 21.4log10(4.37f + 1.0) and f =

(10bE21.4 �1.0)/4.37. With this relationship in hand, we can

smooth the spectral features of a certain audio signal byreplacing the magnitude of each frequency bin with itsaverage magnitude across one critical bandwidth usingthe calculation below:

P (!,�) =

f(b(!)+ �2 )X

'=f(b(!)� �2 )

|H('))|2 (1)

where we are using a � = 0.5 in this paper. The effect ofcritical band smoothing can be understood through thegraph in Fig 5.

Fig. 5. Critical Band Smoothed Spectrum Example

5.2 Proposed Performance Criteria

In this article, a new performance criteria is designedto study the performance among different BSS algo-rithms in the context of human spectral perception.For the evaluation of the BSS algorithms in this paper,

5

the estimated source is decomposed into three parts asdiscussed previously in [9], but instead of measuring theenergy ratio in time domain, we examine the energyratio in critical band smoothed frequency domain toinclude the objectiveness of human hearing perception.For instance, the critical band smoothed spectrum of theestimated source is decomposed as Sestimate = Strue +Sinterfere + Sartifact where, Sestimate is the critical bandsmoothed spectrum of the estimated source, Strue is theprojection of the critical band smoothed spectrum ofthe estimated source onto the critical band smoothedspectrum of the targeted true source, Sinterfer is thesummation of all the projections of the critical bandsmoothed spectrum of the estimated source onto allother true sources (excluding the targeted true source),and finally Sartifact is calculated as Sestimate - (Strue +Sinterfer) which stands for additional error or artifactintroduced by the BSS algorithm. The three metrics SDR,SIR, and SAR in this paper is defined as follows:

SDR = 10log10(kStruek2

kSinterfere + Sartifact)k2) (2)

SIR = 10log10(kStruek2

kSinterferek2) (3)

SAR = 10log10(kStrue + Sinterferek2

kSartifactk2) (4)

where Strue, Sinterfere, and Sartifact are the criticalband smoothed spectrums as stated previously. It iscritical to note that SAR stands for Sources to ArtifactRatio and in the numerator: kStrue + Sinterferk2 is thetotal energy of all the sources present in the estimatedsource. This arrangement makes SAR independent ofSIR and make a robust and accurate evaluation on theartifact caused by the BSS algorithm. An evaluationtest on an estimated source generated by a true sourceadding more and more white noise using the new estab-lished performance measurement as a demonstration ispresented in Fig 6.

Fig. 6. Performance Evaluation Demonstration

6 RESULT & CONCLUSION

Through the algorithm evaluation pipeline, we have suc-cessfully generated the SDR, SIR, and SAR comparison

Fig. 7. Fast ICA, ICA, GMM, Supervised NMF, Unsuper-vised NMF, and Partially-Supervised NMF

among GMM, NMF, and ICA by testing on the samemixing generated using two sources: a Bass and a Drumaudio clip. As mentioned in the performance evaluationsection, SDR measures the general accuracy of how wellthe estimated source is compared to its targeted truesource. SIR on the other hand, measures how well thealgorithm prevent other sources from interfering theestimated source while separating. Finally, SAR evaluatehow well the algorithm is at avoiding artifact. As is clearin Fig 7, fitting GMMs with spectrograms of mixed audiosignals yielded the highest SIR measured, but yieldedthe lowest SDR except for unsupervised non-negativematrix factorization. These results are likely due to thesharp division in the frequency domain between sourcesignal estimations created by hard assignment of spectraldata points to GMM components. The results of GMM inFig 4 show a typical cutoff in the frequency domain. Theabrupt cutoff in frequency prevents source signals in dif-ferent frequency bands from interfering with the sourcebeing estimated. This suppression leads to GMMs highSIR value. However, a negative consequence of the sharpfrequency cutoff is that signal content on the wrongside of the cutoff is excluded from the source estimationentirely. This suppression leads to a low SDR value. Onthe contrary, Fast ICA, ICA and NMF have similar SDR,SAR performance which are the highest among all thealgorithm we’ve evaluated. There is however, a majordifference among ICA and NMF, where ICA performswell in a determined system when there is sufficientmixing examples, while NMF performs well in an under-determined system but requires a supervised learningprocess on the true sources examples. Furthermore, wealso evaluate the performance on the unsupervised andpartially supervised version of NMF. As we can seefrom the result in Fig 7, the performance of NMF dropsaccording to the degree it is unsupervised.

6

REFERENCES

[1] Boualem Boashash. Time Frequency Analysis. Else-vier Science, 2003.

[2] E. Jacobsen and R. Lyons. The sliding dft. Signal

Processing Magazine, IEEE, 20(2):74–80, Mar 2003.[3] P. Smaragdis and J.C. Brown. Non-negative matrix

factorization for polyphonic music transcription. InIEEE Workshop on Applications of Signal Processing

to Audio and Acoustics (WASPAA), pages 177–180,October 2003.

[4] Cedric Fevotte, Nancy Bertin, and Jean-Louis Dur-rieu. Nonnegative matrix factorization with theitakura-saito divergence: With application to musicanalysis. Neural Comput., 21(3):793–830, March 2009.

[5] Patrik O. Hoyer. Non-negative matrix factorizationwith sparseness constraints. CoRR, cs.LG/0408058,2004.

[6] Cedric Fevotte and Jerome Idier. Algorithms fornonnegative matrix factorization with the beta-divergence. CoRR, abs/1010.1763, 2010.

[7] Judith C. Brown Paris Smaragdis. Non-negativematrix factorization for polyphonic music transcrip-tion. IEEE Workshop on Applications of Signal Process-

ing to Audio and Acoustics, pages 177–180, October19-22 2003.

[8] Treurniet W. Bitto R. Schmidmer C. Sporer T.Beerends J. Colomes C. Keyhl M. Stoll G. Branden-burg K. Feiten B. Thiede, T. Peaqthe itu standard forobjective measurement of perceived audio quality.Journal of the Audio Engineering Society, 48(1/2):3–29,January/February 2000.

[9] E. Vincent, R. Gribonval, and C. Fevotte. Perfor-mance measurement in blind audio source separa-tion. Audio, Speech, and Language Processing, IEEE

Transactions on, 14(4):1462–1469, July 2006.[10] Jonathan S. Abel and David P. Berners. Signal pro-

cessing techniques for digital audio effects. pages273–280, Spring 2011.

[11] B. C. J. Moore and B. R. Glasberg. A revision ofzwicker’s loudness model. Acta Acustica, 82:335–345, Spring 1996.

Blind Audio Source Separation Pipeline and Algorithm ...cs229.stanford.edu/proj2015/124_report.pdf1...

Documents

Transcript of Blind Audio Source Separation Pipeline and Algorithm ...cs229.stanford.edu/proj2015/124_report.pdf1...