SEMANTICALLY CONSISTENT SEPARATION OF FOREGROUND …dpwe/pubs/PapaE14-rpca.pdf · 2014-05-19 ·...

8
Proc. of the 17 th Int. Conference on Digital Audio Effects (DAFx-14), Erlangen, Germany, September 1-5, 2014 MUSIC-CONTENT-ADAPTIVE ROBUST PRINCIPAL COMPONENT ANALYSIS FOR A SEMANTICALLY CONSISTENT SEPARATION OF FOREGROUND AND BACKGROUND IN MUSIC AUDIO SIGNALS Hélène Papadopoulos Laboratoire des Signaux et Systèmes UMR 8506, CNRS-SUPELEC-Univ. Paris-Sud, France helene.papadopoulos[at]lss.supelec.fr Daniel P.W. Ellis LabROSA Columbia University dpwe[at]ee.columbia.edu ABSTRACT Robust Principal Component Analysis (RPCA) is a technique to decompose signals into sparse and low rank components, and has recently drawn the attention of the MIR field for the problem of separating leading vocals from accompaniment, with appealing re- sults obtained on small excerpts of music. However, the perfor- mance of the method drops when processing entire music tracks. We present an adaptive formulation of RPCA that incorporates music content information to guide the decomposition. Experi- ments on a set of complete music tracks of various genres show that the proposed algorithm is able to better process entire pieces of music that may exhibit large variations in the music content, and compares favorably with the state-of-the-art. 1. INTRODUCTION In the general context of processing high-dimensional data, a re- current problem consists in extracting specific information from a massive amount of related or unrelated information. Examples include recovering documents with specific topics from a collec- tion of Web text documents [1] or detecting moving objects from camera recordings for video surveillance purpose [2]. Among nu- merous existing methods, the technique of Robust Principal Com- ponent Analysis (RPCA) [3, 4], has recently drawn a lot of atten- tion. All the above-mentioned problems can be formulated as sep- arating some foreground components (the keywords in Web data, the moving objects in video) from an underlying background (the background corpus topic in Web data, the stable environment in video), that can be respectively modeled as a sparse plus a low- rank contribution. RPCA has been used extensively in the field of image pro- cessing (e.g. image segmentation [5], visual pattern correspon- dence [6], surveillance video processing [7], batch image align- ment [8], etc.). However, its application in Music Information Re- trieval (MIR) is much more recent. Existing applications in audio include audio classification, as in [9] where audio segments from video sound files are classified into classes (applause and laugh- ter occurrences); [10] addresses the problem of refining available social tags obtained through social tagging websites to maximize their quality. The main application of the RPCA framework in music focuses on the task of separating a foreground component, usually the singing voice, from a background accompaniment in monaural polyphonic recordings, i.e., when only one channel of recording is available. This scenario is the primary focus of this paper. The singing voice is a complex and important music signal attribute that has been much studied in MIR. Its separation is es- sential for many applications, such as singer identification [11], melody transcription [12], or query by humming [13]. We refer the reader to [14] for a recent review of singing voice separation methods. Recently, approaches that take advantage of repetition in the signal have emerged. These approaches assume that the background accompaniment has a repetitive musical structure, in contrast to the vocal signal whose repetitions, if any, occur only at a much larger timescale [15, 16, 17]. In [15] a simple method for separating music and voice is proposed based on the extraction of the underlying repeating musical structure using binary time- frequency masking (REPET algorithm). The methods assumes that there is no variations in the background and is thus limited to short excerpts. In [16], the method is generalized to permit the processing of complete musical tracks by relying on the assump- tion of local spectral-periodicity. Moreover, artifacts are reduced by using soft-masks. Inspired by these approaches, [17] proposes a model for singing voice separation based on repetition, but without using the hypothesis of local periodicity. The background musi- cal accompaniment at a given frame is identified using the nearest neighbor frames in the whole mixture spectrogram. Most recently, RPCA has emerged as a promising approach to singing voice separation based on the idea that the repetitive musical accompaniment may lie in a low-rank subspace, while the singing voice is relatively sparse in the time-frequency do- main [18]. The voice and the accompaniment are separated by de- composing the Short-Time-Fourier Transform (STFT) magnitude (i.e., spectrogram) into sparse and low-rank components. When tested on short audio excerpts from the MIR-1K dataset 1 RPCA shows improvement over two state-of-the-art approaches [19, 15]. The decomposition is improved in [20] by adding a regularization term to incorporate a prior tendency towards harmonicity in the low-rank component, reflecting the fact that background voices can be described as a harmonic series of sinusoids at multiples of a fundamental frequency. A post-processing step is applied to the sparse component of the decomposition to eliminate the per- cussive sounds. [21] addresses the problem of jointly finding a sparse approximation of a varying component (e.g., the singing voice) and a repeating background (e.g., the musical accompani- ment) in the same redundant dictionary. In parallel with the RPCA idea of [3], the mixture is decomposed into a sum of two com- ponents: a structured sparse matrix and an unstructured sparse matrix. Structured sparsity is enforced using mixed norms, along 1 The MIR-1K dataset [19] is a set of 1000 short excerpts (4 13s) extracted from 110 Chinese karaoke pop songs, where accompaniment and the singing voices are separately recorded. See https://sites.google.com/site/unvoicedsoundseparation/mir-1k. DAFX-1

Transcript of SEMANTICALLY CONSISTENT SEPARATION OF FOREGROUND …dpwe/pubs/PapaE14-rpca.pdf · 2014-05-19 ·...

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

MUSIC-CONTENT-ADAPTIVE ROBUST PRINCIPAL COMPONENT ANALYSIS FOR ASEMANTICALLY CONSISTENT SEPARATION OF FOREGROUND AND BACKGROUND

IN MUSIC AUDIO SIGNALS

Heacutelegravene Papadopoulos

Laboratoire des Signaux et SystegravemesUMR 8506 CNRS-SUPELEC-Univ Paris-Sud Francehelenepapadopoulos[at]lsssupelecfr

Daniel PW Ellis

LabROSAColumbia University

dpwe[at]eecolumbiaedu

ABSTRACT

Robust Principal Component Analysis (RPCA) is a technique todecompose signals into sparse and low rank components and hasrecently drawn the attention of the MIR field for the problem ofseparating leading vocals from accompaniment with appealing re-sults obtained on small excerpts of music However the perfor-mance of the method drops when processing entire music tracksWe present an adaptive formulation of RPCA that incorporatesmusic content information to guide the decomposition Experi-ments on a set of complete music tracks of various genres showthat the proposed algorithm is able to better process entirepiecesof music that may exhibit large variations in the music content andcompares favorably with the state-of-the-art

1 INTRODUCTION

In the general context of processing high-dimensional data a re-current problem consists in extracting specific information froma massive amount of related or unrelated information Examplesinclude recovering documents with specific topics from a collec-tion of Web text documents [1] or detecting moving objects fromcamera recordings for video surveillance purpose [2] Among nu-merous existing methods the technique of Robust PrincipalCom-ponent Analysis (RPCA) [3 4] has recently drawn a lot of atten-tion All the above-mentioned problems can be formulated assep-arating some foreground components (the keywords in Web datathe moving objects in video) from an underlying background (thebackground corpus topic in Web data the stable environmentinvideo) that can be respectively modeled as a sparse plus a low-rank contribution

RPCA has been used extensively in the field of image pro-cessing (eg image segmentation [5] visual pattern correspon-dence [6] surveillance video processing [7] batch image align-ment [8] etc) However its application in Music Information Re-trieval (MIR) is much more recent Existing applications inaudioinclude audio classification as in [9] where audio segmentsfromvideo sound files are classified into classes (applause and laugh-ter occurrences) [10] addresses the problem of refining availablesocial tags obtained through social tagging websites to maximizetheir quality The main application of the RPCA framework inmusic focuses on the task of separating a foreground componentusually the singing voice from a background accompanimentinmonaural polyphonic recordings ie when only one channel ofrecording is available This scenario is the primary focus of thispaper

The singing voice is a complex and important music signal

attribute that has been much studied in MIR Its separation is es-sential for many applications such as singer identification [11]melody transcription [12] or query by humming [13] We referthe reader to [14] for a recent review of singing voice separationmethods Recently approaches that take advantage of repetitionin the signal have emerged These approaches assume that thebackground accompaniment has a repetitive musical structure incontrast to the vocal signal whose repetitions if any occur onlyat a much larger timescale [15 16 17] In [15] a simple methodfor separating music and voice is proposed based on the extractionof the underlying repeating musical structure using binarytime-frequency masking (REPET algorithm) The methods assumesthat there is no variations in the background and is thus limitedto short excerpts In [16] the method is generalized to permit theprocessing of complete musical tracks by relying on the assump-tion of local spectral-periodicity Moreover artifacts are reducedby using soft-masks Inspired by these approaches [17] proposes amodel for singing voice separation based on repetition butwithoutusing the hypothesis of local periodicity The background musi-cal accompaniment at a given frame is identified using the nearestneighbor frames in the whole mixture spectrogram

Most recently RPCA has emerged as a promising approachto singing voice separation based on the idea that the repetitivemusical accompaniment may lie in a low-rank subspace whilethe singing voice is relatively sparse in the time-frequency do-main [18] The voice and the accompaniment are separated by de-composing the Short-Time-Fourier Transform (STFT) magnitude(ie spectrogram) into sparse and low-rank components Whentested on short audio excerpts from the MIR-1K dataset1 RPCAshows improvement over two state-of-the-art approaches [19 15]The decomposition is improved in [20] by adding a regularizationterm to incorporate a prior tendency towards harmonicity inthelow-rank component reflecting the fact that background voicescan be described as a harmonic series of sinusoids at multiplesof a fundamental frequency A post-processing step is applied tothe sparse component of the decomposition to eliminate the per-cussive sounds [21] addresses the problem of jointly finding asparse approximation of a varying component (eg the singingvoice) and a repeating background (eg the musical accompani-ment) in the sameredundant dictionary In parallel with the RPCAidea of [3] the mixture is decomposed into a sum of two com-ponents astructuredsparse matrix and anunstructuredsparsematrix Structured sparsity is enforced using mixed normsalong

1The MIR-1K dataset [19] is a set of 1000 short excerpts(4 minus 13s) extracted from 110 Chinese karaoke pop songs whereaccompaniment and the singing voices are separately recorded Seehttpssitesgooglecomsiteunvoicedsoundseparationmir-1k

DAFX-1

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

with a greedy Matching Pursuit algorithm [22] The model is eval-uated on short popular music excerpts from the Beach Boys [23]proposes a non-negative variant of RPCA termed robust low-ranknon-negative matrix factorization (RNMF) In this approach thelow-rank model is represented as a non-negative linear combina-tion of non-negative basis vectors The proposed frameworkal-lows incorporating unsupervised semi- and fully-supervised learn-ing with supervised training drastically improving the results ofthe separation Other related works including [24 25] addresssinging voice separation based on low-rank representations alonebut are beyond the scope of this article

While RPCA performs well on thesim10 sec clips of MIR-1Kthe full-length Beach Boys examples of [14] give much less sat-isfying results When dealing with whole recordings the musi-cal background may include significant changes in instrumentationand dynamics which may rival the variation in the foreground andhence its rank in the spectrogram representation Furtherfore-ground may vary in its complexity (eg solo voice followedby aduet) and may be unevenly distributed throughout the piece (egentire segments with background only) Thus the best way toapply RPCA to separatecompletemusic pieces remains an openquestion

In this article we explore an adaptive version of RPCA (A-RPCA) that is able to handle complex music signals by takinginto account the intrinsic musical content We aim to adjustthetask through the incorporation of domain knowledge that guidesthe decomposition towards results that are physically and musi-cally meaningful Time-frequency representations of music audiomay be structured in several ways according to their content Forinstance the frequency axis can be segmented into regions corre-sponding to the spectral range of each instrument of the mixtureIn the singing separation scenario coefficients that are not in thesinging voice spectral band should not be selected in the sparselayer In the time dimension music audio signals can generallybe organized into a hierarchy of segments at different scales eachwith its own semantic function (bar phrase entire sectionetc)and each having specific characteristics in terms of instrumen-tation leading voice etc Importantly as the segments becomeshorter we expect the accompaniment to span less variation andthus the rank of the background to reduce

We will show a way for this music content information to beincorporated in the decomposition to allow an accurate processingof entiremusic tracks More specifically we incorporate voice ac-tivity information as a cue to separate the leading voice from thebackground Music pieces can be segmented into vocal segments(where the leading voice is present) and background segments (thatcan be purely instrumental or may contain backing voices) Find-ing vocal segments (voicing detection [26]) is a subject that hasreceived significant attention within MIR [26 27 28 29] The de-composition into sparse and low-rank components should be co-herent with the semantic structure of the piece the sparse (fore-ground) component should be denser in sections containing theleading voice while portions of the sparse matrix corresponding tonon-singing segments should ideally be null Thus while the tech-nique remains the same as [18] at the lowest level we consider theproblem of segmenting a longer track into suitable pieces and howto locally adapt the parameters of the decomposition by incorpo-rating prior information

2 ROBUST PRINCIPAL COMPONENT ANALYSIS VIAPRINCIPAL COMPONENT PURSUIT

In [3] Candegraveset al show that under very broad conditions a datamatrix D isin R

mtimesn can be exactly and uniquely decomposed intoa low-rank componentA and a sparse componentE via a convexprogram calledPrincipal Component Pursuit(RPCA-PCP) givenby

minAE

Alowast + λE1 st D = A + E (1)

whereλ gt 0 is a regularization parameter that trades between therank ofA and the sparsity ofE The nuclear normmiddotlowast ndash the sumof singular values ndash is used as surrogate for the rank ofA [30] andthe ℓ1 norm middot1 (sum of absolute values of the matrix entries)is an effective surrogate for theℓ0 pseudo-norm the number ofnon-zero entries in the matrix [31 32]

The Augmented Lagrange Multiplier Method (ALM) and itspractical variant the Alternating Direction Method of Multipliers(ADM) have been proposed as efficient optimization schemestosolve this problem [33 34 35] ALM works by minimizing theaugmented Lagrangian function of (1)

L(A E Y micro) = Alowast+λE1+〈Y A+EminusD〉+micro

2A+EminusD2

F

(2)whereY isin R

mtimesn is the Lagrange multiplier of the linear con-straint that allows removing the equality constraintmicro gt 0 is apenalty parameter for the violation of the linear constraint 〈middot middot〉denotes the standard trace inner product andmiddotF is the Frobeniusnorm2 ALM [34] is an iterative scheme that works by repeatedlyminimizing A andE simultaneously In contrast ADM splits theminimization of (2) into two smaller and easier subproblems withA andE minimized sequentially

Ak+1 = argminA

L(AEk Y k microk) (3a)

Ek+1 = argminE

L(Ak+1 E Y k microk) (3b)

Both subproblems (3a) and (3b) are shrinkage problems that haveclosed-form solutions that we briefly present here We referthereader to [34 35] for more details For convenience we introducethe scalar soft-thresholding (shrinkage) operatorSǫ[x]

Sǫ[x] = sgn(x) middot max(|x| minus ǫ 0) =

8

lt

x minus ǫ if x gt ǫx + ǫ if x lt minusǫ

0 otherwise

wherex isin R andǫ gt 0 This operator can be extended to matricesby applying it element-wise

Problem (3a) is equivalent to

Ak+1 = minA

Alowast +microk

2A minus (D minus Ek +

1

microkY k)2

F

ff

(4)

that has according to [36] a closed-from solution given by

Ak+1 = US 1

microk[Σ]V T

2The Frobenius norm of matrixA is defined asAF =

s

X

ij

A2ij

DAFX-2

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

whereU isin Rmtimesr V isin R

ntimesr andΣ isin Rrtimesr are obtained via the

singular value decomposition(UΣ V ) = SV D(DminusEk + Y k

microk )

Problem (3b) can be written as

Ek+1 = minE

λE1 +microk

2E minus (D minus Ak+1 +

1

microkY k)2

F

ff

(5)whose solution is given by the least-absolute shrinkage andse-lection operator (Lasso) [37] a method also known in the signalprocessing community as basis pursuit denoising [38]

Ek+1 = S λmicrok

[D minus Ak+1 +Y k

microk

]

In other words denotingGE = D minus Ak+1 + Y k

microk

forall i isin [1 m]forall j isin [1 n] Ek+1

ij = sgn(GEij)middotmax(|GE

ij |minusλ

microk 0)

3 ADAPTIVE RPCA (A-RPCA)

As discussed in Section 1 in a given song the foreground vo-cals typically exhibit a clustered distribution in the time-frequencyplane relating to the semantic structure of the piece that alternatesbetween vocal and non-vocal (background) segments This struc-ture should be reflected in the decomposition frames belongingto singing voice-inactive segments should result in zero-valuedcolumns inE

The balance between the sparse and low-rank contributionsis set by the value of the regularization parameterλ The voiceseparation quality with respect to the value ofλ for thePink NoisePartysongTheir Shallow Singularityis illustrated in Fig 1 As wecan observe the bestλ differs depending on whether we processthe entire song or restrict processing to just the singing voice-active parts Because the separation for the background part ismonotonically better asλ increases the difference between theoptimumλ indicates that the global separation quality is compro-mised between the singing voice and the background part

05 1 2 3 4 5 6 7 8 9 10 20 100minus25

minus20

minus15

minus10

minus5

0

5

10

value of λ

NSDR

entirevoice only

Figure 1 Variation of the estimated singing voice NSDR (seedef-inition in Section 4) according to the value ofλ under two situa-tions bull NSDR when only the singing voice-active parts of theseparated signal are processedlowast NSDR when the entire signal isprocessed

Figure 2 Waveform of the separated voice for various valuesofλ for the songIs This Loveby Bob Marley From top to bottomclean voiceλ = λ1 2 lowast λ1 5 lowast λ1 10 lowast λ1

In the theoretical formulation of RPCA-PCP [3] there is nosingle value ofλ that works for separating sparse from low-rankcomponents in all conditions They recommendλ = max(m n)minus

1

2

but also note that the decomposition can be improved by choos-ing λ in light of prior knowledge about the solution In prac-tice we have found that the decomposition of music audio is verysensitive to the choice ofλ with frequently no single value ableto achieve a satisfying separation between voice and instrumen-tal parts across a whole recording This is illustrated in Fig 2which shows the waveforms of the resynthesized separated voiceobtained with the RPCA-PCP formulation for variousλ Forλ =λ1 = 1

p

max(mn) andλ2 = 2lowastλ1 aroundt = 115 s (dashedrectangle) there is a non-zero contribution in the voice layer but noactual lead vocal This is eliminated withλ = 5 lowast λ1 10 lowast λ1 butat the expense of a very poor quality voice estimate the resultingsignal consists of percussive sounds and higher harmonics of theinstruments and does not resemble the voice Note that similarobservations have been made in the context of video surveillance[39]

To address the problem of variations inλ we propose an adap-tive variant of the RPCA consisting of a weighted decomposi-tion that incorporates prior information about the music contentSpecifically voice activity information is used as a cue to ad-just the regularization parameter through the entire analyzed piecein the (3b) step and therefore better match the balance betweensparse and low-rank contributions to suit to the actual music con-tent This idea is related to previous theoretical work [4041 42]but to our knowledge its application in the framework of RPCA isnew

We consider a time segmentation of the magnitude spectro-gram intoNblock consecutive (non-overlapping) blocks of vocal non-vocal (background accompaniment) segments We can rep-resent the magnitude spectrogram as a concatenation of column-blocksD = [D1D2 middot middot middotDNblock] the sparse layer asE = [E1 middot middot middotENblock]andGE = [GE

1 middot middot middotGENblock

]We can minimize the objective function with respect to each

column-block separately To guide the separation we aim atset-ting a different value ofλl l isin [1 Nblocks] for each block ac-cording to the voice activity side information For each block theproblem is equivalent to Eq (5) and accordingly the solution tothe resulting problem

Ek+1

l = minEl

λlEl1 +microk

2El minus GE

l 2F

ff

is given byEk+1

l = S λlmicrok

[GEl ] (6)

DAFX-3

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Algorithm 1 Adaptive RPCA (A-RPCA)

Input spectrogramD blocksλ λ1 λNblocks

Output E AInitialization Y 0 = DJ(D) where J(D) =max(D2 λ

minus1Dinfin) E0 = 0 micro0 gt 0 ρ gt 1k = 0while not convergeddo

update A(U Σ V ) = SV D(D minusEk + Y k

microk ) Ak+1 = US 1

microk[Σ]V T

update Efor each blockl do

λ = λl

Et+1

l = S λlmicrok

[Dl minus Ak+1

l +Y k

l

microk ]

end forEt+1 = [Et+1

1 Et+1

2 middot middot middotEt+1

Nblock]

update Y microY k+1 = Y k minus microk(Ak+1 + Ek+1 minus D)microk+1 = ρ middot microk

k = k + 1end while

Denoteλv the constant value of the regularization parameterλused in the basic formulation of RPCA for voice separation [18]To guide the separation in the A-RPCA formulation we assignto each block a valueλl in accordance with the considered priormusic structure information Using a largeλl in blocks withoutleading voice will favor retaining non-zero coefficients inthe ac-companiment layer Denoting byΩV the set of time frames thatcontain voice the values ofλl are set as

forall l isin [1 Nblock]

λl = λv if El sub ΩV

λl = λnv otherwise(7)

with λnv gt λv to enhance sparsity ofE when no vocal activityis detected Note that instead of two distinct values ofλl fur-ther improvements could be obtained by tuningλl more preciselyto suit the segment characteristics For instance vibratoinforma-tion could be used to quantify the amount of voice in the mixturewithin each block and to set a specific regularization parameter ac-cordingly The update rules of the A-RPCA algorithm are detailedin Algorithm 1

In Section 4 we investigate the results of adaptive-RPCA withboth exact (ground-truth) and estimated vocal activity informationFor estimating vocal activity information we use the voicing de-tection step of the melody extraction algorithm implemented inthe MELODIA Melody Extraction vamp plug-in3 as it is freelyavailable for people to download and use We refer the readerto[26] and references therein for other voicing detection algorithmsThe algorithm for the automatic extraction of the main melodyfrom polyphonic music recordings implemented in MELODIA isa salience-based model that is described in [43] It is basedonthe creation and characterization of pitch contours grouped usingauditory streaming cues and includes a voice detection step thatindicates when the melody is present we use this melody locationas an indicator of leading voice activity Note that while melodycan sometimes be carried by other instruments in the evaluationdataset of Section 4 it is mainly singing

3httpmtgupfedutechnologiesmelodia

4 EVALUATION

In this section we present the results of our approach evaluated ona database of complete music tracks of various genres We com-pare the proposed adaptive method with the baseline method [18]as well as another state-of-the-art method [16] Sound examplesdiscussed in the article can be found athttppapadopoulosellisdafx14blogspotfr

41 Parameters Dataset and Evaluation Criteria

To evaluate the proposed approach we have constructed a databaseof 12 complete music tracks of various genres with separated vo-cal and accompaniment files as well as mixture versions formedas the sum of the vocal and accompaniment files The tracks listedin Tab 1 were created from multitracks mixed in Audacity4 thenexported with or without the vocal or accompaniment lines

Following previous work [18 44 15] the separations are eval-uated with metrics from the BSS-EVAL toolbox [45] which pro-vides a framework for the evaluation of source separation algo-rithms when the original sources are available for comparisonThree ratios are considered for both sources Source-to-Distortion(SDR) Sources-to-Interference (SIR) and Sources-to-Artifacts (SAR)In addition we measure the improvement in SDR between themixtured and the estimated resynthesized singing voicee by theNormalized SDR (NSDR also known asSDR improvement SDRI)defined for the voice as NSDR(e e d) = SDR(e e)minusSDR(d e)wheree is the original clean singing voice The same measure isused for the evaluation of the background Each measure is com-puted globally on the whole track but also locally according to thesegmentation into vocalnon-vocal segments Higher values of themetrics indicate better separation

We compare the results of the A-RPCA with musically-informedadaptiveλ and the baseline RPCA method [18] with fixedλ usingthe same parameter settings in the analysis stage the STFT of eachmixture is computed using a window length of1024 samples with75 overlap at a sampling rate of115KHz No post-processing(such as masking) is added After spectrogram decomposition thesignals are reconstructed using the inverse STFT and the phase ofthe original signal

The parameterλ is set to1p

max(m n) in the baseline methodTwo different versions of the proposed A-RPCA algorithm areevaluated First A-RPCA with exact voice activity informationusing manually annotated ground-truth (A-RPCA_GT) andλl =λ for singing voice regions andλl = 5lowastλ for background only re-gions In the other configuration estimated voice activitylocationis used (A-RPCA_est) with same settings for theλl

We also compare our approach with the REPET state-of-the-art algorithm based on repeating pattern discovery and binary time-frequency masking [16] Note that we use for comparison the ver-sion of REPET that is designed for processing complete musicaltracks (as opposed to the original one introduced in [15]) Thismethod includes a simple low pass filtering post-processingstep[46] that consists in removing all frequencies below100Hz fromthe vocal signal and adding these components back into the back-ground layer We further apply this post-processing step toourmodel before comparison with the REPET algorithm

Paired sample t-tests at the 5 significance level are performedto determine whether there is statistical significance in the resultsbetween various configurations

4httpaudacitysourceforgenet

DAFX-4

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 1 Sound excerpts used for the evaluationback proportion of background (no leading voice) segments (in of the whole excerptduration) RecallRecand False AlarmFAvoicing detection rate

Name back Rec FA Name back Rec FA1- BeatlesSgt Pepperrsquos Lonely Hearts Club Band 493 7474 4556 8 - Bob MarleyIs This Love 372 6622 36842 - BeatlesWith A Little Help From My Friends 135 7010 1471 9 - Doobie BrothersLong Train Running 656 8412 58513 - BeatlesShersquos Leaving Home 246 7752 3017 10 -Marvin GayeHeard it Through The Grapevine 302 7922 17904 - BeatlesA Day in The Life 356 6130 6396 11 -The EaglesTake it Easy 355 7868 302056 -Puccinipiece for soprano and piano 247 4790 2704 12 -The PoliceMessage in aBottle 249 7390 20447 - Pink Noise PartyTheir Shallow Singularity 421 6415 6183

42 Results and Discussion

Results of the separation for the sparse (singing voice) andlow-rank (background accompaniment) layers are presented in Tables2 3 4 and 5 To have a better insight of the results we presentmeasures computed both on the entire song and on the singingvoice-active part only that is obtained by concatenating all seg-ments labeled as vocal segments in the ground truth

bull Global separation results As we can see from Tables 2 and3 using a musically-informed adaptive regularization parameterallows improving the results of the separation both for the back-ground and the leading voice components Note that the larger theproportion of purely-instrumental segments in a piece (seeTab 1)the larger the results improvement (see in particular pieces 1 7 8and 9) which is consistent with the goal of the proposed methodStatistical tests show that the improvement in the results is signifi-cant

As discussed in Section 3 the quality of the separation withthe baseline method [18] depends on the value of the regulariza-tion parameter Moreover the value that leads to the best separa-tion quality differs from one music excerpt to another Thus whenprocessing automatically a collection of music tracks thechoice ofthis value results from a trade-off We report here results obtainedwith the typical choiceλv = 1

p

max(m n) in Eq (7) Note thatfor a given value ofλv in the baseline method the separation canalways be further improved by the A-RPCA algorithm using a reg-ularization parameter that is adapted to the music content based onprior music structure information in all experiments fora givenconstant valueλv in the baseline method settingλnv gt λv in Eq(7) improves the results

For the singing voice layer improved SDR (better overall sep-aration performance) and SIR (better capability of removing musicinterferences from the singing voice) with A-RPCA are obtainedat the price of introducing more artifacts in the estimated voice(lower SARvoice) Listening tests reveal that in some segmentsprocessed by A-RPCA as for instance segment[1prime00primeprime minus 1prime15primeprime]in Fig 3 one can hear some high frequency isolated coefficientssuperimposed to the separated voice This drawback could bere-duced by including harmonicity priors in the sparse component ofRPCA as proposed in [20] This performance trade-off is com-monly encountered in musicvoice separation [14 47] Howeverwe can notice that all three measures are significantly improvedwith A-RPCA for the background layer

bull Ground truth versus estimated voice activity location Im-perfect voice activity location information still allows an improve-ment although to a lesser extent than with ground-truth voice ac-tivity information In table 1 we report the accuracy results of thevoicing detection step Similarly to the measures used for melody

Figure 3 Separated voice for various values ofλ for thePink NoiseParty songTheir Shallow Singularity From top to bottom cleanvoice constantλ1 = 1

p

max(m n) constantλ = 5lowastλ1 adap-tive λ = (λ1 5 lowast λ1)

detection in [48 12] we consider theVoicing Recall Rate definedas the proportion of frames labeled voiced in the ground truth thatare estimated as voiced frames by the algorithm and theVoicingFalse Alarm Rate defined as the proportion of frames labeled asunvoiced in the ground truth that are mistakenly estimated to bevoiced by the algorithm The decrease in the results mainly comesfrom background segments classified as vocal segments Howeverstatistical tests show that the improvement in the results betweenRPCA and A-RPCA_est is still significant

bull Local separation results It is interesting to note that usingan adaptive regularization parameter in a unified analysis of thewhole piece is different from separately analyzing the successivevocalnon-vocal segments with different but constant values ofλ(see for instance the dashed rectangles areas in Fig 3)

bull Analysis of the results on vocal segments We expect the sep-aration on background-only parts of the song to be improved withthe A-RPCA algorithm Indeed the side information directlyin-dicates these regions where the foreground (sparse) componentsshould be avoided this can be clearly seen in Fig 3 However theimprovements under the proposed model are not limited to non-vocal regions only Results measured on the vocal segments aloneindicate that by using the adaptive algorithm the voice is also bet-ter estimated as shown in Table 3 The improvement over RPCAis statistically significant both when using ground truth and esti-mated voice activity location information This indicatesthat sideinformation helps not only to better determine the background onlysegments but also enables improved recovery of the singingvoicepresumably because the low-rank background model is a bettermatch to the actual background

Side information could have be added as a pre- or post-processingstep to the RPCA algorithm The adaptive-RPCA algorithm presentsadvantages over such approaches To analyze this we compare the

DAFX-5

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 2 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity

Entire songRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -466 -216 -318SIR (dB) -386 074 -046SAR (dB) 899 481 394

NSDR 170 420 318

Back

SDR (dB) 414 652 608SIR (dB) 1148 1330 1207SAR (dB) 551 803 783

NSDR -235 003 -041

Table 3 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity

Vocal segmentsRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -319 -200 -196SIR (dB) -233 -039 074SAR (dB) 944 727 464

NSDR 167 285 290

Back

SDR (dB) 363 518 528SIR (dB) 995 1064 1041SAR (dB) 539 732 754

NSDR -137 018 029

Table 4 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity Low-pass filtering post-processing is applied REPET is thecomparison algorithm [16]

Entire songRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -276 -072 -211 -220SIR (dB) -017 403 222 134SAR (dB) 433 333 232 319

NSDR 360 564 425 416

Back

SDR (dB) 516 761 681 501SIR (dB) 1453 1449 1299 1683SAR (dB) 596 902 844 547

NSDR -132 112 033 -148

Table 5 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity Low-pass filtering post-processing is appliedREPET is the comparison algorithm [16]

Vocal segments onlyRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -125 -053 -083 -070SIR (dB) 149 304 362 302SAR (dB) 502 446 312 402

NSDR 360 432 402 415

Back

SDR (dB) 485 603 611 480SIR (dB) 1307 1238 1141 1533SAR (dB) 591 769 820 541

NSDR -014 103 111 -020

A-RPCA algorithm with two variants of RPCA incorporating sideinformation either as a pre- or a post-processing step

bull RPCA_OV pre Only the concatenation of segments clas-sified as vocal is processed by RPCA (the singing voiceestimate being set to zero in the remaining non-vocal seg-ments)

bull RPCA_OV post The whole song is processed by RPCAand non-zeros coefficients estimated as belonging to thevoice layer in non-vocal segments are transferred to thebackground layer

Results of the decomposition computed across the vocal seg-ments only are presented in Table 6 Note that the RPCA_OV post

results reduce to the RPCA results in Table 3 since they are com-puted on vocal segments only There is no statistical difference be-tween the estimated voice obtained by processing with RPCA thewhole song and the vocal segments only Results are significantlybetter using the A-RPCA algorithm than using RPCA_OV pre andRPCA_OV post This is illustrated in Figure 4 which shows anexample of the decomposition on an excerpt of theDoobie Broth-ers songLong Train Runningcomposed by a non-vocal followedby a vocal segment We can see that there are misclassified partialsin the voice spectrogram obtained with the baseline RPCA that areremoved with A-RPCA Moreover the gap in the singing voicearound frame 50 (breathing) is cleaner in the case of A-RPCA thanin the case of RPCA Listening tests confirm that the backgroundis better attenuated in the voice layer when using A-RPCA

Table 6 SDR SIR and SAR (in dB) and NSDR resultsfor the voice (Voice) and background layer (Back) com-puted across the vocal segments only averaged across all thesongs RPCA_OV post is when using the baseline system andset the voice estimate to zero in background-only segmentsRPCA_OV pre is when processing only the voice segments withthe baseline model A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses es-timated voice activity

RPCA_OV post RPCA_OV pre A-RPCA_GT A-RPCA_est

Voice

SDR -319 -328 -200 -196SIR -233 -231 362 074SAR 944 897 727 464

NSDR 167 157 285 290

Back

SDR 363 372 518 528SIR 995 922 1064 1041SAR 539 585 732 754

NSDR -137 -128 018 029

bull Comparison with the state-of-the-art As we can see from Ta-ble 4 the results obtained with the RPCA baseline method arenotbetter than those obtained with the REPET algorithm On the con-trary the REPET algorithm is significantly outperformed bytheA-RPCA algorithm when using ground truth voice activity infor-mation both for the sparse and low-rank layers However notethat when using estimated voice activity information the differ-

DAFX-6

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Figure 4 [Top Figure] Example decomposition on an excerpt ofthe Doobie BrotherssongLong Train Runningand [Bottom Fig-ure] zoom between frames [525-580] (dashed rectangle in theTopFigure) For each figure the top pane shows the part between0 and500Hz of the spectrogram of the original signal The clean sign-ing voice appears in the second pane The separated signing voiceobtained with baseline model (RPCA) with the baseline modelwhen restricting the analysis to singing voice-active segments only(RPCA_OV pre) and with the proposed A-RPCA model are rep-resented in panes 3 to 5 For comparison the sixth pane showstheresults obtained with REPET [16]

ence in the results between REPET and A-RPCA is not statisticallysignificant for the sparse layer If we look closer at the results itis interesting to note that the voice estimation improvement by A-RPCA_GT over REPET mainly comes from the non-vocal partswhere the voice estimated is favored to be null Indeed Table 5indicate that the voice estimates on vocal segments obtained withA-RPCA_GT and REPET are similar This is illustrated by thetwo last panes in the [bottom] Figure 4 which show similar spec-trograms of the voice estimates obtained with the A-RPCA andREPET algorithms on the vocal part of the excerpt

5 CONCLUSION

We have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including localvariations in the music structure Music content information isincorporated in the decomposition to guide the selection ofcoeffi-cients in the sparse and low-rank layers according to the semanticstructure of the piece This motivates the choice of using a regu-larization parameter that is informed by musical cues Results in-dicate that with the proposed algorithm not only the backgroundsegments are better discriminated but also that the singing voice isbetter estimated in vocal segments presumably because thelow-rank background model is a better match to the actual backgroundThe method could be extended with other criteria (singer identi-fication vibrato saliency etc) It could also be improvedby in-corporating additional information to set differently theregulariza-tion parameters foreachtrack to better accommodate the varyingcontrast of foreground and background The idea of an adaptivedecomposition could also be improved with a more complex for-mulation of RPCA that incorporates additional constraints[20] ora learned dictionary [49]

6 REFERENCES

[1] K Min Z Zhang J Wright and Y Ma ldquoDecomposingbackground topics from keywords by principal componentpursuitrdquo inCIKM 2010

[2] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation ofbackground subtraction techniques for video surveillancerdquoin CCVPR 2011 pp 1937ndash1944

[3] EJ Candegraves X Li and J Ma Y andb Wright ldquoRobustprincipal component analysisrdquoJournal of the ACM vol58 no 3 Article 11 2011

[4] V Chandrasekaran S Sanghavi P Parrilo and A Will-sky ldquoSparse and low-rank matrix decompositionsrdquo inSysid2009

[5] B Cheng G Liu J Wang Z Huang and S Yan ldquoMulti-task low-rank affinity pursuit for image segmentationrdquo inICCV 2011 pp 2439ndash2446

[6] Z Zeng TH Chan K Jia and D Xu ldquoFinding correspon-dence from multiple images via sparse and low-rank decom-positionrdquo inECCV 2012 pp 325ndash339

[7] F Yang H Jiang Z Shen W Deng and DN MetaxasldquoAdaptive low rank and sparse decomposition of video usingcompressive sensingrdquoCoRR vol abs13021610 2013

[8] Y Peng A Ganesh J Wright and Y Xu W andMa ldquoRaslRobust alignment by sparse and low-rank decomposition forlinearly correlated imagesrdquoIEEE Trans Pattern Anal MachIntell vol 34 no 11 pp 2233ndash2246 2012

[9] Z Shi J Han T Zheng and S Deng ldquoOnline learningfor classification of low-rank representation features anditsapplications in audio segment classificationrdquoCoRR volabs11124243 2011

[10] YH Yang D Bogdanov P Herrera and M Sordo ldquoMusicretagging using label propagation and robust principal com-ponent analysisrdquo inWWW New York NY USA 2012 pp869ndash876

[11] W Cai Q Li and X Guan ldquoAutomatic singer identificationbased on auditory featuresrdquo 2011

DAFX-7

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

with a greedy Matching Pursuit algorithm [22] The model is eval-uated on short popular music excerpts from the Beach Boys [23]proposes a non-negative variant of RPCA termed robust low-ranknon-negative matrix factorization (RNMF) In this approach thelow-rank model is represented as a non-negative linear combina-tion of non-negative basis vectors The proposed frameworkal-lows incorporating unsupervised semi- and fully-supervised learn-ing with supervised training drastically improving the results ofthe separation Other related works including [24 25] addresssinging voice separation based on low-rank representations alonebut are beyond the scope of this article

While RPCA performs well on thesim10 sec clips of MIR-1Kthe full-length Beach Boys examples of [14] give much less sat-isfying results When dealing with whole recordings the musi-cal background may include significant changes in instrumentationand dynamics which may rival the variation in the foreground andhence its rank in the spectrogram representation Furtherfore-ground may vary in its complexity (eg solo voice followedby aduet) and may be unevenly distributed throughout the piece (egentire segments with background only) Thus the best way toapply RPCA to separatecompletemusic pieces remains an openquestion

In this article we explore an adaptive version of RPCA (A-RPCA) that is able to handle complex music signals by takinginto account the intrinsic musical content We aim to adjustthetask through the incorporation of domain knowledge that guidesthe decomposition towards results that are physically and musi-cally meaningful Time-frequency representations of music audiomay be structured in several ways according to their content Forinstance the frequency axis can be segmented into regions corre-sponding to the spectral range of each instrument of the mixtureIn the singing separation scenario coefficients that are not in thesinging voice spectral band should not be selected in the sparselayer In the time dimension music audio signals can generallybe organized into a hierarchy of segments at different scales eachwith its own semantic function (bar phrase entire sectionetc)and each having specific characteristics in terms of instrumen-tation leading voice etc Importantly as the segments becomeshorter we expect the accompaniment to span less variation andthus the rank of the background to reduce

We will show a way for this music content information to beincorporated in the decomposition to allow an accurate processingof entiremusic tracks More specifically we incorporate voice ac-tivity information as a cue to separate the leading voice from thebackground Music pieces can be segmented into vocal segments(where the leading voice is present) and background segments (thatcan be purely instrumental or may contain backing voices) Find-ing vocal segments (voicing detection [26]) is a subject that hasreceived significant attention within MIR [26 27 28 29] The de-composition into sparse and low-rank components should be co-herent with the semantic structure of the piece the sparse (fore-ground) component should be denser in sections containing theleading voice while portions of the sparse matrix corresponding tonon-singing segments should ideally be null Thus while the tech-nique remains the same as [18] at the lowest level we consider theproblem of segmenting a longer track into suitable pieces and howto locally adapt the parameters of the decomposition by incorpo-rating prior information

2 ROBUST PRINCIPAL COMPONENT ANALYSIS VIAPRINCIPAL COMPONENT PURSUIT

In [3] Candegraveset al show that under very broad conditions a datamatrix D isin R

mtimesn can be exactly and uniquely decomposed intoa low-rank componentA and a sparse componentE via a convexprogram calledPrincipal Component Pursuit(RPCA-PCP) givenby

minAE

Alowast + λE1 st D = A + E (1)

whereλ gt 0 is a regularization parameter that trades between therank ofA and the sparsity ofE The nuclear normmiddotlowast ndash the sumof singular values ndash is used as surrogate for the rank ofA [30] andthe ℓ1 norm middot1 (sum of absolute values of the matrix entries)is an effective surrogate for theℓ0 pseudo-norm the number ofnon-zero entries in the matrix [31 32]

The Augmented Lagrange Multiplier Method (ALM) and itspractical variant the Alternating Direction Method of Multipliers(ADM) have been proposed as efficient optimization schemestosolve this problem [33 34 35] ALM works by minimizing theaugmented Lagrangian function of (1)

L(A E Y micro) = Alowast+λE1+〈Y A+EminusD〉+micro

2A+EminusD2

F

(2)whereY isin R

mtimesn is the Lagrange multiplier of the linear con-straint that allows removing the equality constraintmicro gt 0 is apenalty parameter for the violation of the linear constraint 〈middot middot〉denotes the standard trace inner product andmiddotF is the Frobeniusnorm2 ALM [34] is an iterative scheme that works by repeatedlyminimizing A andE simultaneously In contrast ADM splits theminimization of (2) into two smaller and easier subproblems withA andE minimized sequentially

Ak+1 = argminA

L(AEk Y k microk) (3a)

Ek+1 = argminE

L(Ak+1 E Y k microk) (3b)

Both subproblems (3a) and (3b) are shrinkage problems that haveclosed-form solutions that we briefly present here We referthereader to [34 35] for more details For convenience we introducethe scalar soft-thresholding (shrinkage) operatorSǫ[x]

Sǫ[x] = sgn(x) middot max(|x| minus ǫ 0) =

8

lt

x minus ǫ if x gt ǫx + ǫ if x lt minusǫ

0 otherwise

wherex isin R andǫ gt 0 This operator can be extended to matricesby applying it element-wise

Problem (3a) is equivalent to

Ak+1 = minA

Alowast +microk

2A minus (D minus Ek +

1

microkY k)2

F

ff

(4)

that has according to [36] a closed-from solution given by

Ak+1 = US 1

microk[Σ]V T

2The Frobenius norm of matrixA is defined asAF =

s

X

ij

A2ij

DAFX-2

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

whereU isin Rmtimesr V isin R

ntimesr andΣ isin Rrtimesr are obtained via the

singular value decomposition(UΣ V ) = SV D(DminusEk + Y k

microk )

Problem (3b) can be written as

Ek+1 = minE

λE1 +microk

2E minus (D minus Ak+1 +

1

microkY k)2

F

ff

(5)whose solution is given by the least-absolute shrinkage andse-lection operator (Lasso) [37] a method also known in the signalprocessing community as basis pursuit denoising [38]

Ek+1 = S λmicrok

[D minus Ak+1 +Y k

microk

]

In other words denotingGE = D minus Ak+1 + Y k

microk

forall i isin [1 m]forall j isin [1 n] Ek+1

ij = sgn(GEij)middotmax(|GE

ij |minusλ

microk 0)

3 ADAPTIVE RPCA (A-RPCA)

As discussed in Section 1 in a given song the foreground vo-cals typically exhibit a clustered distribution in the time-frequencyplane relating to the semantic structure of the piece that alternatesbetween vocal and non-vocal (background) segments This struc-ture should be reflected in the decomposition frames belongingto singing voice-inactive segments should result in zero-valuedcolumns inE

The balance between the sparse and low-rank contributionsis set by the value of the regularization parameterλ The voiceseparation quality with respect to the value ofλ for thePink NoisePartysongTheir Shallow Singularityis illustrated in Fig 1 As wecan observe the bestλ differs depending on whether we processthe entire song or restrict processing to just the singing voice-active parts Because the separation for the background part ismonotonically better asλ increases the difference between theoptimumλ indicates that the global separation quality is compro-mised between the singing voice and the background part

05 1 2 3 4 5 6 7 8 9 10 20 100minus25

minus20

minus15

minus10

minus5

0

5

10

value of λ

NSDR

entirevoice only

Figure 1 Variation of the estimated singing voice NSDR (seedef-inition in Section 4) according to the value ofλ under two situa-tions bull NSDR when only the singing voice-active parts of theseparated signal are processedlowast NSDR when the entire signal isprocessed

Figure 2 Waveform of the separated voice for various valuesofλ for the songIs This Loveby Bob Marley From top to bottomclean voiceλ = λ1 2 lowast λ1 5 lowast λ1 10 lowast λ1

In the theoretical formulation of RPCA-PCP [3] there is nosingle value ofλ that works for separating sparse from low-rankcomponents in all conditions They recommendλ = max(m n)minus

1

2

but also note that the decomposition can be improved by choos-ing λ in light of prior knowledge about the solution In prac-tice we have found that the decomposition of music audio is verysensitive to the choice ofλ with frequently no single value ableto achieve a satisfying separation between voice and instrumen-tal parts across a whole recording This is illustrated in Fig 2which shows the waveforms of the resynthesized separated voiceobtained with the RPCA-PCP formulation for variousλ Forλ =λ1 = 1

p

max(mn) andλ2 = 2lowastλ1 aroundt = 115 s (dashedrectangle) there is a non-zero contribution in the voice layer but noactual lead vocal This is eliminated withλ = 5 lowast λ1 10 lowast λ1 butat the expense of a very poor quality voice estimate the resultingsignal consists of percussive sounds and higher harmonics of theinstruments and does not resemble the voice Note that similarobservations have been made in the context of video surveillance[39]

To address the problem of variations inλ we propose an adap-tive variant of the RPCA consisting of a weighted decomposi-tion that incorporates prior information about the music contentSpecifically voice activity information is used as a cue to ad-just the regularization parameter through the entire analyzed piecein the (3b) step and therefore better match the balance betweensparse and low-rank contributions to suit to the actual music con-tent This idea is related to previous theoretical work [4041 42]but to our knowledge its application in the framework of RPCA isnew

We consider a time segmentation of the magnitude spectro-gram intoNblock consecutive (non-overlapping) blocks of vocal non-vocal (background accompaniment) segments We can rep-resent the magnitude spectrogram as a concatenation of column-blocksD = [D1D2 middot middot middotDNblock] the sparse layer asE = [E1 middot middot middotENblock]andGE = [GE

1 middot middot middotGENblock

]We can minimize the objective function with respect to each

column-block separately To guide the separation we aim atset-ting a different value ofλl l isin [1 Nblocks] for each block ac-cording to the voice activity side information For each block theproblem is equivalent to Eq (5) and accordingly the solution tothe resulting problem

Ek+1

l = minEl

λlEl1 +microk

2El minus GE

l 2F

ff

is given byEk+1

l = S λlmicrok

[GEl ] (6)

DAFX-3

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Algorithm 1 Adaptive RPCA (A-RPCA)

Input spectrogramD blocksλ λ1 λNblocks

Output E AInitialization Y 0 = DJ(D) where J(D) =max(D2 λ

minus1Dinfin) E0 = 0 micro0 gt 0 ρ gt 1k = 0while not convergeddo

update A(U Σ V ) = SV D(D minusEk + Y k

microk ) Ak+1 = US 1

microk[Σ]V T

update Efor each blockl do

λ = λl

Et+1

l = S λlmicrok

[Dl minus Ak+1

l +Y k

l

microk ]

end forEt+1 = [Et+1

1 Et+1

2 middot middot middotEt+1

Nblock]

update Y microY k+1 = Y k minus microk(Ak+1 + Ek+1 minus D)microk+1 = ρ middot microk

k = k + 1end while

Denoteλv the constant value of the regularization parameterλused in the basic formulation of RPCA for voice separation [18]To guide the separation in the A-RPCA formulation we assignto each block a valueλl in accordance with the considered priormusic structure information Using a largeλl in blocks withoutleading voice will favor retaining non-zero coefficients inthe ac-companiment layer Denoting byΩV the set of time frames thatcontain voice the values ofλl are set as

forall l isin [1 Nblock]

λl = λv if El sub ΩV

λl = λnv otherwise(7)

with λnv gt λv to enhance sparsity ofE when no vocal activityis detected Note that instead of two distinct values ofλl fur-ther improvements could be obtained by tuningλl more preciselyto suit the segment characteristics For instance vibratoinforma-tion could be used to quantify the amount of voice in the mixturewithin each block and to set a specific regularization parameter ac-cordingly The update rules of the A-RPCA algorithm are detailedin Algorithm 1

In Section 4 we investigate the results of adaptive-RPCA withboth exact (ground-truth) and estimated vocal activity informationFor estimating vocal activity information we use the voicing de-tection step of the melody extraction algorithm implemented inthe MELODIA Melody Extraction vamp plug-in3 as it is freelyavailable for people to download and use We refer the readerto[26] and references therein for other voicing detection algorithmsThe algorithm for the automatic extraction of the main melodyfrom polyphonic music recordings implemented in MELODIA isa salience-based model that is described in [43] It is basedonthe creation and characterization of pitch contours grouped usingauditory streaming cues and includes a voice detection step thatindicates when the melody is present we use this melody locationas an indicator of leading voice activity Note that while melodycan sometimes be carried by other instruments in the evaluationdataset of Section 4 it is mainly singing

3httpmtgupfedutechnologiesmelodia

4 EVALUATION

In this section we present the results of our approach evaluated ona database of complete music tracks of various genres We com-pare the proposed adaptive method with the baseline method [18]as well as another state-of-the-art method [16] Sound examplesdiscussed in the article can be found athttppapadopoulosellisdafx14blogspotfr

41 Parameters Dataset and Evaluation Criteria

To evaluate the proposed approach we have constructed a databaseof 12 complete music tracks of various genres with separated vo-cal and accompaniment files as well as mixture versions formedas the sum of the vocal and accompaniment files The tracks listedin Tab 1 were created from multitracks mixed in Audacity4 thenexported with or without the vocal or accompaniment lines

Following previous work [18 44 15] the separations are eval-uated with metrics from the BSS-EVAL toolbox [45] which pro-vides a framework for the evaluation of source separation algo-rithms when the original sources are available for comparisonThree ratios are considered for both sources Source-to-Distortion(SDR) Sources-to-Interference (SIR) and Sources-to-Artifacts (SAR)In addition we measure the improvement in SDR between themixtured and the estimated resynthesized singing voicee by theNormalized SDR (NSDR also known asSDR improvement SDRI)defined for the voice as NSDR(e e d) = SDR(e e)minusSDR(d e)wheree is the original clean singing voice The same measure isused for the evaluation of the background Each measure is com-puted globally on the whole track but also locally according to thesegmentation into vocalnon-vocal segments Higher values of themetrics indicate better separation

We compare the results of the A-RPCA with musically-informedadaptiveλ and the baseline RPCA method [18] with fixedλ usingthe same parameter settings in the analysis stage the STFT of eachmixture is computed using a window length of1024 samples with75 overlap at a sampling rate of115KHz No post-processing(such as masking) is added After spectrogram decomposition thesignals are reconstructed using the inverse STFT and the phase ofthe original signal

The parameterλ is set to1p

max(m n) in the baseline methodTwo different versions of the proposed A-RPCA algorithm areevaluated First A-RPCA with exact voice activity informationusing manually annotated ground-truth (A-RPCA_GT) andλl =λ for singing voice regions andλl = 5lowastλ for background only re-gions In the other configuration estimated voice activitylocationis used (A-RPCA_est) with same settings for theλl

We also compare our approach with the REPET state-of-the-art algorithm based on repeating pattern discovery and binary time-frequency masking [16] Note that we use for comparison the ver-sion of REPET that is designed for processing complete musicaltracks (as opposed to the original one introduced in [15]) Thismethod includes a simple low pass filtering post-processingstep[46] that consists in removing all frequencies below100Hz fromthe vocal signal and adding these components back into the back-ground layer We further apply this post-processing step toourmodel before comparison with the REPET algorithm

Paired sample t-tests at the 5 significance level are performedto determine whether there is statistical significance in the resultsbetween various configurations

4httpaudacitysourceforgenet

DAFX-4

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 1 Sound excerpts used for the evaluationback proportion of background (no leading voice) segments (in of the whole excerptduration) RecallRecand False AlarmFAvoicing detection rate

Name back Rec FA Name back Rec FA1- BeatlesSgt Pepperrsquos Lonely Hearts Club Band 493 7474 4556 8 - Bob MarleyIs This Love 372 6622 36842 - BeatlesWith A Little Help From My Friends 135 7010 1471 9 - Doobie BrothersLong Train Running 656 8412 58513 - BeatlesShersquos Leaving Home 246 7752 3017 10 -Marvin GayeHeard it Through The Grapevine 302 7922 17904 - BeatlesA Day in The Life 356 6130 6396 11 -The EaglesTake it Easy 355 7868 302056 -Puccinipiece for soprano and piano 247 4790 2704 12 -The PoliceMessage in aBottle 249 7390 20447 - Pink Noise PartyTheir Shallow Singularity 421 6415 6183

42 Results and Discussion

Results of the separation for the sparse (singing voice) andlow-rank (background accompaniment) layers are presented in Tables2 3 4 and 5 To have a better insight of the results we presentmeasures computed both on the entire song and on the singingvoice-active part only that is obtained by concatenating all seg-ments labeled as vocal segments in the ground truth

bull Global separation results As we can see from Tables 2 and3 using a musically-informed adaptive regularization parameterallows improving the results of the separation both for the back-ground and the leading voice components Note that the larger theproportion of purely-instrumental segments in a piece (seeTab 1)the larger the results improvement (see in particular pieces 1 7 8and 9) which is consistent with the goal of the proposed methodStatistical tests show that the improvement in the results is signifi-cant

As discussed in Section 3 the quality of the separation withthe baseline method [18] depends on the value of the regulariza-tion parameter Moreover the value that leads to the best separa-tion quality differs from one music excerpt to another Thus whenprocessing automatically a collection of music tracks thechoice ofthis value results from a trade-off We report here results obtainedwith the typical choiceλv = 1

p

max(m n) in Eq (7) Note thatfor a given value ofλv in the baseline method the separation canalways be further improved by the A-RPCA algorithm using a reg-ularization parameter that is adapted to the music content based onprior music structure information in all experiments fora givenconstant valueλv in the baseline method settingλnv gt λv in Eq(7) improves the results

For the singing voice layer improved SDR (better overall sep-aration performance) and SIR (better capability of removing musicinterferences from the singing voice) with A-RPCA are obtainedat the price of introducing more artifacts in the estimated voice(lower SARvoice) Listening tests reveal that in some segmentsprocessed by A-RPCA as for instance segment[1prime00primeprime minus 1prime15primeprime]in Fig 3 one can hear some high frequency isolated coefficientssuperimposed to the separated voice This drawback could bere-duced by including harmonicity priors in the sparse component ofRPCA as proposed in [20] This performance trade-off is com-monly encountered in musicvoice separation [14 47] Howeverwe can notice that all three measures are significantly improvedwith A-RPCA for the background layer

bull Ground truth versus estimated voice activity location Im-perfect voice activity location information still allows an improve-ment although to a lesser extent than with ground-truth voice ac-tivity information In table 1 we report the accuracy results of thevoicing detection step Similarly to the measures used for melody

Figure 3 Separated voice for various values ofλ for thePink NoiseParty songTheir Shallow Singularity From top to bottom cleanvoice constantλ1 = 1

p

max(m n) constantλ = 5lowastλ1 adap-tive λ = (λ1 5 lowast λ1)

detection in [48 12] we consider theVoicing Recall Rate definedas the proportion of frames labeled voiced in the ground truth thatare estimated as voiced frames by the algorithm and theVoicingFalse Alarm Rate defined as the proportion of frames labeled asunvoiced in the ground truth that are mistakenly estimated to bevoiced by the algorithm The decrease in the results mainly comesfrom background segments classified as vocal segments Howeverstatistical tests show that the improvement in the results betweenRPCA and A-RPCA_est is still significant

bull Local separation results It is interesting to note that usingan adaptive regularization parameter in a unified analysis of thewhole piece is different from separately analyzing the successivevocalnon-vocal segments with different but constant values ofλ(see for instance the dashed rectangles areas in Fig 3)

bull Analysis of the results on vocal segments We expect the sep-aration on background-only parts of the song to be improved withthe A-RPCA algorithm Indeed the side information directlyin-dicates these regions where the foreground (sparse) componentsshould be avoided this can be clearly seen in Fig 3 However theimprovements under the proposed model are not limited to non-vocal regions only Results measured on the vocal segments aloneindicate that by using the adaptive algorithm the voice is also bet-ter estimated as shown in Table 3 The improvement over RPCAis statistically significant both when using ground truth and esti-mated voice activity location information This indicatesthat sideinformation helps not only to better determine the background onlysegments but also enables improved recovery of the singingvoicepresumably because the low-rank background model is a bettermatch to the actual background

Side information could have be added as a pre- or post-processingstep to the RPCA algorithm The adaptive-RPCA algorithm presentsadvantages over such approaches To analyze this we compare the

DAFX-5

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 2 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity

Entire songRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -466 -216 -318SIR (dB) -386 074 -046SAR (dB) 899 481 394

NSDR 170 420 318

Back

SDR (dB) 414 652 608SIR (dB) 1148 1330 1207SAR (dB) 551 803 783

NSDR -235 003 -041

Table 3 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity

Vocal segmentsRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -319 -200 -196SIR (dB) -233 -039 074SAR (dB) 944 727 464

NSDR 167 285 290

Back

SDR (dB) 363 518 528SIR (dB) 995 1064 1041SAR (dB) 539 732 754

NSDR -137 018 029

Table 4 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity Low-pass filtering post-processing is applied REPET is thecomparison algorithm [16]

Entire songRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -276 -072 -211 -220SIR (dB) -017 403 222 134SAR (dB) 433 333 232 319

NSDR 360 564 425 416

Back

SDR (dB) 516 761 681 501SIR (dB) 1453 1449 1299 1683SAR (dB) 596 902 844 547

NSDR -132 112 033 -148

Table 5 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity Low-pass filtering post-processing is appliedREPET is the comparison algorithm [16]

Vocal segments onlyRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -125 -053 -083 -070SIR (dB) 149 304 362 302SAR (dB) 502 446 312 402

NSDR 360 432 402 415

Back

SDR (dB) 485 603 611 480SIR (dB) 1307 1238 1141 1533SAR (dB) 591 769 820 541

NSDR -014 103 111 -020

A-RPCA algorithm with two variants of RPCA incorporating sideinformation either as a pre- or a post-processing step

bull RPCA_OV pre Only the concatenation of segments clas-sified as vocal is processed by RPCA (the singing voiceestimate being set to zero in the remaining non-vocal seg-ments)

bull RPCA_OV post The whole song is processed by RPCAand non-zeros coefficients estimated as belonging to thevoice layer in non-vocal segments are transferred to thebackground layer

Results of the decomposition computed across the vocal seg-ments only are presented in Table 6 Note that the RPCA_OV post

results reduce to the RPCA results in Table 3 since they are com-puted on vocal segments only There is no statistical difference be-tween the estimated voice obtained by processing with RPCA thewhole song and the vocal segments only Results are significantlybetter using the A-RPCA algorithm than using RPCA_OV pre andRPCA_OV post This is illustrated in Figure 4 which shows anexample of the decomposition on an excerpt of theDoobie Broth-ers songLong Train Runningcomposed by a non-vocal followedby a vocal segment We can see that there are misclassified partialsin the voice spectrogram obtained with the baseline RPCA that areremoved with A-RPCA Moreover the gap in the singing voicearound frame 50 (breathing) is cleaner in the case of A-RPCA thanin the case of RPCA Listening tests confirm that the backgroundis better attenuated in the voice layer when using A-RPCA

Table 6 SDR SIR and SAR (in dB) and NSDR resultsfor the voice (Voice) and background layer (Back) com-puted across the vocal segments only averaged across all thesongs RPCA_OV post is when using the baseline system andset the voice estimate to zero in background-only segmentsRPCA_OV pre is when processing only the voice segments withthe baseline model A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses es-timated voice activity

RPCA_OV post RPCA_OV pre A-RPCA_GT A-RPCA_est

Voice

SDR -319 -328 -200 -196SIR -233 -231 362 074SAR 944 897 727 464

NSDR 167 157 285 290

Back

SDR 363 372 518 528SIR 995 922 1064 1041SAR 539 585 732 754

NSDR -137 -128 018 029

bull Comparison with the state-of-the-art As we can see from Ta-ble 4 the results obtained with the RPCA baseline method arenotbetter than those obtained with the REPET algorithm On the con-trary the REPET algorithm is significantly outperformed bytheA-RPCA algorithm when using ground truth voice activity infor-mation both for the sparse and low-rank layers However notethat when using estimated voice activity information the differ-

DAFX-6

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Figure 4 [Top Figure] Example decomposition on an excerpt ofthe Doobie BrotherssongLong Train Runningand [Bottom Fig-ure] zoom between frames [525-580] (dashed rectangle in theTopFigure) For each figure the top pane shows the part between0 and500Hz of the spectrogram of the original signal The clean sign-ing voice appears in the second pane The separated signing voiceobtained with baseline model (RPCA) with the baseline modelwhen restricting the analysis to singing voice-active segments only(RPCA_OV pre) and with the proposed A-RPCA model are rep-resented in panes 3 to 5 For comparison the sixth pane showstheresults obtained with REPET [16]

ence in the results between REPET and A-RPCA is not statisticallysignificant for the sparse layer If we look closer at the results itis interesting to note that the voice estimation improvement by A-RPCA_GT over REPET mainly comes from the non-vocal partswhere the voice estimated is favored to be null Indeed Table 5indicate that the voice estimates on vocal segments obtained withA-RPCA_GT and REPET are similar This is illustrated by thetwo last panes in the [bottom] Figure 4 which show similar spec-trograms of the voice estimates obtained with the A-RPCA andREPET algorithms on the vocal part of the excerpt

5 CONCLUSION

We have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including localvariations in the music structure Music content information isincorporated in the decomposition to guide the selection ofcoeffi-cients in the sparse and low-rank layers according to the semanticstructure of the piece This motivates the choice of using a regu-larization parameter that is informed by musical cues Results in-dicate that with the proposed algorithm not only the backgroundsegments are better discriminated but also that the singing voice isbetter estimated in vocal segments presumably because thelow-rank background model is a better match to the actual backgroundThe method could be extended with other criteria (singer identi-fication vibrato saliency etc) It could also be improvedby in-corporating additional information to set differently theregulariza-tion parameters foreachtrack to better accommodate the varyingcontrast of foreground and background The idea of an adaptivedecomposition could also be improved with a more complex for-mulation of RPCA that incorporates additional constraints[20] ora learned dictionary [49]

6 REFERENCES

[1] K Min Z Zhang J Wright and Y Ma ldquoDecomposingbackground topics from keywords by principal componentpursuitrdquo inCIKM 2010

[2] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation ofbackground subtraction techniques for video surveillancerdquoin CCVPR 2011 pp 1937ndash1944

[3] EJ Candegraves X Li and J Ma Y andb Wright ldquoRobustprincipal component analysisrdquoJournal of the ACM vol58 no 3 Article 11 2011

[4] V Chandrasekaran S Sanghavi P Parrilo and A Will-sky ldquoSparse and low-rank matrix decompositionsrdquo inSysid2009

[5] B Cheng G Liu J Wang Z Huang and S Yan ldquoMulti-task low-rank affinity pursuit for image segmentationrdquo inICCV 2011 pp 2439ndash2446

[6] Z Zeng TH Chan K Jia and D Xu ldquoFinding correspon-dence from multiple images via sparse and low-rank decom-positionrdquo inECCV 2012 pp 325ndash339

[7] F Yang H Jiang Z Shen W Deng and DN MetaxasldquoAdaptive low rank and sparse decomposition of video usingcompressive sensingrdquoCoRR vol abs13021610 2013

[8] Y Peng A Ganesh J Wright and Y Xu W andMa ldquoRaslRobust alignment by sparse and low-rank decomposition forlinearly correlated imagesrdquoIEEE Trans Pattern Anal MachIntell vol 34 no 11 pp 2233ndash2246 2012

[9] Z Shi J Han T Zheng and S Deng ldquoOnline learningfor classification of low-rank representation features anditsapplications in audio segment classificationrdquoCoRR volabs11124243 2011

[10] YH Yang D Bogdanov P Herrera and M Sordo ldquoMusicretagging using label propagation and robust principal com-ponent analysisrdquo inWWW New York NY USA 2012 pp869ndash876

[11] W Cai Q Li and X Guan ldquoAutomatic singer identificationbased on auditory featuresrdquo 2011

DAFX-7

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

whereU isin Rmtimesr V isin R

ntimesr andΣ isin Rrtimesr are obtained via the

singular value decomposition(UΣ V ) = SV D(DminusEk + Y k

microk )

Problem (3b) can be written as

Ek+1 = minE

λE1 +microk

2E minus (D minus Ak+1 +

1

microkY k)2

F

ff

(5)whose solution is given by the least-absolute shrinkage andse-lection operator (Lasso) [37] a method also known in the signalprocessing community as basis pursuit denoising [38]

Ek+1 = S λmicrok

[D minus Ak+1 +Y k

microk

]

In other words denotingGE = D minus Ak+1 + Y k

microk

forall i isin [1 m]forall j isin [1 n] Ek+1

ij = sgn(GEij)middotmax(|GE

ij |minusλ

microk 0)

3 ADAPTIVE RPCA (A-RPCA)

As discussed in Section 1 in a given song the foreground vo-cals typically exhibit a clustered distribution in the time-frequencyplane relating to the semantic structure of the piece that alternatesbetween vocal and non-vocal (background) segments This struc-ture should be reflected in the decomposition frames belongingto singing voice-inactive segments should result in zero-valuedcolumns inE

The balance between the sparse and low-rank contributionsis set by the value of the regularization parameterλ The voiceseparation quality with respect to the value ofλ for thePink NoisePartysongTheir Shallow Singularityis illustrated in Fig 1 As wecan observe the bestλ differs depending on whether we processthe entire song or restrict processing to just the singing voice-active parts Because the separation for the background part ismonotonically better asλ increases the difference between theoptimumλ indicates that the global separation quality is compro-mised between the singing voice and the background part

05 1 2 3 4 5 6 7 8 9 10 20 100minus25

minus20

minus15

minus10

minus5

0

5

10

value of λ

NSDR

entirevoice only

Figure 1 Variation of the estimated singing voice NSDR (seedef-inition in Section 4) according to the value ofλ under two situa-tions bull NSDR when only the singing voice-active parts of theseparated signal are processedlowast NSDR when the entire signal isprocessed

Figure 2 Waveform of the separated voice for various valuesofλ for the songIs This Loveby Bob Marley From top to bottomclean voiceλ = λ1 2 lowast λ1 5 lowast λ1 10 lowast λ1

In the theoretical formulation of RPCA-PCP [3] there is nosingle value ofλ that works for separating sparse from low-rankcomponents in all conditions They recommendλ = max(m n)minus

1

2

but also note that the decomposition can be improved by choos-ing λ in light of prior knowledge about the solution In prac-tice we have found that the decomposition of music audio is verysensitive to the choice ofλ with frequently no single value ableto achieve a satisfying separation between voice and instrumen-tal parts across a whole recording This is illustrated in Fig 2which shows the waveforms of the resynthesized separated voiceobtained with the RPCA-PCP formulation for variousλ Forλ =λ1 = 1

p

max(mn) andλ2 = 2lowastλ1 aroundt = 115 s (dashedrectangle) there is a non-zero contribution in the voice layer but noactual lead vocal This is eliminated withλ = 5 lowast λ1 10 lowast λ1 butat the expense of a very poor quality voice estimate the resultingsignal consists of percussive sounds and higher harmonics of theinstruments and does not resemble the voice Note that similarobservations have been made in the context of video surveillance[39]

To address the problem of variations inλ we propose an adap-tive variant of the RPCA consisting of a weighted decomposi-tion that incorporates prior information about the music contentSpecifically voice activity information is used as a cue to ad-just the regularization parameter through the entire analyzed piecein the (3b) step and therefore better match the balance betweensparse and low-rank contributions to suit to the actual music con-tent This idea is related to previous theoretical work [4041 42]but to our knowledge its application in the framework of RPCA isnew

We consider a time segmentation of the magnitude spectro-gram intoNblock consecutive (non-overlapping) blocks of vocal non-vocal (background accompaniment) segments We can rep-resent the magnitude spectrogram as a concatenation of column-blocksD = [D1D2 middot middot middotDNblock] the sparse layer asE = [E1 middot middot middotENblock]andGE = [GE

1 middot middot middotGENblock

]We can minimize the objective function with respect to each

column-block separately To guide the separation we aim atset-ting a different value ofλl l isin [1 Nblocks] for each block ac-cording to the voice activity side information For each block theproblem is equivalent to Eq (5) and accordingly the solution tothe resulting problem

Ek+1

l = minEl

λlEl1 +microk

2El minus GE

l 2F

ff

is given byEk+1

l = S λlmicrok

[GEl ] (6)

DAFX-3

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Algorithm 1 Adaptive RPCA (A-RPCA)

Input spectrogramD blocksλ λ1 λNblocks

Output E AInitialization Y 0 = DJ(D) where J(D) =max(D2 λ

minus1Dinfin) E0 = 0 micro0 gt 0 ρ gt 1k = 0while not convergeddo

update A(U Σ V ) = SV D(D minusEk + Y k

microk ) Ak+1 = US 1

microk[Σ]V T

update Efor each blockl do

λ = λl

Et+1

l = S λlmicrok

[Dl minus Ak+1

l +Y k

l

microk ]

end forEt+1 = [Et+1

1 Et+1

2 middot middot middotEt+1

Nblock]

update Y microY k+1 = Y k minus microk(Ak+1 + Ek+1 minus D)microk+1 = ρ middot microk

k = k + 1end while

Denoteλv the constant value of the regularization parameterλused in the basic formulation of RPCA for voice separation [18]To guide the separation in the A-RPCA formulation we assignto each block a valueλl in accordance with the considered priormusic structure information Using a largeλl in blocks withoutleading voice will favor retaining non-zero coefficients inthe ac-companiment layer Denoting byΩV the set of time frames thatcontain voice the values ofλl are set as

forall l isin [1 Nblock]

λl = λv if El sub ΩV

λl = λnv otherwise(7)

with λnv gt λv to enhance sparsity ofE when no vocal activityis detected Note that instead of two distinct values ofλl fur-ther improvements could be obtained by tuningλl more preciselyto suit the segment characteristics For instance vibratoinforma-tion could be used to quantify the amount of voice in the mixturewithin each block and to set a specific regularization parameter ac-cordingly The update rules of the A-RPCA algorithm are detailedin Algorithm 1

In Section 4 we investigate the results of adaptive-RPCA withboth exact (ground-truth) and estimated vocal activity informationFor estimating vocal activity information we use the voicing de-tection step of the melody extraction algorithm implemented inthe MELODIA Melody Extraction vamp plug-in3 as it is freelyavailable for people to download and use We refer the readerto[26] and references therein for other voicing detection algorithmsThe algorithm for the automatic extraction of the main melodyfrom polyphonic music recordings implemented in MELODIA isa salience-based model that is described in [43] It is basedonthe creation and characterization of pitch contours grouped usingauditory streaming cues and includes a voice detection step thatindicates when the melody is present we use this melody locationas an indicator of leading voice activity Note that while melodycan sometimes be carried by other instruments in the evaluationdataset of Section 4 it is mainly singing

3httpmtgupfedutechnologiesmelodia

4 EVALUATION

In this section we present the results of our approach evaluated ona database of complete music tracks of various genres We com-pare the proposed adaptive method with the baseline method [18]as well as another state-of-the-art method [16] Sound examplesdiscussed in the article can be found athttppapadopoulosellisdafx14blogspotfr

41 Parameters Dataset and Evaluation Criteria

To evaluate the proposed approach we have constructed a databaseof 12 complete music tracks of various genres with separated vo-cal and accompaniment files as well as mixture versions formedas the sum of the vocal and accompaniment files The tracks listedin Tab 1 were created from multitracks mixed in Audacity4 thenexported with or without the vocal or accompaniment lines

Following previous work [18 44 15] the separations are eval-uated with metrics from the BSS-EVAL toolbox [45] which pro-vides a framework for the evaluation of source separation algo-rithms when the original sources are available for comparisonThree ratios are considered for both sources Source-to-Distortion(SDR) Sources-to-Interference (SIR) and Sources-to-Artifacts (SAR)In addition we measure the improvement in SDR between themixtured and the estimated resynthesized singing voicee by theNormalized SDR (NSDR also known asSDR improvement SDRI)defined for the voice as NSDR(e e d) = SDR(e e)minusSDR(d e)wheree is the original clean singing voice The same measure isused for the evaluation of the background Each measure is com-puted globally on the whole track but also locally according to thesegmentation into vocalnon-vocal segments Higher values of themetrics indicate better separation

We compare the results of the A-RPCA with musically-informedadaptiveλ and the baseline RPCA method [18] with fixedλ usingthe same parameter settings in the analysis stage the STFT of eachmixture is computed using a window length of1024 samples with75 overlap at a sampling rate of115KHz No post-processing(such as masking) is added After spectrogram decomposition thesignals are reconstructed using the inverse STFT and the phase ofthe original signal

The parameterλ is set to1p

max(m n) in the baseline methodTwo different versions of the proposed A-RPCA algorithm areevaluated First A-RPCA with exact voice activity informationusing manually annotated ground-truth (A-RPCA_GT) andλl =λ for singing voice regions andλl = 5lowastλ for background only re-gions In the other configuration estimated voice activitylocationis used (A-RPCA_est) with same settings for theλl

We also compare our approach with the REPET state-of-the-art algorithm based on repeating pattern discovery and binary time-frequency masking [16] Note that we use for comparison the ver-sion of REPET that is designed for processing complete musicaltracks (as opposed to the original one introduced in [15]) Thismethod includes a simple low pass filtering post-processingstep[46] that consists in removing all frequencies below100Hz fromthe vocal signal and adding these components back into the back-ground layer We further apply this post-processing step toourmodel before comparison with the REPET algorithm

Paired sample t-tests at the 5 significance level are performedto determine whether there is statistical significance in the resultsbetween various configurations

4httpaudacitysourceforgenet

DAFX-4

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 1 Sound excerpts used for the evaluationback proportion of background (no leading voice) segments (in of the whole excerptduration) RecallRecand False AlarmFAvoicing detection rate

Name back Rec FA Name back Rec FA1- BeatlesSgt Pepperrsquos Lonely Hearts Club Band 493 7474 4556 8 - Bob MarleyIs This Love 372 6622 36842 - BeatlesWith A Little Help From My Friends 135 7010 1471 9 - Doobie BrothersLong Train Running 656 8412 58513 - BeatlesShersquos Leaving Home 246 7752 3017 10 -Marvin GayeHeard it Through The Grapevine 302 7922 17904 - BeatlesA Day in The Life 356 6130 6396 11 -The EaglesTake it Easy 355 7868 302056 -Puccinipiece for soprano and piano 247 4790 2704 12 -The PoliceMessage in aBottle 249 7390 20447 - Pink Noise PartyTheir Shallow Singularity 421 6415 6183

42 Results and Discussion

Results of the separation for the sparse (singing voice) andlow-rank (background accompaniment) layers are presented in Tables2 3 4 and 5 To have a better insight of the results we presentmeasures computed both on the entire song and on the singingvoice-active part only that is obtained by concatenating all seg-ments labeled as vocal segments in the ground truth

bull Global separation results As we can see from Tables 2 and3 using a musically-informed adaptive regularization parameterallows improving the results of the separation both for the back-ground and the leading voice components Note that the larger theproportion of purely-instrumental segments in a piece (seeTab 1)the larger the results improvement (see in particular pieces 1 7 8and 9) which is consistent with the goal of the proposed methodStatistical tests show that the improvement in the results is signifi-cant

As discussed in Section 3 the quality of the separation withthe baseline method [18] depends on the value of the regulariza-tion parameter Moreover the value that leads to the best separa-tion quality differs from one music excerpt to another Thus whenprocessing automatically a collection of music tracks thechoice ofthis value results from a trade-off We report here results obtainedwith the typical choiceλv = 1

p

max(m n) in Eq (7) Note thatfor a given value ofλv in the baseline method the separation canalways be further improved by the A-RPCA algorithm using a reg-ularization parameter that is adapted to the music content based onprior music structure information in all experiments fora givenconstant valueλv in the baseline method settingλnv gt λv in Eq(7) improves the results

For the singing voice layer improved SDR (better overall sep-aration performance) and SIR (better capability of removing musicinterferences from the singing voice) with A-RPCA are obtainedat the price of introducing more artifacts in the estimated voice(lower SARvoice) Listening tests reveal that in some segmentsprocessed by A-RPCA as for instance segment[1prime00primeprime minus 1prime15primeprime]in Fig 3 one can hear some high frequency isolated coefficientssuperimposed to the separated voice This drawback could bere-duced by including harmonicity priors in the sparse component ofRPCA as proposed in [20] This performance trade-off is com-monly encountered in musicvoice separation [14 47] Howeverwe can notice that all three measures are significantly improvedwith A-RPCA for the background layer

bull Ground truth versus estimated voice activity location Im-perfect voice activity location information still allows an improve-ment although to a lesser extent than with ground-truth voice ac-tivity information In table 1 we report the accuracy results of thevoicing detection step Similarly to the measures used for melody

Figure 3 Separated voice for various values ofλ for thePink NoiseParty songTheir Shallow Singularity From top to bottom cleanvoice constantλ1 = 1

p

max(m n) constantλ = 5lowastλ1 adap-tive λ = (λ1 5 lowast λ1)

detection in [48 12] we consider theVoicing Recall Rate definedas the proportion of frames labeled voiced in the ground truth thatare estimated as voiced frames by the algorithm and theVoicingFalse Alarm Rate defined as the proportion of frames labeled asunvoiced in the ground truth that are mistakenly estimated to bevoiced by the algorithm The decrease in the results mainly comesfrom background segments classified as vocal segments Howeverstatistical tests show that the improvement in the results betweenRPCA and A-RPCA_est is still significant

bull Local separation results It is interesting to note that usingan adaptive regularization parameter in a unified analysis of thewhole piece is different from separately analyzing the successivevocalnon-vocal segments with different but constant values ofλ(see for instance the dashed rectangles areas in Fig 3)

bull Analysis of the results on vocal segments We expect the sep-aration on background-only parts of the song to be improved withthe A-RPCA algorithm Indeed the side information directlyin-dicates these regions where the foreground (sparse) componentsshould be avoided this can be clearly seen in Fig 3 However theimprovements under the proposed model are not limited to non-vocal regions only Results measured on the vocal segments aloneindicate that by using the adaptive algorithm the voice is also bet-ter estimated as shown in Table 3 The improvement over RPCAis statistically significant both when using ground truth and esti-mated voice activity location information This indicatesthat sideinformation helps not only to better determine the background onlysegments but also enables improved recovery of the singingvoicepresumably because the low-rank background model is a bettermatch to the actual background

Side information could have be added as a pre- or post-processingstep to the RPCA algorithm The adaptive-RPCA algorithm presentsadvantages over such approaches To analyze this we compare the

DAFX-5

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 2 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity

Entire songRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -466 -216 -318SIR (dB) -386 074 -046SAR (dB) 899 481 394

NSDR 170 420 318

Back

SDR (dB) 414 652 608SIR (dB) 1148 1330 1207SAR (dB) 551 803 783

NSDR -235 003 -041

Table 3 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity

Vocal segmentsRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -319 -200 -196SIR (dB) -233 -039 074SAR (dB) 944 727 464

NSDR 167 285 290

Back

SDR (dB) 363 518 528SIR (dB) 995 1064 1041SAR (dB) 539 732 754

NSDR -137 018 029

Table 4 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity Low-pass filtering post-processing is applied REPET is thecomparison algorithm [16]

Entire songRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -276 -072 -211 -220SIR (dB) -017 403 222 134SAR (dB) 433 333 232 319

NSDR 360 564 425 416

Back

SDR (dB) 516 761 681 501SIR (dB) 1453 1449 1299 1683SAR (dB) 596 902 844 547

NSDR -132 112 033 -148

Table 5 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity Low-pass filtering post-processing is appliedREPET is the comparison algorithm [16]

Vocal segments onlyRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -125 -053 -083 -070SIR (dB) 149 304 362 302SAR (dB) 502 446 312 402

NSDR 360 432 402 415

Back

SDR (dB) 485 603 611 480SIR (dB) 1307 1238 1141 1533SAR (dB) 591 769 820 541

NSDR -014 103 111 -020

A-RPCA algorithm with two variants of RPCA incorporating sideinformation either as a pre- or a post-processing step

bull RPCA_OV pre Only the concatenation of segments clas-sified as vocal is processed by RPCA (the singing voiceestimate being set to zero in the remaining non-vocal seg-ments)

bull RPCA_OV post The whole song is processed by RPCAand non-zeros coefficients estimated as belonging to thevoice layer in non-vocal segments are transferred to thebackground layer

Results of the decomposition computed across the vocal seg-ments only are presented in Table 6 Note that the RPCA_OV post

results reduce to the RPCA results in Table 3 since they are com-puted on vocal segments only There is no statistical difference be-tween the estimated voice obtained by processing with RPCA thewhole song and the vocal segments only Results are significantlybetter using the A-RPCA algorithm than using RPCA_OV pre andRPCA_OV post This is illustrated in Figure 4 which shows anexample of the decomposition on an excerpt of theDoobie Broth-ers songLong Train Runningcomposed by a non-vocal followedby a vocal segment We can see that there are misclassified partialsin the voice spectrogram obtained with the baseline RPCA that areremoved with A-RPCA Moreover the gap in the singing voicearound frame 50 (breathing) is cleaner in the case of A-RPCA thanin the case of RPCA Listening tests confirm that the backgroundis better attenuated in the voice layer when using A-RPCA

Table 6 SDR SIR and SAR (in dB) and NSDR resultsfor the voice (Voice) and background layer (Back) com-puted across the vocal segments only averaged across all thesongs RPCA_OV post is when using the baseline system andset the voice estimate to zero in background-only segmentsRPCA_OV pre is when processing only the voice segments withthe baseline model A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses es-timated voice activity

RPCA_OV post RPCA_OV pre A-RPCA_GT A-RPCA_est

Voice

SDR -319 -328 -200 -196SIR -233 -231 362 074SAR 944 897 727 464

NSDR 167 157 285 290

Back

SDR 363 372 518 528SIR 995 922 1064 1041SAR 539 585 732 754

NSDR -137 -128 018 029

bull Comparison with the state-of-the-art As we can see from Ta-ble 4 the results obtained with the RPCA baseline method arenotbetter than those obtained with the REPET algorithm On the con-trary the REPET algorithm is significantly outperformed bytheA-RPCA algorithm when using ground truth voice activity infor-mation both for the sparse and low-rank layers However notethat when using estimated voice activity information the differ-

DAFX-6

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Figure 4 [Top Figure] Example decomposition on an excerpt ofthe Doobie BrotherssongLong Train Runningand [Bottom Fig-ure] zoom between frames [525-580] (dashed rectangle in theTopFigure) For each figure the top pane shows the part between0 and500Hz of the spectrogram of the original signal The clean sign-ing voice appears in the second pane The separated signing voiceobtained with baseline model (RPCA) with the baseline modelwhen restricting the analysis to singing voice-active segments only(RPCA_OV pre) and with the proposed A-RPCA model are rep-resented in panes 3 to 5 For comparison the sixth pane showstheresults obtained with REPET [16]

ence in the results between REPET and A-RPCA is not statisticallysignificant for the sparse layer If we look closer at the results itis interesting to note that the voice estimation improvement by A-RPCA_GT over REPET mainly comes from the non-vocal partswhere the voice estimated is favored to be null Indeed Table 5indicate that the voice estimates on vocal segments obtained withA-RPCA_GT and REPET are similar This is illustrated by thetwo last panes in the [bottom] Figure 4 which show similar spec-trograms of the voice estimates obtained with the A-RPCA andREPET algorithms on the vocal part of the excerpt

5 CONCLUSION

We have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including localvariations in the music structure Music content information isincorporated in the decomposition to guide the selection ofcoeffi-cients in the sparse and low-rank layers according to the semanticstructure of the piece This motivates the choice of using a regu-larization parameter that is informed by musical cues Results in-dicate that with the proposed algorithm not only the backgroundsegments are better discriminated but also that the singing voice isbetter estimated in vocal segments presumably because thelow-rank background model is a better match to the actual backgroundThe method could be extended with other criteria (singer identi-fication vibrato saliency etc) It could also be improvedby in-corporating additional information to set differently theregulariza-tion parameters foreachtrack to better accommodate the varyingcontrast of foreground and background The idea of an adaptivedecomposition could also be improved with a more complex for-mulation of RPCA that incorporates additional constraints[20] ora learned dictionary [49]

6 REFERENCES

[1] K Min Z Zhang J Wright and Y Ma ldquoDecomposingbackground topics from keywords by principal componentpursuitrdquo inCIKM 2010

[2] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation ofbackground subtraction techniques for video surveillancerdquoin CCVPR 2011 pp 1937ndash1944

[3] EJ Candegraves X Li and J Ma Y andb Wright ldquoRobustprincipal component analysisrdquoJournal of the ACM vol58 no 3 Article 11 2011

[4] V Chandrasekaran S Sanghavi P Parrilo and A Will-sky ldquoSparse and low-rank matrix decompositionsrdquo inSysid2009

[5] B Cheng G Liu J Wang Z Huang and S Yan ldquoMulti-task low-rank affinity pursuit for image segmentationrdquo inICCV 2011 pp 2439ndash2446

[6] Z Zeng TH Chan K Jia and D Xu ldquoFinding correspon-dence from multiple images via sparse and low-rank decom-positionrdquo inECCV 2012 pp 325ndash339

[7] F Yang H Jiang Z Shen W Deng and DN MetaxasldquoAdaptive low rank and sparse decomposition of video usingcompressive sensingrdquoCoRR vol abs13021610 2013

[8] Y Peng A Ganesh J Wright and Y Xu W andMa ldquoRaslRobust alignment by sparse and low-rank decomposition forlinearly correlated imagesrdquoIEEE Trans Pattern Anal MachIntell vol 34 no 11 pp 2233ndash2246 2012

[9] Z Shi J Han T Zheng and S Deng ldquoOnline learningfor classification of low-rank representation features anditsapplications in audio segment classificationrdquoCoRR volabs11124243 2011

[10] YH Yang D Bogdanov P Herrera and M Sordo ldquoMusicretagging using label propagation and robust principal com-ponent analysisrdquo inWWW New York NY USA 2012 pp869ndash876

[11] W Cai Q Li and X Guan ldquoAutomatic singer identificationbased on auditory featuresrdquo 2011

DAFX-7

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Algorithm 1 Adaptive RPCA (A-RPCA)

Input spectrogramD blocksλ λ1 λNblocks

Output E AInitialization Y 0 = DJ(D) where J(D) =max(D2 λ

minus1Dinfin) E0 = 0 micro0 gt 0 ρ gt 1k = 0while not convergeddo

update A(U Σ V ) = SV D(D minusEk + Y k

microk ) Ak+1 = US 1

microk[Σ]V T

update Efor each blockl do

λ = λl

Et+1

l = S λlmicrok

[Dl minus Ak+1

l +Y k

l

microk ]

end forEt+1 = [Et+1

1 Et+1

2 middot middot middotEt+1

Nblock]

update Y microY k+1 = Y k minus microk(Ak+1 + Ek+1 minus D)microk+1 = ρ middot microk

k = k + 1end while

Denoteλv the constant value of the regularization parameterλused in the basic formulation of RPCA for voice separation [18]To guide the separation in the A-RPCA formulation we assignto each block a valueλl in accordance with the considered priormusic structure information Using a largeλl in blocks withoutleading voice will favor retaining non-zero coefficients inthe ac-companiment layer Denoting byΩV the set of time frames thatcontain voice the values ofλl are set as

forall l isin [1 Nblock]

λl = λv if El sub ΩV

λl = λnv otherwise(7)

with λnv gt λv to enhance sparsity ofE when no vocal activityis detected Note that instead of two distinct values ofλl fur-ther improvements could be obtained by tuningλl more preciselyto suit the segment characteristics For instance vibratoinforma-tion could be used to quantify the amount of voice in the mixturewithin each block and to set a specific regularization parameter ac-cordingly The update rules of the A-RPCA algorithm are detailedin Algorithm 1

In Section 4 we investigate the results of adaptive-RPCA withboth exact (ground-truth) and estimated vocal activity informationFor estimating vocal activity information we use the voicing de-tection step of the melody extraction algorithm implemented inthe MELODIA Melody Extraction vamp plug-in3 as it is freelyavailable for people to download and use We refer the readerto[26] and references therein for other voicing detection algorithmsThe algorithm for the automatic extraction of the main melodyfrom polyphonic music recordings implemented in MELODIA isa salience-based model that is described in [43] It is basedonthe creation and characterization of pitch contours grouped usingauditory streaming cues and includes a voice detection step thatindicates when the melody is present we use this melody locationas an indicator of leading voice activity Note that while melodycan sometimes be carried by other instruments in the evaluationdataset of Section 4 it is mainly singing

3httpmtgupfedutechnologiesmelodia

4 EVALUATION

In this section we present the results of our approach evaluated ona database of complete music tracks of various genres We com-pare the proposed adaptive method with the baseline method [18]as well as another state-of-the-art method [16] Sound examplesdiscussed in the article can be found athttppapadopoulosellisdafx14blogspotfr

41 Parameters Dataset and Evaluation Criteria

To evaluate the proposed approach we have constructed a databaseof 12 complete music tracks of various genres with separated vo-cal and accompaniment files as well as mixture versions formedas the sum of the vocal and accompaniment files The tracks listedin Tab 1 were created from multitracks mixed in Audacity4 thenexported with or without the vocal or accompaniment lines

Following previous work [18 44 15] the separations are eval-uated with metrics from the BSS-EVAL toolbox [45] which pro-vides a framework for the evaluation of source separation algo-rithms when the original sources are available for comparisonThree ratios are considered for both sources Source-to-Distortion(SDR) Sources-to-Interference (SIR) and Sources-to-Artifacts (SAR)In addition we measure the improvement in SDR between themixtured and the estimated resynthesized singing voicee by theNormalized SDR (NSDR also known asSDR improvement SDRI)defined for the voice as NSDR(e e d) = SDR(e e)minusSDR(d e)wheree is the original clean singing voice The same measure isused for the evaluation of the background Each measure is com-puted globally on the whole track but also locally according to thesegmentation into vocalnon-vocal segments Higher values of themetrics indicate better separation

We compare the results of the A-RPCA with musically-informedadaptiveλ and the baseline RPCA method [18] with fixedλ usingthe same parameter settings in the analysis stage the STFT of eachmixture is computed using a window length of1024 samples with75 overlap at a sampling rate of115KHz No post-processing(such as masking) is added After spectrogram decomposition thesignals are reconstructed using the inverse STFT and the phase ofthe original signal

The parameterλ is set to1p

max(m n) in the baseline methodTwo different versions of the proposed A-RPCA algorithm areevaluated First A-RPCA with exact voice activity informationusing manually annotated ground-truth (A-RPCA_GT) andλl =λ for singing voice regions andλl = 5lowastλ for background only re-gions In the other configuration estimated voice activitylocationis used (A-RPCA_est) with same settings for theλl

We also compare our approach with the REPET state-of-the-art algorithm based on repeating pattern discovery and binary time-frequency masking [16] Note that we use for comparison the ver-sion of REPET that is designed for processing complete musicaltracks (as opposed to the original one introduced in [15]) Thismethod includes a simple low pass filtering post-processingstep[46] that consists in removing all frequencies below100Hz fromthe vocal signal and adding these components back into the back-ground layer We further apply this post-processing step toourmodel before comparison with the REPET algorithm

Paired sample t-tests at the 5 significance level are performedto determine whether there is statistical significance in the resultsbetween various configurations

4httpaudacitysourceforgenet

DAFX-4

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 1 Sound excerpts used for the evaluationback proportion of background (no leading voice) segments (in of the whole excerptduration) RecallRecand False AlarmFAvoicing detection rate

Name back Rec FA Name back Rec FA1- BeatlesSgt Pepperrsquos Lonely Hearts Club Band 493 7474 4556 8 - Bob MarleyIs This Love 372 6622 36842 - BeatlesWith A Little Help From My Friends 135 7010 1471 9 - Doobie BrothersLong Train Running 656 8412 58513 - BeatlesShersquos Leaving Home 246 7752 3017 10 -Marvin GayeHeard it Through The Grapevine 302 7922 17904 - BeatlesA Day in The Life 356 6130 6396 11 -The EaglesTake it Easy 355 7868 302056 -Puccinipiece for soprano and piano 247 4790 2704 12 -The PoliceMessage in aBottle 249 7390 20447 - Pink Noise PartyTheir Shallow Singularity 421 6415 6183

42 Results and Discussion

Results of the separation for the sparse (singing voice) andlow-rank (background accompaniment) layers are presented in Tables2 3 4 and 5 To have a better insight of the results we presentmeasures computed both on the entire song and on the singingvoice-active part only that is obtained by concatenating all seg-ments labeled as vocal segments in the ground truth

bull Global separation results As we can see from Tables 2 and3 using a musically-informed adaptive regularization parameterallows improving the results of the separation both for the back-ground and the leading voice components Note that the larger theproportion of purely-instrumental segments in a piece (seeTab 1)the larger the results improvement (see in particular pieces 1 7 8and 9) which is consistent with the goal of the proposed methodStatistical tests show that the improvement in the results is signifi-cant

As discussed in Section 3 the quality of the separation withthe baseline method [18] depends on the value of the regulariza-tion parameter Moreover the value that leads to the best separa-tion quality differs from one music excerpt to another Thus whenprocessing automatically a collection of music tracks thechoice ofthis value results from a trade-off We report here results obtainedwith the typical choiceλv = 1

p

max(m n) in Eq (7) Note thatfor a given value ofλv in the baseline method the separation canalways be further improved by the A-RPCA algorithm using a reg-ularization parameter that is adapted to the music content based onprior music structure information in all experiments fora givenconstant valueλv in the baseline method settingλnv gt λv in Eq(7) improves the results

For the singing voice layer improved SDR (better overall sep-aration performance) and SIR (better capability of removing musicinterferences from the singing voice) with A-RPCA are obtainedat the price of introducing more artifacts in the estimated voice(lower SARvoice) Listening tests reveal that in some segmentsprocessed by A-RPCA as for instance segment[1prime00primeprime minus 1prime15primeprime]in Fig 3 one can hear some high frequency isolated coefficientssuperimposed to the separated voice This drawback could bere-duced by including harmonicity priors in the sparse component ofRPCA as proposed in [20] This performance trade-off is com-monly encountered in musicvoice separation [14 47] Howeverwe can notice that all three measures are significantly improvedwith A-RPCA for the background layer

bull Ground truth versus estimated voice activity location Im-perfect voice activity location information still allows an improve-ment although to a lesser extent than with ground-truth voice ac-tivity information In table 1 we report the accuracy results of thevoicing detection step Similarly to the measures used for melody

Figure 3 Separated voice for various values ofλ for thePink NoiseParty songTheir Shallow Singularity From top to bottom cleanvoice constantλ1 = 1

p

max(m n) constantλ = 5lowastλ1 adap-tive λ = (λ1 5 lowast λ1)

detection in [48 12] we consider theVoicing Recall Rate definedas the proportion of frames labeled voiced in the ground truth thatare estimated as voiced frames by the algorithm and theVoicingFalse Alarm Rate defined as the proportion of frames labeled asunvoiced in the ground truth that are mistakenly estimated to bevoiced by the algorithm The decrease in the results mainly comesfrom background segments classified as vocal segments Howeverstatistical tests show that the improvement in the results betweenRPCA and A-RPCA_est is still significant

bull Local separation results It is interesting to note that usingan adaptive regularization parameter in a unified analysis of thewhole piece is different from separately analyzing the successivevocalnon-vocal segments with different but constant values ofλ(see for instance the dashed rectangles areas in Fig 3)

bull Analysis of the results on vocal segments We expect the sep-aration on background-only parts of the song to be improved withthe A-RPCA algorithm Indeed the side information directlyin-dicates these regions where the foreground (sparse) componentsshould be avoided this can be clearly seen in Fig 3 However theimprovements under the proposed model are not limited to non-vocal regions only Results measured on the vocal segments aloneindicate that by using the adaptive algorithm the voice is also bet-ter estimated as shown in Table 3 The improvement over RPCAis statistically significant both when using ground truth and esti-mated voice activity location information This indicatesthat sideinformation helps not only to better determine the background onlysegments but also enables improved recovery of the singingvoicepresumably because the low-rank background model is a bettermatch to the actual background

Side information could have be added as a pre- or post-processingstep to the RPCA algorithm The adaptive-RPCA algorithm presentsadvantages over such approaches To analyze this we compare the

DAFX-5

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 2 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity

Entire songRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -466 -216 -318SIR (dB) -386 074 -046SAR (dB) 899 481 394

NSDR 170 420 318

Back

SDR (dB) 414 652 608SIR (dB) 1148 1330 1207SAR (dB) 551 803 783

NSDR -235 003 -041

Table 3 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity

Vocal segmentsRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -319 -200 -196SIR (dB) -233 -039 074SAR (dB) 944 727 464

NSDR 167 285 290

Back

SDR (dB) 363 518 528SIR (dB) 995 1064 1041SAR (dB) 539 732 754

NSDR -137 018 029

Table 4 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity Low-pass filtering post-processing is applied REPET is thecomparison algorithm [16]

Entire songRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -276 -072 -211 -220SIR (dB) -017 403 222 134SAR (dB) 433 333 232 319

NSDR 360 564 425 416

Back

SDR (dB) 516 761 681 501SIR (dB) 1453 1449 1299 1683SAR (dB) 596 902 844 547

NSDR -132 112 033 -148

Table 5 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity Low-pass filtering post-processing is appliedREPET is the comparison algorithm [16]

Vocal segments onlyRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -125 -053 -083 -070SIR (dB) 149 304 362 302SAR (dB) 502 446 312 402

NSDR 360 432 402 415

Back

SDR (dB) 485 603 611 480SIR (dB) 1307 1238 1141 1533SAR (dB) 591 769 820 541

NSDR -014 103 111 -020

A-RPCA algorithm with two variants of RPCA incorporating sideinformation either as a pre- or a post-processing step

bull RPCA_OV pre Only the concatenation of segments clas-sified as vocal is processed by RPCA (the singing voiceestimate being set to zero in the remaining non-vocal seg-ments)

bull RPCA_OV post The whole song is processed by RPCAand non-zeros coefficients estimated as belonging to thevoice layer in non-vocal segments are transferred to thebackground layer

Results of the decomposition computed across the vocal seg-ments only are presented in Table 6 Note that the RPCA_OV post

results reduce to the RPCA results in Table 3 since they are com-puted on vocal segments only There is no statistical difference be-tween the estimated voice obtained by processing with RPCA thewhole song and the vocal segments only Results are significantlybetter using the A-RPCA algorithm than using RPCA_OV pre andRPCA_OV post This is illustrated in Figure 4 which shows anexample of the decomposition on an excerpt of theDoobie Broth-ers songLong Train Runningcomposed by a non-vocal followedby a vocal segment We can see that there are misclassified partialsin the voice spectrogram obtained with the baseline RPCA that areremoved with A-RPCA Moreover the gap in the singing voicearound frame 50 (breathing) is cleaner in the case of A-RPCA thanin the case of RPCA Listening tests confirm that the backgroundis better attenuated in the voice layer when using A-RPCA

Table 6 SDR SIR and SAR (in dB) and NSDR resultsfor the voice (Voice) and background layer (Back) com-puted across the vocal segments only averaged across all thesongs RPCA_OV post is when using the baseline system andset the voice estimate to zero in background-only segmentsRPCA_OV pre is when processing only the voice segments withthe baseline model A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses es-timated voice activity

RPCA_OV post RPCA_OV pre A-RPCA_GT A-RPCA_est

Voice

SDR -319 -328 -200 -196SIR -233 -231 362 074SAR 944 897 727 464

NSDR 167 157 285 290

Back

SDR 363 372 518 528SIR 995 922 1064 1041SAR 539 585 732 754

NSDR -137 -128 018 029

bull Comparison with the state-of-the-art As we can see from Ta-ble 4 the results obtained with the RPCA baseline method arenotbetter than those obtained with the REPET algorithm On the con-trary the REPET algorithm is significantly outperformed bytheA-RPCA algorithm when using ground truth voice activity infor-mation both for the sparse and low-rank layers However notethat when using estimated voice activity information the differ-

DAFX-6

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Figure 4 [Top Figure] Example decomposition on an excerpt ofthe Doobie BrotherssongLong Train Runningand [Bottom Fig-ure] zoom between frames [525-580] (dashed rectangle in theTopFigure) For each figure the top pane shows the part between0 and500Hz of the spectrogram of the original signal The clean sign-ing voice appears in the second pane The separated signing voiceobtained with baseline model (RPCA) with the baseline modelwhen restricting the analysis to singing voice-active segments only(RPCA_OV pre) and with the proposed A-RPCA model are rep-resented in panes 3 to 5 For comparison the sixth pane showstheresults obtained with REPET [16]

ence in the results between REPET and A-RPCA is not statisticallysignificant for the sparse layer If we look closer at the results itis interesting to note that the voice estimation improvement by A-RPCA_GT over REPET mainly comes from the non-vocal partswhere the voice estimated is favored to be null Indeed Table 5indicate that the voice estimates on vocal segments obtained withA-RPCA_GT and REPET are similar This is illustrated by thetwo last panes in the [bottom] Figure 4 which show similar spec-trograms of the voice estimates obtained with the A-RPCA andREPET algorithms on the vocal part of the excerpt

5 CONCLUSION

We have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including localvariations in the music structure Music content information isincorporated in the decomposition to guide the selection ofcoeffi-cients in the sparse and low-rank layers according to the semanticstructure of the piece This motivates the choice of using a regu-larization parameter that is informed by musical cues Results in-dicate that with the proposed algorithm not only the backgroundsegments are better discriminated but also that the singing voice isbetter estimated in vocal segments presumably because thelow-rank background model is a better match to the actual backgroundThe method could be extended with other criteria (singer identi-fication vibrato saliency etc) It could also be improvedby in-corporating additional information to set differently theregulariza-tion parameters foreachtrack to better accommodate the varyingcontrast of foreground and background The idea of an adaptivedecomposition could also be improved with a more complex for-mulation of RPCA that incorporates additional constraints[20] ora learned dictionary [49]

6 REFERENCES

[1] K Min Z Zhang J Wright and Y Ma ldquoDecomposingbackground topics from keywords by principal componentpursuitrdquo inCIKM 2010

[2] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation ofbackground subtraction techniques for video surveillancerdquoin CCVPR 2011 pp 1937ndash1944

[3] EJ Candegraves X Li and J Ma Y andb Wright ldquoRobustprincipal component analysisrdquoJournal of the ACM vol58 no 3 Article 11 2011

[4] V Chandrasekaran S Sanghavi P Parrilo and A Will-sky ldquoSparse and low-rank matrix decompositionsrdquo inSysid2009

[5] B Cheng G Liu J Wang Z Huang and S Yan ldquoMulti-task low-rank affinity pursuit for image segmentationrdquo inICCV 2011 pp 2439ndash2446

[6] Z Zeng TH Chan K Jia and D Xu ldquoFinding correspon-dence from multiple images via sparse and low-rank decom-positionrdquo inECCV 2012 pp 325ndash339

[7] F Yang H Jiang Z Shen W Deng and DN MetaxasldquoAdaptive low rank and sparse decomposition of video usingcompressive sensingrdquoCoRR vol abs13021610 2013

[8] Y Peng A Ganesh J Wright and Y Xu W andMa ldquoRaslRobust alignment by sparse and low-rank decomposition forlinearly correlated imagesrdquoIEEE Trans Pattern Anal MachIntell vol 34 no 11 pp 2233ndash2246 2012

[9] Z Shi J Han T Zheng and S Deng ldquoOnline learningfor classification of low-rank representation features anditsapplications in audio segment classificationrdquoCoRR volabs11124243 2011

[10] YH Yang D Bogdanov P Herrera and M Sordo ldquoMusicretagging using label propagation and robust principal com-ponent analysisrdquo inWWW New York NY USA 2012 pp869ndash876

[11] W Cai Q Li and X Guan ldquoAutomatic singer identificationbased on auditory featuresrdquo 2011

DAFX-7

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 1 Sound excerpts used for the evaluationback proportion of background (no leading voice) segments (in of the whole excerptduration) RecallRecand False AlarmFAvoicing detection rate

Name back Rec FA Name back Rec FA1- BeatlesSgt Pepperrsquos Lonely Hearts Club Band 493 7474 4556 8 - Bob MarleyIs This Love 372 6622 36842 - BeatlesWith A Little Help From My Friends 135 7010 1471 9 - Doobie BrothersLong Train Running 656 8412 58513 - BeatlesShersquos Leaving Home 246 7752 3017 10 -Marvin GayeHeard it Through The Grapevine 302 7922 17904 - BeatlesA Day in The Life 356 6130 6396 11 -The EaglesTake it Easy 355 7868 302056 -Puccinipiece for soprano and piano 247 4790 2704 12 -The PoliceMessage in aBottle 249 7390 20447 - Pink Noise PartyTheir Shallow Singularity 421 6415 6183

42 Results and Discussion

Results of the separation for the sparse (singing voice) andlow-rank (background accompaniment) layers are presented in Tables2 3 4 and 5 To have a better insight of the results we presentmeasures computed both on the entire song and on the singingvoice-active part only that is obtained by concatenating all seg-ments labeled as vocal segments in the ground truth

bull Global separation results As we can see from Tables 2 and3 using a musically-informed adaptive regularization parameterallows improving the results of the separation both for the back-ground and the leading voice components Note that the larger theproportion of purely-instrumental segments in a piece (seeTab 1)the larger the results improvement (see in particular pieces 1 7 8and 9) which is consistent with the goal of the proposed methodStatistical tests show that the improvement in the results is signifi-cant

As discussed in Section 3 the quality of the separation withthe baseline method [18] depends on the value of the regulariza-tion parameter Moreover the value that leads to the best separa-tion quality differs from one music excerpt to another Thus whenprocessing automatically a collection of music tracks thechoice ofthis value results from a trade-off We report here results obtainedwith the typical choiceλv = 1

p

max(m n) in Eq (7) Note thatfor a given value ofλv in the baseline method the separation canalways be further improved by the A-RPCA algorithm using a reg-ularization parameter that is adapted to the music content based onprior music structure information in all experiments fora givenconstant valueλv in the baseline method settingλnv gt λv in Eq(7) improves the results

For the singing voice layer improved SDR (better overall sep-aration performance) and SIR (better capability of removing musicinterferences from the singing voice) with A-RPCA are obtainedat the price of introducing more artifacts in the estimated voice(lower SARvoice) Listening tests reveal that in some segmentsprocessed by A-RPCA as for instance segment[1prime00primeprime minus 1prime15primeprime]in Fig 3 one can hear some high frequency isolated coefficientssuperimposed to the separated voice This drawback could bere-duced by including harmonicity priors in the sparse component ofRPCA as proposed in [20] This performance trade-off is com-monly encountered in musicvoice separation [14 47] Howeverwe can notice that all three measures are significantly improvedwith A-RPCA for the background layer

bull Ground truth versus estimated voice activity location Im-perfect voice activity location information still allows an improve-ment although to a lesser extent than with ground-truth voice ac-tivity information In table 1 we report the accuracy results of thevoicing detection step Similarly to the measures used for melody

Figure 3 Separated voice for various values ofλ for thePink NoiseParty songTheir Shallow Singularity From top to bottom cleanvoice constantλ1 = 1

p

max(m n) constantλ = 5lowastλ1 adap-tive λ = (λ1 5 lowast λ1)

detection in [48 12] we consider theVoicing Recall Rate definedas the proportion of frames labeled voiced in the ground truth thatare estimated as voiced frames by the algorithm and theVoicingFalse Alarm Rate defined as the proportion of frames labeled asunvoiced in the ground truth that are mistakenly estimated to bevoiced by the algorithm The decrease in the results mainly comesfrom background segments classified as vocal segments Howeverstatistical tests show that the improvement in the results betweenRPCA and A-RPCA_est is still significant

bull Local separation results It is interesting to note that usingan adaptive regularization parameter in a unified analysis of thewhole piece is different from separately analyzing the successivevocalnon-vocal segments with different but constant values ofλ(see for instance the dashed rectangles areas in Fig 3)

bull Analysis of the results on vocal segments We expect the sep-aration on background-only parts of the song to be improved withthe A-RPCA algorithm Indeed the side information directlyin-dicates these regions where the foreground (sparse) componentsshould be avoided this can be clearly seen in Fig 3 However theimprovements under the proposed model are not limited to non-vocal regions only Results measured on the vocal segments aloneindicate that by using the adaptive algorithm the voice is also bet-ter estimated as shown in Table 3 The improvement over RPCAis statistically significant both when using ground truth and esti-mated voice activity location information This indicatesthat sideinformation helps not only to better determine the background onlysegments but also enables improved recovery of the singingvoicepresumably because the low-rank background model is a bettermatch to the actual background

Side information could have be added as a pre- or post-processingstep to the RPCA algorithm The adaptive-RPCA algorithm presentsadvantages over such approaches To analyze this we compare the

DAFX-5

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 2 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity

Entire songRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -466 -216 -318SIR (dB) -386 074 -046SAR (dB) 899 481 394

NSDR 170 420 318

Back

SDR (dB) 414 652 608SIR (dB) 1148 1330 1207SAR (dB) 551 803 783

NSDR -235 003 -041

Table 3 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity

Vocal segmentsRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -319 -200 -196SIR (dB) -233 -039 074SAR (dB) 944 727 464

NSDR 167 285 290

Back

SDR (dB) 363 518 528SIR (dB) 995 1064 1041SAR (dB) 539 732 754

NSDR -137 018 029

Table 4 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity Low-pass filtering post-processing is applied REPET is thecomparison algorithm [16]

Entire songRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -276 -072 -211 -220SIR (dB) -017 403 222 134SAR (dB) 433 333 232 319

NSDR 360 564 425 416

Back

SDR (dB) 516 761 681 501SIR (dB) 1453 1449 1299 1683SAR (dB) 596 902 844 547

NSDR -132 112 033 -148

Table 5 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity Low-pass filtering post-processing is appliedREPET is the comparison algorithm [16]

Vocal segments onlyRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -125 -053 -083 -070SIR (dB) 149 304 362 302SAR (dB) 502 446 312 402

NSDR 360 432 402 415

Back

SDR (dB) 485 603 611 480SIR (dB) 1307 1238 1141 1533SAR (dB) 591 769 820 541

NSDR -014 103 111 -020

A-RPCA algorithm with two variants of RPCA incorporating sideinformation either as a pre- or a post-processing step

bull RPCA_OV pre Only the concatenation of segments clas-sified as vocal is processed by RPCA (the singing voiceestimate being set to zero in the remaining non-vocal seg-ments)

bull RPCA_OV post The whole song is processed by RPCAand non-zeros coefficients estimated as belonging to thevoice layer in non-vocal segments are transferred to thebackground layer

Results of the decomposition computed across the vocal seg-ments only are presented in Table 6 Note that the RPCA_OV post

results reduce to the RPCA results in Table 3 since they are com-puted on vocal segments only There is no statistical difference be-tween the estimated voice obtained by processing with RPCA thewhole song and the vocal segments only Results are significantlybetter using the A-RPCA algorithm than using RPCA_OV pre andRPCA_OV post This is illustrated in Figure 4 which shows anexample of the decomposition on an excerpt of theDoobie Broth-ers songLong Train Runningcomposed by a non-vocal followedby a vocal segment We can see that there are misclassified partialsin the voice spectrogram obtained with the baseline RPCA that areremoved with A-RPCA Moreover the gap in the singing voicearound frame 50 (breathing) is cleaner in the case of A-RPCA thanin the case of RPCA Listening tests confirm that the backgroundis better attenuated in the voice layer when using A-RPCA

Table 6 SDR SIR and SAR (in dB) and NSDR resultsfor the voice (Voice) and background layer (Back) com-puted across the vocal segments only averaged across all thesongs RPCA_OV post is when using the baseline system andset the voice estimate to zero in background-only segmentsRPCA_OV pre is when processing only the voice segments withthe baseline model A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses es-timated voice activity

RPCA_OV post RPCA_OV pre A-RPCA_GT A-RPCA_est

Voice

SDR -319 -328 -200 -196SIR -233 -231 362 074SAR 944 897 727 464

NSDR 167 157 285 290

Back

SDR 363 372 518 528SIR 995 922 1064 1041SAR 539 585 732 754

NSDR -137 -128 018 029

bull Comparison with the state-of-the-art As we can see from Ta-ble 4 the results obtained with the RPCA baseline method arenotbetter than those obtained with the REPET algorithm On the con-trary the REPET algorithm is significantly outperformed bytheA-RPCA algorithm when using ground truth voice activity infor-mation both for the sparse and low-rank layers However notethat when using estimated voice activity information the differ-

DAFX-6

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Figure 4 [Top Figure] Example decomposition on an excerpt ofthe Doobie BrotherssongLong Train Runningand [Bottom Fig-ure] zoom between frames [525-580] (dashed rectangle in theTopFigure) For each figure the top pane shows the part between0 and500Hz of the spectrogram of the original signal The clean sign-ing voice appears in the second pane The separated signing voiceobtained with baseline model (RPCA) with the baseline modelwhen restricting the analysis to singing voice-active segments only(RPCA_OV pre) and with the proposed A-RPCA model are rep-resented in panes 3 to 5 For comparison the sixth pane showstheresults obtained with REPET [16]

ence in the results between REPET and A-RPCA is not statisticallysignificant for the sparse layer If we look closer at the results itis interesting to note that the voice estimation improvement by A-RPCA_GT over REPET mainly comes from the non-vocal partswhere the voice estimated is favored to be null Indeed Table 5indicate that the voice estimates on vocal segments obtained withA-RPCA_GT and REPET are similar This is illustrated by thetwo last panes in the [bottom] Figure 4 which show similar spec-trograms of the voice estimates obtained with the A-RPCA andREPET algorithms on the vocal part of the excerpt

5 CONCLUSION

We have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including localvariations in the music structure Music content information isincorporated in the decomposition to guide the selection ofcoeffi-cients in the sparse and low-rank layers according to the semanticstructure of the piece This motivates the choice of using a regu-larization parameter that is informed by musical cues Results in-dicate that with the proposed algorithm not only the backgroundsegments are better discriminated but also that the singing voice isbetter estimated in vocal segments presumably because thelow-rank background model is a better match to the actual backgroundThe method could be extended with other criteria (singer identi-fication vibrato saliency etc) It could also be improvedby in-corporating additional information to set differently theregulariza-tion parameters foreachtrack to better accommodate the varyingcontrast of foreground and background The idea of an adaptivedecomposition could also be improved with a more complex for-mulation of RPCA that incorporates additional constraints[20] ora learned dictionary [49]

6 REFERENCES

[1] K Min Z Zhang J Wright and Y Ma ldquoDecomposingbackground topics from keywords by principal componentpursuitrdquo inCIKM 2010

[2] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation ofbackground subtraction techniques for video surveillancerdquoin CCVPR 2011 pp 1937ndash1944

[3] EJ Candegraves X Li and J Ma Y andb Wright ldquoRobustprincipal component analysisrdquoJournal of the ACM vol58 no 3 Article 11 2011

[4] V Chandrasekaran S Sanghavi P Parrilo and A Will-sky ldquoSparse and low-rank matrix decompositionsrdquo inSysid2009

[5] B Cheng G Liu J Wang Z Huang and S Yan ldquoMulti-task low-rank affinity pursuit for image segmentationrdquo inICCV 2011 pp 2439ndash2446

[6] Z Zeng TH Chan K Jia and D Xu ldquoFinding correspon-dence from multiple images via sparse and low-rank decom-positionrdquo inECCV 2012 pp 325ndash339

[7] F Yang H Jiang Z Shen W Deng and DN MetaxasldquoAdaptive low rank and sparse decomposition of video usingcompressive sensingrdquoCoRR vol abs13021610 2013

[8] Y Peng A Ganesh J Wright and Y Xu W andMa ldquoRaslRobust alignment by sparse and low-rank decomposition forlinearly correlated imagesrdquoIEEE Trans Pattern Anal MachIntell vol 34 no 11 pp 2233ndash2246 2012

[9] Z Shi J Han T Zheng and S Deng ldquoOnline learningfor classification of low-rank representation features anditsapplications in audio segment classificationrdquoCoRR volabs11124243 2011

[10] YH Yang D Bogdanov P Herrera and M Sordo ldquoMusicretagging using label propagation and robust principal com-ponent analysisrdquo inWWW New York NY USA 2012 pp869ndash876

[11] W Cai Q Li and X Guan ldquoAutomatic singer identificationbased on auditory featuresrdquo 2011

DAFX-7

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Table 2 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity

Entire songRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -466 -216 -318SIR (dB) -386 074 -046SAR (dB) 899 481 394

NSDR 170 420 318

Back

SDR (dB) 414 652 608SIR (dB) 1148 1330 1207SAR (dB) 551 803 783

NSDR -235 003 -041

Table 3 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity

Vocal segmentsRPCA A-RPCA_GT A-RPCA_est

Voice

SDR (dB) -319 -200 -196SIR (dB) -233 -039 074SAR (dB) 944 727 464

NSDR 167 285 290

Back

SDR (dB) 363 518 528SIR (dB) 995 1064 1041SAR (dB) 539 732 754

NSDR -137 018 029

Table 4 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the wholesong for all models averaged across all the songs RPCA is the base-line system A-RPCA_GT is the adaptive version using groundtruthvoice activity information and A-RPCA_est uses estimatedvoice ac-tivity Low-pass filtering post-processing is applied REPET is thecomparison algorithm [16]

Entire songRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -276 -072 -211 -220SIR (dB) -017 403 222 134SAR (dB) 433 333 232 319

NSDR 360 564 425 416

Back

SDR (dB) 516 761 681 501SIR (dB) 1453 1449 1299 1683SAR (dB) 596 902 844 547

NSDR -132 112 033 -148

Table 5 SDR SIR and SAR (in dB) and NSDR results for the voice(Voice) and background layer (Back) computed across the vocal seg-ments only for all models averaged across all the songs RPCAis the baseline system A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses esti-mated voice activity Low-pass filtering post-processing is appliedREPET is the comparison algorithm [16]

Vocal segments onlyRPCA A-RPCA_GT A-RPCA_est REPET

Voice

SDR (dB) -125 -053 -083 -070SIR (dB) 149 304 362 302SAR (dB) 502 446 312 402

NSDR 360 432 402 415

Back

SDR (dB) 485 603 611 480SIR (dB) 1307 1238 1141 1533SAR (dB) 591 769 820 541

NSDR -014 103 111 -020

A-RPCA algorithm with two variants of RPCA incorporating sideinformation either as a pre- or a post-processing step

bull RPCA_OV pre Only the concatenation of segments clas-sified as vocal is processed by RPCA (the singing voiceestimate being set to zero in the remaining non-vocal seg-ments)

bull RPCA_OV post The whole song is processed by RPCAand non-zeros coefficients estimated as belonging to thevoice layer in non-vocal segments are transferred to thebackground layer

Results of the decomposition computed across the vocal seg-ments only are presented in Table 6 Note that the RPCA_OV post

results reduce to the RPCA results in Table 3 since they are com-puted on vocal segments only There is no statistical difference be-tween the estimated voice obtained by processing with RPCA thewhole song and the vocal segments only Results are significantlybetter using the A-RPCA algorithm than using RPCA_OV pre andRPCA_OV post This is illustrated in Figure 4 which shows anexample of the decomposition on an excerpt of theDoobie Broth-ers songLong Train Runningcomposed by a non-vocal followedby a vocal segment We can see that there are misclassified partialsin the voice spectrogram obtained with the baseline RPCA that areremoved with A-RPCA Moreover the gap in the singing voicearound frame 50 (breathing) is cleaner in the case of A-RPCA thanin the case of RPCA Listening tests confirm that the backgroundis better attenuated in the voice layer when using A-RPCA

Table 6 SDR SIR and SAR (in dB) and NSDR resultsfor the voice (Voice) and background layer (Back) com-puted across the vocal segments only averaged across all thesongs RPCA_OV post is when using the baseline system andset the voice estimate to zero in background-only segmentsRPCA_OV pre is when processing only the voice segments withthe baseline model A-RPCA_GT is the adaptive version usingground truth voice activity information and A-RPCA_est uses es-timated voice activity

RPCA_OV post RPCA_OV pre A-RPCA_GT A-RPCA_est

Voice

SDR -319 -328 -200 -196SIR -233 -231 362 074SAR 944 897 727 464

NSDR 167 157 285 290

Back

SDR 363 372 518 528SIR 995 922 1064 1041SAR 539 585 732 754

NSDR -137 -128 018 029

bull Comparison with the state-of-the-art As we can see from Ta-ble 4 the results obtained with the RPCA baseline method arenotbetter than those obtained with the REPET algorithm On the con-trary the REPET algorithm is significantly outperformed bytheA-RPCA algorithm when using ground truth voice activity infor-mation both for the sparse and low-rank layers However notethat when using estimated voice activity information the differ-

DAFX-6

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Figure 4 [Top Figure] Example decomposition on an excerpt ofthe Doobie BrotherssongLong Train Runningand [Bottom Fig-ure] zoom between frames [525-580] (dashed rectangle in theTopFigure) For each figure the top pane shows the part between0 and500Hz of the spectrogram of the original signal The clean sign-ing voice appears in the second pane The separated signing voiceobtained with baseline model (RPCA) with the baseline modelwhen restricting the analysis to singing voice-active segments only(RPCA_OV pre) and with the proposed A-RPCA model are rep-resented in panes 3 to 5 For comparison the sixth pane showstheresults obtained with REPET [16]

ence in the results between REPET and A-RPCA is not statisticallysignificant for the sparse layer If we look closer at the results itis interesting to note that the voice estimation improvement by A-RPCA_GT over REPET mainly comes from the non-vocal partswhere the voice estimated is favored to be null Indeed Table 5indicate that the voice estimates on vocal segments obtained withA-RPCA_GT and REPET are similar This is illustrated by thetwo last panes in the [bottom] Figure 4 which show similar spec-trograms of the voice estimates obtained with the A-RPCA andREPET algorithms on the vocal part of the excerpt

5 CONCLUSION

We have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including localvariations in the music structure Music content information isincorporated in the decomposition to guide the selection ofcoeffi-cients in the sparse and low-rank layers according to the semanticstructure of the piece This motivates the choice of using a regu-larization parameter that is informed by musical cues Results in-dicate that with the proposed algorithm not only the backgroundsegments are better discriminated but also that the singing voice isbetter estimated in vocal segments presumably because thelow-rank background model is a better match to the actual backgroundThe method could be extended with other criteria (singer identi-fication vibrato saliency etc) It could also be improvedby in-corporating additional information to set differently theregulariza-tion parameters foreachtrack to better accommodate the varyingcontrast of foreground and background The idea of an adaptivedecomposition could also be improved with a more complex for-mulation of RPCA that incorporates additional constraints[20] ora learned dictionary [49]

6 REFERENCES

[1] K Min Z Zhang J Wright and Y Ma ldquoDecomposingbackground topics from keywords by principal componentpursuitrdquo inCIKM 2010

[2] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation ofbackground subtraction techniques for video surveillancerdquoin CCVPR 2011 pp 1937ndash1944

[3] EJ Candegraves X Li and J Ma Y andb Wright ldquoRobustprincipal component analysisrdquoJournal of the ACM vol58 no 3 Article 11 2011

[4] V Chandrasekaran S Sanghavi P Parrilo and A Will-sky ldquoSparse and low-rank matrix decompositionsrdquo inSysid2009

[5] B Cheng G Liu J Wang Z Huang and S Yan ldquoMulti-task low-rank affinity pursuit for image segmentationrdquo inICCV 2011 pp 2439ndash2446

[6] Z Zeng TH Chan K Jia and D Xu ldquoFinding correspon-dence from multiple images via sparse and low-rank decom-positionrdquo inECCV 2012 pp 325ndash339

[7] F Yang H Jiang Z Shen W Deng and DN MetaxasldquoAdaptive low rank and sparse decomposition of video usingcompressive sensingrdquoCoRR vol abs13021610 2013

[8] Y Peng A Ganesh J Wright and Y Xu W andMa ldquoRaslRobust alignment by sparse and low-rank decomposition forlinearly correlated imagesrdquoIEEE Trans Pattern Anal MachIntell vol 34 no 11 pp 2233ndash2246 2012

[9] Z Shi J Han T Zheng and S Deng ldquoOnline learningfor classification of low-rank representation features anditsapplications in audio segment classificationrdquoCoRR volabs11124243 2011

[10] YH Yang D Bogdanov P Herrera and M Sordo ldquoMusicretagging using label propagation and robust principal com-ponent analysisrdquo inWWW New York NY USA 2012 pp869ndash876

[11] W Cai Q Li and X Guan ldquoAutomatic singer identificationbased on auditory featuresrdquo 2011

DAFX-7

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

Figure 4 [Top Figure] Example decomposition on an excerpt ofthe Doobie BrotherssongLong Train Runningand [Bottom Fig-ure] zoom between frames [525-580] (dashed rectangle in theTopFigure) For each figure the top pane shows the part between0 and500Hz of the spectrogram of the original signal The clean sign-ing voice appears in the second pane The separated signing voiceobtained with baseline model (RPCA) with the baseline modelwhen restricting the analysis to singing voice-active segments only(RPCA_OV pre) and with the proposed A-RPCA model are rep-resented in panes 3 to 5 For comparison the sixth pane showstheresults obtained with REPET [16]

ence in the results between REPET and A-RPCA is not statisticallysignificant for the sparse layer If we look closer at the results itis interesting to note that the voice estimation improvement by A-RPCA_GT over REPET mainly comes from the non-vocal partswhere the voice estimated is favored to be null Indeed Table 5indicate that the voice estimates on vocal segments obtained withA-RPCA_GT and REPET are similar This is illustrated by thetwo last panes in the [bottom] Figure 4 which show similar spec-trograms of the voice estimates obtained with the A-RPCA andREPET algorithms on the vocal part of the excerpt

5 CONCLUSION

We have explored an adaptive version of the RPCA technique thatallows the processing of entire pieces of music including localvariations in the music structure Music content information isincorporated in the decomposition to guide the selection ofcoeffi-cients in the sparse and low-rank layers according to the semanticstructure of the piece This motivates the choice of using a regu-larization parameter that is informed by musical cues Results in-dicate that with the proposed algorithm not only the backgroundsegments are better discriminated but also that the singing voice isbetter estimated in vocal segments presumably because thelow-rank background model is a better match to the actual backgroundThe method could be extended with other criteria (singer identi-fication vibrato saliency etc) It could also be improvedby in-corporating additional information to set differently theregulariza-tion parameters foreachtrack to better accommodate the varyingcontrast of foreground and background The idea of an adaptivedecomposition could also be improved with a more complex for-mulation of RPCA that incorporates additional constraints[20] ora learned dictionary [49]

6 REFERENCES

[1] K Min Z Zhang J Wright and Y Ma ldquoDecomposingbackground topics from keywords by principal componentpursuitrdquo inCIKM 2010

[2] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation ofbackground subtraction techniques for video surveillancerdquoin CCVPR 2011 pp 1937ndash1944

[3] EJ Candegraves X Li and J Ma Y andb Wright ldquoRobustprincipal component analysisrdquoJournal of the ACM vol58 no 3 Article 11 2011

[4] V Chandrasekaran S Sanghavi P Parrilo and A Will-sky ldquoSparse and low-rank matrix decompositionsrdquo inSysid2009

[5] B Cheng G Liu J Wang Z Huang and S Yan ldquoMulti-task low-rank affinity pursuit for image segmentationrdquo inICCV 2011 pp 2439ndash2446

[6] Z Zeng TH Chan K Jia and D Xu ldquoFinding correspon-dence from multiple images via sparse and low-rank decom-positionrdquo inECCV 2012 pp 325ndash339

[7] F Yang H Jiang Z Shen W Deng and DN MetaxasldquoAdaptive low rank and sparse decomposition of video usingcompressive sensingrdquoCoRR vol abs13021610 2013

[8] Y Peng A Ganesh J Wright and Y Xu W andMa ldquoRaslRobust alignment by sparse and low-rank decomposition forlinearly correlated imagesrdquoIEEE Trans Pattern Anal MachIntell vol 34 no 11 pp 2233ndash2246 2012

[9] Z Shi J Han T Zheng and S Deng ldquoOnline learningfor classification of low-rank representation features anditsapplications in audio segment classificationrdquoCoRR volabs11124243 2011

[10] YH Yang D Bogdanov P Herrera and M Sordo ldquoMusicretagging using label propagation and robust principal com-ponent analysisrdquo inWWW New York NY USA 2012 pp869ndash876

[11] W Cai Q Li and X Guan ldquoAutomatic singer identificationbased on auditory featuresrdquo 2011

DAFX-7

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References

Proc of the 17th Int Conference on Digital Audio Effects (DAFx-14) Erlangen Germany September 1-5 2014

[12] J Salamon E Goacutemez DPW Ellis and G RichardldquoMelody extraction from polyphonic music signals Ap-proaches applications and challengesrdquoIEEE Signal Pro-cess Mag 2013

[13] RB Dannenberg WP Birmingham B Pardo N HuC Meek and G Tzanetakis ldquoA comparative evaluation ofsearch techniques for query-by-humming using the musarttestbedrdquo J Am Soc Inf Sci Technol vol 58 no 5 pp687ndash701 2007

[14] B Zhu W Li R Li and X Xue ldquoMulti-stage non-negativematrix factorization for monaural singing voice separationrdquoIEEE Trans Audio Speech Language Process vol 21 no10 pp 2096ndash2107 2013

[15] Z Rafii and B Pardo ldquoA simple musicvoice separationmethod based on the extraction of the repeating musicalstructurerdquo inICASSP 2011

[16] A Liutkus Z Rafii R Badeau B Pardo and G RichardldquoAdaptive filtering for musicvoice separation exploitingtherepeating musical structurerdquo inICASSP 2012

[17] D FitzGerald ldquoVocal separation using nearest neighboursand median filteringrdquo inISSC 2012

[18] PS Huang SD Chen P Smaragdis and M Hasegawa-Johnson ldquoSinging voice separation from monaural record-ings using robust principal component analysisrdquo inICASSP2012

[19] CL Hsu and JSR Jang ldquoOn the improvement of singingvoice separation for monaural recordings using the mir-1kdatasetrdquo IEEE Trans Audio Speech Language Processvol 18 no 2 pp 310ndash319 2010

[20] YH Yang ldquoOn sparse and low-rank matrix decompositionfor singing voice separationrdquo inMM 2012 pp 757ndash760

[21] M Moussallam G Richard and L Daudet ldquoAudio sourceseparation informed by redundancy with greedy multiscaledecompositionsrdquo inEUSIPCO 2012 pp 2644ndash2648

[22] SG Mallat and Z Zhang ldquoMatching pursuits with time-frequency dictionariesrdquoIEEE Trans Audio Speech Lan-guage Process vol 41 no 12 pp 3397ndash3415 1993

[23] P Sprechmann A Bronstein and G Sapiro ldquoReal-timeon-line singing voice separation from monaural recordings usingrobust low rank modelingrdquo inISMIR 2012

[24] A Lefeacutevre F Glineur and PA Absil ldquoA nuclear-normbased convex formulation for informed source separationrdquoin ESANN 2013

[25] YH Yang ldquoLow-rank representation of both singing voiceand music accompaniment via learned dictionariesrdquo inIS-MIR 2013

[26] J SalamonMelody Extraction from Polyphonic Music Sig-nals PhD thesis Department of Information and Commu-nication Technologies Universitat Pompeu Fabra BarcelonaSpain 2013

[27] AL Berenzweig and DPW Ellis ldquoLocating singing voicesegments within music signalsrdquo inWASPAA 2001 pp 119ndash122

[28] TL Nwe and Y Wang ldquoAutomatic detection of vocal seg-ments in popular songsrdquo inProc ISMIR 2004 pp 138ndash145

[29] L Feng AB Nielsen and LK Hansen ldquoVocal segmentclassification in popular musicrdquo inISMIR 2008 pp 121ndash126

[30] M Fazel Matrix Rank Minimization with ApplicationsPhD thesis Dept of Elec Eng Stanford Univ 2002

[31] B Recht M Fazel and PA Parrilo ldquoGuaranteed minimum-rank solutions of linear matrix equations via nuclear normminimizationrdquo SIAM Rev vol 52 no 3 pp 471ndash501 2010

[32] EJ Candegraves and B Recht ldquoExact matrix completion via con-vex optimizationrdquo Found Comput Math vol 9 no 6 pp717ndash772 2009

[33] Z Lin A Ganesh J Wright L Wu M Chen and Y MaldquoFast convex optimization algorithms for exact recovery ofa corrupted low-rank matrixrdquo Tech Rep UILU-ENG-09-2214 UIUC Tech Rep 2009

[34] Z Lin M Chen and Y Ma ldquoThe augmented lagrange mul-tiplier method for exact recovery of corrupted low-rank ma-tricesrdquo Tech Rep UILU-ENG-09-2215 UIUC 2009

[35] Xiaoming Yuan and Junfeng Yang ldquoSparse and low-rankmatrix decomposition via alternating direction methodsrdquoPreprint pp 1ndash11 2009

[36] JF Cai EJ Candegraves and Z Shen ldquoA singular value thresh-olding algorithm for matrix completionrdquoSIAM J on Opti-mization vol 20 no 4 pp 1956ndash1982 2010

[37] R Tibshirani ldquoRegression shrinkage and selection via thelassordquo J R Stat Soc Series B vol 58 no 1 pp 267ndash2881996

[38] S Chen L David D Donoho and M Saunders ldquoAtomicdecomposition by basis pursuitrdquoSIAM Journal on ScientificComputing vol 20 pp 33ndash61 1998

[39] Z Gao LF Cheong and M ShanBlock-Sparse RPCA forConsistent Foreground Detection vol 7576 ofLecture Notesin Computer Science pp 690ndash703 Springer Berlin Heidel-berg 2012

[40] Y Grandvalet ldquoLeast absolute shrinkage is equivalent toquadratic penalizationrdquo inICANN 98 L Niklasson M Bo-den and T Ziemke Eds Perspectives in Neural Computingpp 201ndash206 Springer London 1998

[41] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquoJ AmStatist Assoc vol 101 no 476 pp 1418ndash1429 2006

[42] D Angelosante and G Giannakis ldquoRls-weighted lasso foradaptive estimation of sparse signalsrdquo inICASSP 2009 pp3245ndash3248

[43] J Salamon and E Goacutemez ldquoMelody extraction from poly-phonic music signals using pitch contour characteristicsrdquoIEEE Trans Audio Speech Language Process vol 20 pp1759ndash1770 2012

[44] JL Durrieu G Richard B David and C Fevotte rdquoIEEETrans Audio Speech Language Process vol 18 no 3 pp564ndash575 March 2010

[45] E Vincent R Gribonval and C Fevotte ldquoPerformancemea-surement in blind audio source separationrdquoIEEE Trans Au-dio Speech Language Process vol 14 no 4 pp 1462ndash1469 2006

[46] D FitzGerald and M Gainza ldquoSingle channel vocal sepa-ration using median filtering and factorisation techniquesrdquoISAST Transactions on Electronic and Signal Processingvol 4 no 1 pp 62ndash73 2010

[47] Z Rafii F Germain DL Sun and GJ Mysore ldquoCom-bining modeling of singing voice and background music forautomatic separation of musical mixturesrdquo inISMIR 2013

[48] G E Poliner D P W Ellis F Ehmann E Goacutemez S Stre-ich and B Ong ldquoMelody transcription from music audioApproaches and evaluationrdquoIEEE Trans Audio SpeechLanguage Process vol 15 no 4 pp 1247ndash1256 2007

[49] Z Chen and DPW Ellis ldquoSpeech enhancement by sparselow-rank and dictionary spectrogram decompositionrdquo inWASPAA 2013

DAFX-8

  • 1 Introduction
  • 2 Robust Principal Component Analysis via Principal Component Pursuit
  • 3 Adaptive RPCA (A-RPCA)
  • 4 Evaluation
    • 41 Parameters Dataset and Evaluation Criteria
    • 42 Results and Discussion
      • 5 Conclusion
      • 6 References