Research Article Score-Informed Source Separation for ... · is paper proposes a system for...

20
Research Article Score-Informed Source Separation for Multichannel Orchestral Recordings Marius Miron, Julio J. Carabias-Orti, Juan J. Bosch, Emilia Gómez, and Jordi Janer Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain Correspondence should be addressed to Marius Miron; [email protected] Received 24 June 2016; Accepted 3 November 2016 Academic Editor: Alexander Petrovsky Copyright © 2016 Marius Miron et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral music repertoire relies on the existence of scores. us, a reliable separation requires a good alignment of the score with the audio of the performance. To that extent, automatic score alignment methods are reliable when allowing a tolerance window around the actual onset and offset. Moreover, several factors increase the difficulty of our task: a high reverberant image, large ensembles having rich polyphony, and a large variety of instruments recorded within a distant-microphone setup. To solve these problems, we design context-specific methods such as the refinement of score-following output in order to obtain a more precise alignment. Moreover, we extend a close-microphone separation framework to deal with the distant-microphone orchestral recordings. en, we propose the first open evaluation dataset in this musical context, including annotations of the notes played by multiple instruments from an orchestral ensemble. e evaluation aims at analyzing the interactions of important parts of the separation framework on the quality of separation. Results show that we are able to align the original score with the audio of the performance and separate the sources corresponding to the instrument sections. 1. Introduction Western classical music is a centuries-old heritage tradition- ally driven by well-established practices. For instance, large orchestral ensembles are commonly tied to a physically closed place, the concert hall. In addition, Western classical music is bounded by established customs related to the types of instruments played, the presence of a score, the aesthetic guidance of a conductor, and compositions spanning a large time frame. Our work is conducted within the PHENICX project [1], which aims at enriching the concert experience through technology. Specifically, this paper aims at adapting and extending score-informed audio source separation to the inherent complexity of orchestral music. is scenario involves challenges like changes in dynamics and tempo, a large variety of instruments, high reverberance, and simul- taneous melodic lines but also opportunities as multichannel recordings. Score-informed source separation systems depend on the accuracy of different parts which are not necessarily integrated in the same parametric model. For instance, they rely on a score alignment framework that yields a coarsely aligned score [2–5] or, in a multichannel scenario, they compute a panning matrix to assess the weight of each instrument in each channel [6, 7]. To account for that, we adapt and improve the parts of the system within the complex scenario of orchestral music. Furthermore, we are interested in establishing a methodology for this task for future research and we propose a dataset in order to objectively assess the contribution of each part of the separation framework to the quality of separation. 1.1. Relation to Previous Work. Audio source separation is a challenging task when sources corresponding to different instrument sections are strongly correlated in time and frequency [8]. Without any previous knowledge, it is difficult to separate two sections which play, for instance, consonant notes simultaneously. One way to approach this problem is to introduce into the separation framework information about the characteristics of the signals such as a well-aligned score [2–5, 9, 10]. Furthermore, previous research relates the Hindawi Publishing Corporation Journal of Electrical and Computer Engineering Volume 2016, Article ID 8363507, 19 pages http://dx.doi.org/10.1155/2016/8363507

Transcript of Research Article Score-Informed Source Separation for ... · is paper proposes a system for...

Page 1: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Research ArticleScore-Informed Source Separation for MultichannelOrchestral Recordings

Marius Miron Julio J Carabias-Orti Juan J Bosch Emilia Goacutemez and Jordi Janer

Music Technology Group Universitat Pompeu Fabra Barcelona Spain

Correspondence should be addressed to Marius Miron mariusmironupfedu

Received 24 June 2016 Accepted 3 November 2016

Academic Editor Alexander Petrovsky

Copyright copy 2016 Marius Miron et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

This paper proposes a system for score-informed audio source separation for multichannel orchestral recordings The orchestralmusic repertoire relies on the existence of scores Thus a reliable separation requires a good alignment of the score with the audioof the performance To that extent automatic score alignment methods are reliable when allowing a tolerance window around theactual onset and offsetMoreover several factors increase the difficulty of our task a high reverberant image large ensembles havingrich polyphony and a large variety of instruments recorded within a distant-microphone setup To solve these problems we designcontext-specific methods such as the refinement of score-following output in order to obtain a more precise alignment Moreoverwe extend a close-microphone separation framework to deal with the distant-microphone orchestral recordingsThen we proposethe first open evaluation dataset in this musical context including annotations of the notes played by multiple instruments froman orchestral ensemble The evaluation aims at analyzing the interactions of important parts of the separation framework on thequality of separation Results show that we are able to align the original score with the audio of the performance and separate thesources corresponding to the instrument sections

1 Introduction

Western classical music is a centuries-old heritage tradition-ally driven by well-established practices For instance largeorchestral ensembles are commonly tied to a physically closedplace the concert hall In addition Western classical musicis bounded by established customs related to the types ofinstruments played the presence of a score the aestheticguidance of a conductor and compositions spanning a largetime frame Our work is conducted within the PHENICXproject [1] which aims at enriching the concert experiencethrough technology Specifically this paper aims at adaptingand extending score-informed audio source separation tothe inherent complexity of orchestral music This scenarioinvolves challenges like changes in dynamics and tempo alarge variety of instruments high reverberance and simul-taneous melodic lines but also opportunities as multichannelrecordings

Score-informed source separation systems depend onthe accuracy of different parts which are not necessarilyintegrated in the same parametric model For instance they

rely on a score alignment framework that yields a coarselyaligned score [2ndash5] or in a multichannel scenario theycompute a panning matrix to assess the weight of eachinstrument in each channel [6 7] To account for that weadapt and improve the parts of the systemwithin the complexscenario of orchestral music Furthermore we are interestedin establishing amethodology for this task for future researchand we propose a dataset in order to objectively assess thecontribution of each part of the separation framework to thequality of separation

11 Relation to Previous Work Audio source separation isa challenging task when sources corresponding to differentinstrument sections are strongly correlated in time andfrequency [8] Without any previous knowledge it is difficultto separate two sections which play for instance consonantnotes simultaneously One way to approach this problemis to introduce into the separation framework informationabout the characteristics of the signals such as a well-alignedscore [2ndash5 9 10] Furthermore previous research relates the

Hindawi Publishing CorporationJournal of Electrical and Computer EngineeringVolume 2016 Article ID 8363507 19 pageshttpdxdoiorg10115520168363507

2 Journal of Electrical and Computer Engineering

accuracy of the alignment to the quality of source separation[4 11] For Western classical music a correctly aligned scoreyields the exact time where each instrument is playingThus an important step in score-informed source separationis obtaining a correctly aligned score which can be doneautomatically with an audio-to-score alignment system

Audio-to-score alignment deals with the alignment of asymbolic representation such as the score with the audio ofthe rendition In a live scenario this task deals with followingthe musical score while listening to the live performanceand it is known as score-following To our knowledgewith the exception of [12] audio-to-score alignment systemshave not been rigorously tested in the context of orchestralmusic However with respect to classical music limitedexperimental scenarios comprising Bach chorales played by afour instruments have been discussed in [4 13] Furthermorein [14] a subset of RWC classical music database [15] isused for training and testing though no details are givenregarding the instrumental complexity of the pieces More-over a Beethoven orchestral piece from the same databaseis tested in [16] obtaining lower accuracy than the otherevaluated pieces These results point out the complexity of anorchestral scenario underlined in [12] Particularly a largernumber of instruments many instruments playing concur-rently different melody lines [17] prove to be a more difficultproblem than tracking a limited number of instruments asin pop music or for instance string quartets and pianopieces Although in this paper we do not propose a newsystem for orchestral music audio-to-score alignment thetask being a complex and extensive itself we are interestedin analyzing the relation between the task and the quality ofscore-informed source separation in such a complex scenario

Besides the score state-of-the-art source separationmethods take into account characteristics of the audio signalwhich can be integrated into the system thus achievingbetter results For instance the system can learn timbremodels for each of the instruments [7 18] Moreover itcan rely on the assumption that the family of featuredinstruments is known and their spectral characteristics areuseful to discriminate between different sections playingsimultaneously when the harmonics of the notes overlapIn a more difficult case when neither the score or thetimbre of the instrument is available temporal continuityand frequency sparsity [19] help in distributing the energybetween sources in a musically meaningful way Further-more an initial pitch detection can improve the results [6 20ndash22] if the method assumes a predominant source Howeverour scenario assumes multichannel recordings with distantmicrophones in contrast to the close-microphone approachin [6] and we cannot assume that a source is predominantAs a matter a fact previous methods deal with a limitedcase the separation between small number of harmonicinstruments or piano [3 4 6 18] leaving the case of orchestralmusic as an open issue In this paper we investigate ascenario characterized by large reverberation halls [23] alarge number of musicians in each section a large diversityof instruments played abrupt tempo changes and manyconcurrent melody lines often within the same instrumentsection

Regarding the techniques used for source separationmatrix decomposition has been increasingly popular forsource separation during the recent years [2 3 6 18ndash20]Nonnegative matrix factorization (NMF) is a particular caseof decomposition which restricts the values of the factormatrices to be nonnegative The first-factor matrix can beseen as a dictionary representing spectral templates Foraudio signals a dictionary is learned for each of the sourcesand stored into the basis matrix as a set of spectral templatesThe second-factor matrix holds the temporal activation orthe weights of the templates Then the resulting factorizedspectrogram is calculated as a linear combination of thetemplate vectors with a set of weight vectors forming theactivation matrix This representation allows for paramet-ric models such as the source-filter model [3 18 20] orthe multiexcitation model [9] which can easily captureimportant traits of harmonic instruments and help separatebetween them as it is the case with orchestral music Themultiexcitation model has been evaluated in a restrictedscenario of Bach chorales played by a quartet [4] and forthis particular database has been extended in the scope ofclose-microphone recordings [6] and score-informed sourceseparation [11] From a source separation point of view inthis article we extend and evaluate the work in [6 11 18] fororchestral music

In order to obtain a better separation with any NMFparametric model the sparseness of the gains matrix isincreased by initializing it with time and frequency informa-tion obtained from the score [3ndash5 24]The values between thetime frames where a note template is not activated are set tozero and will remain this way during factorization allowingfor the energy from the spectrogram to be redistributedbetween the notes and the instruments which actually playduring that interval A better alignment leads to better gainsinitialization and better separation Nonetheless audio-to-score alignment mainly fixes global misalignments whichare due to tempo variations and does not deal with localmisalignments [21] To account for local misalignmentsscore-informed source separation systems include onset andoffset information into the parametric model [3 5 24] or useimage processing in order to refine the gains matrix so thatit closely matches the actual time and frequency boundariesof the played notes [11] Conversely local misalignmentscan be fixed explicitly [25ndash27] To our knowledge none ofthese techniques have been explored for orchestral musicalthough there is a scope for testing their usefulness ifwe take into account several factors as the synchronizationof musicians in large ensembles concurrent melody linesand reverberation Furthermore the alignment systems aremonaural However in our case the separation is done onmultichannel recordings and the delays between the sourcesand microphones might yield local misalignments Hencetowards a better separation and more precise alignment wepropose a robust method to refine the output of a scorealignment system with respect to each audio channel of themultichannel audio

In the proposed distant-microphone scenario we nor-mally do not have microphones close to a particular instru-ment or soloist and moreover in an underdetermined case

Journal of Electrical and Computer Engineering 3

TFrepresentation

Scorealignment

Instrumentspectral

templates

Multi-channelaudio

Score

Panningmatrix

estimation

Gainsinitialization

NMFparameterestimation

Separatedsource

estimation

Gainsrefinement

xi(f t)

xi(f t)

mij

gjn(t)

gjn(t)

gjn(t)

bjn(f)

bjn(f)

sj998400 (t)

Figure 1 The diagram representing the flow of operations in the system

the number of sources surpasses the number ofmicrophonesTherefore recording the sound of an entire section alsocaptures interference from other instruments and the rever-beration of the concert hall To that extent our task is differentfrom interference reduction in close-microphone recordings[6 7 23] these approaches being evaluated for pop concerts[23] or quartets [6] Additionally we do not target a blindcase source separation as the previous systems [6 7 23]Subsequently we adapt and improve the systems in [6 11] byusing information from all the channels similar to parallelfactor analysis (PARAFAC) in [28 29]

With respect to the evaluation to our knowledge this isthe first time score-informed source separation is objectivelyevaluated on such a complex scenario An objective evalua-tion provides a more precise estimation of the contributionof each part of the framework and their influence on theseparation Additionally it establishes a methodology forfuture research and eases the research reproducibility Weannotated the database proposed in [30] comprising fourpieces of orchestral music recorded in an anechoic room inorder to obtain a score which is perfectly aligned with theanechoic recordings Then using the Roomsim software in[31] we simulate a concert hall in order to obtain realisticmultitrack recordings

12 Applications The proposed framework for score-infor-med source separation has been used to separate recordingsby various orchestrasThe recordings are processed automat-ically and stored in the multimodal repository Repovizz [32]The repository serves the data through its API for severalapplicationsThefirst application is called instrument empha-sis or orchestra focus and allows for emphasizing a particularinstrument over the full orchestra The second applicationrelates to spatialization of the separated musical sources inthe case of virtual reality scenarios and it is commonly knownas Acoustic Rendering Third we propose an application toestimating the spatial locations of the instruments on thestage All the three applications are detailed in Section 7

13 Outline We introduce the architecture of the frameworkwith its main parts in Section 2 Then in Section 3 we

give an outline of the baseline source separation systemFurthermore we present the extension of the baseline systemthe initialization of the gains with score information (Sec-tion 4) and the note refinement (Section 41) Additionally theproposed extension to the multichannel case is introducedin Section 5 We present the dataset and the evaluationprocedures and discuss the results in Section 6 The demosand applications are described in Section 7

2 Proposed Approach Overview

The diagram of the proposed framework is presented inFigure 1 The baseline system relies on training spectral tem-plates for the instruments we aim to separate (Section 33)Then we compute the spectrograms associated with themultichannel audio The spectrograms along with the scoreof the piece are used to align the score to the audio Fromthe aligned score we derive gains matrix that serves as aninput for the NMF parameter estimation stage (Section 4)along with the learned spectral templates Furthermore thegains and the spectrogram are used to calculate a panningmatrix (Section 31) which yields the contribution of eachinstrument in each channel After the parameter estimationstage (Section 32) the gains are refined in order to improvethe separation (Section 41) Then the spectrograms ofthe separated sources are estimated using Wiener filtering(Section 35)

For the score alignment step we use the system in[13] which aligns the scores to a chosen microphone andachieved the best results inMIREX score-following challenge(httpwwwmusic-irorgmirexwiki2015Real-time Audioto Score Alignment (aka Score Following) Results) How-ever other state-of-the-art alignment systems can be used atthis step since our final goal is to refine a given score withrespect to each channel in order to minimize the errors inseparation (Section 52) Accounting for that we extend themodel proposed by [6] and the gains refinement formonauralrecordings in [11] to the case of score-informed multichannelsource separation in the more complex scenario of orchestralmusic

4 Journal of Electrical and Computer Engineering

3 Baseline Method for MultichannelSource Separation

According to the baseline model in [6] the short-termcomplex valued Fourier transform (STFT) in time frame 119905and frequency 119891 for channel 119894 = 1 119868 where 119868 is the totalnumber of channels is expressed as

119909119894 (119891 119905) asymp 119894 (119891 119905) =119869

sum119895=1

119898119894119895119904119895 (119891 119905) (1)

where 119904119895(119891 119905) represents the estimation of the complex valuedSTFT computed for the source 119895 = 1 119869 with 119869 the totalnumber of sources Note that in this paper we consider asource or instrument one or more instruments of the samekind (eg a section of violins) Additionally 119898119894119895 is a mixingmatrix of size 119868times119869 that accounts for the contribution of source119894 to channel 119895 In addition we denote119909119894(119891 119905) as themagnitudespectrogram and119898119894119895 as the real-valued panning matrix

Under the NMF model described in [18] each source119904119895(119891 119905) is factored as a product of two matrices 119892119895119899(119905)the matrix which holds the gains or activation of the basisfunction corresponding to pitch 119899 at frame 119905 and 119887119895119899(119891) 119899 =1 119873 the matrix which holds bases where 119899 = 1 119873 isdefined as the pitch range for instrument 119895 Hence source 119895is modeled as

119904119895 (119891 119905) asymp119873

sum119899=1

119887119895119899 (119891) 119892119895119899 (119905) (2)

The model represents a pitch for each source 119895 as a singletemplate stored in the basis matrix 119887119895119899(119891) The temporalactivation of a template (eg onset and offset times for a note)is modeled using the gains matrix 119892119895119899(119905) Under harmonicityconstraints [18] the NMF model for the basis matrix isdefined as

119887119895119899 (119891) =119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (3)

where ℎ = 1 119867 is the number of harmonics 119886119895119899(ℎ) is theamplitude of harmonic ℎ for note 119899 and instrument 1198951198910(119899) isthe fundamental frequency of note 119899 119866(119891) is the magnitudespectrum of the window function and the spectrum of aharmonic component at frequency ℎ1198910(119899) is approximated by119866(119891 minus ℎ1198910(119899))

Considering the model given in (3) the initial equation(2) for the computation of themagnitude spectrogram for thesource 119895 is expressed as

119904119895 (119891 119905) asymp119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (4)

and (1) for the factorization of magnitude spectrogram forchannel 119894 is rewritten as

119894 (119891 119905) =119869

sum119895=1

119898119894119895119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (5)

31 PanningMatrix Estimation Thepanningmatrix gives thecontribution of each instrument in each channel and as seenin (1) influences directly the separation of the sources Thepanning matrix is estimated by calculating an overlappingmask which discriminates the time-frequency zones forwhich the partials of a source are not overlapped with thepartials of other sources Then using the overlapping maska panning coefficient is computed for each pair of sourcesat each channel The estimation algorithm in the baselineframework is described in [6]

32 AugmentedNMF for Parameter Estimation According to[33] the parameters of the NMFmodel are estimated bymin-imizing a cost function which measures the reconstructionerror between the observed 119909119894(119891 119905) and the estimated 119894(119891 119905)For flexibility reasons we use the beta-divergence [34] costfunction which allows for modeling popular cost functionsfor different values of 120573 such as Euclidean (EUC) distance(120573 = 2) Kullback-Leibler (KL) divergence (120573 = 1) and theItakura-Saito (IS) divergence (120573 = 0)

The minimization procedures assure that the distancebetween 119909119894(119891 119905) and 119894(119891 119905) does not increase with eachiteration thus accounting for the nonnegativity of the basisand the gains By these means the magnitude spectrogram ofa source is explained solely by additive reconstruction

33 Timbre-Informed Signal Model An advantage of theharmonic model is that templates can be learned for variousinstruments if the appropriate training data is availableThe RWC instrument database [15] offers recordings of soloinstrument playing isolated notes along all their correspond-ing pitch rangeThemethod in [6] uses these recordings alongwith the ground truth annotation to learn instrument spectraltemplates for each note of each instrument More details onthe training procedure can be found in the original paper [6]

Once the basis functions 119887119895119899(119891) corresponding to thespectral templates are learned they are used at the fac-torization stage in any orchestral setup which contains thetargeted instrumentsThus after training the basis 119887119895119899(119891) arekept fixed while the gains 119892119895119899(119905) are estimated during thefactorization procedure

34 Gains Estimation The factorization procedure to esti-mate the gains 119892119895119899(119905) considers the previously computedpanning matrix 119898119894119895 and the learned basis 119887119895119899(119891) from thetraining stage Consequently we have the following updaterules

119892119895119899 (119905) larr997888 119892119895119899 (119905)sum119891119894119898119894119895119887119895119899 (119891) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119894119898119894119895119887119895119899 (119891) (119891 119905)120573minus1

(6)

35 From the Estimated Gains to the Separated Signals Thereconstruction of the sources is done by estimating thecomplex amplitude for each time-frequency bin In the caseof binary separation a cell is entirely associated with asingle source However when having many instruments asin orchestral music it is more advantageous to redistribute

Journal of Electrical and Computer Engineering 5

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the mixing matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 1 Gain estimation method

energy proportionally over all sources as in the Wienerfiltering method [23]

This model allows for estimating each separated source119904119895(119905) from mixture 119909119894(119905) using a generalized time-frequencyWiener filter over the short-time Fourier transform (STFT)domain as in [3 34]

Let 1205721198951015840 be the Wiener filter of source 1198951015840 representing therelative energy contribution of the predominant source withrespect to the energy of the multichannel mixed signal 119909119894(119905)at channel 119894

1205721198951015840 (119905 119891) =10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162

sum11989510038161003816100381610038161003816119860 119894119895

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162 (7)

Then the corresponding spectrogram of source 1198951015840 isestimated as

1198951015840 (119891 119905) =1205721198951015840 (119905 119891)10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162119909119894 (119891 119905) (8)

The estimated source 1198951015840(119891 119905) is computed with theinverse overlap-add STFT of 1198951015840(119891 119905)

The estimated source magnitude spectrogram is com-puted using the gains 119892119895119899(119905) estimated in Section 34 and119887119895119899(119891) the fixed basis functions learned in Section 33119895(119905 119891) = 119892119899119895(119905)119887119895119899(119891) Then if we replace 119904119895(119905 119891) with119895(119905 119891) and if we consider the mixing matrix coefficientscomputed in Section 31 we can calculate the Wiener maskfrom (7)

1205721198951015840 (119905 119891) =11989821198941198951015840 119895 (119891 119905)

2

sum1198951198982119894119895119895 (119891 119905)2 (9)

Using (8) we can apply the Wiener mask to the mul-tichannel signal spectrogram thus obtaining 1198951015840(119891 119905) theestimated predominant source spectrogram Finally we usethe phase information from the original mixture signal119909119894(119905) and through inverse overlap-add STFT we obtain theestimated predominant source 1198951015840(119905)

4 Gains Initialization with Score Information

In the baseline method [6] the gains are initialized fol-lowing a transcription stage In our case the automaticalignment system yields a score which offers an analogousrepresentation to the one obtained by the transcription Tothat extent the output of the alignment is used to initialize

the gains for the NMF based methods for score-informedsource separation

Although the alignment algorithm aims at fixing globalmisalignments it does not account for local misalignmentsIn the case of score-informed source separation having abetter aligned score leads to better separation [11] since itincreases the sparseness of the gains matrix by setting to zerothe activation for a time frame in which a note is not played(eg the corresponding spectral template of the note in thebasis matrix is not activated outside this time boundary)However in a real-case scenario the initialization of gainsderived from theMIDI score must take into account the localmisalignments This has been traditionally done by setting atolerance window around the onsets and offsets [3 5 24] orby refining the gains after a number of NMF iterations andthen reestimating the gains [11] While the former integratesnote refinement into the parametric model the latter detectscontours in the gains using image processing heuristics andexplicitly associates them with meaningful entities as notesIn this paper we present two methods for note refinementin Section 41 we detail the method in [11] which is used as abaseline for our framework and in Section 52 we adapt andimprove this baseline to the multichannel case

On these terms if we account for errors up to 119889 frames inthe audio-to-score alignment we need to increase the timeinterval around the onset and the offset for a MIDI notewhen we initialize the gains Thus the values in 119892119895119899(119905) forinstrument 119895 and pitch corresponding to a MIDI note 119899 areset to 1 for the frames where the MIDI note is played as wellas the neighboring 119889 framesThe other values in 119892119895119899(119905) are setto 0 and do not change during computation while the valuesset to 1 evolve according to the energy distributed betweenthe instruments

Having initialized the gains the classical augmentedNMF factorization is applied to estimate the gains corre-sponding to each source 119895 in the mixture The process isdetailed in Algorithm 1

41 Note Refinement The note refinement method in [11]aims at associating the values in the gains matrix 119892119895119899(119905)with notes Therefore it is applied after a certain numberof iterations of Algorithm 1 when the gains matrix yields ameaningful distribution of the energy between instruments

The method is applied on each note separately with thescope of refining the gains associated with the targeted noteThe gains matrix 119892119895119899(119905) can be understood as a greyscaleimage with each element in the matrix representing a pixel inthe image A set of image processing heuristics are deployedto detect shapes and contours in this image commonly known

6 Journal of Electrical and Computer Engineering

Blob detection for a single note

Time (frames)

Binarization

0Time (seconds)

(a) (b)

150

200

190

210

180

220

230

160

170

Resulting time gains after NMF gjn(t)

64200 10 20 30 40 50 60 70

Time (frames)0 10 20 30 40 50 60 70

Submatrix gsjn(t)

Submatrix psjn(t)

12

6420

Figure 2 After NMF the resulting gains (a) are split in submatrices (b) and used to detect blobs [11]

as blobs [35 p 248] As a result each blob is associated with asingle note giving the onset and offset times for the note andits frequency contour This representation further increasesthe sparsity of the gains 119892119895119899(119905) yielding less interference andbetter separation

As seen in Figure 2 the method considers an imagepatch around the pixels corresponding to the pitch of thenote and the onset and offset of the note given by thealignment stage plus the additional 119889 frames accounting forthe local misalignments In fact the method works with thesame submatrices of 119892119895119899(119905) which are set to 1 according totheir correspondingMIDI note during the gains initializationstage as explained in Section 34 Therefore for a given note119896 = 1 119870119895 we process submatrix 119896119895119899(119905) of the gainsmatrix119892119895119899(119905) where 119870119895 is the total number of notes for instrument119895

The steps of the method in [11] image preprocessingbinarization and blob selection are explained in Sections412 411 and 413

411 Image Preprocessing The preprocessing stage ensuresthrough smoothing that there are no energy discontinuitieswithin an image patch Furthermore it gives more weightto the pixels situated closer to the central bin in the blob inorder to eliminate interference from the neighboring notes(close in time and frequency) but still preserving vibratos ortransitions between notes

First we convolve with a smoothing Gaussian filter [35p 86] each row of the submatrix 119896119895119899(119905) We choose a one-dimension Gaussian filter

119908 (119905) = 1radic2120587120601

119890minus(minus119905221206012) (10)

where 119905 is the time axis and 120601 is the standard deviationThuseach row vector of 119896119895119899(119905) is convolved with 119908(119905) and theresult is truncated in order to preserve the dimensions of theinitial matrix by removing the mirrored frames

Second we penalize values in 119896119895119899(119905) which are furtheraway from the central bin by multiplying each column vectorof this matrix with a 1-dimensional Gaussian centered in thecentral frequency bin represented by vector V(119899)

V (119899) = 1radic2120587]

119890minus(119899minus120581)22]2 (11)

where 119899 is the frequency axis 120581 is the position of the centralfrequency bin and ] is the standard deviation The values ofthe parameters above are given in Section 642 as a part ofthe evaluation setup

412 Image Binarization Image binarization sets to zerothe elements of the matrix 119896119895119899(119905) which are lower than athreshold and to one the elements larger than the thresholdThis involves deriving a submatrix 119896119895119899(119905) associated withnote 119896

119896119895119899 (119905) =

0 if 119896119895119899 (119905) lt mean (119896119895119899 (119905)) 1 if 119896119895119899 (119905) ge mean (119896119895119899 (119905))

(12)

413 Blob Selection First we detect blobs in each binary sub-matrix 119896119895119899(119905) using the connectivity rules described in [35p 248] and [27] Second from the detected blob candidateswe determine the best blob for each note in a similar way to[11] We assign a value to each blob depending on its areaand the overlap with the blobs corresponding to adjacent

Journal of Electrical and Computer Engineering 7

notes which will help us penalize the overlap between blobsof adjacent notes

As a first step we penalize parts of the blobswhich overlapin time with other blobs from different notes 119896 minus 1 119896 119896 + 1This is done by weighting each element in 119896119895119899(119905) with factor120574 depending on the amount of overlapping with blobs fromadjacent notes The resulting score matrix has the followingexpression

119896119895119899 (119905) =

120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896minus1119895119899 (119905) = 1120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896+1119895119899 (119905) = 1119896119895119899 (119905) otherwise

(13)

where 120574 is a value in the interval [0 1]Then we compute a score for each note 119897 and for each blob

associated with the note by summing up the elements in thescore matrix 119896119895119899(119905) which are considered to be part of a blobThe best blob candidate is the one with the highest score andfurther on it is associated with the note its boundaries givingthe note onset and offsets

42 Gains Reinitialization and Recomputation Having asso-ciated a blob with each note (Section 413) we discardobtrusive energy from the gainsmatrix 119892119895119899(119905) by eliminatingthe pixels corresponding to the blobswhichwere not selectedmaking thematrix sparser Furthermore the energy excludedfrom instrumentrsquos gains is redistributed to other instrumentscontributing to better source separation Thus the gainsare reinitialized with the information obtained from thecorresponding blobs and we can repeat the factorizationAlgorithm 1 to recompute 119892119895119899(119905) Note that the energy whichis excluded by note refinement is set to zero 119892119895119899(119905) and willremain zero during the factorization

In order to refine the gains 119892119895119899(119905) we define a set ofmatrices 119901119896119895119899(119905) derived from the matrices corresponding tothe best blobs 119896119895119899(119905) which contain 1 only for the elementsassociated with the best blob and 0 otherwise We rebuildthe gains matrix 119892119895119899(119905) with the set of submatrices 119901119896119895119899(119905)For the corresponding bins 119899 and time frames 119905 of note 119896we initialize the values in 119892119895119899(119905) with the values in 119901119896119895119899(119905)Then we reiterate the gains estimation with Algorithm 1Furthermore we obtain the spectrogram of the separatedsources with the method described in Section 35

5 PARAFAC Model for MultichannelGains Estimation

Parallel factor analysis methods (PARAFAC) [28 29] aremostly used under the nonnegative tensor factorizationparadigm By these means the NMF model is extended towork with 3-valence tensors where each slice of the sensorrepresents the spectrogram for a channel Another approachis to stack up spectrograms for each channel in a singlematrix[36] and perform a joint estimation of the spectrograms ofsources in all channels Hence we extend the NMF model

in Section 3 to jointly estimate the gains matrices in all thechannels

51 Multichannel Gains Estimation The algorithm describedin Section 3 estimates gains 119892119899119895 for source 119895 with respectto single channel 119894 determined as the corresponding rowin column 119895 of the panning matrix where element 119898119894119895has the maximum value However we argue that a betterestimation can benefit from the information in all channelsTo this extent we can further include update rules for otherparameters such as mixing matrix119898119894119895 which were otherwisekept fixed in Section 3 because the factorization algorithmestimates the parameters jointly for all the channels

We propose to integrate information from all channels byconcatenating their corresponding spectrogram matrices onthe time axis as in

119909 (119891 119905) = [1199091 (119891 119905) 1199091 (119891 119905) sdot sdot sdot 119909119868 (119891 119905)] (14)

We are interested in jointly estimating the gains 119892119899119895 of thesource 119895 in all the channels Consequently we concatenate intime the gains corresponding to each channel 119894 for 119894 = 1 119868where 119868 is the total number of channels as seen in (15) Thenew gains are initialized with identical score informationobtained from the alignment stage However during theestimation of the gains for channel 119894 the new gains 119892119894119899119895(119905)evolve accordingly taking into account the correspondingspectrogram 119909119894(119891 119905) Moreover during the gains refinementstage each gain is refined separately with respect to eachchannel

119892119899119895 (119905) = [1198921119899119895 (119905) 1198922119899119895 (119905) sdot sdot sdot 119892119868119899119895 (119905)] (15)

In (5) we describe the factorization model for theestimated spectrogram considering the mixing matrix thebasis and the gains Since we estimate a set of 119868 gains for eachsource 119895 = 1 119869 this will result in 119869 estimations of thespectrograms corresponding to all the channels 119894 = 1 119868as seen in

119895119894 (119891 119905)

=119869

sum119895=1

119898119894119895119873

sum119899=1

119892119894119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (16)

Each iteration of the factorization algorithm yields addi-tional information regarding the distribution of energybetween each instrument and each channel Therefore wecan include in the factorization update rules for mixingmatrix 119898119894119895 as in (17) By updating the mixing parameters ateach factorization step we can obtain a better estimation for119895119894 (119891 119905)

119898119894119895 larr997888 119898119894119895sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) (119891 119905)120573minus1

(17)

Considering the above the new rules to estimate theparameters are described in Algorithm 2

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 2: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

2 Journal of Electrical and Computer Engineering

accuracy of the alignment to the quality of source separation[4 11] For Western classical music a correctly aligned scoreyields the exact time where each instrument is playingThus an important step in score-informed source separationis obtaining a correctly aligned score which can be doneautomatically with an audio-to-score alignment system

Audio-to-score alignment deals with the alignment of asymbolic representation such as the score with the audio ofthe rendition In a live scenario this task deals with followingthe musical score while listening to the live performanceand it is known as score-following To our knowledgewith the exception of [12] audio-to-score alignment systemshave not been rigorously tested in the context of orchestralmusic However with respect to classical music limitedexperimental scenarios comprising Bach chorales played by afour instruments have been discussed in [4 13] Furthermorein [14] a subset of RWC classical music database [15] isused for training and testing though no details are givenregarding the instrumental complexity of the pieces More-over a Beethoven orchestral piece from the same databaseis tested in [16] obtaining lower accuracy than the otherevaluated pieces These results point out the complexity of anorchestral scenario underlined in [12] Particularly a largernumber of instruments many instruments playing concur-rently different melody lines [17] prove to be a more difficultproblem than tracking a limited number of instruments asin pop music or for instance string quartets and pianopieces Although in this paper we do not propose a newsystem for orchestral music audio-to-score alignment thetask being a complex and extensive itself we are interestedin analyzing the relation between the task and the quality ofscore-informed source separation in such a complex scenario

Besides the score state-of-the-art source separationmethods take into account characteristics of the audio signalwhich can be integrated into the system thus achievingbetter results For instance the system can learn timbremodels for each of the instruments [7 18] Moreover itcan rely on the assumption that the family of featuredinstruments is known and their spectral characteristics areuseful to discriminate between different sections playingsimultaneously when the harmonics of the notes overlapIn a more difficult case when neither the score or thetimbre of the instrument is available temporal continuityand frequency sparsity [19] help in distributing the energybetween sources in a musically meaningful way Further-more an initial pitch detection can improve the results [6 20ndash22] if the method assumes a predominant source Howeverour scenario assumes multichannel recordings with distantmicrophones in contrast to the close-microphone approachin [6] and we cannot assume that a source is predominantAs a matter a fact previous methods deal with a limitedcase the separation between small number of harmonicinstruments or piano [3 4 6 18] leaving the case of orchestralmusic as an open issue In this paper we investigate ascenario characterized by large reverberation halls [23] alarge number of musicians in each section a large diversityof instruments played abrupt tempo changes and manyconcurrent melody lines often within the same instrumentsection

Regarding the techniques used for source separationmatrix decomposition has been increasingly popular forsource separation during the recent years [2 3 6 18ndash20]Nonnegative matrix factorization (NMF) is a particular caseof decomposition which restricts the values of the factormatrices to be nonnegative The first-factor matrix can beseen as a dictionary representing spectral templates Foraudio signals a dictionary is learned for each of the sourcesand stored into the basis matrix as a set of spectral templatesThe second-factor matrix holds the temporal activation orthe weights of the templates Then the resulting factorizedspectrogram is calculated as a linear combination of thetemplate vectors with a set of weight vectors forming theactivation matrix This representation allows for paramet-ric models such as the source-filter model [3 18 20] orthe multiexcitation model [9] which can easily captureimportant traits of harmonic instruments and help separatebetween them as it is the case with orchestral music Themultiexcitation model has been evaluated in a restrictedscenario of Bach chorales played by a quartet [4] and forthis particular database has been extended in the scope ofclose-microphone recordings [6] and score-informed sourceseparation [11] From a source separation point of view inthis article we extend and evaluate the work in [6 11 18] fororchestral music

In order to obtain a better separation with any NMFparametric model the sparseness of the gains matrix isincreased by initializing it with time and frequency informa-tion obtained from the score [3ndash5 24]The values between thetime frames where a note template is not activated are set tozero and will remain this way during factorization allowingfor the energy from the spectrogram to be redistributedbetween the notes and the instruments which actually playduring that interval A better alignment leads to better gainsinitialization and better separation Nonetheless audio-to-score alignment mainly fixes global misalignments whichare due to tempo variations and does not deal with localmisalignments [21] To account for local misalignmentsscore-informed source separation systems include onset andoffset information into the parametric model [3 5 24] or useimage processing in order to refine the gains matrix so thatit closely matches the actual time and frequency boundariesof the played notes [11] Conversely local misalignmentscan be fixed explicitly [25ndash27] To our knowledge none ofthese techniques have been explored for orchestral musicalthough there is a scope for testing their usefulness ifwe take into account several factors as the synchronizationof musicians in large ensembles concurrent melody linesand reverberation Furthermore the alignment systems aremonaural However in our case the separation is done onmultichannel recordings and the delays between the sourcesand microphones might yield local misalignments Hencetowards a better separation and more precise alignment wepropose a robust method to refine the output of a scorealignment system with respect to each audio channel of themultichannel audio

In the proposed distant-microphone scenario we nor-mally do not have microphones close to a particular instru-ment or soloist and moreover in an underdetermined case

Journal of Electrical and Computer Engineering 3

TFrepresentation

Scorealignment

Instrumentspectral

templates

Multi-channelaudio

Score

Panningmatrix

estimation

Gainsinitialization

NMFparameterestimation

Separatedsource

estimation

Gainsrefinement

xi(f t)

xi(f t)

mij

gjn(t)

gjn(t)

gjn(t)

bjn(f)

bjn(f)

sj998400 (t)

Figure 1 The diagram representing the flow of operations in the system

the number of sources surpasses the number ofmicrophonesTherefore recording the sound of an entire section alsocaptures interference from other instruments and the rever-beration of the concert hall To that extent our task is differentfrom interference reduction in close-microphone recordings[6 7 23] these approaches being evaluated for pop concerts[23] or quartets [6] Additionally we do not target a blindcase source separation as the previous systems [6 7 23]Subsequently we adapt and improve the systems in [6 11] byusing information from all the channels similar to parallelfactor analysis (PARAFAC) in [28 29]

With respect to the evaluation to our knowledge this isthe first time score-informed source separation is objectivelyevaluated on such a complex scenario An objective evalua-tion provides a more precise estimation of the contributionof each part of the framework and their influence on theseparation Additionally it establishes a methodology forfuture research and eases the research reproducibility Weannotated the database proposed in [30] comprising fourpieces of orchestral music recorded in an anechoic room inorder to obtain a score which is perfectly aligned with theanechoic recordings Then using the Roomsim software in[31] we simulate a concert hall in order to obtain realisticmultitrack recordings

12 Applications The proposed framework for score-infor-med source separation has been used to separate recordingsby various orchestrasThe recordings are processed automat-ically and stored in the multimodal repository Repovizz [32]The repository serves the data through its API for severalapplicationsThefirst application is called instrument empha-sis or orchestra focus and allows for emphasizing a particularinstrument over the full orchestra The second applicationrelates to spatialization of the separated musical sources inthe case of virtual reality scenarios and it is commonly knownas Acoustic Rendering Third we propose an application toestimating the spatial locations of the instruments on thestage All the three applications are detailed in Section 7

13 Outline We introduce the architecture of the frameworkwith its main parts in Section 2 Then in Section 3 we

give an outline of the baseline source separation systemFurthermore we present the extension of the baseline systemthe initialization of the gains with score information (Sec-tion 4) and the note refinement (Section 41) Additionally theproposed extension to the multichannel case is introducedin Section 5 We present the dataset and the evaluationprocedures and discuss the results in Section 6 The demosand applications are described in Section 7

2 Proposed Approach Overview

The diagram of the proposed framework is presented inFigure 1 The baseline system relies on training spectral tem-plates for the instruments we aim to separate (Section 33)Then we compute the spectrograms associated with themultichannel audio The spectrograms along with the scoreof the piece are used to align the score to the audio Fromthe aligned score we derive gains matrix that serves as aninput for the NMF parameter estimation stage (Section 4)along with the learned spectral templates Furthermore thegains and the spectrogram are used to calculate a panningmatrix (Section 31) which yields the contribution of eachinstrument in each channel After the parameter estimationstage (Section 32) the gains are refined in order to improvethe separation (Section 41) Then the spectrograms ofthe separated sources are estimated using Wiener filtering(Section 35)

For the score alignment step we use the system in[13] which aligns the scores to a chosen microphone andachieved the best results inMIREX score-following challenge(httpwwwmusic-irorgmirexwiki2015Real-time Audioto Score Alignment (aka Score Following) Results) How-ever other state-of-the-art alignment systems can be used atthis step since our final goal is to refine a given score withrespect to each channel in order to minimize the errors inseparation (Section 52) Accounting for that we extend themodel proposed by [6] and the gains refinement formonauralrecordings in [11] to the case of score-informed multichannelsource separation in the more complex scenario of orchestralmusic

4 Journal of Electrical and Computer Engineering

3 Baseline Method for MultichannelSource Separation

According to the baseline model in [6] the short-termcomplex valued Fourier transform (STFT) in time frame 119905and frequency 119891 for channel 119894 = 1 119868 where 119868 is the totalnumber of channels is expressed as

119909119894 (119891 119905) asymp 119894 (119891 119905) =119869

sum119895=1

119898119894119895119904119895 (119891 119905) (1)

where 119904119895(119891 119905) represents the estimation of the complex valuedSTFT computed for the source 119895 = 1 119869 with 119869 the totalnumber of sources Note that in this paper we consider asource or instrument one or more instruments of the samekind (eg a section of violins) Additionally 119898119894119895 is a mixingmatrix of size 119868times119869 that accounts for the contribution of source119894 to channel 119895 In addition we denote119909119894(119891 119905) as themagnitudespectrogram and119898119894119895 as the real-valued panning matrix

Under the NMF model described in [18] each source119904119895(119891 119905) is factored as a product of two matrices 119892119895119899(119905)the matrix which holds the gains or activation of the basisfunction corresponding to pitch 119899 at frame 119905 and 119887119895119899(119891) 119899 =1 119873 the matrix which holds bases where 119899 = 1 119873 isdefined as the pitch range for instrument 119895 Hence source 119895is modeled as

119904119895 (119891 119905) asymp119873

sum119899=1

119887119895119899 (119891) 119892119895119899 (119905) (2)

The model represents a pitch for each source 119895 as a singletemplate stored in the basis matrix 119887119895119899(119891) The temporalactivation of a template (eg onset and offset times for a note)is modeled using the gains matrix 119892119895119899(119905) Under harmonicityconstraints [18] the NMF model for the basis matrix isdefined as

119887119895119899 (119891) =119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (3)

where ℎ = 1 119867 is the number of harmonics 119886119895119899(ℎ) is theamplitude of harmonic ℎ for note 119899 and instrument 1198951198910(119899) isthe fundamental frequency of note 119899 119866(119891) is the magnitudespectrum of the window function and the spectrum of aharmonic component at frequency ℎ1198910(119899) is approximated by119866(119891 minus ℎ1198910(119899))

Considering the model given in (3) the initial equation(2) for the computation of themagnitude spectrogram for thesource 119895 is expressed as

119904119895 (119891 119905) asymp119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (4)

and (1) for the factorization of magnitude spectrogram forchannel 119894 is rewritten as

119894 (119891 119905) =119869

sum119895=1

119898119894119895119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (5)

31 PanningMatrix Estimation Thepanningmatrix gives thecontribution of each instrument in each channel and as seenin (1) influences directly the separation of the sources Thepanning matrix is estimated by calculating an overlappingmask which discriminates the time-frequency zones forwhich the partials of a source are not overlapped with thepartials of other sources Then using the overlapping maska panning coefficient is computed for each pair of sourcesat each channel The estimation algorithm in the baselineframework is described in [6]

32 AugmentedNMF for Parameter Estimation According to[33] the parameters of the NMFmodel are estimated bymin-imizing a cost function which measures the reconstructionerror between the observed 119909119894(119891 119905) and the estimated 119894(119891 119905)For flexibility reasons we use the beta-divergence [34] costfunction which allows for modeling popular cost functionsfor different values of 120573 such as Euclidean (EUC) distance(120573 = 2) Kullback-Leibler (KL) divergence (120573 = 1) and theItakura-Saito (IS) divergence (120573 = 0)

The minimization procedures assure that the distancebetween 119909119894(119891 119905) and 119894(119891 119905) does not increase with eachiteration thus accounting for the nonnegativity of the basisand the gains By these means the magnitude spectrogram ofa source is explained solely by additive reconstruction

33 Timbre-Informed Signal Model An advantage of theharmonic model is that templates can be learned for variousinstruments if the appropriate training data is availableThe RWC instrument database [15] offers recordings of soloinstrument playing isolated notes along all their correspond-ing pitch rangeThemethod in [6] uses these recordings alongwith the ground truth annotation to learn instrument spectraltemplates for each note of each instrument More details onthe training procedure can be found in the original paper [6]

Once the basis functions 119887119895119899(119891) corresponding to thespectral templates are learned they are used at the fac-torization stage in any orchestral setup which contains thetargeted instrumentsThus after training the basis 119887119895119899(119891) arekept fixed while the gains 119892119895119899(119905) are estimated during thefactorization procedure

34 Gains Estimation The factorization procedure to esti-mate the gains 119892119895119899(119905) considers the previously computedpanning matrix 119898119894119895 and the learned basis 119887119895119899(119891) from thetraining stage Consequently we have the following updaterules

119892119895119899 (119905) larr997888 119892119895119899 (119905)sum119891119894119898119894119895119887119895119899 (119891) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119894119898119894119895119887119895119899 (119891) (119891 119905)120573minus1

(6)

35 From the Estimated Gains to the Separated Signals Thereconstruction of the sources is done by estimating thecomplex amplitude for each time-frequency bin In the caseof binary separation a cell is entirely associated with asingle source However when having many instruments asin orchestral music it is more advantageous to redistribute

Journal of Electrical and Computer Engineering 5

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the mixing matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 1 Gain estimation method

energy proportionally over all sources as in the Wienerfiltering method [23]

This model allows for estimating each separated source119904119895(119905) from mixture 119909119894(119905) using a generalized time-frequencyWiener filter over the short-time Fourier transform (STFT)domain as in [3 34]

Let 1205721198951015840 be the Wiener filter of source 1198951015840 representing therelative energy contribution of the predominant source withrespect to the energy of the multichannel mixed signal 119909119894(119905)at channel 119894

1205721198951015840 (119905 119891) =10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162

sum11989510038161003816100381610038161003816119860 119894119895

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162 (7)

Then the corresponding spectrogram of source 1198951015840 isestimated as

1198951015840 (119891 119905) =1205721198951015840 (119905 119891)10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162119909119894 (119891 119905) (8)

The estimated source 1198951015840(119891 119905) is computed with theinverse overlap-add STFT of 1198951015840(119891 119905)

The estimated source magnitude spectrogram is com-puted using the gains 119892119895119899(119905) estimated in Section 34 and119887119895119899(119891) the fixed basis functions learned in Section 33119895(119905 119891) = 119892119899119895(119905)119887119895119899(119891) Then if we replace 119904119895(119905 119891) with119895(119905 119891) and if we consider the mixing matrix coefficientscomputed in Section 31 we can calculate the Wiener maskfrom (7)

1205721198951015840 (119905 119891) =11989821198941198951015840 119895 (119891 119905)

2

sum1198951198982119894119895119895 (119891 119905)2 (9)

Using (8) we can apply the Wiener mask to the mul-tichannel signal spectrogram thus obtaining 1198951015840(119891 119905) theestimated predominant source spectrogram Finally we usethe phase information from the original mixture signal119909119894(119905) and through inverse overlap-add STFT we obtain theestimated predominant source 1198951015840(119905)

4 Gains Initialization with Score Information

In the baseline method [6] the gains are initialized fol-lowing a transcription stage In our case the automaticalignment system yields a score which offers an analogousrepresentation to the one obtained by the transcription Tothat extent the output of the alignment is used to initialize

the gains for the NMF based methods for score-informedsource separation

Although the alignment algorithm aims at fixing globalmisalignments it does not account for local misalignmentsIn the case of score-informed source separation having abetter aligned score leads to better separation [11] since itincreases the sparseness of the gains matrix by setting to zerothe activation for a time frame in which a note is not played(eg the corresponding spectral template of the note in thebasis matrix is not activated outside this time boundary)However in a real-case scenario the initialization of gainsderived from theMIDI score must take into account the localmisalignments This has been traditionally done by setting atolerance window around the onsets and offsets [3 5 24] orby refining the gains after a number of NMF iterations andthen reestimating the gains [11] While the former integratesnote refinement into the parametric model the latter detectscontours in the gains using image processing heuristics andexplicitly associates them with meaningful entities as notesIn this paper we present two methods for note refinementin Section 41 we detail the method in [11] which is used as abaseline for our framework and in Section 52 we adapt andimprove this baseline to the multichannel case

On these terms if we account for errors up to 119889 frames inthe audio-to-score alignment we need to increase the timeinterval around the onset and the offset for a MIDI notewhen we initialize the gains Thus the values in 119892119895119899(119905) forinstrument 119895 and pitch corresponding to a MIDI note 119899 areset to 1 for the frames where the MIDI note is played as wellas the neighboring 119889 framesThe other values in 119892119895119899(119905) are setto 0 and do not change during computation while the valuesset to 1 evolve according to the energy distributed betweenthe instruments

Having initialized the gains the classical augmentedNMF factorization is applied to estimate the gains corre-sponding to each source 119895 in the mixture The process isdetailed in Algorithm 1

41 Note Refinement The note refinement method in [11]aims at associating the values in the gains matrix 119892119895119899(119905)with notes Therefore it is applied after a certain numberof iterations of Algorithm 1 when the gains matrix yields ameaningful distribution of the energy between instruments

The method is applied on each note separately with thescope of refining the gains associated with the targeted noteThe gains matrix 119892119895119899(119905) can be understood as a greyscaleimage with each element in the matrix representing a pixel inthe image A set of image processing heuristics are deployedto detect shapes and contours in this image commonly known

6 Journal of Electrical and Computer Engineering

Blob detection for a single note

Time (frames)

Binarization

0Time (seconds)

(a) (b)

150

200

190

210

180

220

230

160

170

Resulting time gains after NMF gjn(t)

64200 10 20 30 40 50 60 70

Time (frames)0 10 20 30 40 50 60 70

Submatrix gsjn(t)

Submatrix psjn(t)

12

6420

Figure 2 After NMF the resulting gains (a) are split in submatrices (b) and used to detect blobs [11]

as blobs [35 p 248] As a result each blob is associated with asingle note giving the onset and offset times for the note andits frequency contour This representation further increasesthe sparsity of the gains 119892119895119899(119905) yielding less interference andbetter separation

As seen in Figure 2 the method considers an imagepatch around the pixels corresponding to the pitch of thenote and the onset and offset of the note given by thealignment stage plus the additional 119889 frames accounting forthe local misalignments In fact the method works with thesame submatrices of 119892119895119899(119905) which are set to 1 according totheir correspondingMIDI note during the gains initializationstage as explained in Section 34 Therefore for a given note119896 = 1 119870119895 we process submatrix 119896119895119899(119905) of the gainsmatrix119892119895119899(119905) where 119870119895 is the total number of notes for instrument119895

The steps of the method in [11] image preprocessingbinarization and blob selection are explained in Sections412 411 and 413

411 Image Preprocessing The preprocessing stage ensuresthrough smoothing that there are no energy discontinuitieswithin an image patch Furthermore it gives more weightto the pixels situated closer to the central bin in the blob inorder to eliminate interference from the neighboring notes(close in time and frequency) but still preserving vibratos ortransitions between notes

First we convolve with a smoothing Gaussian filter [35p 86] each row of the submatrix 119896119895119899(119905) We choose a one-dimension Gaussian filter

119908 (119905) = 1radic2120587120601

119890minus(minus119905221206012) (10)

where 119905 is the time axis and 120601 is the standard deviationThuseach row vector of 119896119895119899(119905) is convolved with 119908(119905) and theresult is truncated in order to preserve the dimensions of theinitial matrix by removing the mirrored frames

Second we penalize values in 119896119895119899(119905) which are furtheraway from the central bin by multiplying each column vectorof this matrix with a 1-dimensional Gaussian centered in thecentral frequency bin represented by vector V(119899)

V (119899) = 1radic2120587]

119890minus(119899minus120581)22]2 (11)

where 119899 is the frequency axis 120581 is the position of the centralfrequency bin and ] is the standard deviation The values ofthe parameters above are given in Section 642 as a part ofthe evaluation setup

412 Image Binarization Image binarization sets to zerothe elements of the matrix 119896119895119899(119905) which are lower than athreshold and to one the elements larger than the thresholdThis involves deriving a submatrix 119896119895119899(119905) associated withnote 119896

119896119895119899 (119905) =

0 if 119896119895119899 (119905) lt mean (119896119895119899 (119905)) 1 if 119896119895119899 (119905) ge mean (119896119895119899 (119905))

(12)

413 Blob Selection First we detect blobs in each binary sub-matrix 119896119895119899(119905) using the connectivity rules described in [35p 248] and [27] Second from the detected blob candidateswe determine the best blob for each note in a similar way to[11] We assign a value to each blob depending on its areaand the overlap with the blobs corresponding to adjacent

Journal of Electrical and Computer Engineering 7

notes which will help us penalize the overlap between blobsof adjacent notes

As a first step we penalize parts of the blobswhich overlapin time with other blobs from different notes 119896 minus 1 119896 119896 + 1This is done by weighting each element in 119896119895119899(119905) with factor120574 depending on the amount of overlapping with blobs fromadjacent notes The resulting score matrix has the followingexpression

119896119895119899 (119905) =

120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896minus1119895119899 (119905) = 1120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896+1119895119899 (119905) = 1119896119895119899 (119905) otherwise

(13)

where 120574 is a value in the interval [0 1]Then we compute a score for each note 119897 and for each blob

associated with the note by summing up the elements in thescore matrix 119896119895119899(119905) which are considered to be part of a blobThe best blob candidate is the one with the highest score andfurther on it is associated with the note its boundaries givingthe note onset and offsets

42 Gains Reinitialization and Recomputation Having asso-ciated a blob with each note (Section 413) we discardobtrusive energy from the gainsmatrix 119892119895119899(119905) by eliminatingthe pixels corresponding to the blobswhichwere not selectedmaking thematrix sparser Furthermore the energy excludedfrom instrumentrsquos gains is redistributed to other instrumentscontributing to better source separation Thus the gainsare reinitialized with the information obtained from thecorresponding blobs and we can repeat the factorizationAlgorithm 1 to recompute 119892119895119899(119905) Note that the energy whichis excluded by note refinement is set to zero 119892119895119899(119905) and willremain zero during the factorization

In order to refine the gains 119892119895119899(119905) we define a set ofmatrices 119901119896119895119899(119905) derived from the matrices corresponding tothe best blobs 119896119895119899(119905) which contain 1 only for the elementsassociated with the best blob and 0 otherwise We rebuildthe gains matrix 119892119895119899(119905) with the set of submatrices 119901119896119895119899(119905)For the corresponding bins 119899 and time frames 119905 of note 119896we initialize the values in 119892119895119899(119905) with the values in 119901119896119895119899(119905)Then we reiterate the gains estimation with Algorithm 1Furthermore we obtain the spectrogram of the separatedsources with the method described in Section 35

5 PARAFAC Model for MultichannelGains Estimation

Parallel factor analysis methods (PARAFAC) [28 29] aremostly used under the nonnegative tensor factorizationparadigm By these means the NMF model is extended towork with 3-valence tensors where each slice of the sensorrepresents the spectrogram for a channel Another approachis to stack up spectrograms for each channel in a singlematrix[36] and perform a joint estimation of the spectrograms ofsources in all channels Hence we extend the NMF model

in Section 3 to jointly estimate the gains matrices in all thechannels

51 Multichannel Gains Estimation The algorithm describedin Section 3 estimates gains 119892119899119895 for source 119895 with respectto single channel 119894 determined as the corresponding rowin column 119895 of the panning matrix where element 119898119894119895has the maximum value However we argue that a betterestimation can benefit from the information in all channelsTo this extent we can further include update rules for otherparameters such as mixing matrix119898119894119895 which were otherwisekept fixed in Section 3 because the factorization algorithmestimates the parameters jointly for all the channels

We propose to integrate information from all channels byconcatenating their corresponding spectrogram matrices onthe time axis as in

119909 (119891 119905) = [1199091 (119891 119905) 1199091 (119891 119905) sdot sdot sdot 119909119868 (119891 119905)] (14)

We are interested in jointly estimating the gains 119892119899119895 of thesource 119895 in all the channels Consequently we concatenate intime the gains corresponding to each channel 119894 for 119894 = 1 119868where 119868 is the total number of channels as seen in (15) Thenew gains are initialized with identical score informationobtained from the alignment stage However during theestimation of the gains for channel 119894 the new gains 119892119894119899119895(119905)evolve accordingly taking into account the correspondingspectrogram 119909119894(119891 119905) Moreover during the gains refinementstage each gain is refined separately with respect to eachchannel

119892119899119895 (119905) = [1198921119899119895 (119905) 1198922119899119895 (119905) sdot sdot sdot 119892119868119899119895 (119905)] (15)

In (5) we describe the factorization model for theestimated spectrogram considering the mixing matrix thebasis and the gains Since we estimate a set of 119868 gains for eachsource 119895 = 1 119869 this will result in 119869 estimations of thespectrograms corresponding to all the channels 119894 = 1 119868as seen in

119895119894 (119891 119905)

=119869

sum119895=1

119898119894119895119873

sum119899=1

119892119894119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (16)

Each iteration of the factorization algorithm yields addi-tional information regarding the distribution of energybetween each instrument and each channel Therefore wecan include in the factorization update rules for mixingmatrix 119898119894119895 as in (17) By updating the mixing parameters ateach factorization step we can obtain a better estimation for119895119894 (119891 119905)

119898119894119895 larr997888 119898119894119895sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) (119891 119905)120573minus1

(17)

Considering the above the new rules to estimate theparameters are described in Algorithm 2

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 3: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 3

TFrepresentation

Scorealignment

Instrumentspectral

templates

Multi-channelaudio

Score

Panningmatrix

estimation

Gainsinitialization

NMFparameterestimation

Separatedsource

estimation

Gainsrefinement

xi(f t)

xi(f t)

mij

gjn(t)

gjn(t)

gjn(t)

bjn(f)

bjn(f)

sj998400 (t)

Figure 1 The diagram representing the flow of operations in the system

the number of sources surpasses the number ofmicrophonesTherefore recording the sound of an entire section alsocaptures interference from other instruments and the rever-beration of the concert hall To that extent our task is differentfrom interference reduction in close-microphone recordings[6 7 23] these approaches being evaluated for pop concerts[23] or quartets [6] Additionally we do not target a blindcase source separation as the previous systems [6 7 23]Subsequently we adapt and improve the systems in [6 11] byusing information from all the channels similar to parallelfactor analysis (PARAFAC) in [28 29]

With respect to the evaluation to our knowledge this isthe first time score-informed source separation is objectivelyevaluated on such a complex scenario An objective evalua-tion provides a more precise estimation of the contributionof each part of the framework and their influence on theseparation Additionally it establishes a methodology forfuture research and eases the research reproducibility Weannotated the database proposed in [30] comprising fourpieces of orchestral music recorded in an anechoic room inorder to obtain a score which is perfectly aligned with theanechoic recordings Then using the Roomsim software in[31] we simulate a concert hall in order to obtain realisticmultitrack recordings

12 Applications The proposed framework for score-infor-med source separation has been used to separate recordingsby various orchestrasThe recordings are processed automat-ically and stored in the multimodal repository Repovizz [32]The repository serves the data through its API for severalapplicationsThefirst application is called instrument empha-sis or orchestra focus and allows for emphasizing a particularinstrument over the full orchestra The second applicationrelates to spatialization of the separated musical sources inthe case of virtual reality scenarios and it is commonly knownas Acoustic Rendering Third we propose an application toestimating the spatial locations of the instruments on thestage All the three applications are detailed in Section 7

13 Outline We introduce the architecture of the frameworkwith its main parts in Section 2 Then in Section 3 we

give an outline of the baseline source separation systemFurthermore we present the extension of the baseline systemthe initialization of the gains with score information (Sec-tion 4) and the note refinement (Section 41) Additionally theproposed extension to the multichannel case is introducedin Section 5 We present the dataset and the evaluationprocedures and discuss the results in Section 6 The demosand applications are described in Section 7

2 Proposed Approach Overview

The diagram of the proposed framework is presented inFigure 1 The baseline system relies on training spectral tem-plates for the instruments we aim to separate (Section 33)Then we compute the spectrograms associated with themultichannel audio The spectrograms along with the scoreof the piece are used to align the score to the audio Fromthe aligned score we derive gains matrix that serves as aninput for the NMF parameter estimation stage (Section 4)along with the learned spectral templates Furthermore thegains and the spectrogram are used to calculate a panningmatrix (Section 31) which yields the contribution of eachinstrument in each channel After the parameter estimationstage (Section 32) the gains are refined in order to improvethe separation (Section 41) Then the spectrograms ofthe separated sources are estimated using Wiener filtering(Section 35)

For the score alignment step we use the system in[13] which aligns the scores to a chosen microphone andachieved the best results inMIREX score-following challenge(httpwwwmusic-irorgmirexwiki2015Real-time Audioto Score Alignment (aka Score Following) Results) How-ever other state-of-the-art alignment systems can be used atthis step since our final goal is to refine a given score withrespect to each channel in order to minimize the errors inseparation (Section 52) Accounting for that we extend themodel proposed by [6] and the gains refinement formonauralrecordings in [11] to the case of score-informed multichannelsource separation in the more complex scenario of orchestralmusic

4 Journal of Electrical and Computer Engineering

3 Baseline Method for MultichannelSource Separation

According to the baseline model in [6] the short-termcomplex valued Fourier transform (STFT) in time frame 119905and frequency 119891 for channel 119894 = 1 119868 where 119868 is the totalnumber of channels is expressed as

119909119894 (119891 119905) asymp 119894 (119891 119905) =119869

sum119895=1

119898119894119895119904119895 (119891 119905) (1)

where 119904119895(119891 119905) represents the estimation of the complex valuedSTFT computed for the source 119895 = 1 119869 with 119869 the totalnumber of sources Note that in this paper we consider asource or instrument one or more instruments of the samekind (eg a section of violins) Additionally 119898119894119895 is a mixingmatrix of size 119868times119869 that accounts for the contribution of source119894 to channel 119895 In addition we denote119909119894(119891 119905) as themagnitudespectrogram and119898119894119895 as the real-valued panning matrix

Under the NMF model described in [18] each source119904119895(119891 119905) is factored as a product of two matrices 119892119895119899(119905)the matrix which holds the gains or activation of the basisfunction corresponding to pitch 119899 at frame 119905 and 119887119895119899(119891) 119899 =1 119873 the matrix which holds bases where 119899 = 1 119873 isdefined as the pitch range for instrument 119895 Hence source 119895is modeled as

119904119895 (119891 119905) asymp119873

sum119899=1

119887119895119899 (119891) 119892119895119899 (119905) (2)

The model represents a pitch for each source 119895 as a singletemplate stored in the basis matrix 119887119895119899(119891) The temporalactivation of a template (eg onset and offset times for a note)is modeled using the gains matrix 119892119895119899(119905) Under harmonicityconstraints [18] the NMF model for the basis matrix isdefined as

119887119895119899 (119891) =119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (3)

where ℎ = 1 119867 is the number of harmonics 119886119895119899(ℎ) is theamplitude of harmonic ℎ for note 119899 and instrument 1198951198910(119899) isthe fundamental frequency of note 119899 119866(119891) is the magnitudespectrum of the window function and the spectrum of aharmonic component at frequency ℎ1198910(119899) is approximated by119866(119891 minus ℎ1198910(119899))

Considering the model given in (3) the initial equation(2) for the computation of themagnitude spectrogram for thesource 119895 is expressed as

119904119895 (119891 119905) asymp119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (4)

and (1) for the factorization of magnitude spectrogram forchannel 119894 is rewritten as

119894 (119891 119905) =119869

sum119895=1

119898119894119895119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (5)

31 PanningMatrix Estimation Thepanningmatrix gives thecontribution of each instrument in each channel and as seenin (1) influences directly the separation of the sources Thepanning matrix is estimated by calculating an overlappingmask which discriminates the time-frequency zones forwhich the partials of a source are not overlapped with thepartials of other sources Then using the overlapping maska panning coefficient is computed for each pair of sourcesat each channel The estimation algorithm in the baselineframework is described in [6]

32 AugmentedNMF for Parameter Estimation According to[33] the parameters of the NMFmodel are estimated bymin-imizing a cost function which measures the reconstructionerror between the observed 119909119894(119891 119905) and the estimated 119894(119891 119905)For flexibility reasons we use the beta-divergence [34] costfunction which allows for modeling popular cost functionsfor different values of 120573 such as Euclidean (EUC) distance(120573 = 2) Kullback-Leibler (KL) divergence (120573 = 1) and theItakura-Saito (IS) divergence (120573 = 0)

The minimization procedures assure that the distancebetween 119909119894(119891 119905) and 119894(119891 119905) does not increase with eachiteration thus accounting for the nonnegativity of the basisand the gains By these means the magnitude spectrogram ofa source is explained solely by additive reconstruction

33 Timbre-Informed Signal Model An advantage of theharmonic model is that templates can be learned for variousinstruments if the appropriate training data is availableThe RWC instrument database [15] offers recordings of soloinstrument playing isolated notes along all their correspond-ing pitch rangeThemethod in [6] uses these recordings alongwith the ground truth annotation to learn instrument spectraltemplates for each note of each instrument More details onthe training procedure can be found in the original paper [6]

Once the basis functions 119887119895119899(119891) corresponding to thespectral templates are learned they are used at the fac-torization stage in any orchestral setup which contains thetargeted instrumentsThus after training the basis 119887119895119899(119891) arekept fixed while the gains 119892119895119899(119905) are estimated during thefactorization procedure

34 Gains Estimation The factorization procedure to esti-mate the gains 119892119895119899(119905) considers the previously computedpanning matrix 119898119894119895 and the learned basis 119887119895119899(119891) from thetraining stage Consequently we have the following updaterules

119892119895119899 (119905) larr997888 119892119895119899 (119905)sum119891119894119898119894119895119887119895119899 (119891) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119894119898119894119895119887119895119899 (119891) (119891 119905)120573minus1

(6)

35 From the Estimated Gains to the Separated Signals Thereconstruction of the sources is done by estimating thecomplex amplitude for each time-frequency bin In the caseof binary separation a cell is entirely associated with asingle source However when having many instruments asin orchestral music it is more advantageous to redistribute

Journal of Electrical and Computer Engineering 5

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the mixing matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 1 Gain estimation method

energy proportionally over all sources as in the Wienerfiltering method [23]

This model allows for estimating each separated source119904119895(119905) from mixture 119909119894(119905) using a generalized time-frequencyWiener filter over the short-time Fourier transform (STFT)domain as in [3 34]

Let 1205721198951015840 be the Wiener filter of source 1198951015840 representing therelative energy contribution of the predominant source withrespect to the energy of the multichannel mixed signal 119909119894(119905)at channel 119894

1205721198951015840 (119905 119891) =10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162

sum11989510038161003816100381610038161003816119860 119894119895

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162 (7)

Then the corresponding spectrogram of source 1198951015840 isestimated as

1198951015840 (119891 119905) =1205721198951015840 (119905 119891)10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162119909119894 (119891 119905) (8)

The estimated source 1198951015840(119891 119905) is computed with theinverse overlap-add STFT of 1198951015840(119891 119905)

The estimated source magnitude spectrogram is com-puted using the gains 119892119895119899(119905) estimated in Section 34 and119887119895119899(119891) the fixed basis functions learned in Section 33119895(119905 119891) = 119892119899119895(119905)119887119895119899(119891) Then if we replace 119904119895(119905 119891) with119895(119905 119891) and if we consider the mixing matrix coefficientscomputed in Section 31 we can calculate the Wiener maskfrom (7)

1205721198951015840 (119905 119891) =11989821198941198951015840 119895 (119891 119905)

2

sum1198951198982119894119895119895 (119891 119905)2 (9)

Using (8) we can apply the Wiener mask to the mul-tichannel signal spectrogram thus obtaining 1198951015840(119891 119905) theestimated predominant source spectrogram Finally we usethe phase information from the original mixture signal119909119894(119905) and through inverse overlap-add STFT we obtain theestimated predominant source 1198951015840(119905)

4 Gains Initialization with Score Information

In the baseline method [6] the gains are initialized fol-lowing a transcription stage In our case the automaticalignment system yields a score which offers an analogousrepresentation to the one obtained by the transcription Tothat extent the output of the alignment is used to initialize

the gains for the NMF based methods for score-informedsource separation

Although the alignment algorithm aims at fixing globalmisalignments it does not account for local misalignmentsIn the case of score-informed source separation having abetter aligned score leads to better separation [11] since itincreases the sparseness of the gains matrix by setting to zerothe activation for a time frame in which a note is not played(eg the corresponding spectral template of the note in thebasis matrix is not activated outside this time boundary)However in a real-case scenario the initialization of gainsderived from theMIDI score must take into account the localmisalignments This has been traditionally done by setting atolerance window around the onsets and offsets [3 5 24] orby refining the gains after a number of NMF iterations andthen reestimating the gains [11] While the former integratesnote refinement into the parametric model the latter detectscontours in the gains using image processing heuristics andexplicitly associates them with meaningful entities as notesIn this paper we present two methods for note refinementin Section 41 we detail the method in [11] which is used as abaseline for our framework and in Section 52 we adapt andimprove this baseline to the multichannel case

On these terms if we account for errors up to 119889 frames inthe audio-to-score alignment we need to increase the timeinterval around the onset and the offset for a MIDI notewhen we initialize the gains Thus the values in 119892119895119899(119905) forinstrument 119895 and pitch corresponding to a MIDI note 119899 areset to 1 for the frames where the MIDI note is played as wellas the neighboring 119889 framesThe other values in 119892119895119899(119905) are setto 0 and do not change during computation while the valuesset to 1 evolve according to the energy distributed betweenthe instruments

Having initialized the gains the classical augmentedNMF factorization is applied to estimate the gains corre-sponding to each source 119895 in the mixture The process isdetailed in Algorithm 1

41 Note Refinement The note refinement method in [11]aims at associating the values in the gains matrix 119892119895119899(119905)with notes Therefore it is applied after a certain numberof iterations of Algorithm 1 when the gains matrix yields ameaningful distribution of the energy between instruments

The method is applied on each note separately with thescope of refining the gains associated with the targeted noteThe gains matrix 119892119895119899(119905) can be understood as a greyscaleimage with each element in the matrix representing a pixel inthe image A set of image processing heuristics are deployedto detect shapes and contours in this image commonly known

6 Journal of Electrical and Computer Engineering

Blob detection for a single note

Time (frames)

Binarization

0Time (seconds)

(a) (b)

150

200

190

210

180

220

230

160

170

Resulting time gains after NMF gjn(t)

64200 10 20 30 40 50 60 70

Time (frames)0 10 20 30 40 50 60 70

Submatrix gsjn(t)

Submatrix psjn(t)

12

6420

Figure 2 After NMF the resulting gains (a) are split in submatrices (b) and used to detect blobs [11]

as blobs [35 p 248] As a result each blob is associated with asingle note giving the onset and offset times for the note andits frequency contour This representation further increasesthe sparsity of the gains 119892119895119899(119905) yielding less interference andbetter separation

As seen in Figure 2 the method considers an imagepatch around the pixels corresponding to the pitch of thenote and the onset and offset of the note given by thealignment stage plus the additional 119889 frames accounting forthe local misalignments In fact the method works with thesame submatrices of 119892119895119899(119905) which are set to 1 according totheir correspondingMIDI note during the gains initializationstage as explained in Section 34 Therefore for a given note119896 = 1 119870119895 we process submatrix 119896119895119899(119905) of the gainsmatrix119892119895119899(119905) where 119870119895 is the total number of notes for instrument119895

The steps of the method in [11] image preprocessingbinarization and blob selection are explained in Sections412 411 and 413

411 Image Preprocessing The preprocessing stage ensuresthrough smoothing that there are no energy discontinuitieswithin an image patch Furthermore it gives more weightto the pixels situated closer to the central bin in the blob inorder to eliminate interference from the neighboring notes(close in time and frequency) but still preserving vibratos ortransitions between notes

First we convolve with a smoothing Gaussian filter [35p 86] each row of the submatrix 119896119895119899(119905) We choose a one-dimension Gaussian filter

119908 (119905) = 1radic2120587120601

119890minus(minus119905221206012) (10)

where 119905 is the time axis and 120601 is the standard deviationThuseach row vector of 119896119895119899(119905) is convolved with 119908(119905) and theresult is truncated in order to preserve the dimensions of theinitial matrix by removing the mirrored frames

Second we penalize values in 119896119895119899(119905) which are furtheraway from the central bin by multiplying each column vectorof this matrix with a 1-dimensional Gaussian centered in thecentral frequency bin represented by vector V(119899)

V (119899) = 1radic2120587]

119890minus(119899minus120581)22]2 (11)

where 119899 is the frequency axis 120581 is the position of the centralfrequency bin and ] is the standard deviation The values ofthe parameters above are given in Section 642 as a part ofthe evaluation setup

412 Image Binarization Image binarization sets to zerothe elements of the matrix 119896119895119899(119905) which are lower than athreshold and to one the elements larger than the thresholdThis involves deriving a submatrix 119896119895119899(119905) associated withnote 119896

119896119895119899 (119905) =

0 if 119896119895119899 (119905) lt mean (119896119895119899 (119905)) 1 if 119896119895119899 (119905) ge mean (119896119895119899 (119905))

(12)

413 Blob Selection First we detect blobs in each binary sub-matrix 119896119895119899(119905) using the connectivity rules described in [35p 248] and [27] Second from the detected blob candidateswe determine the best blob for each note in a similar way to[11] We assign a value to each blob depending on its areaand the overlap with the blobs corresponding to adjacent

Journal of Electrical and Computer Engineering 7

notes which will help us penalize the overlap between blobsof adjacent notes

As a first step we penalize parts of the blobswhich overlapin time with other blobs from different notes 119896 minus 1 119896 119896 + 1This is done by weighting each element in 119896119895119899(119905) with factor120574 depending on the amount of overlapping with blobs fromadjacent notes The resulting score matrix has the followingexpression

119896119895119899 (119905) =

120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896minus1119895119899 (119905) = 1120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896+1119895119899 (119905) = 1119896119895119899 (119905) otherwise

(13)

where 120574 is a value in the interval [0 1]Then we compute a score for each note 119897 and for each blob

associated with the note by summing up the elements in thescore matrix 119896119895119899(119905) which are considered to be part of a blobThe best blob candidate is the one with the highest score andfurther on it is associated with the note its boundaries givingthe note onset and offsets

42 Gains Reinitialization and Recomputation Having asso-ciated a blob with each note (Section 413) we discardobtrusive energy from the gainsmatrix 119892119895119899(119905) by eliminatingthe pixels corresponding to the blobswhichwere not selectedmaking thematrix sparser Furthermore the energy excludedfrom instrumentrsquos gains is redistributed to other instrumentscontributing to better source separation Thus the gainsare reinitialized with the information obtained from thecorresponding blobs and we can repeat the factorizationAlgorithm 1 to recompute 119892119895119899(119905) Note that the energy whichis excluded by note refinement is set to zero 119892119895119899(119905) and willremain zero during the factorization

In order to refine the gains 119892119895119899(119905) we define a set ofmatrices 119901119896119895119899(119905) derived from the matrices corresponding tothe best blobs 119896119895119899(119905) which contain 1 only for the elementsassociated with the best blob and 0 otherwise We rebuildthe gains matrix 119892119895119899(119905) with the set of submatrices 119901119896119895119899(119905)For the corresponding bins 119899 and time frames 119905 of note 119896we initialize the values in 119892119895119899(119905) with the values in 119901119896119895119899(119905)Then we reiterate the gains estimation with Algorithm 1Furthermore we obtain the spectrogram of the separatedsources with the method described in Section 35

5 PARAFAC Model for MultichannelGains Estimation

Parallel factor analysis methods (PARAFAC) [28 29] aremostly used under the nonnegative tensor factorizationparadigm By these means the NMF model is extended towork with 3-valence tensors where each slice of the sensorrepresents the spectrogram for a channel Another approachis to stack up spectrograms for each channel in a singlematrix[36] and perform a joint estimation of the spectrograms ofsources in all channels Hence we extend the NMF model

in Section 3 to jointly estimate the gains matrices in all thechannels

51 Multichannel Gains Estimation The algorithm describedin Section 3 estimates gains 119892119899119895 for source 119895 with respectto single channel 119894 determined as the corresponding rowin column 119895 of the panning matrix where element 119898119894119895has the maximum value However we argue that a betterestimation can benefit from the information in all channelsTo this extent we can further include update rules for otherparameters such as mixing matrix119898119894119895 which were otherwisekept fixed in Section 3 because the factorization algorithmestimates the parameters jointly for all the channels

We propose to integrate information from all channels byconcatenating their corresponding spectrogram matrices onthe time axis as in

119909 (119891 119905) = [1199091 (119891 119905) 1199091 (119891 119905) sdot sdot sdot 119909119868 (119891 119905)] (14)

We are interested in jointly estimating the gains 119892119899119895 of thesource 119895 in all the channels Consequently we concatenate intime the gains corresponding to each channel 119894 for 119894 = 1 119868where 119868 is the total number of channels as seen in (15) Thenew gains are initialized with identical score informationobtained from the alignment stage However during theestimation of the gains for channel 119894 the new gains 119892119894119899119895(119905)evolve accordingly taking into account the correspondingspectrogram 119909119894(119891 119905) Moreover during the gains refinementstage each gain is refined separately with respect to eachchannel

119892119899119895 (119905) = [1198921119899119895 (119905) 1198922119899119895 (119905) sdot sdot sdot 119892119868119899119895 (119905)] (15)

In (5) we describe the factorization model for theestimated spectrogram considering the mixing matrix thebasis and the gains Since we estimate a set of 119868 gains for eachsource 119895 = 1 119869 this will result in 119869 estimations of thespectrograms corresponding to all the channels 119894 = 1 119868as seen in

119895119894 (119891 119905)

=119869

sum119895=1

119898119894119895119873

sum119899=1

119892119894119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (16)

Each iteration of the factorization algorithm yields addi-tional information regarding the distribution of energybetween each instrument and each channel Therefore wecan include in the factorization update rules for mixingmatrix 119898119894119895 as in (17) By updating the mixing parameters ateach factorization step we can obtain a better estimation for119895119894 (119891 119905)

119898119894119895 larr997888 119898119894119895sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) (119891 119905)120573minus1

(17)

Considering the above the new rules to estimate theparameters are described in Algorithm 2

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 4: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

4 Journal of Electrical and Computer Engineering

3 Baseline Method for MultichannelSource Separation

According to the baseline model in [6] the short-termcomplex valued Fourier transform (STFT) in time frame 119905and frequency 119891 for channel 119894 = 1 119868 where 119868 is the totalnumber of channels is expressed as

119909119894 (119891 119905) asymp 119894 (119891 119905) =119869

sum119895=1

119898119894119895119904119895 (119891 119905) (1)

where 119904119895(119891 119905) represents the estimation of the complex valuedSTFT computed for the source 119895 = 1 119869 with 119869 the totalnumber of sources Note that in this paper we consider asource or instrument one or more instruments of the samekind (eg a section of violins) Additionally 119898119894119895 is a mixingmatrix of size 119868times119869 that accounts for the contribution of source119894 to channel 119895 In addition we denote119909119894(119891 119905) as themagnitudespectrogram and119898119894119895 as the real-valued panning matrix

Under the NMF model described in [18] each source119904119895(119891 119905) is factored as a product of two matrices 119892119895119899(119905)the matrix which holds the gains or activation of the basisfunction corresponding to pitch 119899 at frame 119905 and 119887119895119899(119891) 119899 =1 119873 the matrix which holds bases where 119899 = 1 119873 isdefined as the pitch range for instrument 119895 Hence source 119895is modeled as

119904119895 (119891 119905) asymp119873

sum119899=1

119887119895119899 (119891) 119892119895119899 (119905) (2)

The model represents a pitch for each source 119895 as a singletemplate stored in the basis matrix 119887119895119899(119891) The temporalactivation of a template (eg onset and offset times for a note)is modeled using the gains matrix 119892119895119899(119905) Under harmonicityconstraints [18] the NMF model for the basis matrix isdefined as

119887119895119899 (119891) =119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (3)

where ℎ = 1 119867 is the number of harmonics 119886119895119899(ℎ) is theamplitude of harmonic ℎ for note 119899 and instrument 1198951198910(119899) isthe fundamental frequency of note 119899 119866(119891) is the magnitudespectrum of the window function and the spectrum of aharmonic component at frequency ℎ1198910(119899) is approximated by119866(119891 minus ℎ1198910(119899))

Considering the model given in (3) the initial equation(2) for the computation of themagnitude spectrogram for thesource 119895 is expressed as

119904119895 (119891 119905) asymp119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (4)

and (1) for the factorization of magnitude spectrogram forchannel 119894 is rewritten as

119894 (119891 119905) =119869

sum119895=1

119898119894119895119873

sum119899=1

119892119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (5)

31 PanningMatrix Estimation Thepanningmatrix gives thecontribution of each instrument in each channel and as seenin (1) influences directly the separation of the sources Thepanning matrix is estimated by calculating an overlappingmask which discriminates the time-frequency zones forwhich the partials of a source are not overlapped with thepartials of other sources Then using the overlapping maska panning coefficient is computed for each pair of sourcesat each channel The estimation algorithm in the baselineframework is described in [6]

32 AugmentedNMF for Parameter Estimation According to[33] the parameters of the NMFmodel are estimated bymin-imizing a cost function which measures the reconstructionerror between the observed 119909119894(119891 119905) and the estimated 119894(119891 119905)For flexibility reasons we use the beta-divergence [34] costfunction which allows for modeling popular cost functionsfor different values of 120573 such as Euclidean (EUC) distance(120573 = 2) Kullback-Leibler (KL) divergence (120573 = 1) and theItakura-Saito (IS) divergence (120573 = 0)

The minimization procedures assure that the distancebetween 119909119894(119891 119905) and 119894(119891 119905) does not increase with eachiteration thus accounting for the nonnegativity of the basisand the gains By these means the magnitude spectrogram ofa source is explained solely by additive reconstruction

33 Timbre-Informed Signal Model An advantage of theharmonic model is that templates can be learned for variousinstruments if the appropriate training data is availableThe RWC instrument database [15] offers recordings of soloinstrument playing isolated notes along all their correspond-ing pitch rangeThemethod in [6] uses these recordings alongwith the ground truth annotation to learn instrument spectraltemplates for each note of each instrument More details onthe training procedure can be found in the original paper [6]

Once the basis functions 119887119895119899(119891) corresponding to thespectral templates are learned they are used at the fac-torization stage in any orchestral setup which contains thetargeted instrumentsThus after training the basis 119887119895119899(119891) arekept fixed while the gains 119892119895119899(119905) are estimated during thefactorization procedure

34 Gains Estimation The factorization procedure to esti-mate the gains 119892119895119899(119905) considers the previously computedpanning matrix 119898119894119895 and the learned basis 119887119895119899(119891) from thetraining stage Consequently we have the following updaterules

119892119895119899 (119905) larr997888 119892119895119899 (119905)sum119891119894119898119894119895119887119895119899 (119891) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119894119898119894119895119887119895119899 (119891) (119891 119905)120573minus1

(6)

35 From the Estimated Gains to the Separated Signals Thereconstruction of the sources is done by estimating thecomplex amplitude for each time-frequency bin In the caseof binary separation a cell is entirely associated with asingle source However when having many instruments asin orchestral music it is more advantageous to redistribute

Journal of Electrical and Computer Engineering 5

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the mixing matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 1 Gain estimation method

energy proportionally over all sources as in the Wienerfiltering method [23]

This model allows for estimating each separated source119904119895(119905) from mixture 119909119894(119905) using a generalized time-frequencyWiener filter over the short-time Fourier transform (STFT)domain as in [3 34]

Let 1205721198951015840 be the Wiener filter of source 1198951015840 representing therelative energy contribution of the predominant source withrespect to the energy of the multichannel mixed signal 119909119894(119905)at channel 119894

1205721198951015840 (119905 119891) =10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162

sum11989510038161003816100381610038161003816119860 119894119895

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162 (7)

Then the corresponding spectrogram of source 1198951015840 isestimated as

1198951015840 (119891 119905) =1205721198951015840 (119905 119891)10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162119909119894 (119891 119905) (8)

The estimated source 1198951015840(119891 119905) is computed with theinverse overlap-add STFT of 1198951015840(119891 119905)

The estimated source magnitude spectrogram is com-puted using the gains 119892119895119899(119905) estimated in Section 34 and119887119895119899(119891) the fixed basis functions learned in Section 33119895(119905 119891) = 119892119899119895(119905)119887119895119899(119891) Then if we replace 119904119895(119905 119891) with119895(119905 119891) and if we consider the mixing matrix coefficientscomputed in Section 31 we can calculate the Wiener maskfrom (7)

1205721198951015840 (119905 119891) =11989821198941198951015840 119895 (119891 119905)

2

sum1198951198982119894119895119895 (119891 119905)2 (9)

Using (8) we can apply the Wiener mask to the mul-tichannel signal spectrogram thus obtaining 1198951015840(119891 119905) theestimated predominant source spectrogram Finally we usethe phase information from the original mixture signal119909119894(119905) and through inverse overlap-add STFT we obtain theestimated predominant source 1198951015840(119905)

4 Gains Initialization with Score Information

In the baseline method [6] the gains are initialized fol-lowing a transcription stage In our case the automaticalignment system yields a score which offers an analogousrepresentation to the one obtained by the transcription Tothat extent the output of the alignment is used to initialize

the gains for the NMF based methods for score-informedsource separation

Although the alignment algorithm aims at fixing globalmisalignments it does not account for local misalignmentsIn the case of score-informed source separation having abetter aligned score leads to better separation [11] since itincreases the sparseness of the gains matrix by setting to zerothe activation for a time frame in which a note is not played(eg the corresponding spectral template of the note in thebasis matrix is not activated outside this time boundary)However in a real-case scenario the initialization of gainsderived from theMIDI score must take into account the localmisalignments This has been traditionally done by setting atolerance window around the onsets and offsets [3 5 24] orby refining the gains after a number of NMF iterations andthen reestimating the gains [11] While the former integratesnote refinement into the parametric model the latter detectscontours in the gains using image processing heuristics andexplicitly associates them with meaningful entities as notesIn this paper we present two methods for note refinementin Section 41 we detail the method in [11] which is used as abaseline for our framework and in Section 52 we adapt andimprove this baseline to the multichannel case

On these terms if we account for errors up to 119889 frames inthe audio-to-score alignment we need to increase the timeinterval around the onset and the offset for a MIDI notewhen we initialize the gains Thus the values in 119892119895119899(119905) forinstrument 119895 and pitch corresponding to a MIDI note 119899 areset to 1 for the frames where the MIDI note is played as wellas the neighboring 119889 framesThe other values in 119892119895119899(119905) are setto 0 and do not change during computation while the valuesset to 1 evolve according to the energy distributed betweenthe instruments

Having initialized the gains the classical augmentedNMF factorization is applied to estimate the gains corre-sponding to each source 119895 in the mixture The process isdetailed in Algorithm 1

41 Note Refinement The note refinement method in [11]aims at associating the values in the gains matrix 119892119895119899(119905)with notes Therefore it is applied after a certain numberof iterations of Algorithm 1 when the gains matrix yields ameaningful distribution of the energy between instruments

The method is applied on each note separately with thescope of refining the gains associated with the targeted noteThe gains matrix 119892119895119899(119905) can be understood as a greyscaleimage with each element in the matrix representing a pixel inthe image A set of image processing heuristics are deployedto detect shapes and contours in this image commonly known

6 Journal of Electrical and Computer Engineering

Blob detection for a single note

Time (frames)

Binarization

0Time (seconds)

(a) (b)

150

200

190

210

180

220

230

160

170

Resulting time gains after NMF gjn(t)

64200 10 20 30 40 50 60 70

Time (frames)0 10 20 30 40 50 60 70

Submatrix gsjn(t)

Submatrix psjn(t)

12

6420

Figure 2 After NMF the resulting gains (a) are split in submatrices (b) and used to detect blobs [11]

as blobs [35 p 248] As a result each blob is associated with asingle note giving the onset and offset times for the note andits frequency contour This representation further increasesthe sparsity of the gains 119892119895119899(119905) yielding less interference andbetter separation

As seen in Figure 2 the method considers an imagepatch around the pixels corresponding to the pitch of thenote and the onset and offset of the note given by thealignment stage plus the additional 119889 frames accounting forthe local misalignments In fact the method works with thesame submatrices of 119892119895119899(119905) which are set to 1 according totheir correspondingMIDI note during the gains initializationstage as explained in Section 34 Therefore for a given note119896 = 1 119870119895 we process submatrix 119896119895119899(119905) of the gainsmatrix119892119895119899(119905) where 119870119895 is the total number of notes for instrument119895

The steps of the method in [11] image preprocessingbinarization and blob selection are explained in Sections412 411 and 413

411 Image Preprocessing The preprocessing stage ensuresthrough smoothing that there are no energy discontinuitieswithin an image patch Furthermore it gives more weightto the pixels situated closer to the central bin in the blob inorder to eliminate interference from the neighboring notes(close in time and frequency) but still preserving vibratos ortransitions between notes

First we convolve with a smoothing Gaussian filter [35p 86] each row of the submatrix 119896119895119899(119905) We choose a one-dimension Gaussian filter

119908 (119905) = 1radic2120587120601

119890minus(minus119905221206012) (10)

where 119905 is the time axis and 120601 is the standard deviationThuseach row vector of 119896119895119899(119905) is convolved with 119908(119905) and theresult is truncated in order to preserve the dimensions of theinitial matrix by removing the mirrored frames

Second we penalize values in 119896119895119899(119905) which are furtheraway from the central bin by multiplying each column vectorof this matrix with a 1-dimensional Gaussian centered in thecentral frequency bin represented by vector V(119899)

V (119899) = 1radic2120587]

119890minus(119899minus120581)22]2 (11)

where 119899 is the frequency axis 120581 is the position of the centralfrequency bin and ] is the standard deviation The values ofthe parameters above are given in Section 642 as a part ofthe evaluation setup

412 Image Binarization Image binarization sets to zerothe elements of the matrix 119896119895119899(119905) which are lower than athreshold and to one the elements larger than the thresholdThis involves deriving a submatrix 119896119895119899(119905) associated withnote 119896

119896119895119899 (119905) =

0 if 119896119895119899 (119905) lt mean (119896119895119899 (119905)) 1 if 119896119895119899 (119905) ge mean (119896119895119899 (119905))

(12)

413 Blob Selection First we detect blobs in each binary sub-matrix 119896119895119899(119905) using the connectivity rules described in [35p 248] and [27] Second from the detected blob candidateswe determine the best blob for each note in a similar way to[11] We assign a value to each blob depending on its areaand the overlap with the blobs corresponding to adjacent

Journal of Electrical and Computer Engineering 7

notes which will help us penalize the overlap between blobsof adjacent notes

As a first step we penalize parts of the blobswhich overlapin time with other blobs from different notes 119896 minus 1 119896 119896 + 1This is done by weighting each element in 119896119895119899(119905) with factor120574 depending on the amount of overlapping with blobs fromadjacent notes The resulting score matrix has the followingexpression

119896119895119899 (119905) =

120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896minus1119895119899 (119905) = 1120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896+1119895119899 (119905) = 1119896119895119899 (119905) otherwise

(13)

where 120574 is a value in the interval [0 1]Then we compute a score for each note 119897 and for each blob

associated with the note by summing up the elements in thescore matrix 119896119895119899(119905) which are considered to be part of a blobThe best blob candidate is the one with the highest score andfurther on it is associated with the note its boundaries givingthe note onset and offsets

42 Gains Reinitialization and Recomputation Having asso-ciated a blob with each note (Section 413) we discardobtrusive energy from the gainsmatrix 119892119895119899(119905) by eliminatingthe pixels corresponding to the blobswhichwere not selectedmaking thematrix sparser Furthermore the energy excludedfrom instrumentrsquos gains is redistributed to other instrumentscontributing to better source separation Thus the gainsare reinitialized with the information obtained from thecorresponding blobs and we can repeat the factorizationAlgorithm 1 to recompute 119892119895119899(119905) Note that the energy whichis excluded by note refinement is set to zero 119892119895119899(119905) and willremain zero during the factorization

In order to refine the gains 119892119895119899(119905) we define a set ofmatrices 119901119896119895119899(119905) derived from the matrices corresponding tothe best blobs 119896119895119899(119905) which contain 1 only for the elementsassociated with the best blob and 0 otherwise We rebuildthe gains matrix 119892119895119899(119905) with the set of submatrices 119901119896119895119899(119905)For the corresponding bins 119899 and time frames 119905 of note 119896we initialize the values in 119892119895119899(119905) with the values in 119901119896119895119899(119905)Then we reiterate the gains estimation with Algorithm 1Furthermore we obtain the spectrogram of the separatedsources with the method described in Section 35

5 PARAFAC Model for MultichannelGains Estimation

Parallel factor analysis methods (PARAFAC) [28 29] aremostly used under the nonnegative tensor factorizationparadigm By these means the NMF model is extended towork with 3-valence tensors where each slice of the sensorrepresents the spectrogram for a channel Another approachis to stack up spectrograms for each channel in a singlematrix[36] and perform a joint estimation of the spectrograms ofsources in all channels Hence we extend the NMF model

in Section 3 to jointly estimate the gains matrices in all thechannels

51 Multichannel Gains Estimation The algorithm describedin Section 3 estimates gains 119892119899119895 for source 119895 with respectto single channel 119894 determined as the corresponding rowin column 119895 of the panning matrix where element 119898119894119895has the maximum value However we argue that a betterestimation can benefit from the information in all channelsTo this extent we can further include update rules for otherparameters such as mixing matrix119898119894119895 which were otherwisekept fixed in Section 3 because the factorization algorithmestimates the parameters jointly for all the channels

We propose to integrate information from all channels byconcatenating their corresponding spectrogram matrices onthe time axis as in

119909 (119891 119905) = [1199091 (119891 119905) 1199091 (119891 119905) sdot sdot sdot 119909119868 (119891 119905)] (14)

We are interested in jointly estimating the gains 119892119899119895 of thesource 119895 in all the channels Consequently we concatenate intime the gains corresponding to each channel 119894 for 119894 = 1 119868where 119868 is the total number of channels as seen in (15) Thenew gains are initialized with identical score informationobtained from the alignment stage However during theestimation of the gains for channel 119894 the new gains 119892119894119899119895(119905)evolve accordingly taking into account the correspondingspectrogram 119909119894(119891 119905) Moreover during the gains refinementstage each gain is refined separately with respect to eachchannel

119892119899119895 (119905) = [1198921119899119895 (119905) 1198922119899119895 (119905) sdot sdot sdot 119892119868119899119895 (119905)] (15)

In (5) we describe the factorization model for theestimated spectrogram considering the mixing matrix thebasis and the gains Since we estimate a set of 119868 gains for eachsource 119895 = 1 119869 this will result in 119869 estimations of thespectrograms corresponding to all the channels 119894 = 1 119868as seen in

119895119894 (119891 119905)

=119869

sum119895=1

119898119894119895119873

sum119899=1

119892119894119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (16)

Each iteration of the factorization algorithm yields addi-tional information regarding the distribution of energybetween each instrument and each channel Therefore wecan include in the factorization update rules for mixingmatrix 119898119894119895 as in (17) By updating the mixing parameters ateach factorization step we can obtain a better estimation for119895119894 (119891 119905)

119898119894119895 larr997888 119898119894119895sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) (119891 119905)120573minus1

(17)

Considering the above the new rules to estimate theparameters are described in Algorithm 2

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 5: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 5

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the mixing matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 1 Gain estimation method

energy proportionally over all sources as in the Wienerfiltering method [23]

This model allows for estimating each separated source119904119895(119905) from mixture 119909119894(119905) using a generalized time-frequencyWiener filter over the short-time Fourier transform (STFT)domain as in [3 34]

Let 1205721198951015840 be the Wiener filter of source 1198951015840 representing therelative energy contribution of the predominant source withrespect to the energy of the multichannel mixed signal 119909119894(119905)at channel 119894

1205721198951015840 (119905 119891) =10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162

sum11989510038161003816100381610038161003816119860 119894119895

100381610038161003816100381610038162 10038161003816100381610038161003816119904119895 (119891 119905)

100381610038161003816100381610038162 (7)

Then the corresponding spectrogram of source 1198951015840 isestimated as

1198951015840 (119891 119905) =1205721198951015840 (119905 119891)10038161003816100381610038161003816119860 1198941198951015840

100381610038161003816100381610038162119909119894 (119891 119905) (8)

The estimated source 1198951015840(119891 119905) is computed with theinverse overlap-add STFT of 1198951015840(119891 119905)

The estimated source magnitude spectrogram is com-puted using the gains 119892119895119899(119905) estimated in Section 34 and119887119895119899(119891) the fixed basis functions learned in Section 33119895(119905 119891) = 119892119899119895(119905)119887119895119899(119891) Then if we replace 119904119895(119905 119891) with119895(119905 119891) and if we consider the mixing matrix coefficientscomputed in Section 31 we can calculate the Wiener maskfrom (7)

1205721198951015840 (119905 119891) =11989821198941198951015840 119895 (119891 119905)

2

sum1198951198982119894119895119895 (119891 119905)2 (9)

Using (8) we can apply the Wiener mask to the mul-tichannel signal spectrogram thus obtaining 1198951015840(119891 119905) theestimated predominant source spectrogram Finally we usethe phase information from the original mixture signal119909119894(119905) and through inverse overlap-add STFT we obtain theestimated predominant source 1198951015840(119905)

4 Gains Initialization with Score Information

In the baseline method [6] the gains are initialized fol-lowing a transcription stage In our case the automaticalignment system yields a score which offers an analogousrepresentation to the one obtained by the transcription Tothat extent the output of the alignment is used to initialize

the gains for the NMF based methods for score-informedsource separation

Although the alignment algorithm aims at fixing globalmisalignments it does not account for local misalignmentsIn the case of score-informed source separation having abetter aligned score leads to better separation [11] since itincreases the sparseness of the gains matrix by setting to zerothe activation for a time frame in which a note is not played(eg the corresponding spectral template of the note in thebasis matrix is not activated outside this time boundary)However in a real-case scenario the initialization of gainsderived from theMIDI score must take into account the localmisalignments This has been traditionally done by setting atolerance window around the onsets and offsets [3 5 24] orby refining the gains after a number of NMF iterations andthen reestimating the gains [11] While the former integratesnote refinement into the parametric model the latter detectscontours in the gains using image processing heuristics andexplicitly associates them with meaningful entities as notesIn this paper we present two methods for note refinementin Section 41 we detail the method in [11] which is used as abaseline for our framework and in Section 52 we adapt andimprove this baseline to the multichannel case

On these terms if we account for errors up to 119889 frames inthe audio-to-score alignment we need to increase the timeinterval around the onset and the offset for a MIDI notewhen we initialize the gains Thus the values in 119892119895119899(119905) forinstrument 119895 and pitch corresponding to a MIDI note 119899 areset to 1 for the frames where the MIDI note is played as wellas the neighboring 119889 framesThe other values in 119892119895119899(119905) are setto 0 and do not change during computation while the valuesset to 1 evolve according to the energy distributed betweenthe instruments

Having initialized the gains the classical augmentedNMF factorization is applied to estimate the gains corre-sponding to each source 119895 in the mixture The process isdetailed in Algorithm 1

41 Note Refinement The note refinement method in [11]aims at associating the values in the gains matrix 119892119895119899(119905)with notes Therefore it is applied after a certain numberof iterations of Algorithm 1 when the gains matrix yields ameaningful distribution of the energy between instruments

The method is applied on each note separately with thescope of refining the gains associated with the targeted noteThe gains matrix 119892119895119899(119905) can be understood as a greyscaleimage with each element in the matrix representing a pixel inthe image A set of image processing heuristics are deployedto detect shapes and contours in this image commonly known

6 Journal of Electrical and Computer Engineering

Blob detection for a single note

Time (frames)

Binarization

0Time (seconds)

(a) (b)

150

200

190

210

180

220

230

160

170

Resulting time gains after NMF gjn(t)

64200 10 20 30 40 50 60 70

Time (frames)0 10 20 30 40 50 60 70

Submatrix gsjn(t)

Submatrix psjn(t)

12

6420

Figure 2 After NMF the resulting gains (a) are split in submatrices (b) and used to detect blobs [11]

as blobs [35 p 248] As a result each blob is associated with asingle note giving the onset and offset times for the note andits frequency contour This representation further increasesthe sparsity of the gains 119892119895119899(119905) yielding less interference andbetter separation

As seen in Figure 2 the method considers an imagepatch around the pixels corresponding to the pitch of thenote and the onset and offset of the note given by thealignment stage plus the additional 119889 frames accounting forthe local misalignments In fact the method works with thesame submatrices of 119892119895119899(119905) which are set to 1 according totheir correspondingMIDI note during the gains initializationstage as explained in Section 34 Therefore for a given note119896 = 1 119870119895 we process submatrix 119896119895119899(119905) of the gainsmatrix119892119895119899(119905) where 119870119895 is the total number of notes for instrument119895

The steps of the method in [11] image preprocessingbinarization and blob selection are explained in Sections412 411 and 413

411 Image Preprocessing The preprocessing stage ensuresthrough smoothing that there are no energy discontinuitieswithin an image patch Furthermore it gives more weightto the pixels situated closer to the central bin in the blob inorder to eliminate interference from the neighboring notes(close in time and frequency) but still preserving vibratos ortransitions between notes

First we convolve with a smoothing Gaussian filter [35p 86] each row of the submatrix 119896119895119899(119905) We choose a one-dimension Gaussian filter

119908 (119905) = 1radic2120587120601

119890minus(minus119905221206012) (10)

where 119905 is the time axis and 120601 is the standard deviationThuseach row vector of 119896119895119899(119905) is convolved with 119908(119905) and theresult is truncated in order to preserve the dimensions of theinitial matrix by removing the mirrored frames

Second we penalize values in 119896119895119899(119905) which are furtheraway from the central bin by multiplying each column vectorof this matrix with a 1-dimensional Gaussian centered in thecentral frequency bin represented by vector V(119899)

V (119899) = 1radic2120587]

119890minus(119899minus120581)22]2 (11)

where 119899 is the frequency axis 120581 is the position of the centralfrequency bin and ] is the standard deviation The values ofthe parameters above are given in Section 642 as a part ofthe evaluation setup

412 Image Binarization Image binarization sets to zerothe elements of the matrix 119896119895119899(119905) which are lower than athreshold and to one the elements larger than the thresholdThis involves deriving a submatrix 119896119895119899(119905) associated withnote 119896

119896119895119899 (119905) =

0 if 119896119895119899 (119905) lt mean (119896119895119899 (119905)) 1 if 119896119895119899 (119905) ge mean (119896119895119899 (119905))

(12)

413 Blob Selection First we detect blobs in each binary sub-matrix 119896119895119899(119905) using the connectivity rules described in [35p 248] and [27] Second from the detected blob candidateswe determine the best blob for each note in a similar way to[11] We assign a value to each blob depending on its areaand the overlap with the blobs corresponding to adjacent

Journal of Electrical and Computer Engineering 7

notes which will help us penalize the overlap between blobsof adjacent notes

As a first step we penalize parts of the blobswhich overlapin time with other blobs from different notes 119896 minus 1 119896 119896 + 1This is done by weighting each element in 119896119895119899(119905) with factor120574 depending on the amount of overlapping with blobs fromadjacent notes The resulting score matrix has the followingexpression

119896119895119899 (119905) =

120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896minus1119895119899 (119905) = 1120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896+1119895119899 (119905) = 1119896119895119899 (119905) otherwise

(13)

where 120574 is a value in the interval [0 1]Then we compute a score for each note 119897 and for each blob

associated with the note by summing up the elements in thescore matrix 119896119895119899(119905) which are considered to be part of a blobThe best blob candidate is the one with the highest score andfurther on it is associated with the note its boundaries givingthe note onset and offsets

42 Gains Reinitialization and Recomputation Having asso-ciated a blob with each note (Section 413) we discardobtrusive energy from the gainsmatrix 119892119895119899(119905) by eliminatingthe pixels corresponding to the blobswhichwere not selectedmaking thematrix sparser Furthermore the energy excludedfrom instrumentrsquos gains is redistributed to other instrumentscontributing to better source separation Thus the gainsare reinitialized with the information obtained from thecorresponding blobs and we can repeat the factorizationAlgorithm 1 to recompute 119892119895119899(119905) Note that the energy whichis excluded by note refinement is set to zero 119892119895119899(119905) and willremain zero during the factorization

In order to refine the gains 119892119895119899(119905) we define a set ofmatrices 119901119896119895119899(119905) derived from the matrices corresponding tothe best blobs 119896119895119899(119905) which contain 1 only for the elementsassociated with the best blob and 0 otherwise We rebuildthe gains matrix 119892119895119899(119905) with the set of submatrices 119901119896119895119899(119905)For the corresponding bins 119899 and time frames 119905 of note 119896we initialize the values in 119892119895119899(119905) with the values in 119901119896119895119899(119905)Then we reiterate the gains estimation with Algorithm 1Furthermore we obtain the spectrogram of the separatedsources with the method described in Section 35

5 PARAFAC Model for MultichannelGains Estimation

Parallel factor analysis methods (PARAFAC) [28 29] aremostly used under the nonnegative tensor factorizationparadigm By these means the NMF model is extended towork with 3-valence tensors where each slice of the sensorrepresents the spectrogram for a channel Another approachis to stack up spectrograms for each channel in a singlematrix[36] and perform a joint estimation of the spectrograms ofsources in all channels Hence we extend the NMF model

in Section 3 to jointly estimate the gains matrices in all thechannels

51 Multichannel Gains Estimation The algorithm describedin Section 3 estimates gains 119892119899119895 for source 119895 with respectto single channel 119894 determined as the corresponding rowin column 119895 of the panning matrix where element 119898119894119895has the maximum value However we argue that a betterestimation can benefit from the information in all channelsTo this extent we can further include update rules for otherparameters such as mixing matrix119898119894119895 which were otherwisekept fixed in Section 3 because the factorization algorithmestimates the parameters jointly for all the channels

We propose to integrate information from all channels byconcatenating their corresponding spectrogram matrices onthe time axis as in

119909 (119891 119905) = [1199091 (119891 119905) 1199091 (119891 119905) sdot sdot sdot 119909119868 (119891 119905)] (14)

We are interested in jointly estimating the gains 119892119899119895 of thesource 119895 in all the channels Consequently we concatenate intime the gains corresponding to each channel 119894 for 119894 = 1 119868where 119868 is the total number of channels as seen in (15) Thenew gains are initialized with identical score informationobtained from the alignment stage However during theestimation of the gains for channel 119894 the new gains 119892119894119899119895(119905)evolve accordingly taking into account the correspondingspectrogram 119909119894(119891 119905) Moreover during the gains refinementstage each gain is refined separately with respect to eachchannel

119892119899119895 (119905) = [1198921119899119895 (119905) 1198922119899119895 (119905) sdot sdot sdot 119892119868119899119895 (119905)] (15)

In (5) we describe the factorization model for theestimated spectrogram considering the mixing matrix thebasis and the gains Since we estimate a set of 119868 gains for eachsource 119895 = 1 119869 this will result in 119869 estimations of thespectrograms corresponding to all the channels 119894 = 1 119868as seen in

119895119894 (119891 119905)

=119869

sum119895=1

119898119894119895119873

sum119899=1

119892119894119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (16)

Each iteration of the factorization algorithm yields addi-tional information regarding the distribution of energybetween each instrument and each channel Therefore wecan include in the factorization update rules for mixingmatrix 119898119894119895 as in (17) By updating the mixing parameters ateach factorization step we can obtain a better estimation for119895119894 (119891 119905)

119898119894119895 larr997888 119898119894119895sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) (119891 119905)120573minus1

(17)

Considering the above the new rules to estimate theparameters are described in Algorithm 2

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 6: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

6 Journal of Electrical and Computer Engineering

Blob detection for a single note

Time (frames)

Binarization

0Time (seconds)

(a) (b)

150

200

190

210

180

220

230

160

170

Resulting time gains after NMF gjn(t)

64200 10 20 30 40 50 60 70

Time (frames)0 10 20 30 40 50 60 70

Submatrix gsjn(t)

Submatrix psjn(t)

12

6420

Figure 2 After NMF the resulting gains (a) are split in submatrices (b) and used to detect blobs [11]

as blobs [35 p 248] As a result each blob is associated with asingle note giving the onset and offset times for the note andits frequency contour This representation further increasesthe sparsity of the gains 119892119895119899(119905) yielding less interference andbetter separation

As seen in Figure 2 the method considers an imagepatch around the pixels corresponding to the pitch of thenote and the onset and offset of the note given by thealignment stage plus the additional 119889 frames accounting forthe local misalignments In fact the method works with thesame submatrices of 119892119895119899(119905) which are set to 1 according totheir correspondingMIDI note during the gains initializationstage as explained in Section 34 Therefore for a given note119896 = 1 119870119895 we process submatrix 119896119895119899(119905) of the gainsmatrix119892119895119899(119905) where 119870119895 is the total number of notes for instrument119895

The steps of the method in [11] image preprocessingbinarization and blob selection are explained in Sections412 411 and 413

411 Image Preprocessing The preprocessing stage ensuresthrough smoothing that there are no energy discontinuitieswithin an image patch Furthermore it gives more weightto the pixels situated closer to the central bin in the blob inorder to eliminate interference from the neighboring notes(close in time and frequency) but still preserving vibratos ortransitions between notes

First we convolve with a smoothing Gaussian filter [35p 86] each row of the submatrix 119896119895119899(119905) We choose a one-dimension Gaussian filter

119908 (119905) = 1radic2120587120601

119890minus(minus119905221206012) (10)

where 119905 is the time axis and 120601 is the standard deviationThuseach row vector of 119896119895119899(119905) is convolved with 119908(119905) and theresult is truncated in order to preserve the dimensions of theinitial matrix by removing the mirrored frames

Second we penalize values in 119896119895119899(119905) which are furtheraway from the central bin by multiplying each column vectorof this matrix with a 1-dimensional Gaussian centered in thecentral frequency bin represented by vector V(119899)

V (119899) = 1radic2120587]

119890minus(119899minus120581)22]2 (11)

where 119899 is the frequency axis 120581 is the position of the centralfrequency bin and ] is the standard deviation The values ofthe parameters above are given in Section 642 as a part ofthe evaluation setup

412 Image Binarization Image binarization sets to zerothe elements of the matrix 119896119895119899(119905) which are lower than athreshold and to one the elements larger than the thresholdThis involves deriving a submatrix 119896119895119899(119905) associated withnote 119896

119896119895119899 (119905) =

0 if 119896119895119899 (119905) lt mean (119896119895119899 (119905)) 1 if 119896119895119899 (119905) ge mean (119896119895119899 (119905))

(12)

413 Blob Selection First we detect blobs in each binary sub-matrix 119896119895119899(119905) using the connectivity rules described in [35p 248] and [27] Second from the detected blob candidateswe determine the best blob for each note in a similar way to[11] We assign a value to each blob depending on its areaand the overlap with the blobs corresponding to adjacent

Journal of Electrical and Computer Engineering 7

notes which will help us penalize the overlap between blobsof adjacent notes

As a first step we penalize parts of the blobswhich overlapin time with other blobs from different notes 119896 minus 1 119896 119896 + 1This is done by weighting each element in 119896119895119899(119905) with factor120574 depending on the amount of overlapping with blobs fromadjacent notes The resulting score matrix has the followingexpression

119896119895119899 (119905) =

120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896minus1119895119899 (119905) = 1120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896+1119895119899 (119905) = 1119896119895119899 (119905) otherwise

(13)

where 120574 is a value in the interval [0 1]Then we compute a score for each note 119897 and for each blob

associated with the note by summing up the elements in thescore matrix 119896119895119899(119905) which are considered to be part of a blobThe best blob candidate is the one with the highest score andfurther on it is associated with the note its boundaries givingthe note onset and offsets

42 Gains Reinitialization and Recomputation Having asso-ciated a blob with each note (Section 413) we discardobtrusive energy from the gainsmatrix 119892119895119899(119905) by eliminatingthe pixels corresponding to the blobswhichwere not selectedmaking thematrix sparser Furthermore the energy excludedfrom instrumentrsquos gains is redistributed to other instrumentscontributing to better source separation Thus the gainsare reinitialized with the information obtained from thecorresponding blobs and we can repeat the factorizationAlgorithm 1 to recompute 119892119895119899(119905) Note that the energy whichis excluded by note refinement is set to zero 119892119895119899(119905) and willremain zero during the factorization

In order to refine the gains 119892119895119899(119905) we define a set ofmatrices 119901119896119895119899(119905) derived from the matrices corresponding tothe best blobs 119896119895119899(119905) which contain 1 only for the elementsassociated with the best blob and 0 otherwise We rebuildthe gains matrix 119892119895119899(119905) with the set of submatrices 119901119896119895119899(119905)For the corresponding bins 119899 and time frames 119905 of note 119896we initialize the values in 119892119895119899(119905) with the values in 119901119896119895119899(119905)Then we reiterate the gains estimation with Algorithm 1Furthermore we obtain the spectrogram of the separatedsources with the method described in Section 35

5 PARAFAC Model for MultichannelGains Estimation

Parallel factor analysis methods (PARAFAC) [28 29] aremostly used under the nonnegative tensor factorizationparadigm By these means the NMF model is extended towork with 3-valence tensors where each slice of the sensorrepresents the spectrogram for a channel Another approachis to stack up spectrograms for each channel in a singlematrix[36] and perform a joint estimation of the spectrograms ofsources in all channels Hence we extend the NMF model

in Section 3 to jointly estimate the gains matrices in all thechannels

51 Multichannel Gains Estimation The algorithm describedin Section 3 estimates gains 119892119899119895 for source 119895 with respectto single channel 119894 determined as the corresponding rowin column 119895 of the panning matrix where element 119898119894119895has the maximum value However we argue that a betterestimation can benefit from the information in all channelsTo this extent we can further include update rules for otherparameters such as mixing matrix119898119894119895 which were otherwisekept fixed in Section 3 because the factorization algorithmestimates the parameters jointly for all the channels

We propose to integrate information from all channels byconcatenating their corresponding spectrogram matrices onthe time axis as in

119909 (119891 119905) = [1199091 (119891 119905) 1199091 (119891 119905) sdot sdot sdot 119909119868 (119891 119905)] (14)

We are interested in jointly estimating the gains 119892119899119895 of thesource 119895 in all the channels Consequently we concatenate intime the gains corresponding to each channel 119894 for 119894 = 1 119868where 119868 is the total number of channels as seen in (15) Thenew gains are initialized with identical score informationobtained from the alignment stage However during theestimation of the gains for channel 119894 the new gains 119892119894119899119895(119905)evolve accordingly taking into account the correspondingspectrogram 119909119894(119891 119905) Moreover during the gains refinementstage each gain is refined separately with respect to eachchannel

119892119899119895 (119905) = [1198921119899119895 (119905) 1198922119899119895 (119905) sdot sdot sdot 119892119868119899119895 (119905)] (15)

In (5) we describe the factorization model for theestimated spectrogram considering the mixing matrix thebasis and the gains Since we estimate a set of 119868 gains for eachsource 119895 = 1 119869 this will result in 119869 estimations of thespectrograms corresponding to all the channels 119894 = 1 119868as seen in

119895119894 (119891 119905)

=119869

sum119895=1

119898119894119895119873

sum119899=1

119892119894119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (16)

Each iteration of the factorization algorithm yields addi-tional information regarding the distribution of energybetween each instrument and each channel Therefore wecan include in the factorization update rules for mixingmatrix 119898119894119895 as in (17) By updating the mixing parameters ateach factorization step we can obtain a better estimation for119895119894 (119891 119905)

119898119894119895 larr997888 119898119894119895sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) (119891 119905)120573minus1

(17)

Considering the above the new rules to estimate theparameters are described in Algorithm 2

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 7: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 7

notes which will help us penalize the overlap between blobsof adjacent notes

As a first step we penalize parts of the blobswhich overlapin time with other blobs from different notes 119896 minus 1 119896 119896 + 1This is done by weighting each element in 119896119895119899(119905) with factor120574 depending on the amount of overlapping with blobs fromadjacent notes The resulting score matrix has the followingexpression

119896119895119899 (119905) =

120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896minus1119895119899 (119905) = 1120574 lowast 119896119895119899 (119905) if 119896119895119899 (119905) and 119896+1119895119899 (119905) = 1119896119895119899 (119905) otherwise

(13)

where 120574 is a value in the interval [0 1]Then we compute a score for each note 119897 and for each blob

associated with the note by summing up the elements in thescore matrix 119896119895119899(119905) which are considered to be part of a blobThe best blob candidate is the one with the highest score andfurther on it is associated with the note its boundaries givingthe note onset and offsets

42 Gains Reinitialization and Recomputation Having asso-ciated a blob with each note (Section 413) we discardobtrusive energy from the gainsmatrix 119892119895119899(119905) by eliminatingthe pixels corresponding to the blobswhichwere not selectedmaking thematrix sparser Furthermore the energy excludedfrom instrumentrsquos gains is redistributed to other instrumentscontributing to better source separation Thus the gainsare reinitialized with the information obtained from thecorresponding blobs and we can repeat the factorizationAlgorithm 1 to recompute 119892119895119899(119905) Note that the energy whichis excluded by note refinement is set to zero 119892119895119899(119905) and willremain zero during the factorization

In order to refine the gains 119892119895119899(119905) we define a set ofmatrices 119901119896119895119899(119905) derived from the matrices corresponding tothe best blobs 119896119895119899(119905) which contain 1 only for the elementsassociated with the best blob and 0 otherwise We rebuildthe gains matrix 119892119895119899(119905) with the set of submatrices 119901119896119895119899(119905)For the corresponding bins 119899 and time frames 119905 of note 119896we initialize the values in 119892119895119899(119905) with the values in 119901119896119895119899(119905)Then we reiterate the gains estimation with Algorithm 1Furthermore we obtain the spectrogram of the separatedsources with the method described in Section 35

5 PARAFAC Model for MultichannelGains Estimation

Parallel factor analysis methods (PARAFAC) [28 29] aremostly used under the nonnegative tensor factorizationparadigm By these means the NMF model is extended towork with 3-valence tensors where each slice of the sensorrepresents the spectrogram for a channel Another approachis to stack up spectrograms for each channel in a singlematrix[36] and perform a joint estimation of the spectrograms ofsources in all channels Hence we extend the NMF model

in Section 3 to jointly estimate the gains matrices in all thechannels

51 Multichannel Gains Estimation The algorithm describedin Section 3 estimates gains 119892119899119895 for source 119895 with respectto single channel 119894 determined as the corresponding rowin column 119895 of the panning matrix where element 119898119894119895has the maximum value However we argue that a betterestimation can benefit from the information in all channelsTo this extent we can further include update rules for otherparameters such as mixing matrix119898119894119895 which were otherwisekept fixed in Section 3 because the factorization algorithmestimates the parameters jointly for all the channels

We propose to integrate information from all channels byconcatenating their corresponding spectrogram matrices onthe time axis as in

119909 (119891 119905) = [1199091 (119891 119905) 1199091 (119891 119905) sdot sdot sdot 119909119868 (119891 119905)] (14)

We are interested in jointly estimating the gains 119892119899119895 of thesource 119895 in all the channels Consequently we concatenate intime the gains corresponding to each channel 119894 for 119894 = 1 119868where 119868 is the total number of channels as seen in (15) Thenew gains are initialized with identical score informationobtained from the alignment stage However during theestimation of the gains for channel 119894 the new gains 119892119894119899119895(119905)evolve accordingly taking into account the correspondingspectrogram 119909119894(119891 119905) Moreover during the gains refinementstage each gain is refined separately with respect to eachchannel

119892119899119895 (119905) = [1198921119899119895 (119905) 1198922119899119895 (119905) sdot sdot sdot 119892119868119899119895 (119905)] (15)

In (5) we describe the factorization model for theestimated spectrogram considering the mixing matrix thebasis and the gains Since we estimate a set of 119868 gains for eachsource 119895 = 1 119869 this will result in 119869 estimations of thespectrograms corresponding to all the channels 119894 = 1 119868as seen in

119895119894 (119891 119905)

=119869

sum119895=1

119898119894119895119873

sum119899=1

119892119894119895119899 (119905)119867

sumℎ=1

119886119895119899 (ℎ) 119866 (119891 minus ℎ1198910 (119899)) (16)

Each iteration of the factorization algorithm yields addi-tional information regarding the distribution of energybetween each instrument and each channel Therefore wecan include in the factorization update rules for mixingmatrix 119898119894119895 as in (17) By updating the mixing parameters ateach factorization step we can obtain a better estimation for119895119894 (119891 119905)

119898119894119895 larr997888 119898119894119895sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) 119909119894 (119891 119905) 119894 (119891 119905)

120573minus2

sum119891119905 119887119895119899 (119891) 119892119899119895 (119905) (119891 119905)120573minus1

(17)

Considering the above the new rules to estimate theparameters are described in Algorithm 2

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 8: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

8 Journal of Electrical and Computer Engineering

(1) Initialize 119887119895119899(119891) with the values learned in Section 33(2) Initialize the gains 119892119895119899(119905) with score information(3) Initialize the panning matrix119898119894119895 with the values learned in Section 31(4) Update the gains using equation (6)(5) Update the panning matrix using equation (17)(6) Repeat Step (2) until the algorithm converges (or maximum number of iterations is reached)

Algorithm 2 Gain estimation method

Note that the current model does not estimate the phasesfor each channel In order to reconstruct source 119895 the modelin Section 3 uses the phase of the signal corresponding tochannel 119894 where it has the maximum value in the panningmatrix as described in Section 35 Thus in order to recon-struct the original signals we can solely rely on the gainsestimated in single channel 119894 in a similar way to the baselinemethod

52 Multichannel Gains Refinement As presented in Sec-tion 51 for a given source we obtain an estimation of thegains corresponding to each channelTherefore we can applynote refinement heuristics in a similar manner to Section 41for each of the gains [1198921119899119895(119905) 119892119868119899119895(119905)]Thenwe can averageout the estimations for all the channel making the blobdetection more robust to the variances between the channels

1198921015840119899119895 (119905) =sum119868119894=1 119892119894119899119895 (119905)

119868 (18)

Having computed the mean over all channels as in(18) for each note 119896 = 1 119870119895 we process submatrix119892119896119895119899(119905) of the new gains matrix 1198921015840119895119899(119905) where 119870119895 is the totalnumber of notes for an instrument 119895 Specifically we applythe same steps preprocessing (Section 411) binarization(Section 412) and blob selection (Section 413) to eachmatrix 119892119896119895119899(119905) and we obtain a binary matrix 119901119896119895119899(119905) having1 s for the elements corresponding to the best blob and 0 s forthe rest

Our hypothesis is that averaging out the gains between allchannels makes blob detection more robust However whenperforming the averaging we do not account for the delaysbetween the channels In order to compute the delay for agiven channel we can compute the best blob separately withthemethod in Section 41 (matrix 119896119895119899(119905)) and compare it withthe one calculated with the averaged estimation (119901119896119895119899(119905))Thisstep is equivalent to comparing the onset times of the two bestblobs for the two estimations Subtracting these onset timeswe get the delay between the averaged estimation and theone obtained for a channel and we can correct this in matrix119901119896119895119899(119905) Accordingly we zero-pad the beginning of119901

119896119895119899(119905)with

the amount of zeros corresponding to the delay or we removethe trailing zeros for a negative delay

6 Materials and Evaluation

61 Dataset The audio material used for evaluation was pre-sented by Patynen et al [30] and consists of four passages of

symphonic music from the Classical and Romantic periodsThis work presented a set of anechoic recordings for eachof the instruments which were then synchronized betweenthem so that they could later be combined to a mix of theorchestra Musicians played in an anechoic chamber and inorder to be synchronous with the rest of the instrumentsthey followed a video featuring a conductor and a pianistplaying each of the four pieces Note that the benefits ofhaving isolated recordings comes at the expense of ignoringthe interactions between musicians which commonly affectintonation and time-synchronization [37]

The four pieces differ in terms of number of instrumentsper instrument class style dynamics and size The firstpassage is a soprano aria of Donna Elvira from the opera DonGiovanni by W A Mozart (1756ndash1791) corresponding to theClassical period and traditionally played by a small groupof musicians The second passage is from L van Beethovenrsquos(1770ndash1827) Symphony no 7 featuring big chords and stringcrescendo The chords and pauses make the reverberationtail of a concert hall clearly audible The third passage isfrom Brucknerrsquos (1824ndash1896) Symphony no 8 and representsthe late Romantic period It features large dynamics andsize of the orchestra Finally G Mahlerrsquos Symphony no 1also featuring a large orchestra is another example of lateromanticism The piece has a more complex texture than theone by Bruckner Furthermore according to the musicianswhich recorded the dataset the last two pieces were alsomoredifficult to play and record [30]

In order to keep the evaluation setup consistent betweenthe four pieces we focus in the following instruments violinviola cello double bass oboe flute clarinet horn trumpetand bassoon All tracks from a single instrument were joinedinto a single track for each of the pieces

For the selected instruments we list the differencesbetween the four pieces in Table 1 Note that in the originaldataset the violins are separated into two groups Howeverfor brevity of evaluation and because in our separation frame-workwe do not consider sources sharing the same instrumenttemplates we decided tomerge the violins into a single groupNote that the pieces by Mahler and Bruckner have a divisiin the groups of violins which implies a larger number ofinstruments playing different melody lines simultaneouslyThis results in a scenariowhich ismore challenging for sourceseparation

We created a ground truth score by hand annotatingthe notes played by the instruments In order to facilitatethis process we first gathered the scores in MIDI formatand automatically computed a global audio-score alignmentusing the method from [13] which has won the MIREX

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 9: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 9

Table 1 Anechoic dataset [30] characteristics

Piece Duration Period Instrument sections Number of tracks Max tracksinstrumentMozart 3min 47 s Classical 8 10 2Beethoven 3min 11 s Classical 10 20 4Mahler 2min 12 s Romantic 10 30 4Bruckner 1min 27 s Romantic 10 39 12

Anechoicrecordings

Scoreannotations

Denoisedrecordings recordings

Multichannel

Figure 3 The steps to create the multichannel recordings dataset

score-following challenge for the past years Then we locallyaligned the notes of each instrument by manually correctingthe onsets and offsets to fit the audio This was performedusing Sonic Visualiser with the guidance of the spectro-gram and the monophonic pitch estimation [38] computedfor each of the isolated instruments The annotation wasperformed by two of the authors which cross-checked thework of their peer Note that this dataset and the proposedannotation are useful not only for our particular task butalso for the evaluation of multiple pitch estimation andautomatic transcription algorithms in large orchestral set-tings a context which has not been considered so far in theliteratureThe annotations can be found at the associated page(httpmtgupfedudownloaddatasetsphenicx-anechoic)

During the recording process detailed in [30] the gainof the microphone amplifiers was fixed to the same valuefor the whole process which reduced the dynamic range ofthe recordings of the quieter instruments This led to noisierrecordings for most of the instruments In Section 62 wedescribe the score-informed denoising procedure we appliedto each track From the denoised isolated recordings we thenused Roomsim to create a multichannel image as detailed inSection 63 The steps necessary to pass from the anechoicrecordings to the multichannel dataset are represented inFigure 3The original files can be obtained from the AcousticGroup at Aalto Univeristy (httpresearchcsaaltofiacou-stics) For the denoising algorithm please refer to httpmtgupfedudownloaddatasetsphenicx-anechoic

62 Dataset Denoising The noise related problems in thedataset were presented in [30] We remove the noise in therecordings with the score-informed method in [39] whichrelies on learned noise spectral pattersThemain difference isthat we rely on a manually annotated score while in [39] thescore is assumed to be misaligned so further regularizationis included to ensure that only certain note combinations inthe score occur

The annotated score yields the time interval where aninstrument is not playing Thus the noise pattern is learnedonly within that interval In this way the method assures that

Violins

Horns

Violas

CellosDouble bass

Trumpets

Clarinets

Bassoons

Oboes

Flutes

CL

RVL

V1

V2

TR

HRN

WWR

WWL30

25

20

1515

20

25

SourcesReceiverAy2

Ax1

Ay1 Ax2

Figure 4 The sources and the receivers (microphones in thesimulated room)

the desired noise which is a part of the actual sound of theinstrument is preserved in the denoised recording

The algorithm takes each anechoic recording of a giveninstrument and removes the noise for the time interval wherean instrument is playing while setting to zero the frameswhere an instrument is not playing

63 Dataset Spatialization To simulate a large reverberanthall we use the software Roomsim [31] We define a configu-ration file which specifies the characteristics of the hall andfor each of the microphones their position relative to eachof the sources The simulated room has similar dimensionsto the Royal Concertgebouw in Amsterdam one of thepartners in the PHENICX project and represents a setup inwhich we tested our frameworkThe simulated roomrsquos widthlength and height are 28m 40m and 12m The absorptioncoefficients are specified in Table 2

Thepositions of the sources andmicrophones in the roomare common for orchestral concerts (Figure 4) A config-uration file is created for each microphone which containsits coordinates (eg (14 17 4) for the center microphone)Then each source is defined through polar coordinates rel-ative to the microphone (eg (114455 minus951944 minus151954)radius azimuth and elevation for the bassoon relative tothe center microphone) We selected all the microphonesto be cardioid in order to match the realistic setup ofConcertgebouw Hall

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 10: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

10 Journal of Electrical and Computer Engineering

Table 2 Room surface absorption coefficients

Standard measurement frequencies (Hz) 125 250 500 1000 2000 4000Absorption of wall in 119909 = 0 plane 04 03 03 03 02 01Absorption of wall in 119909 = 119871119909 plane 04 045 035 035 045 03Absorption of wall in 119910 = 0 plane 04 045 035 035 045 03Absorption of wall in 119910 = 119871119910 plane 04 045 035 035 045 03Absorption of floor that is 119911 = 0 plane 05 06 07 08 08 09Absorption of ceiling that is 119911 = 119871119911plane 04 045 035 035 045 03

Frequency (Hz)

Reverberation time vs frequency

103

RT60

(sec

)

1

09

08

07

06

05

04

03

02

01

0

Figure 5The reverberation time versus frequency for the simulatedroom

Using the configuration file and the anechoic audio filescorresponding to the isolated sources Roomsim generatesthe audio files for each of the microphones along with theimpulse responses for each pair of instruments and micro-phones The impulse responses and the anechoic signals areused during the evaluation to obtain the ground truth spatialimage of the sources in the corresponding microphoneAdditionally we plot Roomsim the reverberation time RT60[31] across the frequencies in Figure 5

We need to adapt ground truth annotations to the audiogenerated with Roomsim as the original annotations weredone on the isolated audio files Roomsim creates an audiofor a given microphone by convolving each of the sourceswith the corresponding impulse response and then summingup the results of the convolution We compute a delay foreach pair of microphones and instruments by taking theposition of the maximum value in the associated impulseresponse vector Then we generate a score for each ofthe pairs by adding the corresponding delay to the noteonsets Additionally since the offset time depends on thereverberation and the frequencies of the notes we add 08 sto each note offset to account for the reverberation besidesthe added delay

64 Evaluation Methodology

641 Parameter Selection In this paper we use a low-levelspectral representation of the audio data which is generatedfrom a windowed FFT of the signal We use a Hanningwindow with the size of 92ms and a hop size of 11ms

Here a logarithmic frequency discretization is adoptedFurthermore two time-frequency resolutions are used Firstto estimate the instrument models and the panning matrixa single semitone resolution is proposed In particular weimplement the time-frequency representation by integratingthe STFT bins corresponding to the same semitone Secondfor the separation task a higher resolution of 14 of semitoneis used which has proven to achieve better separationresults [6] The time-frequency representation is obtained byintegrating the STFT bins corresponding to 14 of semitoneNote that in the separation stage the learned basis functions119887119895119899(119891) are adapted to the 14 of semitone resolution byreplicating 4 times the basis at each semitone to the 4 samplesof the 14 of semitone resolution that belong to this semitoneFor image binarization we pick for the first Gaussian 120601 = 3as the standard deviation and for the second Gaussian 120581 = 4as the position of the central frequency bin and ] = 4 as thestandard deviation corresponding to one semitone

We picked 10 iterations for the NMF and we set the beta-divergence distortion 120573 = 13 as in [6 11]

642 Evaluation Setup We perform three different kind ofevaluations audio-to-score alignment panning matrix esti-mation and score-informed source separation Note that forthe alignment we evaluate the state-of-the-art system in [13]This method does not align notes but combinations of notesin the score (aka states) Here the alignment is performedwith respect to a single audio channel corresponding tothe microphone situated in the center of the stage On theother hand the offsets are estimated by shifting the originalduration for each note in the score [13] or by assigning theoffset time as the onset for the next state We denote thesetwo cases as INT or NEX

Regarding the initialization of the separation frameworkwe can use the raw output of the alignment system Howeveras stated in Section 4 and [3 5 24] a better option is to extendthe onsets and offsets along a tolerancewindow to account forthe errors of the alignment system and for the delays betweencenter channel (on which the alignment is performed) andthe other channels and for the possible errors in the alignmentitself Thus we test two hypotheses regarding the tolerancewindow for the possible errors In the first case we extendthe boundaries with 03 s for onsets and 06 s for offsets (T1)and in the second with 06 s for onsets and 1 s for offsets (T2)Note that the value for the onset times of 03 s is not arbitrarybut the usual threshold for onsets in the score-following in

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 11: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 11

Table 3 Score information used for the initialization of score-informed source separation

Tolerance window size Offset estimation

T1 onsets 03 s offsets 06 s INT interpolation of the offsettime

T2 onsets 06 s offsets 09 s NEX the offset is the onset ofthe next note

MIREX evaluation of real-time score-following [40] Twodifferent tolerance windows were tested to account for thecomplexity of this novel scenario The tolerance window isslightly larger for offsets due to the reverberation time andbecause the ending of the note is not as clear as its onsetA summary of the score information used to initialize thesource separation framework is found in Table 3

We label the test case corresponding to the initializationwith the raw output of the alignment system as Ali Con-versely the test case corresponding to the tolerance windowinitialization is labeled as Ext Furthermore within thetolerance window we can refine the note onsets and offsetswith the methods in Section 41 (Ref1) and Section 52 (Ref2)resulting in other two test cases Since method Ref1 can onlyrefine the score to a single channel the results are solelycomputed with respect to that channel For the multichannelrefinement Ref2 we report the results of the alignment ofeach instrument with respect to each microphone A graphicof the initialization of the framework with the four test caseslisted above (Ali Ext Ref1 and Ref2) along with the groundtruth score initialization (GT) is found in Figure 7 wherewe present the results for these cases in terms of sourceseparation

In order to evaluate the panning matrix estimation stagewe compute an ideal panning matrix based on the impulseresponses generated by Roomsim during the creation ofthe multichannel audio (see Section 63) The ideal panningmatrix gives the ideal contribution of each instrument in eachchannel and it is computed by searching the maximum in theimpulse response vector corresponding to each instrument-channel pair as in

119898ideal (119894 119895) = max (IR (119894 119895) (119905)) (19)

where IR(119894 119895)(119905) is the impulse response of source 119894 in channel119895 By comparing the estimated matrix 119898(119894 119895) with the idealone 119898ideal(119894 119895) we can determine if the algorithm picked awrong channel for separation

643 Evaluation Metrics For score alignment we are inter-ested in a measure which relates to source separation andaccounts for the audio frames which are correctly detectedrather than an alignment rate computed per note onset asfound in [13] Thus we evaluate the alignment at the framelevel rather than at a note level A similar reasoning on theevaluation of score alignment is found in [4]

We consider 0011 s the temporal granularity for thismeasure and the size of a frame Then a frame of a musicalnote is considered a true positive (tp) if it is found in theground truth score and in the aligned score in the exact

time boundaries The same frame is labeled as a false positive(fp) if it is found only in the aligned score and a falsenegative (fn) if it is found only in the ground truth scoreSince the gains initialization is done with score information(see Section 4) lost frames (recall) and incorrectly detectedframes (precision) impact the performance of the sourceseparation algorithm precision is defined as 119901 = tp(tp + fp)and recall as 119903 = tp(tp + fn) Additionally we compute theharmonic mean of precision and recall to obtain 119865-measureas 119865 = 2 sdot ((119901 sdot 119903)(119901 + 119903))

The source separation evaluation framework and metricsemployed are described in [41 42] Correspondingly weuse Source to Distortion Ratio (SDR) Source to InterferenceRatio (SIR) and Source to Artifacts Ratio (SAR) While SDRmeasures the overall quality of the separation and ISR thespatial reconstruction of the source SIR is related to rejectionof the interferences and SAR to the absence of forbiddendistortions and artifacts

The evaluation of source separation is a computationallyintensive process Additionally to process the long audio filesin the dataset would require large memory to perform thematrix calculations To reduce thememory requirements theevaluation is performed for blocks of 30 s with 1 s overlap toallow for continuation

65 Results

651 Score Alignment We evaluate the output of the align-ment Ali along the estimation of the note offsets INT andNEX in terms of 119865-measure (see Section 643) precisionand recall ranging from 0 to 1 Additionally we evaluatethe optimal size for extending the note boundaries along theonsets and offsets T1 and T2 for the refinement methodsRef1 and Ref2 and the baseline Ext Since there are lots ofdifferences between pieces we report the results individuallyper song in Table 4

Methods Ref1 and Ref2 depend on a binarization thresh-oldwhich determines howmuch energy is set to zero A lowerthreshold will result in the consolidation of larger blobs inblob detection In [11] this threshold is set to 05 for a datasetof monaural recordings of Bach chorales played by fourinstruments However we are facing a multichannel scenariowhere capturing the reverberation is important especiallywhen we consider that offsets were annotated with a lowenergy threshold Thus we are interested in losing the leastenergy possible and we set lower values for the threshold 03and 01 Consequently when analyzing the results a lowerthreshold achieves better performance in terms of119865-measurefor Ref1 (067 and 072) and for Ref2 (071 and 075)

According to Table 4 extending note offsets (NEX)rather than interpolating them (INT) gives lower recall inall pieces and the method leads to losing more frames whichcannot be recovered even by extending the offset times in T2NEX T2 yields always a lower recall when compared to INTT2 (eg 119903 = 085 compared to 119903 = 097 for Mozart)

The output of the alignment system Ali is not a goodoption to initialize the gains of source separation system Ithas a high precision and a very low recall (eg the case INTAli has 119901 = 095 and 119903 = 039 compared to case INT Ext

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 12: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

12 Journal of Electrical and Computer Engineering

Table4Alignm

entevaluated

interm

sof119865

-measureprecisio

nandrecallrang

ingfro

m0t

o1

Mozart

Beetho

ven

Mahler

Bruckn

er119865

119901119903

119865119901

119903119865

119901119903

119865119901

119903Ali

069

093

055

056

095

039

061

085

047

060

094

032

INT

T1Re

f1077

077

076

079

081

077

063

074

054

072

073

070

Ref2

083

079

088

082

082

081

067

077

060

081

077

085

Ext

082

073

094

084

084

084

077

069

087

086

078

096

T2Re

f1077

077

076

076

075

077

063

074

054

072

070

071

Ref2

083

078

088

082

076

087

067

077

059

079

073

086

Ext

072

057

097

079

070

092

069

055

093

077

063

098

Ali

049

094

033

051

089

036

042

090

027

048

096

044

NEX

T1Re

f1070

077

064

072

079

066

063

074

054

068

072

064

Ref2

073

079

068

071

079

065

066

077

058

073

074

072

Ext

073

077

068

071

080

065

069

075

064

076

079

072

T2Re

f1074

078

071

072

074

070

063

074

054

072

079

072

Ref2

080

080

080

075

075

075

067

077

059

079

073

086

Ext

073

065

085

073

069

077

072

064

082

079

069

091

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 13: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 13

Table 5 Instruments for which the closest microphone was incorrectly determined for different score information (GT Ali T1 T2 INT andNEX) and two room setups

GT INT NEXAli T1 T2 Ali T1 T2

Setup 1 Clarinet Clarinet double bass Clarinet flute Clarinet flute horn Clarinet double bass Clarinet fluteSetup 2 Cello Cello flute Cello flute Cello flute Bassoon Flute Cello flute

which has 119901 = 084 and 119903 = 084 for Beethoven) For thecase of Beethoven the output is particularly poor comparedto other pieces However by extending the boundaries (Ext)and applying note refinement (Ref1 or Ref2) we are able toincrease the recall and match the performance on the otherpieces

When comparing the size for the tolerance window forthe onsets and offsets we observe that the alignment is moreaccurate with detecting the onsets within 03 s and offsetswithin 06 s In Table 4 T1 achieves better results than T2(eg 119865 = 077 for T1 compared to 119865 = 069 for T2Mahler) Relying on a large window retrieves more framesbut also significantly damages the precision However whenconsidering source separation we might want to lose as lessinformation as possible It is in this special case that therefinement methods Ref1 and Ref2 show their importanceWhen facing larger time boundaries as T2 Ref1 and especiallyRef2 are able to reduce the errors by achieving better precisionwith the minimum amount of loss in recall

The refinement Ref1 has a worse performance than Ref2the multichannel refinement (eg 119865 = 072 compared to119865 = 081 for Bruckner INT T1) Note that in the originalversion [11] Ref1 was assuming monophony within a sourceas it was tested in the simple case of Bach10 dataset [4] To thatextent it was relying on a graph computation to determinethe best distribution of blobs However due to the increasedpolyphony within an instrument (eg violins playing divisi)with simultaneous melodic lines we disabled this feature andin this case Ref1 has lower recall it loses more frames Onthe other hand Ref2 is more robust because it computes ablob estimation per channel Averaging out these estimationsyields better results

The refinement works worse for more complex pieces(Mahler and Bruckner) than for simple pieces (Mozartand Beethoven) Increasing the polyphony within a sourceand the number of instruments having many interleavingmelodic lines a less sparse score also makes the task moredifficult

652 Panning Matrix Estimating correctly the panningmatrix is an important step in the proposed method sinceWiener filtering is performed on the channel where theinstrument has the most energy If the algorithm picks adifferent channel for this step in the separated audio files wecan find more interference between instruments

As described in Section 31 the estimation of the panningmatrix depends on the number of nonoverlapping partials ofthe notes found in the score and their alignment with theaudio To that extent the more nonoverlapping partials wehave the more robust the estimation

Initially we experimented with computing the panningmatrix separately for each piece In the case of Bruckner thepiece is simply too short and there are few nonoverlappingpartials to yield a good estimation resulting in errors in thepanning matrix Since the instrument setup is the same forBruckner Beethoven and Mahler pieces (10 sources in thesame position on the stage) we decided to jointly estimate thematrix for the concatenated audio pieces and the associatedscores We denote Setup 1 as the Mozart piece played by 8sources and Setup 2 the Beethoven Mahler and Brucknerpieces played by 10 sources

Since the panning matrix is computed using the scoredifferent score information can yield very different estimationof the panning matrix To that extent we evaluate theinfluence of audio-to-score alignment namely the cases INTNEX Ali T1 and T2 and the initialization with the groundtruth score information GT

In Table 5 we list the instruments for which the algorithmpicked the wrong channel Note that in the room setupgenerated with Roomsim most of the instruments in Table 5are placed close to other sources from the same family ofinstruments for example cello and double bass flute withclarinet bassoon and oboe In this case the algorithmmakesmoremistakes when selecting the correct channel to performsource separation

In the column GT of Table 5 we can see that having aperfectly aligned score yields less errors when estimating thepanningmatrix Conversely in a real-life scenario we cannotrely on hand annotated score In this case for all columns ofthe Table 5 excluding GT the best estimation is obtained bythe combination of NEX and T1 taking the offset time as theonsets of the next note and then extending the score with asmaller window

Furthermore we compute the SDR values for the instru-ments in Table 5 column GT (clarinet and cello) if theseparation was to be done in the correct channel or in theestimated channel For Setup 1 the channel for clarinet iswrongly mistaken toWWL (woodwinds left) the correct onebeing WWR (woodwinds right) when we have a perfectlyaligned score (GT) However the microphones WWL andWWR are very close (see Figure 4) and they do not capturesignificant energy from other instrument sections and theSDR difference is less than 001 dB However in Setup 2 thecello is wrongly separated in the WWL channel and theSDR difference between this audio and the audio separatedin the correct channel is around minus11 dB for each of the threepieces

653 Source Separation We use the evaluation metricsdescribed in Section 643 Since there is a lot of variability

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 14: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

14 Journal of Electrical and Computer Engineering

0

2

4

6(d

B)(d

B)

(dB)

(dB)

SDR

0

2

4

6

8

10

SIR

0

2

4

6

8

10

12

ISR

16

18

20

22

24

26

28

30SAR

Flut

e

minus2

minus2

minus4

minus2

minus4

minus6

minus8

minus10

MozartBeethoven

MahlerBruckner

Flut

e

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

MozartBeethoven

MahlerBruckner

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Flut

e

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Bass

oon

Cel

lo

Clar

inet

Dba

ss

Hor

n

Viol

a

Viol

in

Obo

e

Trum

pet

Figure 6 Results in terms of SDR SIR SAR and ISR for the instruments and the songs in the dataset

between the four pieces it is more informative to present theresults per piece rather than aggregating them

First we analyze the separation results per instrumentsin an ideal case We assume that the best results for score-informed source separation are obtained in the case of aperfectly aligned score (GT) Furthermore for this case wecalculate the separation in the correct channel for all theinstruments since in Section 652 we could see that pickinga wrong channel could be detrimental We present the resultsas a bar plot in Figure 6

As described in Section 61 and Table 1 the four pieceshad different levels of complexity In Figure 6 we can seethat the more complex the piece is the more difficult it isto achieve a good separation For instance note that celloclarinet flute and double bass achieve good results in termsof SDR on Mozart piece but significantly worse results onother three pieces (eg 45 dB for cello in Mozart comparedtominus5 dB inMahler) Cello anddouble bass are close by in bothof the setups similarly for clarinet and flute and we expect

interference between them Furthermore these instrumentsusually share the same frequency range which can result inadditional interference This is seen in lower SIR values fordouble bass (55 dB SIR forMozart butminus18minus01 andminus02 dBSIR for the others) and flute

An issue for the separation is the spatial reconstructionmeasured by the ISR metric As seen in (9) when applyingtheWienermask themultichannel spectrogram ismultipliedwith the panning matrix Thus wrong values in this matrixcan yield wrong amplitude values of the resulting signals

This is the case for trumpet which is allocated aclose microphone in the current setup and for which weexpect a good separation However trumpet achieves apoor ISR (55 1 and minus1 dB) but has a good separationin terms of SIR and SAR Similarly other instruments ascello double bass flute and viola face the same problemparticularly for the piece by Mahler Therefore a goodestimation of the panning matrix is crucial for a goodISR

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 15: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 15

GT

Ali

Ext

Ref1

Ref2

Ground truth

Refined 1

Extended

Refined 2

Aligned

Submatrix psjn(t) for note s

Figure 7 The test cases for initialization of score-informed sourceseparation for the submatrix 119901119904119895119899(119905)

A low SDR is obtained in the case ofMahler that is relatedto the poor results in alignment obtained for this piece Asseen in Table 4 for INT case 119865-measure is almost 8 lowerin Mahler than in other pieces mainly because of the badprecision

The results are considerably worse for double bass for themore complex pieces of Beethoven (minus91 dB SDR) Mahler(minus85 dB SDR) and Bruckner (minus47 dB SDR) and for furtheranalysis we consider it as an outlier and we exclude it fromthe analysis

Second we want to evaluate the usefulness of note refine-ment in source separation As seen in Section 4 the gainsfor NMF separation are initialized with score information orwith a refined score as described in Section 42 A summaryof the different initialization options and seen in Figure 7Correspondingly we evaluate five different initializations ofthe gains the perfect initialization with the ground truthannotations (Figure 7 (GT)) the direct output of the scorealignment system (Figure 7 (Ali)) the common practice ofNMF gains initialization in state-of-the-art score-informedsource separation [3 5 24] (Figure 7 (Ext)) and the refine-ment approaches (Figure 7 (Ref1 and Ref2)) Note that Ref1 isthe refinement done with the method in Section 41 and Ref2with the multichannel method described in Section 52

We test the difference between the binarization thresholds05 and 01 used in the refinement methods Ref1 and Ref2One-way ANOVA on SDR results gives 119865-value = 00796and 119901 value = 07778 which shows no significant differencebetween both binarization thresholds

The results for the five initializations GT Ref1 Ref2Ext and Ali are presented in Figure 8 for each of thefour pieces Note that for Ref1 Ref2 and Ext we aggregateinformation across all possible outputs of the alignment INTNEX T1 and T2 Analyzing the results we note that themorecomplex the piece the more difficult to separate between theinstruments the piece by Mahler having the worse resultsand the piece by Bruckner a large variance as seen in theerror bars For these two pieces other factors as the increasedpolyphony within a source the number of instruments (eg12 violins versus 4 violins in a group) and the synchronizationissues we described in Section 61 can increase the difficultyof separation up to the point that Ref1 Ref2 and Ext havea minimal improvement To that extent for the piece by

Bruckner extending the boundaries of the notes (Ext) doesnot achieve significantly better results than the raw output ofthe alignment (Ali)

As seen in Figure 8 having a ground truth alignment(GT) helps improving the separation increasing the SDRwith 1ndash15 dB or more for all the test cases Moreover therefinement methods Ref1 and Ref2 increase SDR for most ofthe pieces with the exception of the piece by Mahler Thisis due to an increase of SIR and decrease of interferencesin the signal For instance in the piece by Mozart Ref1 andRef2 increase the SDR with 1 dB when compared to Ext Forthis piece the difference in SIR is around 2 dB Then forBeethoven Ref1 and Ref2 increase 05 dB in terms of SDRwhen compared to Ext and 15 dB in SIR For Bruckner solelyRef2 has a higher SDR however SIR increases with 15 dB inRef1 and Ref2 Note that not only do Ref1 and Ref2 refinethe time boundaries of the notes but also the refinementhappens in frequency because the initialization is done withthe contours of the blobs as seen in Figure 7 This can alsocontribute to a higher SIR

Third we look at the influence of the estimation of noteoffsets INT and NEX and the tolerance window sizes T1and T2 which accounts for errors in the alignment Notethat for this case we do not include the refinement in theresults and we evaluate only the case Ext as we leave outthe refinement in order to isolate the influence of T1 andT2 Results are presented in Figure 9 and show that the bestresults are obtained for the interpolation of the offsets INTThis relates to the results presented in Section 651 Similarlyto the analysis regarding the refinement the results are worsefor the pieces by Mahler and Bruckner and we are not ableto draw a conclusion on which strategy is better for theinitialization as the error bars for the ground truth overlapwith the ones of the tested cases

Fourth we analyze the difference between the PARAFACmodel for multichannel gains estimation as proposed inSection 51 compared with the single channel estimation ofthe gains in Section 34 We performed a one-way ANOVAon SDR and we obtain a 119865-value = 1712 and a 119901-value =01908 Hence there is no significant difference betweensingle channel and multichannel gain estimation when weare not performing postprocessing of the gains using grainrefinement However despite the new updates rule do nothelp in the multichannel case we are able to better refinethe gains In this case we aggregate information all overthe channels and blob detection is more robust even tovariations of the binarization threshold To account for thatfor the piece by Bruckner Ref2 outperforms Ref1 in terms ofSDR and SIR Furthermore as seen in Table 4 the alignmentis always better for Ref2 than Ref1

The audio excerpts from the dataset used for evaluationas well as tracks separated with the ground truth annotatedscore are made available (httprepovizzupfeduphenicxanechoic multi)

7 Applications

71 Instrument Emphasis The first application of our ap-proach for multichannel score-informed source separation

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 16: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

16 Journal of Electrical and Computer Engineering

Bruckner

1

3

5

7

SDR

0

2

4

6

8

SIR

0

2

4

6

8

10

ISR

02468

1012141618202224262830

SAR

(dB)

(dB)

(dB)

(dB)

minus1Mozart Beethoven Mahler

Test cases for gains initializationGTRef1Ref2

ExtAli

Test cases for gains initializationGTRef1Ref2

ExtAli

BrucknerMozart Beethoven Mahler

BrucknerMozart Beethoven MahlerBrucknerMozart Beethoven Mahler

Figure 8 Results in terms of SDR SIR AR and ISR for the NMF gains initialization in different test cases

is Instrument Emphasis which aims at processing multitrackorchestral recordings Once we have the separated tracksit allows emphasizing a particular instrument over the fullorchestra downmix recording Our workflow consists inprocessing multichannel orchestral recording from whichthe musical score is available The multitrack recordings areobtained froma typical on-stage setup in a concert hall wheremultiple microphones are placed on stage at certain distanceof the sourcesThe goal is to reduce leakage of other sectionsobtaining enhanced signal for the selected instrument

In terms of system integration this application hastwo parts The front-end is responsible for interacting withthe user in the uploading of media content and presentthe results to the user The back-end is responsible formanaging the audio data workflow between the differentsignal processing components We process the audio filesin batch estimating the signal decomposition for the fulllength For long audio files as in the case of symphonicrecordings the memory requirements can be too demandingeven for a server infrastructure Therefore to overcomethis limitation audio files are split into blocks After the

separation has been performed the blocks associated witheach instrument are concatenated resulting in the separatedtracks The separation quality is not degraded if the blockshave sufficient duration In our case we set the block durationto 1 minute Examples of this application are found online(httprepovizzupfeduphenicx) and are integrated into thePHENICX prototype (httpphenicxcom)

72 Acoustic Rendering The second application for the sep-arated material is augmented or virtual reality scenarioswhere we apply a spatialization of the separated musicalsources Acoustic Rendering aims at recreating acousticallythe recorded performance from a specific listening locationand orientation with a controllable disposition of instru-ments on the stage and the listener

We have considered binaural synthesis as the most suit-able spatial audio technique for this application Humanslocate the direction of incoming sound based on a number ofcues depending on the angle and distance between listenerand source the sound will arrive with a different intensity

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 17: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 17

Bruckner

1

3

5

7

SDR

(dB)

minus1Mozart Beethoven Mahler

AlignmentGTINT T1INT T2NEX T1

NEX T2INT ANEX A

Figure 9 Results in terms of SDR SIR SAR and ISR for thecombination between different offset estimation methods (INT andExt) and different sizes for the tolerance window for note onsets andoffsets (T1 and T2) GT is the ground truth alignment and Ali is theoutput of the score alignment

and at different time instances at both ears The idea behindbinaural synthesis is to artificially generate these cues to beable to create an illusion of directivity of a sound sourcewhen reproduced over headphones [43 44] For a rapidintegration into the virtual reality prototype we have usedthe noncommercial plugin for Unity3D of binaural synthesisprovided by 3DCeption (httpstwobigearscomindexphp)

Specifically this application opens new possibilities inthe area of virtual reality (VR) where video companiesare already producing orchestral performances specificallyrecorded for VR experiences (eg the company WeMakeVRhas collaborated with the London Symphony Orchestraand the Berliner Philharmoniker httpswwwyoutubecomwatchv=ts4oXFmpacA) Using a VR headset with head-phones theAcoustic Rendering application is able to performan acoustic zoom effect when pointing at a given instrumentor section

73 Source Localization The third application aims to esti-mate the spatial location of musical instruments on stageThis application is useful for recordings where the orchestralayout is unknown (eg small ensemble performances) forthe instrument visualization and Acoustic Rendering use-cases introduced above

As for inputs for the source localization method we needthe multichannel recordings and the approximate positionof the microphones on stage In concert halls the recordingsetup consists typically of a grid structure of hanging over-headmicsThe position of the overheadmics is therefore keptas metadata of the performance recording

Automatic Sound Source Localization (SSL) meth-ods make use of microphone arrays and complex signal

processing techniques however undesired effects such asacoustic reflections and noise make this process difficultbeing currently a hot-topic task in acoustic signal processing[45]

Our approach is a novel time difference of arrival(TDOA) method based on note-onset delay estimation Ittakes the refined score alignment obtained prior to thesignal separation (see Section 41) It follows two stepsfirst for each instrument source the relative time delaysfor the various microphone pairs are evaluated and thenthe source location is found as the intersection of a pairof a set of half-hyperboloids centered around the differentmicrophone pairs Each half-hyperboloid determines thepossible location of a sound source based on the measure ofthe time difference of arrival between the two microphonesfor a specific instrument

To determine the time delay for each instrument andmicrophone pair we evaluate a list of time delay valuescorresponding to all note onsets in the score and take themaximum of the histogram In our experiment we havea time resolution of 28ms corresponding to the Fouriertransform hop size Note that this method does not requiretime intervals in which the sources to play isolated as SRP-PHAT [45] and can be used in complex scenarios

8 Outlook

In this paper we proposed a framework for score-informedseparation of multichannel orchestral recordings in distant-microphone scenarios Furthermore we presented a datasetwhich allows for an objective evaluation of alignment andseparation (Section 61) and proposed a methodology forfuture research to understand the contribution of the differentsteps of the framework (Section 65) Then we introducedseveral applications of our framework (Section 7) To ourknowledge this is the first time the complex scenario oforchestral multichannel recordings is objectively evaluatedfor the task of score-informed source separation

Our framework relies on the accuracy of an audio-to-score alignment system Thus we assessed the influence ofthe alignment on the quality of separation Moreover weproposed and evaluated approaches to refining the alignmentwhich improved the separation for three of the four piecesin the dataset when compared to other two initializationoptions for our framework the raw output of the scorealignment and the tolerance window which the alignmentrelies on

The evaluation shows that the estimation of panningmatrix is an important step Errors in the panning matrixcan result into more interference in separated audio or toproblems in recovery of the amplitude of a signal Sincethe method relies on finding nonoverlapping partials anestimation done on a larger time frame is more robustFurther improvements on determining the correct channelfor an instrument can take advantage of our approach forsource localization in Section 7 provided that the method isreliable enough in localizing a large number of instrumentsTo that extent the best microphone to separate a source is theclosest one determined by the localization method

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 18: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

18 Journal of Electrical and Computer Engineering

When looking at separation across the instruments violacello and double bass were more problematic in the morecomplex pieces In fact the quality of the separation in ourexperiment varies within the pieces and the instruments andfuture research could provide more insight on this problemNote that increasing degree of consonance was related toa more difficult case for source separation [17] Hence wecould expect a worse separation for instruments which areharmonizing or accompanying other sections as the case ofviola cello and double bass in some pieces Future researchcould find more insight on the relation between the musicalcharacteristics of the pieces (eg tonality and texture) andsource separation quality

The evaluation was conducted on the dataset presentedin Section 61 The creation of the dataset was a verylaborious task which involved annotating around 12000 pairsof onsets and offsets denoising the original recordings andtesting different room configurations in order to create themultichannel recordings To that extent annotations helpedus to denoise the audio files which could then be usedin score-informed source separation experiments Further-more the annotations allow for other tasks to be testedwithinthis challenging scenario such as instrument detection ortranscription

Finally we presented several applications of the proposedframework related to Instrument Emphasis or Acoustic Ren-dering some of which are already at the stage of functionalproducts

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank all the partners of thePHENICX consortium for a very fruitful collaborationParticularly they would like to thank Agustin Martorell forhis insights on the musicology side and correcting the scoresandOscarMayor for his help regardingRepovizz and his con-tribution on the applicationsThis work is partially supportedby the Spanish Ministry of Economy and Competitivenessunder CASAS Project (TIN2015-70816-R)

References

[1] E Gomez M Grachten A Hanjalic et al ldquoPHENICX per-formances as highly enriched aNd Interactive concert experi-encesrdquo in Proceedings of the SMAC Stockholm Music AcousticsConference 2013 and SMC Sound and Music Computing Confer-ence 2013

[2] S Ewert and M Muller ldquoUsing score-informed constraints forNMF-based source separationrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo12) pp 129ndash132 Kyoto Japan March 2012

[3] J Fritsch and M D Plumbley ldquoScore informed audio sourceseparation using constrained nonnegative matrix factorizationand score synthesisrdquo in Proceedings of the 38th IEEE Interna-tional Conference on Acoustics Speech and Signal Processing(ICASSP rsquo13) pp 888ndash891 Vancouver Canada May 2013

[4] Z Duan and B Pardo ldquoSoundprism an online system for score-informed source separation of music audiordquo IEEE Journal onSelected Topics in Signal Processing vol 5 no 6 pp 1205ndash12152011

[5] R Hennequin B David and R Badeau ldquoScore informed audiosource separation using a parametric model of non-negativespectrogramrdquo in Proceedings of the 36th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo11) pp 45ndash48 May 2011

[6] J J Carabias-Orti M Cobos P Vera-Candeas and F JRodriguez-Serrano ldquoNonnegative signal factorization withlearnt instrument models for sound source separation in close-microphone recordingsrdquo EURASIP Journal on Advances inSignal Processing vol 2013 article 184 2013

[7] T Pratzlich R M Bittner A Liutkus and M Muller ldquoKernelAdditive Modeling for interference reduction in multi-channelmusic recordingsrdquo in Proceedings of the 40th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo15) pp 584ndash588 Brisbane Australia April 2014

[8] S Ewert B Pardo M Mueller and M D Plumbley ldquoScore-informed source separation for musical audio recordings anoverviewrdquo IEEE Signal Processing Magazine vol 31 no 3 pp116ndash124 2014

[9] F J Rodriguez-Serrano Z Duan P Vera-Candeas B Pardoand J J Carabias-Orti ldquoOnline score-informed source separa-tion with adaptive instrument modelsrdquo Journal of New MusicResearch vol 44 no 2 pp 83ndash96 2015

[10] F J Rodriguez-Serrano S Ewert P Vera-Candeas and MSandler ldquoA score-informed shift-invariant extension of complexmatrix factorization for improving the separation of overlappedpartials in music recordingsrdquo in Proceedings of the IEEE Inter-national Conference on Acoustics Speech and Signal Processing(ICASSP rsquo16) pp 61ndash65 Shanghai China 2016

[11] M Miron J J Carabias and J Janer ldquoImproving score-informed source separation for classical music through noterefinementrdquo in Proceedings of the 16th International Societyfor Music Information Retrieval (ISMIR rsquo15) Malaga SpainOctober 2015

[12] A Arzt H Frostel T Gadermaier M Gasser M Grachtenand G Widmer ldquoArtificial intelligence in the concertgebouwrdquoin Proceedings of the 24th International Joint Conference onArtificial Intelligence pp 165ndash176 Buenos Aires Argentina2015

[13] J J Carabias-Orti F J Rodriguez-Serrano P Vera-CandeasN Ruiz-Reyes and F J Canadas-Quesada ldquoAn audio to scorealignment framework using spectral factorization and dynamictimewarpingrdquo inProceedings of the 16th International Society forMusic Information Retrieval (ISMIR rsquo15) Malaga Spain 2015

[14] O Izmirli and R Dannenberg ldquoUnderstanding features anddistance functions for music sequence alignmentrdquo in Proceed-ings of the 11th International Society for Music InformationRetrieval Conference (ISMIR rsquo10) pp 411ndash416 Utrecht TheNetherlands 2010

[15] M Goto ldquoDevelopment of the RWC music databaserdquo inProceedings of the 18th International Congress on Acoustics (ICArsquo04) pp 553ndash556 Kyoto Japan April 2004

[16] S Ewert M Muller and P Grosche ldquoHigh resolution audiosynchronization using chroma onset featuresrdquo in Proceedingsof the IEEE International Conference on Acoustics Speech andSignal Processing (ICASSP rsquo09) pp 1869ndash1872 IEEE TaipeiTaiwan April 2009

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 19: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

Journal of Electrical and Computer Engineering 19

[17] J J Burred From sparse models to timbre learning new methodsfor musical source separation [PhD thesis] 2008

[18] F J Rodriguez-Serrano J J Carabias-Orti P Vera-CandeasT Virtanen and N Ruiz-Reyes ldquoMultiple instrument mix-tures source separation evaluation using instrument-dependentNMFmodelsrdquo inProceedings of the 10th international conferenceon Latent Variable Analysis and Signal Separation (LVAICA rsquo12)pp 380ndash387 Tel Aviv Israel March 2012

[19] A Klapuri T Virtanen and T Heittola ldquoSound source sepa-ration in monaural music signals using excitation-filter modeland EM algorithmrdquo in Proceedings of the IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo10) pp 5510ndash5513 March 2010

[20] J-L Durrieu A Ozerov C Fevotte G Richard and B DavidldquoMain instrument separation from stereophonic audio signalsusing a sourcefilter modelrdquo in Proceedings of the 17th EuropeanSignal Processing Conference (EUSIPCO rsquo09) pp 15ndash19 GlasgowUK August 2009

[21] J J Bosch K Kondo R Marxer and J Janer ldquoScore-informedand timbre independent lead instrument separation in real-world scenariosrdquo in Proceedings of the 20th European SignalProcessing Conference (EUSIPCO rsquo12) pp 2417ndash2421 August2012

[22] P S Huang M Kim M Hasegawa-Johnson and P SmaragdisldquoSinging-voice separation from monaural recordings usingdeep recurrent neural networksrdquo in Proceedings of the Inter-national Society for Music Information Retrieval Conference(ISMIR rsquo14) Taipei Taiwan October 2014

[23] E K Kokkinis and J Mourjopoulos ldquoUnmixing acousticsources in real reverberant environments for close-microphoneapplicationsrdquo Journal of the Audio Engineering Society vol 58no 11 pp 907ndash922 2010

[24] S Ewert and M Muller ldquoScore-informed voice separationfor piano recordingsrdquo in Proceedings of the 12th InternationalSociety for Music Information Retrieval Conference (ISMIR rsquo11)pp 245ndash250 Miami Fla USA October 2011

[25] B Niedermayer Accurate audio-to-score alignment-data acqui-sition in the context of computational musicology [PhD thesis]Johannes Kepler University Linz Linz Austria 2012

[26] S Wang S Ewert and S Dixon ldquoCompensating for asyn-chronies between musical voices in score-performance align-mentrdquo in Proceedings of the 40th IEEE International Conferenceon Acoustics Speech and Signal Processing (ICASSP rsquo15) pp589ndash593 April 2014

[27] M Miron J Carabias and J Janer ldquoAudio-to-score alignmentat the note level for orchestral recordingsrdquo in Proceedings ofthe 15th International Society for Music Information RetrievalConference (ISMIR rsquo14) Taipei Taiwan 2014

[28] D Fitzgerald M Cranitch and E Coyle ldquoNon-negative tensorfactorisation for sound source separation acoustics speech andsignal processingrdquo in Proceedings of the 2006 IEEE InternationalConference on Acoustics Speech and Signal (ICASSP rsquo06)Toulouse France 2006

[29] C Fevotte and A Ozerov ldquoNotes on nonnegative tensorfactorization of the spectrogram for audio source separationstatistical insights and towards self-clustering of the spatialcuesrdquo in Exploring Music Contents 7th International Sympo-sium CMMR 2010 Malaga Spain June 21ndash24 2010 RevisedPapers vol 6684 of Lecture Notes in Computer Science pp 102ndash115 Springer Berlin Germany 2011

[30] J Patynen V Pulkki and T Lokki ldquoAnechoic recording systemfor symphony orchestrardquo Acta Acustica United with Acusticavol 94 no 6 pp 856ndash865 2008

[31] D Campbell K Palomaki and G Brown ldquoAMatlab simulationof shoebox room acoustics for use in research and teachingrdquoComputing and Information Systems vol 9 no 3 p 48 2005

[32] O Mayor Q Llimona MMarchini P Papiotis and E MaestreldquoRepoVizz a framework for remote storage browsing anno-tation and exchange of multi-modal datardquo in Proceedings of the21st ACM International Conference onMultimedia (MM rsquo13) pp415ndash416 Barcelona Spain October 2013

[33] D D Lee and H S Seung ldquoLearning the parts of objects bynon-negative matrix factorizationrdquo Nature vol 401 no 6755pp 788ndash791 1999

[34] C Fevotte N Bertin and J-L Durrieu ldquoNonnegative matrixfactorization with the Itakura-Saito divergence with applica-tion to music analysisrdquo Neural Computation vol 21 no 3 pp793ndash830 2009

[35] M Nixon Feature Extraction and Image Processing ElsevierScience 2002

[36] R Parry and I Essa ldquoEstimating the spatial position of spectralcomponents in audiordquo in Independent Component Analysis andBlind Signal Separation 6th International Conference ICA 2006Charleston SC USA March 5ndash8 2006 Proceedings J RoscaD Erdogmus J C Prıncipe and S Haykin Eds vol 3889of Lecture Notes in Computer Science pp 666ndash673 SpringerBerlin Germany 2006

[37] P Papiotis M Marchini A Perez-Carrillo and E MaestreldquoMeasuring ensemble interdependence in a string quartetthrough analysis of multidimensional performance datardquo Fron-tiers in Psychology vol 5 article 963 2014

[38] M Mauch and S Dixon ldquoPYIN a fundamental frequency esti-mator using probabilistic threshold distributionsrdquo in Proceed-ings of the IEEE International Conference on Acoustics Speechand Signal Processing (ICASSP rsquo14) pp 659ndash663 Florence ItalyMay 2014

[39] F J Canadas-Quesada P Vera-Candeas D Martınez-MunozN Ruiz-Reyes J J Carabias-Orti and P Cabanas-MoleroldquoConstrained non-negative matrix factorization for score-informed piano music restorationrdquo Digital Signal Processingvol 50 pp 240ndash257 2016

[40] A Cont D Schwarz N Schnell and C Raphael ldquoEvaluationof real-time audio-to-score alignmentrdquo in Proceedings of the 8thInternational Conference onMusic Information Retrieval (ISMIRrsquo07) pp 315ndash316 Vienna Austria September 2007

[41] E Vincent R Gribonval and C Fevotte ldquoPerformance mea-surement in blind audio source separationrdquo IEEE Transactionson Audio Speech and Language Processing vol 14 no 4 pp1462ndash1469 2006

[42] V Emiya E Vincent N Harlander and V Hohmann ldquoSub-jective and objective quality assessment of audio source sep-arationrdquo IEEE Transactions on Audio Speech and LanguageProcessing vol 19 no 7 pp 2046ndash2057 2011

[43] D R Begault and E M Wenzel ldquoHeadphone localization ofspeechrdquo Human Factors vol 35 no 2 pp 361ndash376 1993

[44] G S Kendall ldquoA 3-D sound primer directional hearing andstereo reproductionrdquo Computer Music Journal vol 19 no 4 pp23ndash46 1995

[45] A Martı Guerola Multichannel audio processing for speakerlocalization separation and enhancement UniversitatPolitecnica de Valencia 2013

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Page 20: Research Article Score-Informed Source Separation for ... · is paper proposes a system for score-informed audio source separation for multichannel orchestral recordings. e orchestral

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of