Sparse Chroma Estimation for Harmonic Non-Stationary Audio Juhlin ... · SPARSE CHROMA ESTIMATION...

LUND UNIVERSITY

PO Box 117221 00 Lund+46 46-222 00 00

Sparse Chroma Estimation for Harmonic Non-Stationary Audio

Juhlin, Maria; Kronvall, Ted; Swärd, Johan; Jakobsson, Andreas

Published in:Signal Processing Conference (EUSIPCO), 2015 23rd European

DOI:10.1109/EUSIPCO.2015.7362338

2015

Link to publication

Citation for published version (APA):Juhlin, M., Kronvall, T., Swärd, J., & Jakobsson, A. (2015). Sparse Chroma Estimation for Harmonic Non-Stationary Audio. In Signal Processing Conference (EUSIPCO), 2015 23rd European (pp. 26-30). IEEE--Instituteof Electrical and Electronics Engineers Inc.. DOI: 10.1109/EUSIPCO.2015.7362338

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authorsand/or other copyright owners and it is a condition of accessing publications that users recognise and abide by thelegal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private studyor research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portalTake down policyIf you believe that this document breaches copyright please contact us providing details, and we will removeaccess to the work immediately and investigate your claim.

https://doi.org/10.1109/EUSIPCO.2015.7362338

http://portal.research.lu.se/portal/en/publications/sparse-chroma-estimation-for-harmonic-nonstationary-audio(6578e77c-c7fb-4080-84e9-0f02b0eb99dd).html

SPARSE CHROMA ESTIMATION FOR HARMONIC NON-STATIONARY AUDIO

Maria Juhlin, Ted Kronvall, Johan Sw¨ard, and Andreas Jakobsson

Centre for Mathematical Sciences, Lund University, Sweden.email: {juhlin, ted, js, aj}@maths.lth.se

ABSTRACT

In this work, we extend on our recently proposed block sparsechroma estimator, such that the method also allows for signalswith time-varying envelopes. Using a spline-based amplitudemodulation of the chroma dictionary, the refined estimator isable to model longer frames than our earlier approach, as wellas to model highly time-localized signals, and signals con-taining sudden bursts, such as trumpet or trombone signals,thus retaining more signal information than other methods forchroma estimation. The performance of the proposed estima-tor is evaluated on a recorded trumpet signal, clearly illus-trating the improved performance, as compared to other usedtechniques.

Index Terms— chromagram, amplitude modulation, blocksparsity, convex optimization, ADMM

1. INTRODUCTION

Music is an art-form that most enjoy. Even more so todaythan earlier, as personalized computers and smart telephoneshave enabled ubiquitous music listening and allow everyoneto be their own hobby-DJ. When listening, learning, com-posing, mixing, and identifying music, there are a numberof aspects and approaches one may utilize, such as a compo-sition’s timbre, pitch, tempo, beat, rhythm, and chroma (see,e.g. [1]). Many such features involve analyzing the spectralcontent of the signal. Pitch, as a musical concept, is an or-dinal scale of sounds which is related to, but not necessar-ily as cardinally specific, as the frequency scale. A singlepitch is from a spectral point of view a combination of manynarrowband spectral peaks, which typically share an integerrelationship in terms of their frequencies. In this sense, thepitch is typically defined by the component of lowest fre-quency, i.e., the fundamental, whereas the other frequenciesare referred to as its harmonics. The number of harmonicsin a certain pitch, as well as the magnitude power of these,varies greatly between different sounds. Identifying pitchesin a way similar to our human perception has proved to bea difficult estimation problem. Partly, this difficulty is due

This work was supported in part by the Swedish Research Council, CarlTrygger’s foundation, and the Royal Physiographic Society in Lund.

to octaves; two pitches where one has exactly twice the fun-damental frequency as the other are referred to as being oc-tave equivalent as the distance in pitch by a factor of two iscalled an octave. The octave equivalence is a central part ofthe Western musicological system. Within each octave, theWestern musical system defines twelve so called semi-tones,or chromas. The same chroma is then cyclically defined toeach doubling of fundamental frequency, for all twelve chro-mas [2]. Methods for multi-pitch estimation in audio havebeen thoroughly examined in the literature (see e.g., [3–5],and the references therein). Typically, trouble arises when thecomplexity of the audio signal increases, such that there aresimultaneously two or more pitches present, played by morethan one instrument. Separating these complex combinationsof components in the signal often proves difficult, even if theharmonic structure of the signal is taken into account. Asintroduced in [6], by collecting the pitches in groups in ac-cordance with their respective chroma, we simplify the es-timation, only focusing on chroma, while retaining much ofthe musical information. Chroma features are widely used inapplications such as cover song detection, transcription, andrecommender systems (see, e.g. [7–9]). Most methods forchroma estimation begin with some pitch estimation, whichthen maps into its respective chroma. In this approach, sometake the harmonic structure into account, and others do not.The commonly used method by Ellis [10] is formed via atime-smoothed version of the short time Fourier transform,whereas the method by Muller and Ewert uses a filterbankapproach [11]. Neither of these use the pitches’ harmonicstructure for estimation. On the other hand, taking this struc-ture into account often requires knowledge of the number ofpitches and their respective number of harmonics, which isnotoriously difficult to obtain for multi-pitch signals. Instead,we propose to estimate the present chromas using a sparsemodel reconstruction approach, where explicit model ordersare not required. These parameters are instead controlled im-plicitly using some tuning parameters, which may typicallybe set using cross-validation, or by using some simple heuris-tics. Recently, we proposed such a technique [6], generaliz-ing an earlier work exploiting block sparsity for multi-pitchestimation [12]. Herein, we extend on this model by general-izing it in accordance with the methods presented in [13, 14].The proposed extension allows the signal to have a time vary-

Time (s)

0 2 4 6C C#D D#E F

F#G

G#A

A#B

−35

−30

−25

−20

−15

−10

−5

0

Fig. 1. The normalized log-chromagram for the trumpet scaleusing the method by Muller and Ewert.

ing amplitude, extending the usability of the method to alsoallow for highly non-stationary signals, or signals with sud-den bursts, like trumpets, whose nature may easily be mis-interpreted using ordinary chroma selection techniques. Asin [13], the extended model uses a spline basis to detail thetime-varying envelope of the signal, thereby enabling the am-plitudes to evolve smoothly with time. The time-localizationoffered by the new method also enables a better signal match-ing, such that more overall information is retained in the re-sulting chromagram. The performance of the proposed es-timator is illustrated using a recorded trumpet scale, clearlyillustrating the improved performance as compared to typicalreference methods, and to our earlier proposed estimator.

2. THE SIGNAL MODEL

As shown in [6], a harmonically related audio signal may bewell modeled as a sum of K distinct pitch signals, each con-sisting of Lk harmonically related sinusoids with normalizedfundamental frequencies fk. In this work, we allow the am-plitudes of the harmonic components to vary over time, suchthat

y(t) =KX

k=1

LkX

`=1

↵k,`(t)ei2⇡fk`t, (1)

for t = 1, ..., N , where ↵k,`(t) represents the amplitude of thelth harmonic of the kth pitch, at time instant t. Reminiscentto [13], we model the amplitudes’ time-varying nature usinga spline basis with uniformly spaced knots, i.e.,

↵k,l =RX

r=1

�rsr,k,l = �sk,l. (2)

Here, the amplitude vector ↵k,l is a linear combination of the�r 2 RN spline basis vectors, and sr,k,l denotes the corre-sponding complex amplitude at spline point r of the lth har-

Time (s)

0 2 4 6C C#D D#E F

F#G

G#A

A#B

−35

−30

−25

−20

−15

−10

−5

0

Fig. 2. The normalized log-chromagram for the trumpetscalescale using the method developed by Ellis.

monic for the kth source, and with

↵k,l =⇥↵k,l(1) ↵k,l(2) · · · ↵k,l(N)

⇤T, (3)

sk,l =⇥s1,k,l s2,k,l · · · sR,k,l

⇤T, (4)

� =⇥�1 �2 · · · �R

⇤, (5)

where [·]T denotes the transpose. To mould our algorithm forthe use on harmonic audio signals, we, in accordance with [6],make the partition of different pitches into the twelve equiv-alence classes known as C,C#, D,D#, E, F, F#, G,G#,A,A#, andB. Furthermore, we design the range of fk tohave the structure fk = fbase · 2ck/12+ok where ck and okdenote the equivalence class and the octave belonging of thepitch k, respectively, and fbase denotes a normalized tuningparameter. The reason for this special design of the rangespace is that it conforms with the here examined Western mu-sic scale, which uses a cyclic scale partitioned with twelvesemitones within an octave, spaced by a relative absolute fre-quency of 21/12 [2]. In this work, we have chosen the tuningparameter fbase = 440/29/12+4 Hz, which corresponds to thenote C0. Reminiscent to [6], we thus propose to extend thesignal model to

y(t) ⇡11X

c=0

OX

o=O¯

LmaxX

`=1

↵c,o,`(t)ei2⇡fbase2

(c/12+o)`t, (6)

with O¯

, O, and Lmax denoting the lowest considered octave,the highest considered octave, and the maximum number ofovertones, respectively. This may be expressed compactly as

y(t) =11X

c=1

Wc(t)↵c(t), (7)

Time (s)

0 1000 2000 3000

B A#A

G#G F#F E

D#D C#C −35

−30

−25

−20

−15

−10

−5

0

Fig. 3. The normalized log-chromagram for the trumpet scaleusing the CEBS method.

where

W c =hw

O¯c · · · w

Oc

iT,

wc =⇥z

1c · · · z

Lmaxc

⇤T,

zc =hei2⇡2

c/121 · · · ei2⇡2c/12N

iT,

↵c =h↵c,O

¯,1 · · · ↵c,O,Lmax

· · · ↵c,O,Lmax

iT.

Using (2), one may rewrite (7) as

y(t) =11X

c=0

diag(�Sc,oWTc,o), (8)

where

Sc,o =⇥sc,o,1 · · · sc,o,Lmax

⇤, (9)

sc,o,l =⇥s1,c,o,l · · · sR,c,o,l

⇤T. (10)

As a result, the sought chroma features of the considered sig-nal frame may be found as the parameters minimizing

minimizeS0,O

¯···S11,O

1

2

��

��y �

11X

c=0

OX

o=O¯

diag(�Sc,oWTc,o)

��

��

2

2

, (11)

where y denotes the vector containing the measured signal.To promote a sparse solution, one may rewrite and extend(11) as

minimizeSP

1

2

��

��y �PX

p=1

diag(�SpWTp )

��

��

2

2

+�

PX

p=1

LmaxX

l=1

||sp,l||2 + �

11X

c=0

��Sc

��F,

(12)

where the reparametrization from c, o to p is

Time (s)

1.8 2 2.2 2.4x 105

B A#A

G#G F#F E

D#D C#C −35

−30

−25

−20

−15

−10

−5

0

Fig. 4. The normalized log-chromagram for the trumpet scaleusing the CEAMS method.

p = 12(o�O¯) + c, and thus P denotes the total number of

chroma-octave pairs in the dictionar, and with

Sc =hSc,O

¯· · · Sc,O

i. (13)

The first penalty term in (12) has the effect of forcing columnsin sp,l with small l2 norm to zero, whereas the second pro-motes the sparsity of the resulting chroma estimate.

3. IMPLEMENTATION

Since the problem at hand is convex, one may implement theproposed method efficiently using the Alternating DirectionMethod of Multipliers (ADMM) (see e.g. [15]). DenotingS =

⇥S1 · · · SP

⇤, (12) may be rewritten as

minimizeX,Z

f(X) + g(Z) subject to X� Z = 0 (14)

where

f(X) =1

2

��

��y �PX

p=1

diag(�XpW p)

��

��

2

2

g(Z) = �

PX

p=1

LmaxX

l=1

kZp,lk2 + �

11X

c=0

kZc||F

(15)

with X and Z having the same structure as S. It is worth not-ing that the ADMM separates the sought variable into two un-known variables, here denoted X and Z, enabling the origi-nal problem to be decomposed into easier sub-problems. Theseare in turn solved iteratively until convergence. Introducingthe Lagrangian of (14), i.e.,

L⇢(X,Z,U) = f(X) + g(Z) +⇢

2||X �Z +U ||22 (16)

Fig. 5. The normalized 3-D log-chromagram for the trumpetscale using the CEBS method.

where U represents the scaled dual variable [15], allows (16)to be solved iteratively as

X

(r+1) = arg minX

L⇢(X,Z(r),U (r)), (17)

Z

(r+1) = arg minZ

L⇢(X(r+1),Z,U (r)), (18)

U

(r+1) = X

(r+1) �Z

(r+1) +U

(r). (19)

To solve (17), one differentiates f(X) + ⇢2 ||X �Z +U ||22

with respect to Xp and sets the result equal to zero, whichyields

�NX

n=1

y(n)�(n, ·)HW p(·, n)H +⇢

2(Xp �Zp +Up)

+PX

u=1

NX

n=1

�(n, ·)H�(n, ·)XuW u(·, n)W p(·, n)H = 0.

By stacking all columns in X on top of each other, this maybe represented as

NX

n=1

a(p, n)Hy(n) +⇢

2(zp � up)

=NX

n=1

PX

u=1

a(p, n)Ha(u, n)xu +⇢

2xp,

(20)

where

a(u, n) = W u(·, n)T ⌦ �(n, ·), (21)xu = vec(Xu), (22)zu = vec(Zu), (23)uu = vec(Uu), (24)

with ⌦ denoting the Kronecker product, and W u(·, n) and�(n, ·) denoting the nth column in W u and the nth row �,respectively. Let

Fig. 6. The normalized 3-D log-chromagram for the trumpetscale using the proposed CEAMS method.

A(p, u) =NX

n=1

a(p, n)Ha(u, n), (25)

y(p) =NX

n=1

a(p, n)Hy(n), (26)

Y =⇥y(1) · · · y(P )

⇤T, (27)

A =

0

B@A(1, 1) · · · A(1, P )

.... . .

...A(P, 1) · · · A(P, P )

1

CA . (28)

This yields the proposed algorithm, which is summarized inAlgorithm (1). We term this the Chroma Estimation of Am-plitude Modulated Signals (CEAMS) method. The soft thresh-olds T and T , used in Algorithm (1), are interpreted columnwise, and defined as

T (x,1) = max✓

x

kxk2(kxk2 � 1), 0

◆(29)

T (X,2) =

✓X

kXkF(kXkF � 2), 0

◆(30)

4. NUMERICAL RESULTS

The proposed method was evaluated using a concert C-scaleplayed by a trumpet acquired from [16]. Figures 1-4 illustratethe resulting chromagrams as obtained using the estimatorsin [11], [10], and [6], respectively, as well as the here pro-posed CEAMS estimator. For the latter, we use the parame-ter values � = 0.3 and � = 193, a window length of 1024samples, a sampling frequency of 22050 Hz, Lmax = 9 over-tones, and 9 spline points. As is clear from Figures 1 and2, both the estimators in [10, 11] suffer from apparent prob-lems in choosing the correct chroma-bin for the scale. TheCEBS estimate, shown in Figure 3, is on the other hand no-tably cleaner, but does still suffer from some spurious chromafeatures. As is clear from Figure 4, these peaks are correctly

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

true envelopeCEBS envelopeCEAMS envelope

Fig. 7. The envelopes of the raw signal, the estimation usingCEBS, and the estimation using the proposed CEAMS.

estimated by CEAMS. Here, we have used the same basic set-tings for CEBS as for CEAMS, and with �2 = 0.05, �3 = 3and �4 = 0.1 (in setting these parameters, we have taken careto find the best possible setting for CEBS). Note that the G inthe scale is not detected by any method. This is because thefundamental frequency found in those time frames is 808 Hz,which is slightly closer to G#5 than to G5, using concert tun-ing. To illustrate the difference in time-localization betweenCEBS and CEAMS, Figures 5 and 6 show the 3-D chroma-grams, where it once again can be noted that CEBS fails toidentify the chroma-bin at G#. Moreover, one notes the spu-rious peaks produced in CEBS, as they are of significant mag-nitude, compared to the rest of the chromagram. This is incontrast to CEAMS, where none of the above mentioned be-haviour is present. This is also illustrated in Figure 7, showingthe envelopes of the measured signal together with the CEBSand CEAMS estimates, clearly indicating the better fit of thelatter.

5. REFERENCES[1] M. Muller, D. P. W. Ellis, A. Klapuri, and G. Richard,

“Signal Processing for Music Analysis,” IEEE J. Sel.

Top. Sign. Proc., vol. 5, no. 6, pp. 1088–1110, 2011.[2] R. Shepard, “Circularity in Judgements of Relative

Pitch,” Journal of Acoustical Society of America, vol.36, no. 12, pp. 2346–2353, Dec. 1964.

[3] M. Christensen and A. Jakobsson, Multi-Pitch Estima-

tion, Morgan & Claypool, 2009.[4] A. Klapuri and M. Davy, Signal Processing Methods for

Music Transcription, Springer, 2006.[5] S. I. Adalbjornsson, A. Jakobsson, and M. G. Chris-

tensen, “Multi-Pitch Estimation Exploiting Block Spar-sity,” Elsevier Signal Processing, vol. 109, pp. 236–247,April 2015.

[6] T. Kronvall, M. Juhlin, S. I. Adalbjornsson, andA. Jakobsson, “Sparse Chroma Estimation for Har-monic Audio,” in IEEE Int. Conf. on Acoustics, Speech,

and Sig. Proc., Brisbane, Apr. 19-24 2015.

Algorithm 1 The proposed CEAMS algorithm1: Initiate X = X(0),Z = Z(0), and r = 02: repeat3: X

(r+1) = (AHA+ ⇢

2I)�1

A

HY

4: Z

(r+1) = T (T (X(r+1)p +U

(r)p ,�/⇢),↵/⇢)

5: U

(r+1) = X

(r+1) �Z

(r+1) +U

(r)

6: r r + 17: until convergence

[7] M. A. Bartsch and G. H. Wakefield, “Audio Thumbnail-ing of Popular Music Using Chroma-based Representa-tions,” IEEE Transactions on Multimedia, vol. 7, no. 1,pp. 96–104, Feb. 2005.

[8] S. Kim and S. Narayanan, “Dynamic Chroma Fea-ture Vectors with Applications to Cover Song Identifi-cation.,” in 10th IEEE Workshop on Multimedia Signal

Processing, 2008, pp. 984–987.[9] T.-M. Chang, E.-T. Chen, C.-B. Hsieh, and P.-C. Chang,

“Cover Song Identification with Direct Chroma FeatureExtraction from AAC Files,” in IEEE 2nd Global Con-

ference on Consumer Electronics, Oct. 2013, pp. 55–56.[10] D. P. W. Ellis, “Chroma Feature Analysis and

Synthesis,” http://www.ee.columbia.edu/

dpwe/resources/matlab/chroma-ansyn/,accessed Sept. 2014.

[11] M. Muller and S. Ewert, “Chroma Toolbox: MATLABImplementations for Extracting Variants of Chroma-based Audio Features,” in Proceedings of the 12th In-

ternational Conference on Music Information Retrieval

(ISMIR), 2011.[12] S. I. Adalbjornsson, A. Jakobsson, and M. G. Chris-

tensen, “Estimating Multiple Pitches Using Block Spar-sity,” in 38th IEEE Int. Conf. on Acoustics, Speech, and

Signal Processing, Vancouver, May 26–31, 2013.[13] S. I. Adalbjornsson, J. Sward, T. Kronvall, and A. Jakob-

sson, “A Sparse Approach for Estimation of Ampli-tude Modulated Signals,” in 48th Annual Asilomar Con-

ference on Signals, Systems, and Computers, PacificGrove, USA, Nov. 2-5 2014.

[14] M. G. Christensen, A. Jakobsson, S. V. Andersen, andS. H. Jensen, “Amplitude Modulated Sinusoidal SignalDecomposition for Audio Coding,” IEEE Signal Pro-

cess. Lett., vol. 13, no. 7, pp. 389–392, July 2006.[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eck-

stein, “Distributed Optimization and Statistical Learn-ing via the Alternating Direction Method of Multipli-ers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp.1–122, Jan. 2011.

[16] Mrs. Thomas, “Sound examples,”http://www.hffmcsd.org/webpages/

arushkoski/nyssma.cfm, accessed Feb. 2015.

Sparse Chroma Estimation for Harmonic Non-Stationary Audio Juhlin ... · SPARSE CHROMA ESTIMATION...

Documents

Transcript of Sparse Chroma Estimation for Harmonic Non-Stationary Audio Juhlin ... · SPARSE CHROMA ESTIMATION...