DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

8
DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM PATTERN RECOGNITION Hadrien Foroughmand IRCAM Lab - CNRS - Sorbonne Université [email protected] Geoffroy Peeters LTCI - Télécom Paris - Institut Polytechnique de Paris [email protected] ABSTRACT It has been shown that the harmonic series at the tempo frequency of the onset-strength-function of an audio sig- nal accurately describes its rhythm pattern and can be used to perform tempo or rhythm pattern estimation. Recently, in the case of multi-pitch estimation, the depth of the input layer of a convolutional network has been used to represent the harmonic series of pitch candidates. We use a similar idea here to represent the harmonic series of tempo candi- dates. We propose the Harmonic-Constant-Q-Modulation which represents, using a 4D-tensors, the harmonic se- ries of modulation frequencies (considered as tempo fre- quencies) in several acoustic frequency bands over time. This representation is used as input to a convolutional net- work which is trained to estimate tempo or rhythm pattern classes. Using a large number of datasets, we evaluate the performance of our approach and compare it with previous approaches. We show that it slightly increases Accuracy-1 for tempo estimation but not the average-mean-Recall for rhythm pattern recognition. 1. INTRODUCTION Tempo is one of the most important perceptual elements of music. Today numerous applications rely on tempo information (recommendation, playlist generation, syn- chronization, dj-ing, audio or audio/video editing, beat- synchronous analysis). It is therefore crucial to develop algorithms to correctly estimate it. The automatic estima- tion of tempo from an audio signal has been one of the first research carried on in Music Information Retrieval (MIR) [11]. 25 years later it is still a very active research subject in MIR. This is due to the fact that tempo estima- tion is still an unsolved problem (outside the prototypical cases of pop or techno music) and that recent deep-learning approaches [3] [37] bring new perspectives to it. Tempo is usually defined as (and annotated as) the rate at which people tap their foot or their hand when listen- ing to a music piece. Several people can therefore perceive different tempi for the same piece of music. This is due to the hierarchy of the metrical structure in music (to deal c Hadrien Foroughmand, Geoffroy Peeters. Licensed un- der a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Hadrien Foroughmand, Geoffroy Peeters. “Deep- Rhythm for tempo estimation and rhythm pattern recognition”, 20th In- ternational Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019. with this ambiguity the research community has proposed to consider octave errors as correct) and due to the fact that without the cultural knowledge of the rhythm pattern(s) be- ing played, it can be difficult to perceive “the” tempo (or even “a” tempo). This last point, of course, opens the door to data-driven approaches, which can learn the specifici- ties of the patterns. In this work, we do not deal with the inherent ambiguity of tempo and consider the values pro- vided by annotated datasets as ground-truth. The method we propose here belong to the data-driven systems in the sense that we learn from the data. It also considers both the tempo and rhythm pattern in interaction by adequately modeling the audio content. The tempo of a track can of course vary along time, but in this work we focus on the estimation of constant (global) tempi and rhythm patterns. In the following section, we summarize works related to tempo and rhythm pattern estimation from audio. We refer the reader to [12, 32, 44] for more detailed overviews. 1.1 Related works Tempo estimation. Early MIR systems encoded do- main knowledge (audio, auditory perception and musical knowledge) by hand-crafting signal processing and statis- tical models (hidden Markov, dynamic Bayesian network). Data were at most used to manually tune some parame- ters (such as filter frequencies or transition probabilities). Early techniques for beat tracking and/or tempo estimation belong to this category. Their overall flowchart is a multi- band separation combined with an onset strength function which periodicity is measured. For example, Scheirer [36] proposes the use of band-pass filters combined with res- onating comb filters and peak picking; Klapuri [18] uses the resonating comb filter bank which is driven by the bandwise accent signals, the main extension is the tracking of multiple metrical levels; Gainza and Cole [8] propose a hybrid multiband decomposition where the periodicities of onset functions are tracked in several frequency bands using autocorrelation and then weighted. Since the pioneer works of [11], many audio datasets have been annotated into tempo. This therefore encour- ages researchers to develop data-driven systems based on machine learning and deep learning techniques. Such machine-learning models are K-Nearest-Neighbors (KNN) [40], Gaussian Mixture Model (GMM) [33, 43], Support Vector Machine (SVM) [4, 9, 34], bags of classifiers [20], Random Forest [38] or more recently deep learning mod- els. The first use of deep learning for tempo estimation

Transcript of DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

Page 1: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM PATTERNRECOGNITION

Hadrien ForoughmandIRCAM Lab - CNRS - Sorbonne Université

[email protected]

Geoffroy PeetersLTCI - Télécom Paris - Institut Polytechnique de Paris

[email protected]

ABSTRACT

It has been shown that the harmonic series at the tempofrequency of the onset-strength-function of an audio sig-nal accurately describes its rhythm pattern and can be usedto perform tempo or rhythm pattern estimation. Recently,in the case of multi-pitch estimation, the depth of the inputlayer of a convolutional network has been used to representthe harmonic series of pitch candidates. We use a similaridea here to represent the harmonic series of tempo candi-dates. We propose the Harmonic-Constant-Q-Modulationwhich represents, using a 4D-tensors, the harmonic se-ries of modulation frequencies (considered as tempo fre-quencies) in several acoustic frequency bands over time.This representation is used as input to a convolutional net-work which is trained to estimate tempo or rhythm patternclasses. Using a large number of datasets, we evaluate theperformance of our approach and compare it with previousapproaches. We show that it slightly increases Accuracy-1for tempo estimation but not the average-mean-Recall forrhythm pattern recognition.

1. INTRODUCTION

Tempo is one of the most important perceptual elementsof music. Today numerous applications rely on tempoinformation (recommendation, playlist generation, syn-chronization, dj-ing, audio or audio/video editing, beat-synchronous analysis). It is therefore crucial to developalgorithms to correctly estimate it. The automatic estima-tion of tempo from an audio signal has been one of thefirst research carried on in Music Information Retrieval(MIR) [11]. 25 years later it is still a very active researchsubject in MIR. This is due to the fact that tempo estima-tion is still an unsolved problem (outside the prototypicalcases of pop or techno music) and that recent deep-learningapproaches [3] [37] bring new perspectives to it.

Tempo is usually defined as (and annotated as) the rateat which people tap their foot or their hand when listen-ing to a music piece. Several people can therefore perceivedifferent tempi for the same piece of music. This is dueto the hierarchy of the metrical structure in music (to deal

c© Hadrien Foroughmand, Geoffroy Peeters. Licensed un-der a Creative Commons Attribution 4.0 International License (CC BY4.0). Attribution: Hadrien Foroughmand, Geoffroy Peeters. “Deep-Rhythm for tempo estimation and rhythm pattern recognition”, 20th In-ternational Society for Music Information Retrieval Conference, Delft,The Netherlands, 2019.

with this ambiguity the research community has proposedto consider octave errors as correct) and due to the fact thatwithout the cultural knowledge of the rhythm pattern(s) be-ing played, it can be difficult to perceive “the” tempo (oreven “a” tempo). This last point, of course, opens the doorto data-driven approaches, which can learn the specifici-ties of the patterns. In this work, we do not deal with theinherent ambiguity of tempo and consider the values pro-vided by annotated datasets as ground-truth. The methodwe propose here belong to the data-driven systems in thesense that we learn from the data. It also considers boththe tempo and rhythm pattern in interaction by adequatelymodeling the audio content. The tempo of a track can ofcourse vary along time, but in this work we focus on theestimation of constant (global) tempi and rhythm patterns.

In the following section, we summarize works relatedto tempo and rhythm pattern estimation from audio. Werefer the reader to [12,32,44] for more detailed overviews.

1.1 Related works

Tempo estimation. Early MIR systems encoded do-main knowledge (audio, auditory perception and musicalknowledge) by hand-crafting signal processing and statis-tical models (hidden Markov, dynamic Bayesian network).Data were at most used to manually tune some parame-ters (such as filter frequencies or transition probabilities).Early techniques for beat tracking and/or tempo estimationbelong to this category. Their overall flowchart is a multi-band separation combined with an onset strength functionwhich periodicity is measured. For example, Scheirer [36]proposes the use of band-pass filters combined with res-onating comb filters and peak picking; Klapuri [18] usesthe resonating comb filter bank which is driven by thebandwise accent signals, the main extension is the trackingof multiple metrical levels; Gainza and Cole [8] proposea hybrid multiband decomposition where the periodicitiesof onset functions are tracked in several frequency bandsusing autocorrelation and then weighted.

Since the pioneer works of [11], many audio datasetshave been annotated into tempo. This therefore encour-ages researchers to develop data-driven systems based onmachine learning and deep learning techniques. Suchmachine-learning models are K-Nearest-Neighbors (KNN)[40], Gaussian Mixture Model (GMM) [33, 43], SupportVector Machine (SVM) [4, 9, 34], bags of classifiers [20],Random Forest [38] or more recently deep learning mod-els. The first use of deep learning for tempo estimation

Page 2: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

was proposed by Böck et al. [3] who proposed a deep Re-current Neural Network (bi-LSTM) to predict the positionof the beats inside the signal. This output is then usedas the input of a bank of resonating comb filters to de-tect the periodicity and so the tempo. This technique stillachieves the best results today in terms of Accuracy2. Re-cently, Schreiber and Müller [37] proposed a “single stepapproach” for tempo estimation using deep convolutionalnetworks. The network design is inspired by the flowchartof handcrafted systems: the first layer is supposed to mimicthe extraction of an onset-strength-function. Their systemuses as input Mel-spectrograms and the network is trainedto classify the tempo of an audio excerpt into 256 tempoclasses (from 30 to 285 BPM), it shows very good resultsin terms of Class-Accuracy and Accuracy1.

Rhythm pattern recognition. While tempo andrhythm pattern are closely interleaved, the recognition ofrhythm pattern has received much less attention. This isprobably due to the difficulty of creating datasets annotatedin such rhythm pattern (defining the similarity betweenpatterns — outside the trivial identity case — remains adifficult task). To create such a dataset, one may con-sider the equivalence between the rhythm pattern and therelated dance (such as Tango): Ballroom [12], Extended-ballroom [24] and Greek-dances [14]. Systems to rec-ognize rhythm pattern from the audio are all very dif-ferent: Foote [7] defines a beat spectrum computed witha similarity matrix of MFCCs, Tzanetakis [42] defines abeat histogram computed from an autocorrelation function,Peeters [32] proposes a harmonic analysis of the rhythmpattern, Holzapfel [15] proposes the use of the scale trans-form (which allows to get a tempo invariant representa-tion), Marchand [23,25] extends the latter by combining itwith the modulation spectrum and adding correlation coef-ficients between frequency bands. For a recognition taskon the Extended-ballroom and Greek-dances, Marchandcan be considered as the state-of-the-art.

1.2 Paper proposal and organization

In this paper we present a new audio representation, theHarmonic-Constant-Q-Modulation (HCQM), to be used asinput to a convolutional network for global tempo andrhythm pattern estimation. We name it Deep Rhythmsince it uses the depth of the input and a deep network.The paper is organized as follows. In §2, we describe themotivation for (§2.1) and the computation details of (§2.2)the HCQM. We then describe the architecture of the con-volutional neural network we used (§2.3) and the trainingprocess (§2.4). In §3, we evaluate our proposal for thetask of global tempo estimation (§3.1) and rhythm patternrecognition (§3.2) and discuss the results.

2. PROPOSED METHOD

2.1 Motivations

From Fourier series, it is known that any periodic sig-nal x(t) with period T0 (of fundamental period f0 =1/T0) can be represented as a weighted sum of sinusoidal

components which frequencies are the harmonics of f0:x̂f0,a(t) =

∑Hh=1 ah sin(2πhf0t+ φh).

For the voiced part of speech or pitched musical in-strument, this leads to the so-called “harmonic sinusoidalmodel” [26, 39] that can be used for audio coding ortransformation. This model can also be used to estimatethe pitch of a signal [21]: estimating the f0 such thatx̂f0,a(t) ' x(t). The values ah can be estimated bysampling the magnitude of the DFT at the correspond-ing frequencies ah,f0 = |X(hf0)|. The vector af0 ={a1,f0 · · · aH,f0} represents the spectral envelope of thesignal and is closely related to the timbre of the audio sig-nal, hence the instrument playing. For this reason, thesevalues are often used for instrument classification [29].

For audio musical rhythm, Peeters [30] [31] [32] pro-poses to apply such an harmonic analysis to an onset-strength-function . The period T0 is then defined as the du-ration of a beat. a1,f0 then represents the DFT magnitudeat the 4th-note level, a2,f0 at the 8th-note level, a3,f0 atthe 8th-note-triplet level, while a 1

2 ,f0represent the binary

grouping of the beats and a 13 ,f0

the ternary one. Peetersconsiders that the vector a is representative of the specificrhythm and that therefore af0 represents a specific rhythmplayed at a specific tempo f0. He proposes the followingharmonic series: h ∈ { 14 ,

13 ,

12 ,

23 ,

34 , 1, 1.25, 1.33, . . . 8}

Using this, he shows - in [32] that given the tempo f0, thevector af0 can be used to classify the different rhythm pat-tern; - in [30], that given manually-fixed prototype vectorsa, it is possible to estimate the tempo f0 (looking for thef such that af ' a); - in [31] that the prototype vectors acan be learned (using simple machine-learning) to achievethe best tempo estimation f0.

The method we propose in this paper is in the continua-tion of this last work (learning the values a to estimate thetempo or the rhythm class) but we adapted it to the deeplearning formalism recently proposed by Bittner et al. [2]where the depth of the input to a convolutional network isused to represent the harmonic series af . In [2], a constant-Q-transform (time τ and log-frequency f ) is expanded toa third dimension which represent the harmonic series afof each f (with h ∈ [ 12 , 1, 2, 3, 4, 5]). When f = f0, afwill represent the specific harmonic series of the musicalinstrument (plus an extra value at the 1

2f position used toavoid octave errors). When f 6= f0, af will represent (al-most) random values. In [2], the goal is to estimate theparameters of a filter such that when multiplied with thisthird dimension af it will provide very different valueswhen f = f0 or when f 6= f0. This filter will then beconvolved over all log-frequencies f and time τ to esti-mate the f0’s. This filter is trained using annotated data.In [2], there is actually several of such filter; they consti-tute the first layer of a convolutional network. In practice,in [2], the ah,f are not obtained as |X(hf)|; but by stack-ing in depth several CQTs each starting at different mini-mal frequencies hfmin. The representation is denoted byHarmonic Constant-Q Transform (HCQT): Xhcqt(f, τ, h).

Page 3: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

STFT Onset Strength Function HCQTRaw audio STFT in frequency bandsx(t)

<latexit sha1_base64="gHIePtuoIc2473k2RR8MvnjyLHw=">AAACx3icjVHLTsJAFD3UF+ILcemmkZjghrS40CWJG91hIo9EiGnLAA19ZTolEOLCH9Clfo8/YfwD/QD33hlKohKj07Q9c+45Z+bO2JHnxsIwXjPa0vLK6lp2PbexubW9k98tNOIw4Q6rO6EX8pZtxcxzA1YXrvBYK+LM8m2PNe3hmaw3R4zHbhhciUnEOr7VD9ye61hCUuOSOLrJF42yoYa+CMwUFKuFdunj+aFdC/MvaKOLEA4S+GAIIAh7sBDTcw0TBiLiOpgSxwm5qs5wixx5E1IxUljEDunbp9l1ygY0l5mxcju0ikcvJ6eOQ/KEpOOE5Wq6qicqWbK/ZU9VptzbhP52muUTKzAg9i/fXPlfn+xFoIdT1YNLPUWKkd05aUqiTkXuXP/SlaCEiDiJu1TnhB3lnJ+zrjyx6l2eraXqb0opWTl3Um2Cd7lLumDz53UugkalbB6XzUuzWK1gNrLYxwFKdJ8nqOIcNdQpe4B7POJJu9BCbaSNZ1Itk3r28G1od59RwpPI</latexit>

t<latexit sha1_base64="n/Crv317OzvKiIiDcLEyF/0sYBk=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNXkTaFol2w9rHngZKB49mqeJi8TsxoXJrhGFzF8pBiCIYIkHMKFoKcNBzYS4joYE8cJBbrOcA+TvCmpGClcYgf07dOsnbERzVWm0G6fVgnp5eS0cECemHScsFrN0vVUJyv2t+yxzlR7G9Hfy7KGxErcEvuXb6b8r0/1ItHDie4hoJ4Szaju/Cwl1aeidm596UpSQkKcwl2qc8K+ds7O2dIeoXtXZ+vq+rtWKlbN/Uyb4kPtki7Y+Xmd86BRLjlHJafmFCtlTEcee9jHId3nMSq4RBV1nf2IJzwbF0ZoCCOdSo1c5tnFt2E8fAJmOJKa</latexit> ⌧

<latexit sha1_base64="L2lFTqJkYi2LvCci7Xyg57PtzM0=">AAACx3icjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsuNFdBfuAtkiSTtvQvJhMiqW48Afc6t6P8FOKf6B/4Z1pCmoRnZDkzLnnnJk748S+lwjTfMtpC4tLyyv5VX1tfWNzq7C9U0uilLus6kZ+xBuOnTDfC1lVeMJnjZgzO3B8VncGF7JeHzKeeFF4I0Yxawd2L/S6nmsLSbWEnd4WimbJVMOYB1YGiuev+ln8MtErUWGCFjqI4CJFAIYQgrAPGwk9TVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZs2MDWkuMxPldmkVn15OTgMH5IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF6eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wHtaOSdVyyrq1i+QjTkcce9nFI93mCMi5RQZWy+3jEE561Ky3ShtrdVKrlMs8uvg3t4ROwyJPq</latexit>

⌧<latexit sha1_base64="L2lFTqJkYi2LvCci7Xyg57PtzM0=">AAACx3icjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsuNFdBfuAtkiSTtvQvJhMiqW48Afc6t6P8FOKf6B/4Z1pCmoRnZDkzLnnnJk748S+lwjTfMtpC4tLyyv5VX1tfWNzq7C9U0uilLus6kZ+xBuOnTDfC1lVeMJnjZgzO3B8VncGF7JeHzKeeFF4I0Yxawd2L/S6nmsLSbWEnd4WimbJVMOYB1YGiuev+ln8MtErUWGCFjqI4CJFAIYQgrAPGwk9TVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZs2MDWkuMxPldmkVn15OTgMH5IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF6eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wHtaOSdVyyrq1i+QjTkcce9nFI93mCMi5RQZWy+3jEE561Ky3ShtrdVKrlMs8uvg3t4ROwyJPq</latexit>

⌧<latexit sha1_base64="L2lFTqJkYi2LvCci7Xyg57PtzM0=">AAACx3icjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsuNFdBfuAtkiSTtvQvJhMiqW48Afc6t6P8FOKf6B/4Z1pCmoRnZDkzLnnnJk748S+lwjTfMtpC4tLyyv5VX1tfWNzq7C9U0uilLus6kZ+xBuOnTDfC1lVeMJnjZgzO3B8VncGF7JeHzKeeFF4I0Yxawd2L/S6nmsLSbWEnd4WimbJVMOYB1YGiuev+ln8MtErUWGCFjqI4CJFAIYQgrAPGwk9TVgwERPXxpg4TshTdYZ76ORNScVIYRM7oG+PZs2MDWkuMxPldmkVn15OTgMH5IlIxwnL1QxVT1WyZH/LHqtMubcR/Z0sKyBWoE/sX76Z8r8+2YtAF6eqB496ihUju3OzlFSdity58aUrQQkxcRJ3qM4Ju8o5O2dDeRLVuzxbW9XflVKycu5m2hQfcpd0wdbP65wHtaOSdVyyrq1i+QjTkcce9nFI93mCMi5RQZWy+3jEE561Ky3ShtrdVKrlMs8uvg3t4ROwyJPq</latexit>

⌧ 0 = 1<latexit sha1_base64="E5naABUQ17PS4hYfpHM5W1FBtqU=">AAACynicjVHLSsNAFD2NrxpfVZdugkV0VZK60E2x4MaFiwr2AW2RJJ3W0LyYTIRS3PkDbnXpR/gpxT/Qv/DONAW1iE5Icubcc87MnXFi30uEab7ltIXFpeWV/Kq+tr6xuVXY3mkkUcpdVncjP+Itx06Y74WsLjzhs1bMmR04Pms6w3NZb94xnnhReC1GMesG9iD0+p5rC6KaHWGnhxXrplA0S6YaxjywMlA8e9Ur8ctEr0WFCTroIYKLFAEYQgjCPmwk9LRhwURMXBdj4jghT9UZ7qGTNyUVI4VN7JC+A5q1MzakucxMlNulVXx6OTkNHJAnIh0nLFczVD1VyZL9LXusMuXeRvR3sqyAWIFbYv/yzZT/9cleBPo4VT141FOsGNmdm6Wk6lTkzo0vXQlKiImTuEd1TthVztk5G8qTqN7l2dqq/q6UkpVzN9Om+JC7pAu2fl7nPGiUS9ZxybqyitUypiOPPezjiO7zBFVcoIa66vIRT3jWLjWujbTxVKrlMs8uvg3t4ROJm5Sd</latexit>

⌧ 0 = T<latexit sha1_base64="VE5gjAFfIct+PF8iLSbeZHnCY0M=">AAACynicjVHLSsNAFD3GV42vqks3wSK6Kkld6KZYcOPCRYW+oC2SpNMamheTiVCKO3/ArS79CD+l+Af6F96ZpqAW0QlJzpx7zpm5M07se4kwzbcFbXFpeWU1t6avb2xubed3dhtJlHKX1d3Ij3jLsRPmeyGrC0/4rBVzZgeOz5rO8ELWm3eMJ14U1sQoZt3AHoRe33NtQVSzI+z0qFy7yRfMoqmGMQ+sDBTOX/Vy/DLRq1F+gg56iOAiRQCGEIKwDxsJPW1YMBET18WYOE7IU3WGe+jkTUnFSGETO6TvgGbtjA1pLjMT5XZpFZ9eTk4Dh+SJSMcJy9UMVU9VsmR/yx6rTLm3Ef2dLCsgVuCW2L98M+V/fbIXgT7OVA8e9RQrRnbnZimpOhW5c+NLV4ISYuIk7lGdE3aVc3bOhvIkqnd5traqvyulZOXczbQpPuQu6YKtn9c5DxqlonVStK6tQqWE6chhHwc4pvs8RQWXqKKuunzEE561K41rI208lWoLmWcP34b28Ancu5TA</latexit>

⌧ 0 = 1<latexit sha1_base64="E5naABUQ17PS4hYfpHM5W1FBtqU=">AAACynicjVHLSsNAFD2NrxpfVZdugkV0VZK60E2x4MaFiwr2AW2RJJ3W0LyYTIRS3PkDbnXpR/gpxT/Qv/DONAW1iE5Icubcc87MnXFi30uEab7ltIXFpeWV/Kq+tr6xuVXY3mkkUcpdVncjP+Itx06Y74WsLjzhs1bMmR04Pms6w3NZb94xnnhReC1GMesG9iD0+p5rC6KaHWGnhxXrplA0S6YaxjywMlA8e9Ur8ctEr0WFCTroIYKLFAEYQgjCPmwk9LRhwURMXBdj4jghT9UZ7qGTNyUVI4VN7JC+A5q1MzakucxMlNulVXx6OTkNHJAnIh0nLFczVD1VyZL9LXusMuXeRvR3sqyAWIFbYv/yzZT/9cleBPo4VT141FOsGNmdm6Wk6lTkzo0vXQlKiImTuEd1TthVztk5G8qTqN7l2dqq/q6UkpVzN9Om+JC7pAu2fl7nPGiUS9ZxybqyitUypiOPPezjiO7zBFVcoIa66vIRT3jWLjWujbTxVKrlMs8uvg3t4ROJm5Sd</latexit>

⌧ 0 = T<latexit sha1_base64="VE5gjAFfIct+PF8iLSbeZHnCY0M=">AAACynicjVHLSsNAFD3GV42vqks3wSK6Kkld6KZYcOPCRYW+oC2SpNMamheTiVCKO3/ArS79CD+l+Af6F96ZpqAW0QlJzpx7zpm5M07se4kwzbcFbXFpeWU1t6avb2xubed3dhtJlHKX1d3Ij3jLsRPmeyGrC0/4rBVzZgeOz5rO8ELWm3eMJ14U1sQoZt3AHoRe33NtQVSzI+z0qFy7yRfMoqmGMQ+sDBTOX/Vy/DLRq1F+gg56iOAiRQCGEIKwDxsJPW1YMBET18WYOE7IU3WGe+jkTUnFSGETO6TvgGbtjA1pLjMT5XZpFZ9eTk4Dh+SJSMcJy9UMVU9VsmR/yx6rTLm3Ef2dLCsgVuCW2L98M+V/fbIXgT7OVA8e9RQrRnbnZimpOhW5c+NLV4ISYuIk7lGdE3aVc3bOhvIkqnd5traqvyulZOXczbQpPuQu6YKtn9c5DxqlonVStK6tQqWE6chhHwc4pvs8RQWXqKKuunzEE561K41rI208lWoLmWcP34b28Ancu5TA</latexit>

HCQT

HCQM

X(f, ⌧)<latexit sha1_base64="kTsZ8ytb9JrVLZAWV8AB8QOlcWc=">AAACzHicjVHLSsNAFD2Nr1pftS7dBItQQUpSF7osuHElFexD2iJJOq1D8yKZCKV069aFW/0Zf0L8A/0A996ZpqAW0QlJzpx7zpm5M3bo8lgYxmtGW1hcWl7JrubW1jc2t/LbhUYcJJHD6k7gBlHLtmLmcp/VBRcua4URszzbZU17eCrrzVsWxTzwL8UoZF3PGvi8zx1LEHXVKvUPO8JKDq7zRaNsqKHPAzMFxWqhU/p4vu/UgvwLOughgIMEHhh8CMIuLMT0tGHCQEhcF2PiIkJc1RkmyJE3IRUjhUXskL4DmrVT1qe5zIyV26FVXHojcurYJ09AuoiwXE1X9UQlS/a37LHKlHsb0d9OszxiBW6I/cs3U/7XJ3sR6ONE9cCpp1AxsjsnTUnUqcid61+6EpQQEidxj+oRYUc5Z+esK0+sepdna6n6m1JKVs6dVJvgXe6SLtj8eZ3zoFEpm0dl88IsViuYjix2sYcS3ecxqjhDDXXK9vCARzxp55rQxtpkKtUyqWcH34Z29wn9X5We</latexit>

X(b, ⌧)<latexit sha1_base64="8JixcwbgKV44MEYZYDzkQjioW3Y=">AAACzHicjVHLSsNAFD2Nr1pftS7dBItQQUpSF7osuHElFexD2iJJOq2heTGZCKV069aFW/0Zf0L8A/0A996ZpqAW0QlJzpx7zpm5M3bkubEwjNeMtrC4tLySXc2trW9sbuW3C404TLjD6k7ohbxlWzHz3IDVhSs81oo4s3zbY017eCrrzVvGYzcMLsUoYl3fGgRu33UsQdRVq2QfdoSVHFzni0bZUEOfB2YKitVCp/TxfN+phfkXdNBDCAcJfDAEEIQ9WIjpacOEgYi4LsbEcUKuqjNMkCNvQipGCovYIX0HNGunbEBzmRkrt0OrePRycurYJ09IOk5YrqareqKSJftb9lhlyr2N6G+nWT6xAjfE/uWbKf/rk70I9HGienCpp0gxsjsnTUnUqcid61+6EpQQESdxj+qcsKOcs3PWlSdWvcuztVT9TSklK+dOqk3wLndJF2z+vM550KiUzaOyeWEWqxVMRxa72EOJ7vMYVZyhhjpl+3jAI560c01oY20ylWqZ1LODb0O7+wTzx5Wa</latexit>

Xo(b, ⌧)<latexit sha1_base64="NAxEoOIIP3SgXzZNS4pCbcYGQCw=">AAACznicjVHLSsNAFD2Nr1pftS7dBItQQUpSF7osuHFZwT6gkZKk0xqaZkIyKZRS3Lpz5Vb/xZ8Q/0A/wL13pimoRXRCkjPnnnNn7r1O6HuxMIzXjLa0vLK6ll3PbWxube/kdwuNmCeRy+ou93nUcuyY+V7A6sITPmuFEbOHjs+azuBcxpsjFsUeD67EOGTXQ7sfeD3PtQVR7VaHl5xjS9jJUSdfNMqGWvoiMFNQrBas0sfzvVXj+RdY6ILDRYIhGAIIwj5sxPS0YcJASNw1JsRFhDwVZ5giR96EVIwUNrED+vZp107ZgPYyZ6zcLp3i0xuRU8cheTjpIsLyNF3FE5VZsr/lnqic8m5j+jtpriGxAjfE/uWbK//rk7UI9HCmavCoplAxsjo3zZKorsib61+qEpQhJE7iLsUjwq5yzvusK0+sape9tVX8TSklK/duqk3wLm9JAzZ/jnMRNCpl86RsXprFagWzlcU+DlCieZ6iigvUUFcdf8AjnrSaNtKm2u1MqmVSzx6+Le3uEzBVlnw=</latexit>

Xhcqm(�, ⌧ 0, b, h, )<latexit sha1_base64="4tDzCJ+3h9uO8RuPPw38AZvX4bI=">AAAC4XicjVG7SsRAFD0b3+/10WkRXESFZUm00HLBxlLBfYArSzKOZjAvk4kgi42dndj6A7Yqfov4B2pv753ZCD4QnZDkzLnnnJk748a+SKVlPRWMnt6+/oHBoeGR0bHxieLkVD2NsoTxGov8KGm6Tsp9EfKaFNLnzTjhTuD6vOEebah644QnqYjCHXka873AOQzFgWCOJKpdnGu2Ox47Ds6WWrEnymZLOtli2XTLpldebhdLVsXSw/wJ7ByUqjNvD+nrvbcVFR/Rwj4iMGQIwBFCEvbhIKVnFzYsxMTtoUNcQkjoOscZhsmbkYqTwiH2iL6HNNvN2ZDmKjPVbkar+PQm5DSxQJ6IdAlhtZqp65lOVuxv2R2dqfZ2Sn83zwqIlfCI/cv3ofyvT/UicYB13YOgnmLNqO5YnpLpU1E7Nz91JSkhJk7hfaonhJl2fpyzqT2p7l2draPrz1qpWDVnuTbDi9olXbD9/Tp/gvpKxV6t2Nt2qbqC7hjELOaxRPe5hio2sYUaZZ/jBre4M5hxYVwaV12pUcg90/gyjOt3bF+dlw==</latexit>

h<latexit sha1_base64="RPyJnQzk1j2sIm0otPy5j4t40dA=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCOKyBfuAWiRJp23oNAmTiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujBfzIJG2/ZYzFhaXllfyq+ba+sbmVmF7p55EqfBZzY94JJqemzAehKwmA8lZMxbMHXqcNbzBhao37phIgii8lqOYtYduLwy6ge9Koqr920LRLtl6WPPAyUDx/NU8i18mZiUqTHCDDiL4SDEEQwhJmMNFQk8LDmzExLUxJk4QCnSd4R4meVNSMVK4xA7o26NZK2NDmqvMRLt9WoXTK8hp4YA8EekEYbWapeupTlbsb9ljnan2NqK/l2UNiZXoE/uXb6b8r0/1ItHFqe4hoJ5izaju/Cwl1aeidm596UpSQkycwh2qC8K+ds7O2dKeRPeuztbV9XetVKya+5k2xYfaJV2w8/M650H9qOQcl5yqUywfYTry2MM+Duk+T1DGFSqo6exHPOHZuDS4kRjpVGrkMs8uvg3j4RNJuJKO</latexit>

b<latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit>

� <latexit sha1_base64="B/cf9entUO82rlrrXFfjNnSlIYw=">AAACx3icjVHLSsNAFD2Nr1qrVl0KEiyCq5LUhS4LbnTXgn2ALZKk03ZoXiSTYikuXLh1q5/SPxH/QPEnvDNNQS2iE5KcOfecM3Nn7NDlsTCM14y2tLyyupZdz23kN7e2Czu7jThIIofVncANopZtxczlPqsLLlzWCiNmebbLmvbwXNabIxbFPPCvxDhkHc/q+7zHHUtIqh0O+E2haJQMNfRFYKagWMlPax8PB9NqUHhBG10EcJDAA4MPQdiFhZiea5gwEBLXwYS4iBBXdYY75MibkIqRwiJ2SN8+za5T1qe5zIyV26FVXHojcuo4Ik9AuoiwXE1X9UQlS/a37InKlHsb099OszxiBQbE/uWbK//rk70I9HCmeuDUU6gY2Z2TpiTqVOTO9S9dCUoIiZO4S/WIsKOc83PWlSdWvcuztVT9TSklK+dOqk3wLndJF2z+vM5F0CiXzJOSWTOLlTJmI4t9HOKY7vMUFVygijplD/CIJzxrl1qgjbTbmVTLpJ49fBva/SeD95RL</latexit>

h<latexit sha1_base64="RPyJnQzk1j2sIm0otPy5j4t40dA=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCOKyBfuAWiRJp23oNAmTiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujBfzIJG2/ZYzFhaXllfyq+ba+sbmVmF7p55EqfBZzY94JJqemzAehKwmA8lZMxbMHXqcNbzBhao37phIgii8lqOYtYduLwy6ge9Koqr920LRLtl6WPPAyUDx/NU8i18mZiUqTHCDDiL4SDEEQwhJmMNFQk8LDmzExLUxJk4QCnSd4R4meVNSMVK4xA7o26NZK2NDmqvMRLt9WoXTK8hp4YA8EekEYbWapeupTlbsb9ljnan2NqK/l2UNiZXoE/uXb6b8r0/1ItHFqe4hoJ5izaju/Cwl1aeidm596UpSQkycwh2qC8K+ds7O2dKeRPeuztbV9XetVKya+5k2xYfaJV2w8/M650H9qOQcl5yqUywfYTry2MM+Duk+T1DGFSqo6exHPOHZuDS4kRjpVGrkMs8uvg3j4RNJuJKO</latexit>

b<latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit>

� <latexit sha1_base64="B/cf9entUO82rlrrXFfjNnSlIYw=">AAACx3icjVHLSsNAFD2Nr1qrVl0KEiyCq5LUhS4LbnTXgn2ALZKk03ZoXiSTYikuXLh1q5/SPxH/QPEnvDNNQS2iE5KcOfecM3Nn7NDlsTCM14y2tLyyupZdz23kN7e2Czu7jThIIofVncANopZtxczlPqsLLlzWCiNmebbLmvbwXNabIxbFPPCvxDhkHc/q+7zHHUtIqh0O+E2haJQMNfRFYKagWMlPax8PB9NqUHhBG10EcJDAA4MPQdiFhZiea5gwEBLXwYS4iBBXdYY75MibkIqRwiJ2SN8+za5T1qe5zIyV26FVXHojcuo4Ik9AuoiwXE1X9UQlS/a37InKlHsb099OszxiBQbE/uWbK//rk70I9HCmeuDUU6gY2Z2TpiTqVOTO9S9dCUoIiZO4S/WIsKOc83PWlSdWvcuztVT9TSklK+dOqk3wLndJF2z+vM5F0CiXzJOSWTOLlTJmI4t9HOKY7vMUFVygijplD/CIJzxrl1qgjbTbmVTLpJ49fBva/SeD95RL</latexit>

b <latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit>f <latexit sha1_base64="o1heHs1ybhhUrc6SGM69uDHYIZo=">AAACxHicjVHLSsNAFD3GV61Vqy4FCRbBVUnqQpcFQVy2YB9QiyTptA6dPEgmQim6c+VW/6V/Iv6B4k94Z5qCWkQnJDlz7jln5s64keCJtKzXBWNxaXllNbeWXy9sbG4Vt3eaSZjGHmt4oQjjtuskTPCANSSXgrWjmDm+K1jLHZ6peuuWxQkPg0s5iljXdwYB73PPkUTV+9fFklW29DDngZ2BUrUwqX887E9qYfEFV+ghhIcUPhgCSMICDhJ6OrBhISKuizFxMSGu6wx3yJM3JRUjhUPskL4DmnUyNqC5yky026NVBL0xOU0ckickXUxYrWbqeqqTFftb9lhnqr2N6O9mWT6xEjfE/uWbKf/rU71I9HGqe+DUU6QZ1Z2XpaT6VNTOzS9dSUqIiFO4R/WYsKeds3M2tSfRvauzdXT9TSsVq+Zepk3xrnZJF2z/vM550KyU7eOyXbdL1QqmI4c9HOCI7vMEVVyghobOfsQTno1zQxiJkU6lxkLm2cW3Ydx/Ai2IkvY=</latexit> b <latexit sha1_base64="MX6uDOXtgmyvJQvul84/DkKOUMg=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVZK60I1YEMRlC/YBtUiSTmtomoSZiVCK/oBb/QI/wk8p/oH+hXemKahFdEKSM+eec2bujJeEgZC2/ZYzFhaXllfyq+ba+sbmVmF7pyHilPus7sdhzFueK1gYRKwuAxmyVsKZO/RC1vQG56revGNcBHF0JUcJ6wzdfhT0At+VRNW8m0LRLtl6WPPAyUDx7NU8TV4mZjUuTHCNLmL4SDEEQwRJOIQLQU8bDmwkxHUwJo4TCnSd4R4meVNSMVK4xA7o26dZO2MjmqtMod0+rRLSy8lp4YA8Mek4YbWapeupTlbsb9ljnan2NqK/l2UNiZW4JfYv30z5X5/qRaKHE91DQD0lmlHd+VlKqk9F7dz60pWkhIQ4hbtU54R97Zyds6U9QveuztbV9XetVKya+5k2xYfaJV2w8/M650GjXHKOSk7NKVbKmI489rCPQ7rPY1RwiSrqOvsRT3g2LozQEEY6lRq5zLOLb8N4+AQ7eJKI</latexit>

Figure 1. Flowchart of the computation of the Harmonic-Constant-Q-Modulation (HCQM). See text for details.

2.2 Input representation: the HCQM

As mentioned our goal is here to adapt the harmonic repre-sentation of the rhythm proposed in [30] [31] [32] to thedeep learning formalism proposed in [2]. For this, theHCQT proposed by [2] is not applied to the audio sig-nal, but to a set of Onset-Strength-Function (OSF) whichrepresent the rhythm content in several acoustic frequencybands. The OSFs are low-pass signals which temporal evo-lution is related to the tempo and the rhythm pattern.

We denote our representation by Harmonic-Constant-Q-Modulation (HCQM). As the Modulation Spectrum(MS) [1] it represents, using a time/frequency (τ ′/φ) rep-resentation, the energy evolution (low-pass signal) withineach frequency band b of a first time/frequency (τ/f ) rep-resentation. However, while the MS uses two interleavedShort-Time-Fourier-Transforms (STFTs) for this, we use aConstant-Q transform for the second time/frequency rep-resentation (in order to obtain a better spectral resolution).Finally, as proposed by [2], we add one extra dimension torepresent the content at the harmonics of each frequency φ.We denote it by Xhcqm(φ, τ ′, b, h) where φ are the modu-lation frequencies (which correspond to the tempo frequen-cies), τ ′ are the times of the CQT frames, b are the acousticfrequency bands and h the harmonic numbers.

Computation. In Figure 1, we indicate the computa-tion flowchart of the HCQM. Given an audio signal x(t),we first compute its STFT, denoted byX(f, τ). The acous-tic frequency f of the STFT are grouped into logarithmic-spaced acoustic-frequency-bands b ∈ [1, B = 8]. We de-note it by X(b, τ). The goal of this is to reduce the di-mensionality while preserving the information of the spec-tral location of the rhythm events (kick patterns tend tobe in low frequencies while hit-hat patterns in high fre-quencies). For each band b, we then compute an Onset-Strength-Function over time τ , denoted by Xo(b, τ).

For a specific b, we now consider the signal sb(τ) =Xo(b, τ) and perform the analysis of its periodicities overtime τ . One possibility would be to compute a time-frequency representation Sb(φ, τ

′) over tempo-frequenciesφ and time frame τ ′ and then sample Sb(φ, τ

′) at the posi-tions hφ with h ∈ {12 , 1, 2, 3, 4, 5} to obtain Sb(φ, τ

′, h).This is the idea we used in [32]. However, in the presentwork, we use the idea proposed by [2]: we compute a set

of CQTs 1 , each one with a different starting frequencyhφmin. We set φmin=32.7. Each of these CQTs givesus Sb,h(φ, τ

′) for one value of h. Stacking them over htherefore provides us with Sb(φ, τ

′, h). The idea proposedby [2] therefore allows to mimic the sampling at the hφbut provides the correct window length to achieve a correctspectral resolution. We finally stack the Sb(φ, τ

′, h) over bto obtain the 4D-tensors Xhcqm(φ, τ ′, b, h). The computa-tion parameters (sample rate, hop size, window length andFFT size) are set such that the CQT modulation frequencyφ represents the tempo value in BPM.

Illustration. For easiness of visualisation (it is diffi-cult to visualize a 4D-tensor), we illustrate the HCQMXhcqm(φ, τ ′, b, h) for a given τ ′ (it is then a 3D-tensor).Figure 2 [Left] represent Xhcqm(φ, b, h) on a real au-dio signal with a tempo of 120 bpm. Each sub-figurerepresent Xhcqm(φ, b, h) for a different value of h ∈{ 12 , 1, 2, 3, 4, 5}. The y-axis and x-axis are the tempo fre-quency φ and the acoustic frequency band b. The dashedrectangle super-imposed to the sub-figures indicates theslice of values Xhcqm(φ = 120bpm, b, h) which corre-sponds to the ground-truth tempo. It is this specific patternover b and h that we want to learn using the filters W ofthe first layer of our convolutional network.

2.3 Architecture of the Convolutional Neural Network

The architecture of our network is both inspired by the oneof [2] (since we perform convolutions over an input spec-tral representation and use its depth) and the one of [37](since we perform a classification). However, it differs inthe definition of the input and output.

Input. In [2], the input is the 3D-tensor Xcqt(f, τ, h)and the convolution is done over f and τ (with filters ofdepth H). In our case, the input could be the 4D-tensorsXhcqm(φ, τ ′, b, h) and the convolution could be done overφ, τ ′ and b (with filters of depth H). However, to sim-plify the computation, we reduce Xhcqm(φ, τ ′, b, h) to asequence over τ ′ of 3D-tensorsXhcqm(φ, b, h) 2 . The con-volution is then done over φ and b with filters of depth H .

Our goal is to learn filters W narrow in φ and large inb which represents the specific shape of the harmonic con-

1 For the STFT and CQT we used the librosa library [27].2 Future works will concentrate in performing the convolution directly

using the 4D-tensors; which would allow to perform smoothing over time.

Page 4: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

15360

256

C classes

softmaxFC

Flatten

128*(4,6) 64*(4,6) 64*(4,6) 32*(4,6) 8*(120,6)

240

86 128 64 3264 8h

<latexit sha1_base64="xYTuT2PjQU/U0m+6oAU+cMRingI=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfsALZKk0zY0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQZuNwo6ge9Koqq920LRLtl6WLPAyUDx9NU8SV7GZiUujHGDNmL4SDEAQwRJOIQLQc81HNhIiGthRBwnFOg6wz1M8qakYqRwie3Tt0uz64yNaK4yhXb7tEpILyenhT3yxKTjhNVqlq6nOlmxv2WPdKba25D+XpY1IFaiR+xfvqnyvz7Vi0QHx7qHgHpKNKO687OUVJ+K2rn1pStJCQlxCrepzgn72jk9Z0t7hO5dna2r6+9aqVg19zNtig+1S7pg5+d1zoL6Qck5LDlVp1g+w2TksYNd7NN9HqGMS1RQ09mPeMKzcWGEhjDSidTIZZ5tfBvGwydx+JKu</latexit>

0,5 1 2 3 4 5

h<latexit sha1_base64="xYTuT2PjQU/U0m+6oAU+cMRingI=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfsALZKk0zY0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQZuNwo6ge9Koqq920LRLtl6WLPAyUDx9NU8SV7GZiUujHGDNmL4SDEAQwRJOIQLQc81HNhIiGthRBwnFOg6wz1M8qakYqRwie3Tt0uz64yNaK4yhXb7tEpILyenhT3yxKTjhNVqlq6nOlmxv2WPdKba25D+XpY1IFaiR+xfvqnyvz7Vi0QHx7qHgHpKNKO687OUVJ+K2rn1pStJCQlxCrepzgn72jk9Z0t7hO5dna2r6+9aqVg19zNtig+1S7pg5+d1zoL6Qck5LDlVp1g+w2TksYNd7NN9HqGMS1RQ09mPeMKzcWGEhjDSidTIZZ5tfBvGwydx+JKu</latexit>

Input OutputHCQMHCQM

� <latexit sha1_base64="vdGvxXXYzJ7SuDc3Wez6WmJf/74=">AAACx3icjVHLSsNAFD2Nr/quulQkWARXJdGFLotudNeCfUBbJEmn7dC8SCbFUly4cOtWP6V/Iv6B4k94Z5qCWkQnJDlz7jln5s7YoctjYRivGW1ufmFxKbu8srq2vrGZ29quxkESOaziBG4Q1W0rZi73WUVw4bJ6GDHLs11Ws/sXsl4bsCjmgX8thiFreVbX5x3uWEJSzbDHb3J5o2Cooc8CMwX54t64/PGwPy4FuRc00UYABwk8MPgQhF1YiOlpwISBkLgWRsRFhLiqM9xhhbwJqRgpLGL79O3SrJGyPs1lZqzcDq3i0huRU8cheQLSRYTlarqqJypZsr9lj1Sm3NuQ/naa5REr0CP2L99U+V+f7EWggzPVA6eeQsXI7pw0JVGnIneuf+lKUEJInMRtqkeEHeWcnrOuPLHqXZ6tpepvSilZOXdSbYJ3uUu6YPPndc6C6nHBPCmYZTNfPMdkZLGLAxzRfZ6iiEuUUKHsHh7xhGftSgu0gXY7kWqZ1LODb0O7/wSsN5Rr</latexit>

Xhcqm(�, b, h)<latexit sha1_base64="Af4a+zKom88nV7uGMQt7eZAprFU=">AAAC2HicjVHLSsNAFD2N79ZH1aWbYBUqSEl0ocuiG5cKVou2lGQ62sG8TCZCKYI7cesPuNUf8Af8CPEP9AN07Z0xglpEJyQ5c+45Z+bOuJEnEmlZTzljYHBoeGR0LF8Yn5icKk7P7CVhGjNeY6EXxnXXSbgnAl6TQnq8HsXc8V2P77snm6q+f8bjRITBruxGvOk7x4E4EsyRRLWKs/VWr8NO/fNyI+qIZdNd7iy1iiWrYulh9gM7A6Xqwuv9w1nhbTssPqKBNkIwpPDBEUAS9uAgoecQNixExDXRIy4mJHSd4xx58qak4qRwiD2h7zHNDjM2oLnKTLSb0SoevTE5TSySJyRdTFitZup6qpMV+1t2T2eqvXXp72ZZPrESHWL/8n0q/+tTvUgcYV33IKinSDOqO5alpPpU1M7NL11JSoiIU7hN9Zgw087Pcza1J9G9q7N1dP1ZKxWr5izTpnhRu6QLtn9eZz/YW6nYqxV7xy5VN/AxRjGHeZTpPtdQxRa2UaPsLm5wizvjwLgwLo2rD6mRyzyz+DaM63fAe5q7</latexit>

� <latexit sha1_base64="vdGvxXXYzJ7SuDc3Wez6WmJf/74=">AAACx3icjVHLSsNAFD2Nr/quulQkWARXJdGFLotudNeCfUBbJEmn7dC8SCbFUly4cOtWP6V/Iv6B4k94Z5qCWkQnJDlz7jln5s7YoctjYRivGW1ufmFxKbu8srq2vrGZ29quxkESOaziBG4Q1W0rZi73WUVw4bJ6GDHLs11Ws/sXsl4bsCjmgX8thiFreVbX5x3uWEJSzbDHb3J5o2Cooc8CMwX54t64/PGwPy4FuRc00UYABwk8MPgQhF1YiOlpwISBkLgWRsRFhLiqM9xhhbwJqRgpLGL79O3SrJGyPs1lZqzcDq3i0huRU8cheQLSRYTlarqqJypZsr9lj1Sm3NuQ/naa5REr0CP2L99U+V+f7EWggzPVA6eeQsXI7pw0JVGnIneuf+lKUEJInMRtqkeEHeWcnrOuPLHqXZ6tpepvSilZOXdSbYJ3uUu6YPPndc6C6nHBPCmYZTNfPMdkZLGLAxzRfZ6iiEuUUKHsHh7xhGftSgu0gXY7kWqZ1LODb0O7/wSsN5Rr</latexit>

b<latexit sha1_base64="cxEs3YzUoW7FwsLwVkmEbjA8Ay8=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfuAWiRJpzU0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQduLwq6ge9KoqreTaFol2w9rFngZKB4+mqeJC9jsxIXxrhGBzF8pBiAIYIkHMKFoKcFBzYS4toYEccJBbrOcA+TvCmpGClcYvv07dGslbERzVWm0G6fVgnp5eS0sEeemHScsFrN0vVUJyv2t+yRzlR7G9Lfy7IGxErcEvuXb6r8r0/1ItHFse4hoJ4Szaju/Cwl1aeidm596UpSQkKcwh2qc8K+dk7P2dIeoXtXZ+vq+rtWKlbN/Uyb4kPtki7Y+Xmds6B+UHIOS07VKZbPMBl57GAX+3SfRyjjEhXUdPYjnvBsXBihIYx0IjVymWcb34bx8AljuJKo</latexit>

b<latexit sha1_base64="cxEs3YzUoW7FwsLwVkmEbjA8Ay8=">AAACxHicjVHLSsNAFD2NrxpfVZdugkVwVRJd6EYsCuKyBfuAWiRJpzU0TcLMRChFf8CtfoEf4ycU/0D/wjvTFNQiOiHJmXPPOTN3xkvCQEjbfssZc/MLi0v5ZXNldW19o7C5VRdxyn1W8+Mw5k3PFSwMIlaTgQxZM+HMHXgha3j9c1Vv3DEugji6ksOEtQduLwq6ge9KoqreTaFol2w9rFngZKB4+mqeJC9jsxIXxrhGBzF8pBiAIYIkHMKFoKcFBzYS4toYEccJBbrOcA+TvCmpGClcYvv07dGslbERzVWm0G6fVgnp5eS0sEeemHScsFrN0vVUJyv2t+yRzlR7G9Lfy7IGxErcEvuXb6r8r0/1ItHFse4hoJ4Szaju/Cwl1aeidm596UpSQkKcwh2qc8K+dk7P2dIeoXtXZ+vq+rtWKlbN/Uyb4kPtki7Y+Xmds6B+UHIOS07VKZbPMBl57GAX+3SfRyjjEhXUdPYjnvBsXBihIYx0IjVymWcb34bx8AljuJKo</latexit>

Figure 2. [Left] Example of the HCQM for a real audio with tempo 120bpm. [Right] Architecture of our CNN.

tent of a rhythm pattern. We do the convolution over bbecause the same rhythm pattern can be played with in-strument transposed in acoustic frequencies.

Output. The output of the network proposed by [2] is a2D representation which represents a saliency map of theharmonic content over time. In our case, the outputs are ei-ther the C = 256 classes of tempo (as in [37] we considerthe tempo estimation problem as a classification probleminto 256 tempo classes) or the C = 13 (for extended-ballroom) or C = 6 (for Greek-dances) classes of rhythmpattern. To do so, we added at the end of the network pro-posed by [2] two dense layers, the last one with C unitsand a softmax activation.

Architecture. In Figure 2 [Right], we indicate thearchitecture of our network. The input is a 3D-tensorXhcqm(φ, b, h) for a given time τ ′. The first layer is aset of 128 convolutional filters of shape (φ = 4, b = 6)(with depth H). As mentioned, the convolution is doneover φ and b. The shape of these filters has been chosensuch that they are narrow in tempo frequency φ (to pre-cisely estimate the tempo) but cover multiple frequencyacoustic bands b (because the information relative to thetempo/ rhythm cover several bands). As illustrated in Fig-ure 2 [Left] the goal of the filters is to identify the patternover b and h specific to φ = tempo frequency.

The first layer is followed by two convolutional layersof 64 filters of shape (4, 6), one layer of 32 filters of shape(4, 6), one layer of 8 filters of shape (120, 6) (this allows totrack down the relationships between the modulation fre-quencies φ). The output of the last convolution layer isthen flattened and followed by a dropout with p = 0.5(to avoid over-fitting [41]), a fully-connected layer of 256units, a last fully-connected layer ofC units with a softmaxactivation to perform the classification into C classes. Thesoftmax activation vector is denoted by y(τ ′). The Lossfunction to be minimized is a categorical cross entropy.

All layers are preceded by a batch normalization [16].We used Rectified Linear Units (ReLU) [28] for all convo-lutional layers, and Exponential Linear Units (eLU) [5] forthe first fully-connected layer.

2.4 Training

The inputs of our network are the 3D tensors HCQMXhcqm(φ, b, h) computed for all time τ ′ of the music track.

On the other side, the datasets we will use for our ex-periments only provide global tempi 3 or global rhythmclasses as ground-truths. We therefore have several HC-QMs for a given track which are all associated with thesame ground-truth (multiple instance learning).

To fix the network hyper-parameters, we split the train-ing set into a train (90%) and a validation part (10%).We used the ADAM [17] optimizer to find the parame-ters of the network with a constant learning rate of 0.001,β1 = 0.9 , β2 = 0.999 and ε = 1e − 8. We used mini-batches of 256 HCQM with shuffle and a maximum of 20epochs with early-stopping.

2.5 Aggregating decisions over time

Our network provides an estimation of the tempo orrhythm class at each time τ ′. To obtain a global value weaggregate the softmax activation vectors y(τ ′) over timeby choosing the maximum of the vector y computed as theaverage over τ ′ of the y(τ ′).

3. EVALUATION OF THE SYSTEM

We evaluate our proposed system for two tasks: - globaltempo estimation (§3.1), - rhythm pattern recognition(§3.2). We used the same system (same input represen-tation and same network architecture) for the two tasks.However, considering that the class definitions are differ-ent, we performed two independent trainings 4 .

3.1 Global tempo estimation

Training and testing sets. To be able to compare ourresults with previous works, we use the same paradigm(cross-dataset validation 5 ) and the same datasets as [37]which we consider here as the state-of-the-art. The train-ing set is the union of

3 It should be noted that this does not always correspond to the realityof the track content since some of them have a tempo varying over time(tempo drift), have a silent introduction or a break in the middle.

4 Future works, will consider training a single network for the twotasks or applying transfer learning of one task to the other.

5 Cross-dataset validation uses separate datasets for training and test-ing; not only splitting a single dataset into a train and a test part.

Page 5: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

• LMD tempo: it is a subset of the Lack MIDI dataset [35]annotated into tempo by [38]; it contains 3611 itemsof 30s excerpts of 10 different genres

• MTG tempo it is a subset of the GiantSteps MTG keydataset (MTG tempo) [6] annotated using a tappingmethod by [37]; it contains 1159 items of 2min ex-cerpts of electronic dance music;

• Extended Ballroom [24]: it contains 3826 items of 13genres (it should be noted that we removed from theBallroom test-set the items existing in the ExtendedBallroom training-set).

The total size of the training set is 8596. It covers mul-tiple musical genres to favor generalization.

The test-sets are also the same as in [37] (see [38]for their details): - ACM-Mirum [33] (1410 items), - IS-MIR04 [12] (464 items), - Ballroom [12] (698 items), -Hainsworth [13] (222 items), - GTzan-Rhythm [22] (1000items), - SMC [14] (217 items), - Giantsteps Tempo [19](664 items). We also added the RWC-popular [10] (100items) for comparison with [3]. As in [37], Combined de-notes the union of all test-sets (except RWC popular).

Evaluation protocol. We measure the performancesusing the following metrics:

• Class-Accuracy: it measures the ability of our systemto predict the correct tempo class (in our system wehave 256 tempo classes ranging from 30 to 285bpm);

• Accuracy1: it measures if our estimated tempo iswithin ±4% of the ground-truth tempo

• Accuracy2: is the same as Accuracy1 but consideringoctave errors as correct

Results and discussions. The results are indicated inTables 1, 2 and 3.

Validation of B and H . We first show that the sepa-ration of the signal into multiple acoustic-frequency-bandsb ∈ [1, B = 8] and the use of the harmonic depth H isbeneficial. For this, we compare the results obtained using- B = 8 and h ∈ { 12 , 1, 2, 3, 4, 5} (column “new")- B = 8 but h ∈ {1} (column “h=1")- B = 1 and h ∈ { 12 , 1, 2, 3, 4, 5} (column “b=1").Results are only indicated for the “Combined” test-sets. For all metrics (Class-Accuracy, Accuracy-1 andAccuracy-2), the best results are obtained using B = 8and h ∈ { 12 , 1, 2, 3, 4, 5}.

Comparison with state-of-the-art. We now compareour results with the state-of-the-art represented by the 3following systems: - sch1 denotes the results published in[38], - sch2 in [37] and - böck in [3].

According to these Tables, we see that our method al-lows an improvement for two test-sets: - Ballroom in termsof Class-Accuracy (73.8) - Ballroom and Giantsteps interms of Accuracy1 (92.6 and 83.6) and Accuracy2 (98.7and 97.9). This can be explained by the fact that the mu-sical genres of these two test-sets are represented in ourtraining set (by the Extended-Ballroom and MTG temporespectively). It should be noted however that there is nointersection between the training and test sets. For the IS-MIR04 and GTZAN test-sets, our results are very close (but

lower) to the one of sch1. For the RWC popular test sets,our results (73.0 and 98.0) largely outperforms the onesof böck in terms of Accuracy-1 and Accuracy-2. Finally,if we consider the Combined dataset, our method slightlyoutperforms the other ones in terms of Accuracy1 (74.4).

The worst results of our method were obtained for theSMC data-sets. This can be explained by the fact thatthe SMC data-sets contains rhythm patterns very differentfrom the ones represented in our training-sets.

While our results for Accuracy2 (92.0) are of coursehigher than our results for Accuracy1 (74.4), they are stilllower than the ones of böck (93.6). One reason for this isthat our network (as the ones of sch1 and sch2) is trained todiscriminate BPM classes, therefore it doesn’t know any-thing about octave equivalences.

3.2 Rhythm pattern recognition

Training and testing sets. For the rhythm pattern recog-nition, it is not possible to perform cross-dataset valida-tion (as we did above) because the definition of the rhythmpattern classes is specific to each dataset. Therefore, as inprevious works [25], we perform a 10-fold cross-validationusing the following datasets:

• Extended Ballroom [24]: it contains 4180 samples ofC=13 rhythm classes: ’Foxtrot’, ’Salsa’, ’Vien-nesewaltz’, ’Tango’, ’Rumba’, ’Wcswing’, ’Quick-step’, ’Chacha’, ’Slowwaltz’, ’Pasodoble’, ’Jive’,’Samba’, ’Waltz’

• Greek-dances [15]: it contains 180 samples of C=6rhythm classes: ’Kalamatianos’, ’Kontilies’, ’Male-viziotis’, ’Pentozalis’, ’Sousta’ and ’Kritikos Syrtos’

Evaluation protocol. As in previous works [25], wemeasure the performances using the average-mean-RecallR 6 . For a given fold f , the mean-Recall Rf is the meanover the classes c of the class-recall Rf,c. The averagemean-Recall R is then the average over f of the mean-Recall Rf . We also indicate the standard deviation over fof the mean-Recall Rf .

Results and discussions. The results are indicated inTable 4. We compare our results with the ones of [25],denoted as march, considered here representative of thecurrent state-of-the-art.

Our results (76.4 and 68.3) are largely below the onesof march (94.9 and 77.2). This can be explained by thefact that the “Scale and shift invariant time/frequency” rep-resentation proposed in [25] takes into account the inter-relationships between the frequency bands of the rhythmicevents while our HCQM does not.

To better understand these results, we indicate in Fig-ure 3 [Top] the confusion matrices for the Extended Ball-room. The diagonal represents the Recall Rc of eachclass c. We see that our system is actually suitable todetect the majority of the classes: Rc ≥ 88% for 9classes over 13. ChaCha, Waltz and WcSwing make Rcompletely drop. Waltz is actually estimated 97% of the

6 The mean-Recall is not sensitive to the distribution of the classes. Itsrandom value is always 1/C for a problem with C classes.

Page 6: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

Datasets sch1 sch2 böck new h=1 b=1ACMMirum 38.3 40.6 29.4 38.2ISMIR04 37.7 34.1 27.2 29.4Ballroom 46.8 67.9 33.8 73.8Hainsworth 43.7 43.2 33.8 23.0GTzan 38.8 36.9 32.2 27.6SMC 14.3 12.4 17.1 7.8Giantsteps 53.5 59.8 37.2 25.5RWCpop X X X 66.0Combined 40.9 44.8 31.2 36.8 29.8 32.4

Table 1. Class-Accuracy

sch1 sch2 böck new h=1 b=172.3 79.5 74.0 73.363.4 60.6 55.0 61.264.6 92.0 84.0 92.665.8 77.0 80.6 73.471.0 69.4 69.7 69.731.8 33.6 44.7 30.963.1 73.0 58.9 83.6X X 60.0 73.066.5 74.2 69.5 74.4 64.2 67.9

Table 2. Accuracy1

sch1 sch2 böck new h=1 b=197.3 97.4 97.7 96.592.2 92.2 95.0 87.197.0 98.4 98.7 98.785.6 84.2 89.2 82.993.3 92.6 95.0 89.155.3 50.2 67.3 50.788.7 89.3 86.4 97.9X X 95.0 98.092.2 92.1 93.6 92.0 82.0 88.4

Table 3. Accuracy2

ChachaFoxtrotJive

Pasodoble

QuickstepRumbaSalsaSamba

SlowwaltzTango

ViennesewaltzWaltz

Wcswing

Predicted label

Chacha

Foxtrot

Jive

Pasodoble

Quickstep

Rumba

Salsa

Samba

Slowwaltz

Tango

Viennesewaltz

Waltz

Wcswing

True label

0.45 0.00 0.13 0.00 0.00 0.13 0.28 0.00 0.00 0.02 0.00 0.00 0.00

0.00 0.95 0.00 0.00 0.00 0.02 0.02 0.01 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.97 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00

0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00

0.00 0.00 0.00 0.01 0.97 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00

0.00 0.02 0.02 0.01 0.00 0.93 0.01 0.00 0.00 0.01 0.00 0.00 0.00

0.01 0.02 0.01 0.00 0.00 0.00 0.95 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.05 0.02 0.00 0.00 0.92 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.94 0.02 0.00 0.00 0.00

0.00 0.00 0.01 0.00 0.02 0.01 0.00 0.00 0.06 0.88 0.01 0.00 0.00

0.00 0.02 0.02 0.02 0.06 0.00 0.00 0.02 0.08 0.02 0.77 0.00 0.00

0.00 0.00 0.03 0.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.13 0.04 0.13 0.13 0.04 0.00 0.26 0.00 0.04 0.00 0.22

0.0

0.2

0.4

0.6

0.8

ChachaFoxtrotJive

Pasodoble

QuickstepRumbaSalsaSamba

SlowwaltzTango

ViennesewaltzWaltz

Wcswing

Predicted label

Chacha

Foxtrot

Jive

Pasodoble

Quickstep

Rumba

Salsa

Samba

Slowwaltz

Tango

Viennesewaltz

Waltz

Wcswing

True label

0.93 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.04 0.00 0.00 0.00

0.00 0.96 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00

0.01 0.01 0.96 0.00 0.00 0.01 0.00 0.00 0.00 0.02 0.00 0.00 0.00

0.06 0.04 0.02 0.74 0.02 0.02 0.00 0.00 0.00 0.08 0.02 0.02 0.00

0.00 0.01 0.00 0.00 0.94 0.03 0.00 0.02 0.00 0.00 0.01 0.00 0.00

0.00 0.00 0.01 0.00 0.03 0.93 0.00 0.01 0.00 0.01 0.00 0.00 0.00

0.00 0.00 0.06 0.00 0.02 0.11 0.45 0.34 0.00 0.02 0.00 0.00 0.00

0.00 0.00 0.01 0.00 0.03 0.01 0.01 0.94 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.97 0.00

0.03 0.03 0.01 0.00 0.00 0.01 0.00 0.01 0.00 0.89 0.00 0.01 0.00

0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.94 0.02 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.01 0.97 0.00

0.30 0.22 0.13 0.00 0.00 0.09 0.00 0.09 0.00 0.09 0.00 0.04 0.04

0.0

0.2

0.4

0.6

0.8

Figure 3. Confusion matrix for the Extended Ballroom.[Top] using h ∈ {0.5, 1, 2, 3, 4, 5} [Bottom] using h ∈{ 14 ,

13 ,

12 ,

23 ,

34 , 1, 1.25, 1.33, . . . 8}

time as Pasadoble. This can be explained by the factthat our current harmonic series h ∈ {12 , 1, 2, 3, 4, 5}does not represent anything about the 3/4 meter specificto Waltz which would be represented by h = 1

3 . Toverify our assumption, we redo the experiment using thistime exactly the same harmonic series as proposed in

Datasets march newExtended Ballroom 94.9 76.4 (33.1)Greek-dances 77.2 68.3 (27.5)

Table 4. Average (std) Recall R for rhythm classification.

[32]: h ∈ { 14 ,13 ,

12 ,

23 ,

34 , 1, 1.25, 1.33, . . . 8}. The cor-

responding confusion matrix is indicated in 3 [Bottom]where Waltz is now perfectly recognized (97%), howeverSlowWaltz is now recognized as Waltz in 97% of the caseswhich makes (while the system is better) the average mean-Recall actually decreases to 74.6%. The low results ofWcswing can be explained by the (too) small number oftraining examples (23).

4. CONCLUSION

In this paper we have presented a new approach forglobal tempo and rhythm pattern classification. We haveproposed the Harmonic-Constant-Q-Modulation (HCQM)representation, a 4D-tensor which represents the harmonicseries at candidate tempo frequencies of a multi-bandOnset-Strength-Function. This HCQM is then used as in-put to a deep convolutional network. The filters of thefirst layer of this network are supposed to learn the spe-cific characteristic of the various rhythm patterns. We haveevaluated our approach for two classification tasks: globaltempo and rhythm pattern classification.

For tempo estimation, our method outperforms previ-ous approaches (in terms of Accuracy1 and Accuracy2)for the Ballroom (ballroom music) and Giant-steps tempo(electronic music) test-sets. Both test-sets represent musicgenres with a strong focus on rhythm. It seems thereforethat our approach works better when rhythm patterns areclearly defined. Our method also performs slightly better(in terms of Accuracy1) for the Combined test-set.

For rhythm classification, our method doesn’t work aswell as the state of the art [25]. However, the confusionmatrices indicate that our recognition is above 90% for themajority of the classes of the Extended Ballroom. More-over, we have shown that adapting the harmonic series hcan help improving the performances.

Among future works, we would like to study the useof the HCQM 4D tensors directly, to study other harmonicseries and to study the joint training of (or transfer learningbetween) tempo and rhythm pattern classification.

Page 7: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

Acknowledgement: This work was partly supported byEuropean Union’s Horizon 2020 research and innovationprogram under grant agreement No 761634 (Future Pulseproject).

5. REFERENCES

[1] Les Atlas and Shihab A Shamma. Joint acoustic andmodulation frequency. Advances in Signal Processing,EURASIP Journal on, 2003:668–675, 2003.

[2] Rachel M Bittner, Brian McFee, Justin Salamon, Pe-ter Li, and Juan Pablo Bello. Deep salience represen-tations for f0 estimation in polyphonic music. In Proc.of ISMIR (International Society for Music InformationRetrieval), Suzhou, China, 2017.

[3] Sebastian Böck, Florian Krebs, and Gerhard Widmer.Accurate tempo estimation based on recurrent neuralnetworks and resonating comb filters. In Proc. of IS-MIR (International Society for Music Information Re-trieval), Malaga, Spain, 2015.

[4] Ching-Wei Chen, Markus Cremer, Kyogu Lee, Pe-ter DiMaria, and Ho-Hsiang Wu. Improving perceivedtempo estimation by statistical modeling of higher-level musical descriptors. In Audio Engineering Soci-ety Convention 126. Audio Engineering Society, 2009.

[5] Djork-Arné Clevert, Thomas Unterthiner, and SeppHochreiter. Fast and accurate deep network learn-ing by exponential linear units (elus). arXiv preprintarXiv:1511.07289, 2015.

[6] Angel Faraldo, Sergi Jorda, and Perfecto Herrera. Amulti-profile method for key estimation in edm. In AESInternational Conference on Semantic Audio, Ilmenau,Germany, 2017. Audio Engineering Society.

[7] Jonathan Foote, Matthew L Cooper, and Unjung Nam.Audio retrieval by rhythmic similarity. In Proc. of IS-MIR (International Society for Music Information Re-trieval), Paris, France, 2002.

[8] Mikel Gainza and Eugene Coyle. Tempo detectionusing a hybrid multiband approach. Audio, Speechand Language Processing, IEEE Transactions on,19(1):57–68, 2011.

[9] Aggelos Gkiokas, Vassilios Katsouros, and GeorgeCarayannis. Reducing tempo octave errors by period-icity vector coding and svm learning. In Proc. of IS-MIR (International Society for Music Information Re-trieval), Porto, Portugal, 2012.

[10] Masataka Goto, Hiroki Hashiguchi, TakuichiNishimura, and Ryuichi Oka. Rwc music database:Popular, classical and jazz music databases. In Proc.of ISMIR (International Society for Music InformationRetrieval), Paris, France, 2002.

[11] Masataka Goto and Yoichi Muraoka. A beat trackingsystem for acoustic signals of music. In Proceedings ofthe second ACM international conference on Multime-dia, San Francisco, California, USA, 1994.

[12] Fabien Gouyon, Anssi Klapuri, Simon Dixon, MiguelAlonso, George Tzanetakis, Christian Uhle, and Pe-dro Cano. An experimental comparison of audio tempoinduction algorithms. Audio, Speech and LanguageProcessing, IEEE Transactions on, 14(5):1832–1844,2006.

[13] Stephen Webley Hainsworth. Techniques for the Auto-mated Analysis of Musical Audio. PhD thesis, Univer-sity of Cambridge, UK, September 2004.

[14] Andre Holzapfel, Matthew EP Davies, José R Zap-ata, João Lobato Oliveira, and Fabien Gouyon. Se-lective sampling for beat tracking evaluation. Audio,Speech and Language Processing, IEEE Transactionson, 20(9):2539–2548, 2012.

[15] André Holzapfel and Yannis Stylianou. Scale trans-form in rhythmic similarity of music. Audio, Speechand Language Processing, IEEE Transactions on,19(1):176–185, 2011.

[16] Sergey Ioffe and Christian Szegedy. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. arXiv preprintarXiv:1502.03167, 2015.

[17] Diederik P Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[18] Anssi P Klapuri, Antti J Eronen, and Jaakko T Astola.Analysis of the meter of acoustic musical signals. Au-dio, Speech and Language Processing, IEEE Transac-tions on, 14(1):342–355, 2006.

[19] Peter Knees, Angel Faraldo, Perfecto Herrera, RichardVogl, Sebastian Böck, Florian Hörschläger, and Mick-ael Le Goff. Two data sets for tempo estimation andkey detection in electronic dance music annotated fromuser corrections. In Proc. of ISMIR (International So-ciety for Music Information Retrieval), Malaga, Spain,2015.

[20] Mark Levy. Improving perceptual tempo estimationwith crowd-sourced annotations. In Proc. of ISMIR (In-ternational Society for Music Information Retrieval),Miami, Florida, USA, 2011.

[21] Robert C Maher and James W Beauchamp. Fundamen-tal frequency estimation of musical signals using a two-way mismatch procedure. JASA (Journal of the Acous-tical Society of America), 95(4):2254–2263, 1994.

[22] Ugo Marchand, Quentin Fresnel, and Geoffroy Peeters.Gtzan-rhythm: Extending the gtzan test-set with

Page 8: DEEP-RHYTHM FOR TEMPO ESTIMATION AND RHYTHM …

beat, downbeat and swing annotations. In Late-Breaking/Demo Session of ISMIR (International Soci-ety for Music Information Retrieval), Malaga, Spain,October 2015. Late-Breaking Demo Session of the16th International Society for Music Information Re-trieval Conference, 2015.

[23] Ugo Marchand and Geoffroy Peeters. The modulationscale spectrum and its application to rhythm-contentdescription. In Proc. of DAFx (International Confer-ence on Digital Audio Effects), Erlangen, Germany,2014.

[24] Ugo Marchand and Geoffroy Peeters. The extendedballroom dataset. In Late-Breaking/Demo Session ofISMIR (International Society for Music InformationRetrieval), New York, USA, August 2016. Late-Breaking Demo Session of the 17th International So-ciety for Music Information Retrieval Conf.. 2016.

[25] Ugo Marchand and Geoffroy Peeters. Scale and shiftinvariant time/frequency representation using auditorystatistics: Application to rhythm description. In 2016IEEE 26th International Workshop on Machine Learn-ing for Signal Processing (MLSP). IEEE, 2016.

[26] R McAuley and T Quatieri. Speech analysis/synthesisbased on a sinusoidal representation. Acoustics, Speechand Signal Processing, IEEE Transactions on, 34:744–754, 1986.

[27] Brian McFee, Colin Raffel, Dawen Liang, Daniel PWEllis, Matt McVicar, Eric Battenberg, and Oriol Nieto.librosa: Audio and music signal analysis in python. InProceedings of the 14th python in science conference,Austin, Texas, 2015.

[28] Vinod Nair and Geoffrey E Hinton. Rectified linearunits improve restricted boltzmann machines. In Proc.of ICMC (International Computer Music Conference),Haifa, Israel, 2010.

[29] Geoffroy Peeters. A large set of audio features forsound description (similarity and classification) in thecuidado project. Cuidado project report, Ircam, 2004.

[30] Geoffroy Peeters. Template-based estimation of time-varying tempo. Advances in Signal Processing,EURASIP Journal on, 2007(1):067215, 2006.

[31] Geoffroy Peeters. Template-based estimation of tempo:using unsupervised or supervised learning to createbetter spectral templates. In Proc. of DAFx (Interna-tional Conference on Digital Audio Effects), pages209–212, Graz, Austria, September 2010.

[32] Geoffroy Peeters. Spectral and temporal periodicityrepresentations of rhythm for the automatic classifica-tion of music audio signal. Audio, Speech and Lan-guage Processing, IEEE Transactions on, 19(5):1242–1252, 2011.

[33] Geoffroy Peeters and Joachim Flocon-Cholet. Percep-tual tempo estimation using gmm-regression. In Pro-ceedings of the second international ACM workshopon Music information retrieval with user-centered andmultimodal strategies, Nara, Japan, 2012.

[34] Graham Percival and George Tzanetakis. Streamlinedtempo estimation based on autocorrelation and cross-correlation with pulses. Audio, Speech, and LanguageProcessing, IEEE/ACM Transactions on, 22(12):1765–1776, 2014.

[35] Colin Raffel. Learning-based methods for comparingsequences, with applications to audio-to-midi align-ment and matching. PhD thesis, Columbia University,2016.

[36] Eric D Scheirer. Tempo and beat analysis of acousticmusical signals. JASA (Journal of the Acoustical Soci-ety of America), 103(1):588–601, 1998.

[37] Hendrik Schreiber and M Müller. A single-step ap-proach to musical tempo estimation using a convo-lutional neural network. In Proc. of ISMIR (Interna-tional Society for Music Information Retrieval), Paris,France, 2018.

[38] Hendrik Schreiber and Meinard Müller. A post-processing procedure for improving music tempo esti-mates using supervised learning. In Proc. of ISMIR (In-ternational Society for Music Information Retrieval),Suzhou, China, 2017.

[39] Xavier Serra and Julius Smith. Spectral modeling syn-thesis: A sound analysis/synthesis system based on adeterministic plus stochastic decomposition. ComputerMusic Journal, 14(4):12–24, 1990.

[40] Klaus Seyerlehner, Gerhard Widmer, and DominikSchnitzer. From rhythm patterns to perceived tempo.In Proc. of ISMIR (International Society for Music In-formation Retrieval), Vienna, Austria, 2007.

[41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:a simple way to prevent neural networks fromoverfitting. Journal of Machine Learning Research,15(1):1929–1958, 2014.

[42] George Tzanetakis and Perry Cook. Musical genreclassification of audio signals. Speech and Audio Pro-cessing, IEEE Transactions on, 10(5):293–302, 2002.

[43] Linxing Xiao, Aibo Tian, Wen Li, and Jie Zhou. Us-ing statistic model to capture the association betweentimbre and perceived tempo. In Proc. of ISMIR (In-ternational Society for Music Information Retrieval),Philadelphia, PA, USA, 2008.

[44] Jose Zapata and Emilia Gómez. Comparative evalua-tion and combination of audio tempo estimation ap-proaches. In Aes 42Nd Conference On Semantic Audio,Ilmenau, Germany, 2011.