13-recurrent neural network - KAISTmac.kaist.ac.kr/.../slides/13-recurrent_neural_network.pdf ·...

GCT634: Musical Applications of Machine LearningRecurrent Neural Network

Deep Learning for AMT

Graduate School of Culture Technology, KAISTJuhan Nam

Outlines

• Recurrent Neural Networks (RNN)- Introduction- Mechanics

• Deep Learning for Automatic Music Transcription- Onset Detection- Chord Recognition- Polyphonic Note Transcription

Piano Note Transcription Using MLP

• Frame-level approach - Every frame is assumed to be independent à This is not so intuitive

MIDI Piano roll

Audio Spectrogram! "

! #

! $

! %

Recurrent Neural Networks (RNN)

• Add explicit connections between previous states and current states of hidden layers- The hidden layers are “state vectors” with regard to time

!"#

!"$

!"%

!"&


• Add explicit connections between previous states and current states of hidden layers- The hidden layers are “state vectors” with regard to time

ℎ " ($) = '(()" ℎ " ($ − 1) +(-

" .($) + / " )

ℎ 0 ($) = '(()0 ℎ 0 ($ − 1) +(-

0 ℎ " ($) + / 0 )

ℎ 1 ($) = '(()1 ℎ 1 $ − 1 +(-

1 ℎ 0 ($) + / 1 )

(-"

(-0

(-1

(-2

34($) = 5((-2 ℎ 1 ($) + / 2 )

.($)

()1

()0

()0

recurrent connections

$ = 0, 1, 2, …


• This simple structure is often called “Vanilla RNN”- tanh(') is a common choice for the nonlinearity function )

ℎ + (,) = )(./+ ℎ + (, − 1) +.3

+ '(,) + 4 + )

ℎ 5 (,) = )(./5 ℎ 5 (, − 1) +.3

5 ℎ + (,) + 4 5 )

ℎ 6 (,) = )(./6 ℎ 6 , − 1 +.3

6 ℎ 5 (,) + 4 6 )

.3+

.35

.36

.37

89(,) = :(.37 ℎ 6 (,) + 4 7 )

'(,)

./6

./5

./5

recurrent connections

, = 0, 1, 2, …

Training RNN

• Forward pass- The hidden layers keep updating the states over the time steps- The parameters (!"

# , !$# ) are fixed and shared over the time steps

%(2)

)*(1)

!",

!"-

!".

!"/

%(1)

)*(1)

!",

!"-

!".

!"/

%(0)

)*(0)

!",

!"-

!".

!"/

. . .

. . .

. . .

!$.

!$-

!$,

!$.

!$-

!$,

!$.

!$-

!$,

Unrolled RNN

Training RNN

• Backward pass- Backpropagation through time (BPTT)

!(1)

%(2)

'((1)

)*+

)*,

)*-

)*.

%(1)

'((1)

)*+

)*,

)*-

)*.

%(/)

'((/)

)*+

)*,

)*-

)*.

. . .

. . .

. . .

)1-

)1,

)1+

)1-

)1,

)1+

)1-

)1,

)1+

!(1) !(/)

The Problem of Vanilla RNN

• As the time steps increase in RNN, the gradients during BPTT can become unstable- Exploding or vanishing gradients

• Exploding gradients can be controlled by gradient clipping but vanishing gradients require a different architecture

• The vanilla RNN is usually applied to short sequences

Vanilla RNN

• Another view

[Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM)

• Four neural network layers in one module• Two recurrent flows




• Cell state (“the key to LSTM”)- Information can flow through the cell states without being much changed - Linear connections

Forget gate Input gate




• Cell state (“the key to LSTM”)- Information can flow through the cell states without being much changed- Linear connections

Forget gate





• Cell state (“the key to LSTM”)

- Information can flow through the cell states without being much changed

- Linear connections

New information

Forget gateInput gate





• Generate the next state from the cell

Output gate



Another View of LSTM

• Three gates (sigmoid) and the original RNN connection (tanh)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 4, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 4, 2017

☉

98

ct-1

ht-1

xt

fig

o

W ☉

+ ct

tanh

☉ ht

Long Short Term Memory (LSTM)[Hochreiter et al., 1997]

stack

[Stanford CS231n]

Another View of LSTM

• Much more powerful than the vanilla RNNs- Uninterrupted gradient flow is possible through the cell over time steps - The structure with two current flows is similar to ResNet

• Long-term dependency can be learned- We can use long sequence data as input

7x

7 co

nv

, 6

4, /2

po

ol, /2

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 1

28

, /

2

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 c

on

v, 2

56

, /

2

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 c

on

v, 5

12

, /

2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

av

g p

oo

l

fc

10

00

im

ag

e

3x

3 c

on

v, 5

12

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

po

ol, /2

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

po

ol, /2

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

po

ol, /2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

po

ol, /2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

po

ol, /2

fc

40

96

fc

40

96

fc

10

00

im

ag

e

ou

tp

ut

siz

e: 1

12

ou

tp

ut

siz

e: 2

24

ou

tp

ut

size

: 5

6

ou

tp

ut

size

: 2

8

ou

tp

ut

size

: 1

4

ou

tp

ut

siz

e: 7

ou

tp

ut

siz

e: 1

VG

G-1

93

4-la

ye

r p

la

in

7x

7 co

nv

, 6

4, /2

po

ol, /

2

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 6

4

3x

3 c

on

v, 1

28

, /

2

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 co

nv

, 1

28

3x

3 c

on

v, 2

56

, /

2

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 co

nv

, 2

56

3x

3 c

on

v, 5

12

, /

2

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

3x

3 co

nv

, 5

12

av

g p

oo

l

fc

10

00

ima

ge

34

-la

ye

r r

esid

ua

l

Figu

re3.

Exam

ple

netw

ork

arch

itect

ures

forI

mag

eNet

.L

eft

:th

eV

GG

-19

mod

el[4

1](1

9.6

billi

onFL

OPs

)as

are

fere

nce.

Mid

-

dle

:apl

ain

netw

ork

with

34pa

ram

eter

laye

rs(3

.6bi

llion

FLO

Ps).

Rig

ht:

are

sidu

alne

twor

kw

ith34

para

met

erla

yers

(3.6

billi

onFL

OPs

).Th

edo

tted

shor

tcut

sinc

reas

edi

men

sion

s.T

ab

le1

show

sm

ore

deta

ilsan

dot

herv

aria

nts.

Resid

ua

lN

etw

ork

.B

ased

onth

eab

ove

plai

nne

twor

k,w

ein

sert

shor

tcut

conn

ectio

ns(F

ig.

3,rig

ht)

whi

chtu

rnth

ene

twor

kin

toits

coun

terp

artr

esid

ualv

ersi

on.

The

iden

tity

shor

tcut

s(E

qn.(1

))ca

nbe

dire

ctly

used

whe

nth

ein

puta

ndou

tput

are

ofth

esa

me

dim

ensi

ons

(sol

idlin

esh

ortc

uts

inFi

g.3)

.Whe

nth

edi

men

sion

sinc

reas

e(d

otte

dlin

esh

ortc

uts

inFi

g.3)

,we

cons

ider

two

optio

ns:

(A)

The

shor

tcut

still

perf

orm

sid

entit

ym

appi

ng,w

ithex

traze

roen

tries

padd

edfo

rinc

reas

ing

dim

ensi

ons.

This

optio

nin

trodu

ces

noex

trapa

ram

eter

;(B

)The

proj

ectio

nsh

ortc

utin

Eqn.

(2)i

suse

dto

mat

chdi

men

sion

s(d

one

by1⇥

1co

nvol

utio

ns).

For

both

optio

ns,w

hen

the

shor

tcut

sgo

acro

ssfe

atur

em

aps

oftw

osi

zes,

they

are

perf

orm

edw

itha

strid

eof

2.

3.4

.Im

ple

men

tati

on

Our

impl

emen

tatio

nfo

rIm

ageN

etfo

llow

sth

epr

actic

ein

[21,

41].

The

imag

eis

resi

zed

with

itssh

orte

rsi

dera

n-do

mly

sam

pled

in[256

,480]

for

scal

eau

gmen

tatio

n[4

1].

A22

4⇥22

4cr

opis

rand

omly

sam

pled

from

anim

age

orits

horiz

onta

lflip

,with

the

per-

pixe

lmea

nsu

btra

cted

[21]

.The

stan

dard

colo

raug

men

tatio

nin

[21]

isus

ed.W

ead

optb

atch

norm

aliz

atio

n(B

N)

[16]

right

afte

rea

chco

nvol

utio

nan

dbe

fore

activ

atio

n,fo

llow

ing

[16]

.W

ein

itial

ize

the

wei

ghts

asin

[13]

and

train

allp

lain

/resi

dual

nets

from

scra

tch.

We

use

SGD

with

am

ini-b

atch

size

of25

6.Th

ele

arni

ngra

test

arts

from

0.1

and

isdi

vide

dby

10w

hen

the

erro

rpla

teau

s,an

dth

em

odel

sare

train

edfo

rup

to60

⇥10

4ite

ratio

ns.W

eus

ea

wei

ghtd

ecay

of0.

0001

and

am

omen

tum

of0.

9.W

edo

notu

sedr

opou

t[14

],fo

llow

ing

the

prac

tice

in[1

6].

Inte

stin

g,fo

rcom

paris

onst

udie

sw

ead

optt

hest

anda

rd10

-cro

pte

stin

g[2

1].

For

best

resu

lts,w

ead

optt

hefu

lly-

conv

olut

iona

lfo

rmas

in[4

1,13

],an

dav

erag

eth

esc

ores

atm

ultip

lesc

ales

(imag

esar

ere

size

dsu

chth

atth

esh

orte

rsi

deis

in{224,256,384,480,640}

).

4.

Ex

perim

en

ts

4.1

.Im

ag

eN

et

Cla

ssifi

ca

tio

n

We

eval

uate

ourm

etho

don

the

Imag

eNet

2012

clas

sifi-

catio

nda

tase

t[36

]tha

tcon

sist

sof1

000

clas

ses.

The

mod

els

are

train

edon

the

1.28

mill

ion

train

ing

imag

es,a

ndev

alu-

ated

onth

e50

kva

lidat

ion

imag

es.

We

also

obta

ina

final

resu

lton

the

100k

test

imag

es,r

epor

ted

byth

ete

stse

rver

.W

eev

alua

tebo

thto

p-1

and

top-

5er

rorr

ates

.

Pla

inN

etw

ork

s.

We

first

eval

uate

18-la

yer

and

34-la

yer

plai

nne

ts.T

he34

-laye

rpla

inne

tis

inFi

g.3

(mid

dle)

.The

18-la

yerp

lain

neti

sof

asi

mila

rfor

m.

See

Tabl

e1

ford

e-ta

iled

arch

itect

ures

.Th

ere

sults

inTa

ble

2sh

owth

atth

ede

eper

34-la

yerp

lain

neth

ashi

gher

valid

atio

ner

ror

than

the

shal

low

er18

-laye

rpl

ain

net.

Tore

veal

the

reas

ons,

inFi

g.4

(left)

we

com

-pa

reth

eirt

rain

ing/

valid

atio

ner

rors

durin

gth

etra

inin

gpr

o-ce

dure

.W

eha

veob

serv

edth

ede

grad

atio

npr

oble

m-

the

4

Deep Learning for Automatic Music Transcription

• MLP, CNN and RNN have been applied to many AMT Tasks- Onset detection / Beat tracking- Chord recognition- Polyphonic piano transcription

Onset Detection

• Binary classification using Bi-directional LSTM- Input: half-wave rectified differences of mel-spectrograms with different

window sizes (a bit of hand-design)- After training the Bi-LSTM, a peak-detection rule is used to find the final

onsets

Universal Onset Detection with Bidirectional Long Short-Term Memory Neural Networks, Eyben et al. 2010

3.2 Recurrent neural networks

Another technique for introducing past context to neuralnetworks is to add backward (cyclic) connections to FNNs.The resulting network is called a recurrent neural network(RNN). RNNs can theoretically map from the entire his-tory of previous inputs to each output. The recurrent con-nections form a kind of memory, which allows input valuesto persist in the hidden layer(s) and influence the networkoutput in the future. If future context is also necessary re-quired, a delay between the input values and the outputtargets can be introduced.

3.3 Bidirectional recurrent neural networks

A more elegant incorporation future context is provided bybidirectional recurrent networks (BRNNs). Two separatehidden layers are used instead of one, both connected to thesame input and output layers. The first processes the inputsequence forwards and the second backwards. The net-work therefore has always access to the complete past andthe future context in a symmetrical way, without bloatingthe input layer size or displacing the input values from thecorresponding output targets. The disadvantage of BRNNsis that they must have the complete input sequence at handbefore it can be processed.

3.4 Long Short-Term Memory

Although BRNNs have access to both past and future in-formation, the range of context is limited to a few framesdue to the vanishing gradient problem [11]. The influenceof an input value decays or blows up exponentially overtime, as it cycles through the network with its recurrentconnections and gets dominated by new input values.

ForgetGate

OutputGate

Input

InputGate•

•

•

1.0

Output

MemoryCell

Figure 2. An LSTM block with one memory cell

To overcome this deficiency, a method called Long Short-Term Memory (LSTM) was introduced in [13]. In an LSTMhidden layer, the nonlinear units are replaced by LSTMmemory blocks (Figure 2). Each block contains one ormore self connected linear memory cells and three multi-plicative gates. The internal state of the cell is maintainedwith a recurrent connection of constant weight 1.0. Thisconnection enables the cell to store information over long

periods of time. The content of the memory cell is con-trolled by the multiplicative input, output, and forget gates,which – in computer memory terminology – correspondto write, read, and reset operations. More details on thetraining algorithm employed, and the bidirectional LSTMarchitecture in general can be found in [10].

4. PROPOSED APPROACH

This section describes our novel approach for onset de-tection in music signals, which is based on bidirectionalLong Short-Term Memory (BLSTM) recurrent neural net-works. In contrast to previous approaches it is able tomodel the context an onset occurs in. The properties ofan onset and the amount of relevant context are therebylearned from the data set used for training. The audio datais transformed to the frequency domain via two parallelSTFTs with different window sizes. The obtained mag-nitude spectra and their first order differences are used asinputs to the BLSTM network, which produces an onsetactivation function at its output. Figure 3 shows this basicsignal flow. The individual blocks are described in moredetail in the following sections.

STFT & Difference

STFT & Difference

BLSTM Network

PeakdetectionSignal Onsets

Figure 3. Basic signal flow of the new neural networkbased onset detector

4.1 Feature extraction

As input, the raw PCM audio signal with a sampling rate offs = 44.1 kHz is used. To reduce the computational com-plexity, stereo signals are converted to a monaural signalby averaging both channels. The discrete input audio sig-nal x(t) is segmented into overlapping frames of W sam-ples length (W = 1024 and W = 2048 , see Section 4.2),which are sampled at a rate of one per 10 ms (onset an-notations are available on a frame level). A Hammingwindow is applied to these frames. Applying the STFTyields the complex spectrogram X(n, k), with n being theframe index, and k the frequency bin index. The com-plex spectrogram is converted to the power spectrogramS(n, k) = |X(n, k)|2.

The dimensionality of the spectra is reduced by apply-ing psychoacoustic knowledge: a conversion to the Mel-frequency scale is performed with openSMILE [8]. A fil-terbank with 40 triangular filters, which are equidistant onthe Mel scale, is used to transform the spectrogram S(n, k)to the Mel spectrogram M(n, m). To match human per-ception of loudness, a logarithmic representation is cho-sen:

Mlog(n, m) = log (M(n, m) + 1.0) (1)

591

11th International Society for Music Information Retrieval Conference (ISMIR 2010)

The positive first order difference D+(n, m) is calcu-lated by applying a half-wave rectifier function H(x) =x+|x|

2to the difference of two consecutive Mel spectra:

D+(n, m) = H (Mlog(n, m) − Mlog(n − 1, m)) (2)

4.2 Neural Network stage

As a neural network, an RNN with BLSTM units is used.As inputs to the neural network, two log Mel-spectrogramsM23

log(n, m) and M46log(n, m) (computed with window sizes

of 23.2 ms and 46.4 ms (W = 1024 and W = 2048 sam-ples), respectively) and their corresponding positive firstorder differences D+

23s(n, m) and D+46s(n, m) are applied,

resulting in 160 input units. The network has three hiddenlayers for each direction (6 layers in total) with 20 LSTMunits each. The output layer has two units, whose outputsare normalised to both lie between 0 and 1, and to sumto 1, using the softmax function. The normalised outputsrepresent the probabilities for the classes ‘onset’ and ‘noonset’. This allows the use of the cross entropy error crite-rion to train the network [10]. Alternative networks with asingle output, where a value of 1 represents an onset frameand a value of 0 a non-onset frame, which are trained us-ing the mean squared output error as criterion, were not assuccessful.

4.2.1 Network training

For network training, supervised learning with early stop-ping is used. Each audio sequence is presented frame byframe (in correct temporal order) to the network. Stan-dard gradient descent with backpropagation of the outputerrors is used to iteratively update the network weights.To prevent over-fitting, the performance (cross entropy er-ror, cf. [10]) on a separate validation set is evaluated af-ter each training iteration (epoch). If no improvement ofthis performance over 20 epochs is observed, the trainingis stopped and the network with the best performance onthe validation set is used as the final network. The gradi-ent descent algorithm requires the network weights to beinitialised with non zero values. We initialise the weightswith a random Gaussian distribution with mean 0 and stan-dard deviation 0.1. The training data, as well as validationand test sets are described in Section 5.

4.3 Peak detection stage

A network obtained after training as described in the previ-ous section is able to classify each frame into two classes:‘onset’ and ‘no onset’. The standard method of choosingthe output node with the highest activation to determinethe frame class has not proven effective. Hence, only theoutput activation of the ‘onset’ class is used. Thresholdingand peak detection is applied to it, which is described inthe following sections:

4.3.1 Thresholding

One problem with existing magnitude based reduction func-tions (cf. Section 2) is that the amplitude of the detection

Figure 4. Top: log Mel-spectrogram with ground truth on-sets (vertical dashed lines). Bottom: network output withdetected onsets (marked by dots), ground truth onsets (dot-ted vertical lines), and threshold θ (horizontal dashed line).4 s excerpt from ‘Basement Jaxx - Rendez-Vu’.

function depends on the amplitude of the signal or the mag-nitude of its short time spectrum. Thus, to successfullydeal with high dynamic ranges, adaptive thresholds mustbe used when thresholding the detection function prior topeak picking. Similar to phase based reduction functions,the output activation function of the BLSTM network isnot affected by input amplitude variations, since its valuerepresents a probability of observing an onset rather thanrepresenting onset strength. In order to obtain optimal clas-sification for each song, a fixed threshold θ is computedper song proportional to the median of the activation func-tion (frames n = 1 . . . N ), constrained to the range fromθmin = 0.1 to θmax = 0.3:

θ∗ = λ · median{ao(1), . . . , ao(N)} (3)θ = min (max (0.1, θ∗) , 0.3) (4)

with ao(n) being the output activation function of theBLSTM neural network for the onset class, and the scalingfactor λ chosen to maximise the F1-measure on the valida-tion set. The final onset function oo(n) contains only theactivation values greater than this threshold:

oo(n) =

!ao(n) for ao(n) > θ

0 otherwise(5)

4.3.2 Peak picking

The onsets are represented by the local maxima of the on-set detection function oo(n). Thus, using a standard peaksearch, the final onset function o(n) is given by:

o(n) =

!1 for oo(n − 1) ≤ oo(n) ≥ oo(n + 1)

0 otherwise(6)

592

11th International Society for Music Information Retrieval Conference (ISMIR 2010)

Onset Detection

• Binary classification using 2D CNN- Input: 80-band mel spec and three channels with different window sizes - Output: the binary output is determined on the center of input frames - Filter size of the first conv layer: wide in time and narrow in frequency

...

3 inputchannels(15x80)

10 featuremaps(9x78)

convolve (7x3) max-pool (1x3)


convolve (3x3)


max-pool (1x3)

20 featuremaps(7x8)

fully connected fully connected

256sigmoidunits

sigmoidoutputunit

... ...

Fig. 2: One of the Convolutional Neural Network architectures used in this work. Starting from a stack of three spectrogram excerpts,convolution and max-pooling in turns compute a set of 20 feature maps classified with a fully-connected network.

imum value in non-overlapping 2x2 pixel cells. This both reducesthe amount of data and introduces some translation invariance. Tobe used for classification, the computation chain of a CNN ends ina fully-connected network that integrates information across all lo-cations in all feature maps of the layer below. When introduced,this type of architecture set the state-of-the-art in handwritten digitrecognition [19], and still defines the state-of-the-art on several com-puter vision tasks [20].

3.2. Application to Onset Detection

To be used as an onset detector, we train a CNN on spectrogramexcerpts centered on the frame to classify, giving binary labels todistinguish onsets from non-onsets (see Fig. 2).

Computer vision usually uses square filters, and square pooling.In spectrograms, the two dimensions represent two different modal-ities, though, and we found rectangular shapes to be more effective(cf. [18]). In particular, as the task mostly entails finding changesover time, we use filters wide in time and narrow in frequency, andas the task requires results of high time resolution, but is obliviousto frequency, we perform max-pooling over frequencies only.

Computer vision often handles color images, presenting the in-put such that each neuron has access to the same local region in allcolor channels (e.g., red, green, and blue). Here we train on a stackof spectrograms instead, with different window sizes, but the sameframe rate, and reduced to the same number of frequency bands withlogarithmic filter banks. This way each neuron can combine infor-mation of high temporal and high frequency accuracy for its location.

To detect onsets in a test signal, we compute the spectrogramsand feed them to the network (instead of giving excerpts of the sizeused in training, we can apply the convolution and pooling opera-tions to the full input at once), obtaining an onset activation functionover time. This function is smoothed by convolution with a Ham-ming window of 5 frames, and local maxima higher than a giventhreshold are reported as onsets.

3.3. Training methodology

We train our networks using mini-batch gradient descent with mo-mentum, minimizing cross-entropy error. As an extension to our ex-periments in [1], for each training case we randomly drop 50% of theinputs of the two fully-connected layers and double the remainingconnection weights, to improve generalization and avoid the needfor early stopping (see [21]). As another extension, we note that forour spectrogram frame rate (100 Hz), assigning each annotated onset

to a single frame may be inappropriate – some annotations are notaccurate enough, and some onsets are not that sharp –, so we assignit to three frames instead, weighting the extra frames less in training.We will investigate the effect of our extensions in the experiments.

4. EXPERIMENTAL RESULTS

Starting from the initial experiment of [1], we perform several mod-ifications to both architecture and training, yielding a further im-provement over the previous state of the art. We will report onthese improvements in detail after describing the data and evalua-tion method.

4.1. Data

We evaluate our networks on a dataset of about 102 minutes of mu-sic annotated with 25,927 onsets detailed in [22, p. 4] and also usedin [5]. It contains monophonic and polyphonic instrumental record-ings as well as popular music excerpts. Following [11], we computethree magnitude spectrograms with a hop size of 10 ms and windowsizes of 23 ms, 46 ms and 93 ms. We apply an 80-band Mel filterfrom 27.5 Hz to 16 kHz and scale magnitudes logarithmically. Wenormalize each frequency band to zero mean and unit variance (con-stants computed on a hold-out set). The network input for a singledecision consists of the frame to classify plus a context of ±70ms(15 frames in total), from all three spectrograms, which is about thecontext we found the RNN of [10] to use.

4.2. Evaluation

As in [22, 5], a reported onset is considered correct if it is not fartherthan 25 ms from an unmatched target annotation; any excess detec-tions and targets are false positives and negatives, respectively. Fromthe precision/recall curve obtained by varying the threshold, we re-port metrics for the point of optimal F-score only. As in [22, 5], allresults are obtained in 8-fold cross-validation.

4.3. Initial Architecture

Our initial architecture from [1] is depicted in Fig. 2: From the 3-channel spectrogram excerpts of 15 frames by 80 bands, a convo-lutional layer with filters of 7 frames by 3 bands (by 3 channels)computes 10 feature maps of 9 frames by 78 bands. The next layerperforms max-pooling over 3 adjacent bands without overlap, reduc-ing the maps to 26 bands. Another convolutional layer of 3⇥3 filters

Improved musical onset detection with Convolutional Neural Networks, Schlüter and Böck, 2014

Onset Detection

• CNN is powerful!

• CNN is more interpretable

- They learn the difference of short- and long-window spectrograms to find

onsets.

Precision Recall F-scoreRNN [10, 5] 0.892 0.855 0.873CNN [1] 0.905 0.866 0.885+ Dropout 0.909 0.871 0.890+ Fuzziness 0.914 0.885 0.899+ ReLU 0.917 0.889 0.903SuperFlux [5] 0.883 0.793 0.836

Table 1: Performance of the state-of-the-art RNN compared to theproposed CNN and a hand-designed method. See Sections 4.3–4.6for details on rows 2–5.

and another 3-band max-pooling layer result in 20 maps of 7 framesby 8 bands (1120 neurons in total). These are processed by a fully-connected layer of 256 units and a final fully-connected layer of asingle output unit predicting onsets. Both convolutional layers usethe tanh nonlinearity (with a scalar bias per feature map), and thefully-connected layers use the logistic sigmoid.

The network is trained in mini-batches of 256 examples, for 100epochs, using a fixed learning rate of 0.05, and an initial momentumof 0.45, linearly increased to 0.9 between epochs 10 and 20.

It achieves an F-score of 88.5%, about one percent point abovethe state-of-the-art RNN (Table 1, rows 1–2). (Trained on single-channel spectrograms, both models lose about one percent point.)

4.4. Bagging and Dropout

Bagging is a straightforward way to improve the performance of aclassifier without changing its architecture: Training four RNNs andaveraging their outputs gives a slight improvement to 87.7% F-score.Similarly, bagging two of our CNNs improves results to 89.1%, butfour CNNs perform the same. A single CNN with twice the numberof units in each layer overfits and obtains 87.9% only. Jointly train-ing two CNNs connected to the same output unit does not overfit, butis inferior to training them separately. We conclude that the benefitof bagging over simply enlarging the network does not stem from thefact that its constituent parts do not overfit, but that they are forcedto solve the task on their own – when training two CNNs jointly, thesecond will not receive any learning signal when the first producesthe correct answer with high confidence and vice versa.

The same holds for the hidden units within each CNN. An el-egant way to ensure that each unit receives a learning signal andis encouraged to solve its task independently of its peers is usingdropout [21]: For each training case, half of the units are omittedfrom the network (cheaply accomplished by masking their output),chosen at random, and remaining weights are doubled to compen-sate. Applying this to the inputs of the two fully-connected layersand increasing the learning rate to 1.0, multiplied with 0.995 aftereach epoch, yields 89.0% F-score. Note that dropout does not incurany higher costs at test time, while bagging two CNNs is twice asexpensive. Another key advantage is that it prevents overfitting, al-lowing us to fix training time to 300 epochs and try different setupswithout the need for early stopping on a validation set.

4.5. Fuzzier Training Examples

Onsets are annotated as time points. For training, we associate eachannotation with its closest spectrogram frame and use this frame(along with its ±7 frames of context) as a positive example, and allothers as negative examples. Some onsets have a soft attack, though,

or are not annotated with 10ms precision, resulting in actual onsetsbeing presented to the network as negative training examples. Tocounter this, we would like to train on less sharply defined groundtruth. One solution would be to replace the binary targets with sharpGaussians and turn the classification problem into a regression one,but preliminary experiments on the RNN showed no improvement.Instead, we define a single frame before and after each annotatedonset to be additional positive examples. To still teach the networkabout the most salient onset position, these examples are weightedwith only 25% during training. This measure improves F-score to89.9%, using a higher detection threshold than before. Simply ex-cluding 1 or 2 frames around each onset from training, letting thenetwork freely decide on those, works just slightly worse.

4.6. Rectified Linear Units

Both the hand-designed SuperFlux algorithm [5] and the state-of-the-art RNN [10] build on precomputed positive differences in spec-tral energy over time. Replacing the tanh activation function in theconvolutional layers with the linear rectifier y(x) = max(0, x) pro-vides a direct way for the CNN to learn to compute positive differ-ences in its spectral input, and has been generally shown useful forsupervisedly trained networks [23]. In our case, it improves F-scoreto our final result of 90.3%. Using rectified linear units for the fully-connected hidden layer as well reduces performance to 89.6%.

5. INTROSPECTION

While we have developed a state-of-the-art musical onset detectorthat is perfectly usable as a black box, we would like to know howit works. In particular, we hope to gain some insights on why it isbetter than existing hand-crafted algorithms, and possibly learn fromits solution to improve these algorithms.

For this purpose, we train a CNN with the second convolutionallayer and max-pooling layer removed to make it easier to interpret,and tanh units for the remaining convolutional layer (dropout andfuzziness as before). It achieves 88.8% F-score, which is still farsuperior to the SuperFlux algorithm (Table 1, last row), making it aninteresting model to study. We will visualize both the connectionslearned by the model and its hidden unit states on test data to under-stand its computations. To guide us, we will start at the output unitand work our way backwards through the network, concentrating onthe parts that contribute most to its classification decisions.

5.1. Output Unit

The output unit computes a weighted sum of the 256 hidden unitstates below, then applies the logistic sigmoid function, resulting ina value between 0.0 and 1.0 interpretable as an onset probability.Fig. 3b shows this output over time for two well-chosen test signals:One rich in percussive onsets, the other in transient-free harmonicones.1 Except for a false positive in the latter, the network outputwell matches the ground truth.

To understand how the output is driven by the 256 hidden units,we visualize their states for the two signals, ordered by connectionweight to the output unit (Fig. 3c). Interestingly, the most stronglyconnected units (near the top and bottom border) are hardly activeand do not seem to be useful for these examples – they may havespecialized to exotic corner cases in the training data. In contrast, alarge number of units with small connection weights (near the sign

1http://ofai.at/~jan.schlueter/pubs/2014_icassp/

(Eyben et al. 2010)

(Schlüter and Böck, 2014)

change prominently visible in the figure) clearly reflects the onset lo-cations. Comparing states for the two signals, we see that a numberof positively connected units (below the sign change) detect percus-sive onsets only, while others also detect harmonic ones.

5.2. Fully-Connected Hidden Layer

Having identified the most interesting hidden units (the ones near thesign change), we will investigate what they compute. Fig. 3d visu-alizes the connections of two units to the feature maps in the layerbelow. The second one displays a sharp wide-band off-on-off con-nection to the fourth map, and similarly sharp connections to othermaps. It is good in detecting percussive onsets, which are short wide-band bursts. The first unit computes more long-term differences, no-tably in the first and ninth map, and manages to capture harmoniconsets. Other units look very similar to the two types shown, withvariations in timing and covered frequency bands.

5.3. Convolutional Layer

To close the remaining gap to the input, we will study the featuremaps computed by the convolutional layer. From the previous inves-tigation, maps 4 and 9 seem to play an important role. For the firstsignal, map 4 highlights the onsets very sharply (Fig. 3e). Looking atthe corresponding filter (Fig. 3g), it seems to detect energy bursts of1 to 3 frames in the mid-sized spectrogram, and compute a temporaldifference in the long-window one. Map 9 also computes this tem-poral difference and contrasts it against a slightly offset difference inthe short-window spectrogram (Fig. 3h). While still very fuzzy, thisenhances onsets of the second signal (Fig. 3f).

5.4. Insights

Although our inspection was highly selective, covering a small partof the network only, we formed a basic intuition of what it does. Likespectral flux based methods, the network computes spectral differ-ences over time. In doing so, it adapts the context to the spectrogramwindow length, which was also found to be crucial in [5]. And like[4], the CNN separates the detection of percussive and pitched on-sets. As a novel feature, the network computes the difference ofshort- and long-window spectrograms to find onsets. However, im-itating this is not enough to build a good onset detector. In fact,the key factor seems to be that the network combines hundreds ofminor variations of the same approach, something that cannot be re-produced with hand-designed algorithms.

6. DISCUSSION

Through a combination of recent neural network training methods,we significantly advanced the state of the art in musical onset de-tection. Analyzing the learned model, we find that it rediscoveredseveral ideas used in hand-designed methods, but is superior by com-bining results of many slightly different detectors. This shows thateven for easily understandable problems, labelling data and applyingmachine learning may be more worthwhile than directly engineeringa solution. Further improvements may be achieved by training largernetworks, by trying other filter shapes, by regularizing the convolu-tional layers [24], and by including phase information. More insightsmight be won by recent CNN visualization techniques [25, 26]. An-other direction for future research is to combine ideas from CNNswith RNNs, such as local connectivity and pooling, to obtain a state-of-the-art model suitable for low-latency real-time processing.

(a) input spectrograms (mid-sized window length only)

(b) network output (blue line) and ground truth (vertical red bars)

(c) penultimate layer states ordered by connection weight to out-put, from strongly negative (top) to strongly positive (bottom)

(d) weights of two penultimate layer units: each block shows con-nections to one of the ten feature maps in the layer below, withtime increasing from left to right, frequency from bottom to top,red and blue denoting negative and positive weights, respectively

(e) feature map 4 for the firstsignal (after pooling and tanh)

(f) feature map 9 for the secondsignal (after pooling and tanh)

(g) filter kernel for map 4: three7 ⇥ 3 blocks for the three inputspectrograms (mid, short, long)

(h) filter kernel for map 9: three7 ⇥ 3 blocks for the three inputspectrograms (mid, short, long)

Fig. 3: Network weights and states for two test signals (see Sect. 5).

Acknowledgements: This research is supported by the Aus-trian Science Fund (FWF): TRP 307-N23, and by the EuropeanUnion Seventh Framework Programme FP7 / 2007-2013 throughthe PHENICX project (grant agreement no. 601166). The AustrianResearch Institute for Artificial Intelligence is supported by the Aus-trian Federal Ministry for Transport, Innovation, and Technology.

Chord Recognition

• Deep chroma- Supervised feature Learning of chroma- Input: 15 frames of quarter-tone spectorgram- MLP: 3 dense layers of 512 rectified units - Output: chord labels

Feature Learning for Chord Recognition: The Deep Chroma Extractor , Korzeniowski and Widmer, 2014

spectively, and �l is a (usually non-linear) activation func-tion applied point-wise.

We define two additional special layers: an input layerthat is feeding values to h1 as h0(x) = x, with U0 beingthe input’s dimensionality; and an output layer hL+1 thattakes the same form as shown in Eq. 1, but has a specificsemantic purpose: it represents the output of the network,and thus its dimensionality UL+1 and activation function�L+1 have to be set accordingly. 2

The weights and biases constitute the model’s parame-ters. They are trained in a supervised manner by gradientmethods and error back-propagation in order to minimisethe loss of the network’s output. The loss function de-pends on the domain, but is generally some measure of dif-ference between the current output and the desired output(e.g. mean squared error, categorical cross-entropy, etc.)

In the following, we describe how we compute the inputto the DNN, the concrete DNN architecture and how it wastrained.

4.1 Input Processing

We compute the time-frequency representation of the sig-nal based on the magnitude of its STFT X . The STFTgives significantly worse results than the constant-q trans-form if used as basis for traditional chroma extractors, butwe found in preliminary experiments that our model is notsensitive to this phenomenon. We use a frame size of 8192with a hop size of 4410 at a sample rate of 44100 Hz. Then,we apply triangular filters to convert the linear frequencyscale of the magnitude spectrogram to a logarithmic one inwhat we call the quarter-tone spectrogram S = F4

Log · |X|,where F4

Log is the filter bank. The quarter-tone spectro-gram contains only bins corresponding to frequencies be-tween 30 Hz and 5500 Hz and has 24 bins per octave. Thisresults in a dimensionality of 178 bins. Finally, we applya logarithmic compression such that Slog = log (1 + S),which we will call the logarithmic quarter-tone spectro-gram. To be concise, we will refer to SLog as “spectro-gram” in rest of this paper.

Our model uses a context window around a target frameas input. Through systematic experiments on the validationfolds (see Sec.5.1) we found a context window of ±0.7 s towork best. Since we operate at 10 fps, we feed our networkat each time 15 consecutive frames, which we will denoteas super-frame.

4.2 Model

We define the model architecture and set the model’shyper-parameters based on validation performance in sev-eral preliminary experiments. Although a more systematicapproach might reveal better configurations, we found thatresults do not vary by much once we reach a certain modelcomplexity.

2 For example, for a 3-class classification problem one would use 3units in the output layer and a softmax activation function such that thenetwork’s output can be interpreted as probability distribution of classesgiven the data.

Figure 1. Model overview. At each time 15 consecutiveframes of the input quarter-tone spectrogram SLog are fedto a series of 3 dense layers of 512 rectifier units, and fi-nally to a sigmoid output layer of 12 units (one per pitchclass), which represents the chroma vector for the centreinput frame.

Our model is a deep neural network with 3 hidden layersof 512 rectifier units [11] each. Thus, �l(x) = max(0, x)for 1 l L. The output layer, representing the chromavector, consists of 12 units (one unit per pitch class) with asigmoid activation function �L+1(x) = 1/1+exp(�x). Theinput layer represents the input super-frame and thus has adimensionality of 2670. Fig. 1 shows an overview of ourmodel.

4.3 Training

To train the network, we propagate back through the net-work the gradient of the loss L with relation to the net-work parameters. Our loss is the binary cross-entropybetween each pitch class in the predicted chroma vectorp = hL+1(Slog) and the target chroma vector t, which isderived from the ground truth chord label. For a single datainstance,

L =1

12

12X

i=1

�ti log(pi)� (1� ti) log(1� pi). (2)

We learn the parameters with mini-batch training (batchsize 512) using the ADAM update rule [16]. We also triedsimple stochastic gradient descent with Nesterov momen-tum and a number of manual learn rate schedules, but couldnot achieve better results (to the contrary, using ADAMtraining usually converged earlier). To prevent over-fitting,we apply dropout [26] with probability 0.5 after each hid-den layer and early stopping if validation accuracy doesnot increase after 20 epochs.

5. EXPERIMENTS

To evaluate the chroma features our method produces, weset up a simple chord recognition task. We ignore any post-filtering methods and use a simple, linear classifier (logis-tic regression) to match features to chords. This way wewant to isolate the effect of the feature on recognition ac-curacy. As it is common, we restrict ourselves to distinctonly major/minor chords, resulting in 24 chord classes anda ’no chord’ class.

Chord Recognition

• Deep chroma VS hand-crafted chroma

Figure 6. Average saliency of all input frames of the Bea-tles dataset (bottom image), summed over the time axis(top plot). We see that most relevant information can becollected in barely 3 octaves between G3 at 196 Hz andE6 at 1319 Hz. Hardly any harmonic information residesbelow 110 Hz and above 3136 Hz. The plot is spiky atfrequency bins that correspond to clean semitones becausemost of the songs in the dataset seem to be tuned to a refer-ence frequency of 440 Hz. The network thus usually payslittle attention to the frequency bins between semitones.

Figure 7. Excerpts of chromagrams extracted from thesong “Yesterday” by the Beatles. The lower image showschroma computed by the CW

Log without smoothing. We seea good temporal resolution, but also noise. The centre im-age shows the same chromas after a moving average filterof 1.5 seconds. The filter reduced noise considerably, atthe cost blurring chord transitions. The upper plot showsthe chromagram extracted by our proposed method. It dis-plays precise pitch activations and low noise, while keep-ing chord boundaries crisp. Pixel values are scaled suchthat for each image, the lowest value in the respective chro-magram is mapped to white, the highest to black.

seems to suffice as input to chord recognition methods. Us-ing saliency maps and preliminary experiments on valida-tion folds we also found that a context of 1.5 seconds isadequate for local harmony estimation.

There are plenty possibilities for future work to extendand/or improve our method. To achieve better results, wecould use DNN ensembles instead of a single DNN. Wecould ensure that the network sees data for which its pre-dictions are wrong more often during training, or similarly,we could simulate a more balanced dataset by showingthe net super-frames of rare chords more often. To fur-ther assess how useful the extracted features are for chordrecognition, we shall investigate how well they interactwith post-filtering methods; since the feature extractor istrained discriminatively, Conditional Random Fields [17]would be a natural choice.

Finally, we believe that the proposed method extractsfeatures that are useful in any other MIR applications thatuse chroma features (e.g. structural segmentation, key esti-mation, cover song detection). To facilitate respective ex-periments, we provide source code for our method as partof the madmom audio processing framework [2]. Informa-tion and source code to reproduce our experiments can befound at http://www.cp.jku.at/people/korzeniowski/dc.

8. ACKNOWLEDGEMENTS

This work is supported by the European Research Coun-cil (ERC) under the EU’s Horizon 2020 Framework Pro-gramme (ERC Grant Agreement number 670035, project”Con Espressione”). The Tesla K40 used for this researchwas donated by the NVIDIA Corporation.

9. REFERENCES

[1] Y. Bengio, A. Courville, and P. Vincent. Representa-tion Learning: A Review and New Perspectives. IEEETransactions on Pattern Analysis and Machine Intelli-gence, 35(8):1798–1828, Aug. 2013.

[2] S. Bock, F. Korzeniowski, J. Schluter, F. Krebs,and G. Widmer. madmom: a new Python Audioand Music Signal Processing Library. arXiv preprintarXiv:1605.07008, 2016.

[3] S. Bock, F. Krebs, and G. Widmer. A multi-model ap-proach to beat tracking considering heterogeneous mu-sic styles. In Proceedings of the 15th International So-ciety for Music Information Retrieval Conference (IS-MIR), Taipei, Taiwan, 2014.

[4] S. Bock, F. Krebs, and G. Widmer. Accurate tempo es-timation based on recurrent neural networks and res-onating comb filters. In Proceedings of the 16th Inter-national Society for Music Information Retrieval Con-ference (ISMIR), Malaga, Spain, 2015.

[5] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin-cent. Audio chord recognition with recurrent neuralnetworks. In Proceedings of the 14th International So-ciety for Music Information Retrieval Conference (IS-MIR), Curitiba, Brazil, 2013.

Deep chroma

Hand-craftedchroma

Figure 2. Validation WCSR for Major/minor chord recog-nition of different methods given different audio contextsizes. Whiskers represent 0.95 confidence intervals.

Our compound evaluation dataset comprises the Beat-les [13], Queen and Zweieck [18] datasets (which form the“Isophonics” dataset used in the MIREX 3 competition),the RWC pop dataset 4 [12], and the Robbie Williamsdataset [8]. The datasets total 383 songs or approx. 21hours and 39 minutes of music.

We perform 8-fold cross validation with random splits.For the Beatles dataset, we ensure that each fold has thesame album distribution. For each test fold, we use six ofthe remaining folds for training and one for validation.

As evaluation measure, we compute the WeightedChord Symbol Recall (WCSR), often called Weighted Av-erage Overlap Ratio (WAOR) of major and minor chordsusing the mir eval library [23].

5.1 Compared Features

We evaluate our extracted features CD against threebaselines: a standard chromagram C computed froma constant-q transform, a chromagram with frequencyweighting and logarithmic compression of the underlyingconstant-q transform CW

Log , and the quarter-tone spectro-gram SLog . The chromagrams are computed using the li-brosa library 5 . Their parametrisation follows closely thesuggestions in [7], where CW

Log was found to be the bestchroma feature for chord recognition.

Each baseline can take advantage of context informa-tion. Instead of computing a running mean or median,we allow logistic regression to consider multiple frames ofeach feature 6 . This is a more general way to incorporatecontext, because running mean is a subset of the contextaggregation functions possible in our setup. Since traininglogistic regression is a convex problem, the result is at leastas good as if we used a running mean.

3 http://www.music-ir.org/mirex4 Chord annotations available at https://github.com/tmc323/

Chord-Annotations5 https://github.com/bmcfee/librosa6 Note that this description applies only to the baseline methods. For

our DNN feature extractor, “context” means the amount of context theDNN sees. The logistic regression always sees only one frame of thefeature the DNN computed.

Btls Iso RWC RW Total

C 71.0±0.1 69.5 ±0.1 67.4±0.2 71.1±0.1 69.2±0.1

CWLog 76.0±0.1 74.2 ±0.1 70.3±0.3 74.4±0.2 73.0±0.1

SLog 78.0±0.2 76.5 ±0.2 74.4±0.4 77.8±0.4 76.1±0.2

CD 80.2±0.1 79.3±0.1 77.3±0.1 80.1±0.1 78.8±0.1

Table 1. Cross-validated WCSR on the Maj/min task ofcompared methods on various datasets. Best results arebold-faced (p < 10�9). Small numbers indicate stan-dard deviation over 10 experiments. “Btls” stands for theBeatles, “Iso” for Isophonics, and “RW” for the RobbieWilliams datasets. Note that the Isophonics dataset com-prises the Beatles, Queen and Zweieck datasets.

We determined the optimal amount of context foreach baseline experimentally using the validation folds, asshown in Fig. 2. The best results achieved were 79.0% with1.5 s context for CD, 76.8% with 1.1 s context for SLog ,73.3% with 3.1 s context for CW

Log , and 69.5% with 2.7 scontext for C. We fix these context lengths for testing.

6. RESULTS

Table 1 presents the results of our method compared to thebaselines on several datasets. The chroma features C andCW

Log achieve results comparable to those [7] reported ona slightly different compound dataset. Our proposed fea-ture extractor CD clearly performs best, with p < 10�9

according to a paired t-test. The results indicate that thechroma vectors extracted by the proposed method are bet-ter suited for chord recognition than those computed by thebaselines.

To our surprise, the raw quarter-tone spectrogram SLog

performed better than the chroma features. This indicatesthat computing chroma vectors in the traditional way mixesharmonically relevant features found in the time-frequencyrepresentation with irrelevant ones, and the final classifiercannot disentangle them. This raises the question of whychroma features are preferred to spectrograms in the firstplace. We speculate that the main reason is their muchlower dimensionality and thus ease of modelling (e.g. us-ing Gaussian mixtures).

Artificial neural networks often give good results, butit is difficult to understand what they learned, or on whichbasis they generate their output. In the following, we willtry to dissect the proposed model, understand its workings,and see what it pays attention to. To this end, we com-pute saliency maps using guided back-propagation [25],adapting code freely available 7 for the Lasagne library [9].Leaving out the technical details, a saliency map can be in-terpreted as an attention map of the same size as the input.The higher the absolute saliency at a specific input dimen-sion, the stronger its influence on the output, where pos-itive values indicate a direct relationship, negative valuesan indirect one.

Fig. 3 shows a saliency map and its correspondingsuper-frame, representing a C major chord. As expected,

7 https://github.com/Lasagne/Recipes/

MadMom

• Python-based audio library: http://madmom.readthedocs.io/- Contain pre-trained neural networks for onset detection, beat-tracking and

chord recognition

http://madmom.readthedocs.io/

Polyphonic Note Transcription

• Bi-directional RNN- Input: two log-spectrograms with different window sizes (short window:

onset-sensitive, Long window: pitch-sensitive)

• Use a regression output layer: mean-square-error- Use a thresholding to detect onset and pitch

POLYPHONIC PIANO NOTE TRANSCRIPTION WITH RECURRENT NEURAL NETWORKS

Sebastian Bock, Markus Schedl

Department of Computational Perception, Johannes Kepler University, Linz, [email protected]

ABSTRACTIn this paper a new approach for polyphonic piano note onsettranscription is presented. It is based on a recurrent neuralnetwork to simultaneously detect the onsets and the pitchesof the notes from spectral features. Long Short-Term Mem-ory units are used in a bidirectional neural network to modelthe context of the notes. The use of a single regression out-put layer instead of the often used one-versus-all classifica-tion approach enables the system to significantly lower thenumber of erroneous note detections. Evaluation is basedon common test sets and shows exceptional temporal preci-sion combined with a significant boost in note transcriptionperformance compared to current state-of-the-art approaches.The system is trained jointly with various synthesized pianoinstruments and real piano recordings and thus generalizesmuch better than existing systems.

Index Terms— music information retrieval, neural net-works

1. INTRODUCTION

Music transcription is the process of converting an audiorecording into a musical score or a similar representation. Inthis paper we concentrate on the transcription of piano notes,especially on the two most important aspects of notes, theirpitch and onset times. To detect them as accurately as possi-ble is crucial for a proper transcription of the musical piece.We leave out higher level tasks like determining the lengthof a note (given either in seconds or in a musical notationlike quarter note). Also we do not consider the velocity orintensity. The output of the system is a simplified piano-rollnotation of the audio signal.

Traditional music transcription systems are based on awide range of different technologies, but all have to deal withthe subtasks of estimating the fundamental frequencies andthe onset locations of the notes. A very basic approach for-mulated by Dixon [1] solely relies on the spectral peaks ofthe signal to detect notes; local maxima represent the onsetsand the drop of energy below a minimum threshold marks theoffset of the note. Bello et al. [2] additionally incorporatetime-domain features to predict multiple sounding pitchesassuming that the signal can be constructed as a linear sumof individual waveforms based on a database of piano notes.Raphael [3] proposes a probability-based system which usesa hidden Markov model (HMM) to find chord sequences.The states are represented by frames with labels based on

the sounding pitches. Ryynanen and Klapuri [4] also useHMMs to model note events based on multiple fundamentalfrequency features. Transition between notes are controlledvia musical knowledge.

Most of today’s top performing piano transcription sys-tems rely on machine learning approaches. Marolt [5] de-scribes an elaborate approach based on different neural net-works to recognize tones in an audio recording, combinedwith adaptive oscillators to track partials. Poliner and Ellis [6]use multiple support vector machine (SVM) classifiers trainedon spectral features to detect the sounding fundamental fre-quencies of a frame. Post-processing with HMM is appliedto temporally smooth the output. Boogaart and Lienhart [7]use a cascade of boosted classifiers to predict the onsets andthe corresponding pitches of each note. All these systemsuse multiple classifiers and thus can not reliably distinguishwhether a sounding pitch is the fundamental frequency of anote or a partial of another one. This results in lots of falsenote detections. In contrast, our system uses a single regres-sion model and is thus able to distinguish between these statesand hence lowers the number of false detections significantly.

2. SYSTEM DESCRIPTION

Figure 1 shows the proposed piano transcription system. Ittakes a discretely sampled audio signal as its input. The sig-nal is transferred to the frequency domain via two parallelShort-Time Fourier Transforms (STFT) with different win-dow lengths. The logarithmic magnitude spectrogram of eachSTFT is then filtered to obtain a compressed representationwith the frequency bins corresponding to the tone scale ofa piano with a semitone resolution. This representation isused as input to a bidirectional Long Short-Term Memory(BLSTM) recurrent neural network. The output of the net-work is a piano-roll like representation of the note onsets foreach MIDI note.

Spectrogram46.4ms

Spectrogram185.5ms

BLSTM Network

Note Onset & Pitch DetectionAudio

SemitoneFilterbank

SemitoneFilterbank

Notes

Fig. 1: Proposed piano transcription system overview.

121978-1-4673-0046-9/12/$26.00 ©2012 IEEE ICASSP 2012

Polyphonic piano note transcription with recurrent neural networks , Böck and Schedl, 2012


• Frame-wise classification - Compare DNN, ConvNet (Conv+FC) and AllConv (Conv only)- Input: semi-tone filterbank with 5 frames for Conv

On the Potential of Simple Framewise Approaches to Piano Transcription, Keltz et al., 2016

Optimizer (Plain SGD, Momentum, Nesterov Momentum, Adam)Learning Rate (0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 50.0, 100.0)Momentum (Off, 0.7, 0.8, 0.9)Learning rate Scheduler (On, Off)Batch Normalization (On, Off)Dropout (Off, 0.1, 0.3, 0.5)L1 Penalty (Off, 1e-07, 1e-08, 1e-09)L2 Penalty (Off, 1e-07, 1e-08, 1e-09)

Table 3: The list of additional hyper-parameters varied,and their ranges.

Figure 3: Mean predicted performance for the shallow net

model class, dependent on learning rate (on a logarithmicscale). The dark line shows the mean performance, and thegray area shows the standard deviation.

5. STATE OF THE ART MODELS

Having completed the analysis of input representation,more powerful model classes were tried: a deep neu-ral network consisting entirely of dense layers (DNN),a mixed network with convolutional layers directly afterthe input followed by dense layers (ConvNet), and an all-convolutional network (AllConv [29]). Their architecturesare described in detail in Table 4. To the best of our knowl-edge, this is the first time an all-convolutional net has beenproposed for the task of framewise piano transcription.

We computed a logarithmically filtered spectrogramwith logarithmic magnitude from audio with a sample rateof 44.1 kHz, a filterbank with 48 bins per octave, normedarea filters, no circular shift and no zero padding. Thechoices for circular shift and zero padding ranged very lowon the importance scale, so we simply left them switchedoff. This resulted in only 229 bins, which are logarithmi-cally spaced in the higher frequency regions, and almostlinearly spaced in the lower frequency regions as men-tioned in Section 2.1. The dense network was presentedone frame at a time, whereas the convolutional networkwas given a context in time of two frames to either side ofthe current frame, summing to 5 frames in total.

All further hyper-parameter tuning and architecturalchoices have been left to a human expert. Models withina model class were selected based on average F-measureacross the four validation sets. An automatic search viaa hyper-parameter search algorithm for these larger model

DNN ConvNet AllConv

Input 229 Input 5x229 Input 5x229Dropout 0.1 Conv 32x3x3 Conv 32x3x3Dense 512 Conv 32x3x3 Conv 32x3x3BatchNorm BatchNorm BatchNormDropout 0.25 MaxPool 1x2 MaxPool 1x2Dense 512 Dropout 0.25 Dropout 0.25BatchNorm Conv 64x3x3 Conv 32x1x3Dropout 0.25 MaxPool 1x2 BatchNormDense 512 Dropout 0.25 Conv 32x1x3BatchNorm Dense 512 BatchNormDropout 0.25 Dropout 0.5 MaxPool 1x2Dense 88 Dense 88 Dropout 0.25

Conv 64x1x25BatchNormConv 128x1x25BatchNormDropout 0.5Conv 88x1x1BatchNormAvgPool 1x6Sigmoid

# Params 691288 1877880 284544

Table 4: Model Architectures

classes, as described in [4, 11, 28] is left for future work(the training time for a convolutional model is roughly 8�9hours on a Tesla K40 GPU, which leaves us with 204·3·4·8hours (variants ⇥ #models ⇥ #folds ⇥ hours per model),or on the order of 800 � 900 days of compute time to de-termine the best input representation exactly).

For these powerful models, we followed practical rec-ommendations for training neural networks via gradientdescent found in [1]. Particularly relevant is the way ofsetting the initial learning rate. Strategies that dynamicallyadapt the learning rate, such as Adam or Nesterov Momen-

tum [19, 22] help to a certain extent, but still do not spareus from tuning the initial learning rate and its schedule.

We observed that using a combination of batch normal-ization and dropout together with very simple optimiza-tion strategies leads to low validation error fairly quickly,in terms of the number of epochs trained. The strategythat worked best for determining the learning rate and itsschedule was trying learning rates on a logarithmic scale,starting at 10.0, until the optimization did not diverge any-more [1], then training until the validation error flattenedout for a few epochs, then multiplying the learning ratewith a factor from the set {0.1, 0.25, 0.5, 0.75}. The ratesand schedules we finally settled on were:

• DNN: SGD with Momentum, ↵ = 0.1, µ = 0.9 andhalving of ↵ every 10 epochs

• ConvNet: SGD with Momentum, ↵ = 0.1, µ = 0.9and a halving of ↵ every 5 epochs

• AllConv: SGD with Momentum, ↵ = 1.0, µ = 0.9and a halving of ↵ every 10 epochs

The results for framewise prediction on the MAPSdataset can be found in Table 5. It should be noted that wecompare straightforward, simple, and largely un-smoothed

systems (ours) with hybrid systems [26]. There is a smalldegree of temporal smoothing happening when processingspectrograms with convolutional nets. The term simple issupposed to mean that the resulting models have a smallamount of parameters and the models are composed of afew low-complexity building blocks. All systems are eval-uated on the same train-test splits, referred to as configu-

ration I in [26] as well as on realistic train-test splits, thatwere constructed in the same fashion as configuration II

in [26].

Model Class P R F1

Hybrid DNN [26] 65.66 70.34 67.92Hybrid RNN [26] 67.89 70.66 69.25Hybrid ConvNet [26] 72.45 76.56 74.45DNN 76.63 70.12 73.11ConvNet 80.19 78.66 79.33AllConv 80.75 75.77 78.07

Table 5: Results on the MAPS dataset. Test set perfor-mance was averaged across 4 folds as defined in configu-

ration I in [26].

Model Class P R F1

DNN [26] - - 59.91RNN [26] - - 57.67ConvNet [26] - - 64.14DNN 75.51 57.30 65.15ConvNet 74.50 67.10 70.60AllConv 76.53 63.46 69.38

Table 6: Results on the MAPS dataset. Test set perfor-mance was averaged across 4 folds as defined in configu-

ration II in [26].

6. CONCLUSION

We argue that the results demonstrate: the importance ofproper choice of input representation, and the importanceof hyper-parameter tuning, especially the tuning of learn-ing rate and its schedule; that convolutional networks havea distinct advantage over their deep and dense siblings, be-cause of their context window and that all-convolutionalnetworks perform nearly as well as mixed networks, al-though they have far fewer parameters. We propose thesestraightforward, framewise transcription networks as a newstate-of-the art baseline for framewise piano transcriptionfor the MAPS dataset.

7. ACKNOWLEDGEMENTS

This work is supported by the European ResearchCouncil (ERC Grant Agreement 670035, projectCON ESPRESSIONE), the Austrian Ministries BMVITand BMWFW, the Province of Upper Austria (via theCOMET Center SCCH) and the European Union SeventhFramework Programme FP7 / 2007-2013 through theGiantSteps project (grant agreement no. 610591). Wewould like to thank all developers of Theano [3] andLasagne [9] for providing comprehensive and easy to use

deep learning frameworks. The Tesla K40 used for thisresearch was donated by the NVIDIA Corporation.

8. REFERENCES

[1] Yoshua Bengio. Practical recommendations forgradient-based training of deep architectures. InNeural Networks: Tricks of the Trade, pages 437–478.Springer, 2012.

[2] Taylor Berg-Kirkpatrick, Jacob Andreas, and DanKlein. Unsupervised Transcription of Piano Music. InAdvances in Neural Information Processing Systems,pages 1538–1546, 2014.

[3] James Bergstra, Olivier Breuleux, Frederic Bastien,Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. Theano: a CPU and GPU Math Ex-pression Compiler. In Proceedings of the Python for

Scientific Computing Conference (SciPy), 2010.

[4] James Bergstra, Dan Yamins, and David D. Cox. Hy-peropt: A python library for optimizing the hyperpa-rameters of machine learning algorithms. In Proceed-

ings of the 12th Python in Science Conference, pages13–20, 2013.

[5] Sebastian Bock, Filip Korzeniowski, Jan Schluter, Flo-rian Krebs, and Gerhard Widmer. madmom: a newPython Audio and Music Signal Processing Library.arXiv preprint arXiv:1605.07008, 2016.

[6] Sebastian Bock and Markus Schedl. Polyphonic pi-ano note transcription with recurrent neural networks.In Acoustics, Speech and Signal Processing (ICASSP),

2012 IEEE International Conference on, pages 121–124. IEEE, 2012.

[7] Nicolas Boulanger-Lewandowski, Yoshua Bengio, andPascal Vincent. Modeling temporal dependencies inhigh-dimensional sequences: Application to poly-phonic music generation and transcription. arXiv

preprint arXiv:1206.6392, 2012.

[8] Judith C. Brown. Calculation of a constant Q spec-tral transform. The Journal of the Acoustical Society

of America, 89(1):425–434, 1991.

[9] Sander Dieleman, Jan Schluter, Colin Raffel, Eben Ol-son, Søren Kaae Sønderby, Daniel Nouri, Daniel Mat-urana, Martin Thoma, Eric Battenberg, Jack Kelly,Jeffrey De Fauw, Michael Heilman, diogo149, BrianMcFee, Hendrik Weideman, takacsg84, peterderivaz,Jon, instagibbs, Dr. Kashif Rasul, CongLiu, Britefury,and Jonas Degrave. Lasagne: First release., August2015.

[10] Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and

Signal Processing (ICASSP), 2014 IEEE International

Conference on, pages 6964–6968. IEEE, 2014.


• Convolutional recurrent neural networks (CRNN) - Onset detection networks and frame-level pitch detection networks are

combined with a causal connection

repeated notes that should be held. Note onsets are important, buta piece played with only onset information would either have to beentirely staccato or use some kind of heuristic to determine when torelease notes. A high note with offset score will correspond to a tran-scription that sounds good because it captures the perceptual infor-mation from both onsets and durations. More perceptually accuratemetrics may be possible and warrant further research. In this workwe focus on improving the note with offset score, but also achievestate of the art results for the more common frame and note scores.

3. MODEL CONFIGURATION

Framewise piano transcription tasks typically process frames of rawaudio and produce frames of note activations. Previous framewiseprediction models [3, 4] have treated frames as both independent andof equal importance, at least prior to being processed by a separatelanguage model. We propose that some frames are more importantthan others, specifically the onset frame for any given note. Pianonote energy decays starting immediately after the onset, so the on-set is both the easiest frame to identify and the most perceptuallysignificant.

We take advantage of the significance of onset frames by train-ing a dedicated note onset detector and using the raw output of thatdetector as additional input for the framewise note activation detec-tor. We also use the thresholded output of the onset detector duringthe inference process. An activation from the frame detector is onlyallowed to start a note if the onset detector agrees that an onset ispresent in that frame.

Our onset and frame detectors are built upon the convolutionlayer acoustic model architecture presented in [4], with some mod-ifications. We use librosa [12] to compute the same input data rep-resentation of mel-scaled spectrograms with log amplitude of theinput raw audio with 229 logarithmically-spaced frequency bins, ahop length of 512, an FFT window of 2048, and a sample rate of16khz. However, instead of presenting the network with one targetframe at a time we instead present the entire sequence at once. Theadvantage of this approach is that we can then use the output of theconvolution layers as input to an RNN layer.

The onset detector is composed of the acoustic model, followedby a bidirectional LSTM [13] with 128 units in both the forward andbackward directions, followed by a fully connected sigmoid layerwith 88 outputs for representing the probability of an onset for eachof the 88 piano keys.

The frame activation detector is composed of a separate acousticmodel, followed by a fully connected sigmoid layer with 88 outputs.Its output is concatenated together with the output of the onset detec-tor and followed by a bidirectional LSTM with 128 units in both theforward and backward directions. Finally, the output of that LSTMis followed by a fully connected sigmoid layer with 88 outputs. Dur-ing inference, we use a threshold of 0.5 to determine whether theonset detector or frame detector is active.

Training RNNs over long sequences can require large amountsof memory and is generally faster with larger batch sizes. To ex-pedite training, we split the training audio into smaller files. How-ever, when we do this splitting we do not want to cut the audio dur-ing notes because the onset detector would miss an onset while theframe detector would still need to predict the note’s presence. Wefound that 20 second splits allowed us to achieve a reasonable batchsize during training of at least 8, while also forcing splits in only asmall number of places where notes are active. When notes are ac-tive and we must split, we choose a zero-crossing of the audio signal.Inference is performed on the original and un-split audio file.

Log Mel-Spectrogram

Conv StackConv Stack

BiLSTM

BiLSTM

Onset Loss

Frame Loss

Onset Predictions

Frame Predictions

FC Sigmoid

FC Sigmoid

FC Sigmoid

Fig. 1. Diagram of Network Architecture

Our ground truth note labels are in continuous time, but the re-sults from audio processing are in spectrogram frames. So, we quan-tize our labels to calculate our training loss. When quantizing, weuse the same frame size as the output of the spectrogram. However,when calculating metrics, we compare our inference results againstthe original, continuous time labels.

Our loss function is the sum of two cross-entropy losses: onefrom the onset side and one from the note side.

Ltotal = Lonset + Lframe (1)

Lonset(l, p) =X

i

�l(i) log p(i)� (1� l(i)) log(1� p(i)) (2)

where l = labelsonsets and p = predictionsonsets. The la-bels for the onset loss are created by truncating note lengths tomin(note length, onset length) prior to quantization. We per-formed a coarse hyperparameter search over onset length andfound that 32ms worked best. In hindsight this is not surprising as itis also the length of our frames and so almost all onsets will end upspanning exactly two frames. Labeling only the frame that containsthe exact beginning of the onset doesn’t work as well because ofpossible mis-alignments of the audio and labels. We experimentedwith requiring a minimum amount of time a note had be to presentin a frame before it was labeled, but found that the optimum valuewas to include any presence.

Even within the frame-based loss term, we apply a weighting toencourage accuracy at the start of the note. A note starts at frame t1,completes its onset at t2 and ends at frame t3. Because the weightvector assigns higher weights to the early frames of notes, the model

Onsets and Frames: Dual-Objective Piano Transcription , Hawthorne et al., 2018

Blue: frame predictionRed: onset predictionMagenta: both

Yellow: True PositiveRed: False NegativeGreen: False Positive


• The “Onset and Frames” model outperforms previous state-of-the-arts

Frame Note Note with offsetPrecision Recall F1 Precision Recall F1 Precision Recall F1

Sigtia [3] (our reimpl.) 71.99 73.32 72.22 44.97 49.55 46.58 17.64 19.71 18.38Kelz [4] (our reimpl.) 81.18 65.07 71.60 44.27 61.29 50.94 20.13 27.80 23.14

Melodyne (decay mode) 71.85 50.39 58.57 62.08 48.53 54.02 21.09 16.56 18.40Onsets and Frames 88.53 70.89 78.30 84.24 80.67 82.29 51.32 49.31 50.22

Table 1. Results on MAPS configuration 2 test dataset (ENSTDkCl and ENSTDkAm full-length .wav files). Note-based scores calculatedby the mir eval library, frame-based scores as defined in [11]. Final metric is the mean of scores calculated per piece. MIDI files used tocalculate these scores are available at https://goo.gl/U3YoJz.

a small impact for each component (< 6%). It is encouraging thatforward-only RNNs have only a small accuracy impact as they canbe used for online piano transcription.

We tried many other architectures and data augmentation strate-gies not listed in the table, none of which resulted in any improve-ment. Significantly, augmenting the training audio by adding nor-malization, reverb, compression, noise, and synthesizing the trainingMIDI files with other synthesizers made no difference. We believethis indicates a need for a much larger training dataset of real pianorecordings that have fully accurate label alignments. These require-ments are not satisfied by the current MAPS dataset because only60 of its 270 recordings are from real pianos, and they are also notsatisfied by MusicNet [16] because its alignments are not fully accu-rate. Other approaches, such as seq2seq [17] may not require fullyaccurate alignments.

F1Frame Note Note with offset

Onset and Frames 78.30 82.29 50.22

(a) Frame-only LSTM 76.12 62.71 27.89(b) No Onset Inference 78.37 67.44 34.15

(c) Onset forward LSTM 75.98 80.77 46.36(d) Frame forward LSTM 76.30 82.27 49.50

(e) No Onset LSTM 75.90 80.99 46.14(f) Pretrain Onsets 75.56 81.95 48.02

(g) No Weighted Loss 75.54 80.07 48.55(h) Shared conv 76.85 81.64 43.61

(i) Disconnected Detectors 73.91 82.67 44.83(j) CQT Input 73.07 76.38 41.14

(k) No LSTM, shared conv 67.60 75.34 37.03

Table 2. Ablation Study Results.

6. NEED FOR MORE DATA, MORE RIGOROUS

EVALUATION

The most common dataset for evaluation of piano transcription tasksis the MAPS dataset, in particular the ENSTDkCl and ENSTDkAmrenderings of the MUS collection of pieces. This set has several de-sirable properties: the pieces are real music as opposed to randomly-generated sequences, the pieces are played on a real physical pianoas opposed to a synthesizer, and multiple recording environments areavailable (“close” and “ambient” configurations). The main draw-back of this dataset is that it is only 60 .wav files.

Many papers, for example [8, 3, 18, 19], further restrict the dataused in evaluation by using only the “close” collection and/or onlythe first 30 seconds or less of each file. We believe this results in anevaluation that is not representative of real-world transcription tasks.

Table 3 shows how the score of our model increases dramatically aswe increasingly restrict the dataset.

NotePrecision Recall F1

Cl and Am, Full length 84.00 80.25 81.96Cl only, Full length 85.95 83.05 84.34Cl only, First 30s 87.13 85.96 86.38

Wang [8] Cl only, First 30s 85.93 75.24 80.23Gao [18] Cl only, First 30s? 83.38 87.34 85.06

Table 3. Model results on various dataset configurations.? Results from Gao cannot be directly compared to the other results in thistable because their model was trained on data from the test piano.

In addition to the small number of the MAPS Disklavier record-ings, we have also noticed several cases where the Disklavier appearsto skip some notes played at low velocity. For example, at the be-ginning of the Beethoven Sonata No. 9, 2nd movement, several A[notes played with MIDI velocities in the mid-20s are clearly miss-ing from the audio (https://goo.gl/U3YoJz). More analysisis needed to determine how frequently missed notes occur, but wehave noticed that our model performs particularly poorly on noteswith velocities below 30.

To best measure transcription quality, we believe a new andmuch larger dataset is needed. However, until that exists, evalua-tions should make full use of the data that is currently available.

7. CONCLUSION AND FUTURE WORK

We demonstrate a jointly trained onsets and frames model for tran-scribing polyphonic piano music and also show that using onset in-formation during inference yields significant improvements. Thismodel transfers well between the disparate train and test distribu-tions.

The current quality of the model’s output is on the cusp of en-abling downstream applications such as MIR and automatic musicgeneration. To further improve the results we need to create a newdataset that is much larger and more representative of various pi-ano recording environments and music genres for both training andevaluation. Combining an improved acoustic model with a languagemodel is a natural next step. Another direction is to go beyond tradi-tional spectrogram representations of audio signals. Dilated convo-lutions [20] could enable sub-frame timing predictions.

It is very much worth listening to the examples of transcription.Consider Mozart Sonata K331, 3rd movement. Our system does agood job in terms of capturing harmony, melody and even rhythm. Ifwe compare this to the other systems, the difference is quite audible.Audio examples are available at https://goo.gl/U3YoJz.

Frame Note Note with offsetPrecision Recall F1 Precision Recall F1 Precision Recall F1

Sigtia [3] (our reimpl.) 71.99 73.32 72.22 44.97 49.55 46.58 17.64 19.71 18.38Kelz [4] (our reimpl.) 81.18 65.07 71.60 44.27 61.29 50.94 20.13 27.80 23.14

Melodyne (decay mode) 71.85 50.39 58.57 62.08 48.53 54.02 21.09 16.56 18.40Onsets and Frames 88.53 70.89 78.30 84.24 80.67 82.29 51.32 49.31 50.22

Table 1. Results on MAPS configuration 2 test dataset (ENSTDkCl and ENSTDkAm full-length .wav files). Note-based scores calculatedby the mir eval library, frame-based scores as defined in [11]. Final metric is the mean of scores calculated per piece. MIDI files used tocalculate these scores are available at https://goo.gl/U3YoJz.

a small impact for each component (< 6%). It is encouraging thatforward-only RNNs have only a small accuracy impact as they canbe used for online piano transcription.

We tried many other architectures and data augmentation strate-gies not listed in the table, none of which resulted in any improve-ment. Significantly, augmenting the training audio by adding nor-malization, reverb, compression, noise, and synthesizing the trainingMIDI files with other synthesizers made no difference. We believethis indicates a need for a much larger training dataset of real pianorecordings that have fully accurate label alignments. These require-ments are not satisfied by the current MAPS dataset because only60 of its 270 recordings are from real pianos, and they are also notsatisfied by MusicNet [16] because its alignments are not fully accu-rate. Other approaches, such as seq2seq [17] may not require fullyaccurate alignments.

F1Frame Note Note with offset

Onset and Frames 78.30 82.29 50.22

(a) Frame-only LSTM 76.12 62.71 27.89(b) No Onset Inference 78.37 67.44 34.15

(c) Onset forward LSTM 75.98 80.77 46.36(d) Frame forward LSTM 76.30 82.27 49.50

(e) No Onset LSTM 75.90 80.99 46.14(f) Pretrain Onsets 75.56 81.95 48.02

(g) No Weighted Loss 75.54 80.07 48.55(h) Shared conv 76.85 81.64 43.61

(i) Disconnected Detectors 73.91 82.67 44.83(j) CQT Input 73.07 76.38 41.14

(k) No LSTM, shared conv 67.60 75.34 37.03

Table 2. Ablation Study Results.

6. NEED FOR MORE DATA, MORE RIGOROUS

EVALUATION

The most common dataset for evaluation of piano transcription tasksis the MAPS dataset, in particular the ENSTDkCl and ENSTDkAmrenderings of the MUS collection of pieces. This set has several de-sirable properties: the pieces are real music as opposed to randomly-generated sequences, the pieces are played on a real physical pianoas opposed to a synthesizer, and multiple recording environments areavailable (“close” and “ambient” configurations). The main draw-back of this dataset is that it is only 60 .wav files.

Many papers, for example [8, 3, 18, 19], further restrict the dataused in evaluation by using only the “close” collection and/or onlythe first 30 seconds or less of each file. We believe this results in anevaluation that is not representative of real-world transcription tasks.

Table 3 shows how the score of our model increases dramatically aswe increasingly restrict the dataset.

NotePrecision Recall F1

Cl and Am, Full length 84.00 80.25 81.96Cl only, Full length 85.95 83.05 84.34Cl only, First 30s 87.13 85.96 86.38

Wang [8] Cl only, First 30s 85.93 75.24 80.23Gao [18] Cl only, First 30s? 83.38 87.34 85.06

Table 3. Model results on various dataset configurations.? Results from Gao cannot be directly compared to the other results in thistable because their model was trained on data from the test piano.

In addition to the small number of the MAPS Disklavier record-ings, we have also noticed several cases where the Disklavier appearsto skip some notes played at low velocity. For example, at the be-ginning of the Beethoven Sonata No. 9, 2nd movement, several A[notes played with MIDI velocities in the mid-20s are clearly miss-ing from the audio (https://goo.gl/U3YoJz). More analysisis needed to determine how frequently missed notes occur, but wehave noticed that our model performs particularly poorly on noteswith velocities below 30.

To best measure transcription quality, we believe a new andmuch larger dataset is needed. However, until that exists, evalua-tions should make full use of the data that is currently available.

7. CONCLUSION AND FUTURE WORK

We demonstrate a jointly trained onsets and frames model for tran-scribing polyphonic piano music and also show that using onset in-formation during inference yields significant improvements. Thismodel transfers well between the disparate train and test distribu-tions.

The current quality of the model’s output is on the cusp of en-abling downstream applications such as MIR and automatic musicgeneration. To further improve the results we need to create a newdataset that is much larger and more representative of various pi-ano recording environments and music genres for both training andevaluation. Combining an improved acoustic model with a languagemodel is a natural next step. Another direction is to go beyond tradi-tional spectrogram representations of audio signals. Dilated convo-lutions [20] could enable sub-frame timing predictions.

It is very much worth listening to the examples of transcription.Consider Mozart Sonata K331, 3rd movement. Our system does agood job in terms of capturing harmony, melody and even rhythm. Ifwe compare this to the other systems, the difference is quite audible.Audio examples are available at https://goo.gl/U3YoJz.


• Demo: Onset and Frames- https://magenta.tensorflow.org/onsets-frames

https://magenta.tensorflow.org/onsets-frames

13-recurrent neural network - KAISTmac.kaist.ac.kr/.../slides/13-recurrent_neural_network.pdf ·...

Documents

Transcript of 13-recurrent neural network - KAISTmac.kaist.ac.kr/.../slides/13-recurrent_neural_network.pdf ·...