13-recurrent neural network - KAISTmac.kaist.ac.kr/.../slides/13-recurrent_neural_network.pdf ·...
Transcript of 13-recurrent neural network - KAISTmac.kaist.ac.kr/.../slides/13-recurrent_neural_network.pdf ·...
GCT634: Musical Applications of Machine LearningRecurrent Neural Network
Deep Learning for AMT
Graduate School of Culture Technology, KAISTJuhan Nam
Outlines
• Recurrent Neural Networks (RNN)- Introduction- Mechanics
• Deep Learning for Automatic Music Transcription- Onset Detection- Chord Recognition- Polyphonic Note Transcription
Piano Note Transcription Using MLP
• Frame-level approach - Every frame is assumed to be independent à This is not so intuitive
MIDI Piano roll
Audio Spectrogram! "
! #
! $
! %
Recurrent Neural Networks (RNN)
• Add explicit connections between previous states and current states of hidden layers- The hidden layers are “state vectors” with regard to time
!"#
!"$
!"%
!"&
Recurrent Neural Networks (RNN)
• Add explicit connections between previous states and current states of hidden layers- The hidden layers are “state vectors” with regard to time
ℎ " ($) = '(()" ℎ " ($ − 1) +(-
" .($) + / " )
ℎ 0 ($) = '(()0 ℎ 0 ($ − 1) +(-
0 ℎ " ($) + / 0 )
ℎ 1 ($) = '(()1 ℎ 1 $ − 1 +(-
1 ℎ 0 ($) + / 1 )
(-"
(-0
(-1
(-2
34($) = 5((-2 ℎ 1 ($) + / 2 )
.($)
()1
()0
()0
recurrent connections
$ = 0, 1, 2, …
Recurrent Neural Networks (RNN)
• This simple structure is often called “Vanilla RNN”- tanh(') is a common choice for the nonlinearity function )
ℎ + (,) = )(./+ ℎ + (, − 1) +.3
+ '(,) + 4 + )
ℎ 5 (,) = )(./5 ℎ 5 (, − 1) +.3
5 ℎ + (,) + 4 5 )
ℎ 6 (,) = )(./6 ℎ 6 , − 1 +.3
6 ℎ 5 (,) + 4 6 )
.3+
.35
.36
.37
89(,) = :(.37 ℎ 6 (,) + 4 7 )
'(,)
./6
./5
./5
recurrent connections
, = 0, 1, 2, …
Training RNN
• Forward pass- The hidden layers keep updating the states over the time steps- The parameters (!"
# , !$# ) are fixed and shared over the time steps
%(2)
)*(1)
!",
!"-
!".
!"/
%(1)
)*(1)
!",
!"-
!".
!"/
%(0)
)*(0)
!",
!"-
!".
!"/
. . .
. . .
. . .
!$.
!$-
!$,
!$.
!$-
!$,
!$.
!$-
!$,
Unrolled RNN
Training RNN
• Backward pass- Backpropagation through time (BPTT)
!(1)
%(2)
'((1)
)*+
)*,
)*-
)*.
%(1)
'((1)
)*+
)*,
)*-
)*.
%(/)
'((/)
)*+
)*,
)*-
)*.
. . .
. . .
. . .
)1-
)1,
)1+
)1-
)1,
)1+
)1-
)1,
)1+
!(1) !(/)
The Problem of Vanilla RNN
• As the time steps increase in RNN, the gradients during BPTT can become unstable- Exploding or vanishing gradients
• Exploding gradients can be controlled by gradient clipping but vanishing gradients require a different architecture
• The vanilla RNN is usually applied to short sequences
Vanilla RNN
• Another view
[Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Long Short-Term Memory (LSTM)
• Four neural network layers in one module• Two recurrent flows
[Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Long Short-Term Memory (LSTM)
• Cell state (“the key to LSTM”)- Information can flow through the cell states without being much changed - Linear connections
Forget gate Input gate
[Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Long Short-Term Memory (LSTM)
• Cell state (“the key to LSTM”)- Information can flow through the cell states without being much changed- Linear connections
Forget gate
Forget gate Input gate
[Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Long Short-Term Memory (LSTM)
• Cell state (“the key to LSTM”)
- Information can flow through the cell states without being much changed
- Linear connections
New information
Forget gateInput gate
Forget gate Input gate
[Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Long Short-Term Memory (LSTM)
• Generate the next state from the cell
Output gate
[Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/]
Another View of LSTM
• Three gates (sigmoid) and the original RNN connection (tanh)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 4, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 4, 2017
☉
98
ct-1
ht-1
xt
fig
o
W ☉
+ ct
tanh
☉ ht
Long Short Term Memory (LSTM)[Hochreiter et al., 1997]
stack
[Stanford CS231n]
Another View of LSTM
• Much more powerful than the vanilla RNNs- Uninterrupted gradient flow is possible through the cell over time steps - The structure with two current flows is similar to ResNet
• Long-term dependency can be learned- We can use long sequence data as input
7x
7 co
nv
, 6
4, /2
po
ol, /2
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 1
28
, /
2
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 c
on
v, 2
56
, /
2
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 c
on
v, 5
12
, /
2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
av
g p
oo
l
fc
10
00
im
ag
e
3x
3 c
on
v, 5
12
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
po
ol, /2
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
po
ol, /2
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
po
ol, /2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
po
ol, /2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
po
ol, /2
fc
40
96
fc
40
96
fc
10
00
im
ag
e
ou
tp
ut
siz
e: 1
12
ou
tp
ut
siz
e: 2
24
ou
tp
ut
size
: 5
6
ou
tp
ut
size
: 2
8
ou
tp
ut
size
: 1
4
ou
tp
ut
siz
e: 7
ou
tp
ut
siz
e: 1
VG
G-1
93
4-la
ye
r p
la
in
7x
7 co
nv
, 6
4, /2
po
ol, /
2
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 6
4
3x
3 c
on
v, 1
28
, /
2
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 co
nv
, 1
28
3x
3 c
on
v, 2
56
, /
2
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 co
nv
, 2
56
3x
3 c
on
v, 5
12
, /
2
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
3x
3 co
nv
, 5
12
av
g p
oo
l
fc
10
00
ima
ge
34
-la
ye
r r
esid
ua
l
Figu
re3.
Exam
ple
netw
ork
arch
itect
ures
forI
mag
eNet
.L
eft
:th
eV
GG
-19
mod
el[4
1](1
9.6
billi
onFL
OPs
)as
are
fere
nce.
Mid
-
dle
:apl
ain
netw
ork
with
34pa
ram
eter
laye
rs(3
.6bi
llion
FLO
Ps).
Rig
ht:
are
sidu
alne
twor
kw
ith34
para
met
erla
yers
(3.6
billi
onFL
OPs
).Th
edo
tted
shor
tcut
sinc
reas
edi
men
sion
s.T
ab
le1
show
sm
ore
deta
ilsan
dot
herv
aria
nts.
Resid
ua
lN
etw
ork
.B
ased
onth
eab
ove
plai
nne
twor
k,w
ein
sert
shor
tcut
conn
ectio
ns(F
ig.
3,rig
ht)
whi
chtu
rnth
ene
twor
kin
toits
coun
terp
artr
esid
ualv
ersi
on.
The
iden
tity
shor
tcut
s(E
qn.(1
))ca
nbe
dire
ctly
used
whe
nth
ein
puta
ndou
tput
are
ofth
esa
me
dim
ensi
ons
(sol
idlin
esh
ortc
uts
inFi
g.3)
.Whe
nth
edi
men
sion
sinc
reas
e(d
otte
dlin
esh
ortc
uts
inFi
g.3)
,we
cons
ider
two
optio
ns:
(A)
The
shor
tcut
still
perf
orm
sid
entit
ym
appi
ng,w
ithex
traze
roen
tries
padd
edfo
rinc
reas
ing
dim
ensi
ons.
This
optio
nin
trodu
ces
noex
trapa
ram
eter
;(B
)The
proj
ectio
nsh
ortc
utin
Eqn.
(2)i
suse
dto
mat
chdi
men
sion
s(d
one
by1⇥
1co
nvol
utio
ns).
For
both
optio
ns,w
hen
the
shor
tcut
sgo
acro
ssfe
atur
em
aps
oftw
osi
zes,
they
are
perf
orm
edw
itha
strid
eof
2.
3.4
.Im
ple
men
tati
on
Our
impl
emen
tatio
nfo
rIm
ageN
etfo
llow
sth
epr
actic
ein
[21,
41].
The
imag
eis
resi
zed
with
itssh
orte
rsi
dera
n-do
mly
sam
pled
in[256
,480]
for
scal
eau
gmen
tatio
n[4
1].
A22
4⇥22
4cr
opis
rand
omly
sam
pled
from
anim
age
orits
horiz
onta
lflip
,with
the
per-
pixe
lmea
nsu
btra
cted
[21]
.The
stan
dard
colo
raug
men
tatio
nin
[21]
isus
ed.W
ead
optb
atch
norm
aliz
atio
n(B
N)
[16]
right
afte
rea
chco
nvol
utio
nan
dbe
fore
activ
atio
n,fo
llow
ing
[16]
.W
ein
itial
ize
the
wei
ghts
asin
[13]
and
train
allp
lain
/resi
dual
nets
from
scra
tch.
We
use
SGD
with
am
ini-b
atch
size
of25
6.Th
ele
arni
ngra
test
arts
from
0.1
and
isdi
vide
dby
10w
hen
the
erro
rpla
teau
s,an
dth
em
odel
sare
train
edfo
rup
to60
⇥10
4ite
ratio
ns.W
eus
ea
wei
ghtd
ecay
of0.
0001
and
am
omen
tum
of0.
9.W
edo
notu
sedr
opou
t[14
],fo
llow
ing
the
prac
tice
in[1
6].
Inte
stin
g,fo
rcom
paris
onst
udie
sw
ead
optt
hest
anda
rd10
-cro
pte
stin
g[2
1].
For
best
resu
lts,w
ead
optt
hefu
lly-
conv
olut
iona
lfo
rmas
in[4
1,13
],an
dav
erag
eth
esc
ores
atm
ultip
lesc
ales
(imag
esar
ere
size
dsu
chth
atth
esh
orte
rsi
deis
in{224,256,384,480,640}
).
4.
Ex
perim
en
ts
4.1
.Im
ag
eN
et
Cla
ssifi
ca
tio
n
We
eval
uate
ourm
etho
don
the
Imag
eNet
2012
clas
sifi-
catio
nda
tase
t[36
]tha
tcon
sist
sof1
000
clas
ses.
The
mod
els
are
train
edon
the
1.28
mill
ion
train
ing
imag
es,a
ndev
alu-
ated
onth
e50
kva
lidat
ion
imag
es.
We
also
obta
ina
final
resu
lton
the
100k
test
imag
es,r
epor
ted
byth
ete
stse
rver
.W
eev
alua
tebo
thto
p-1
and
top-
5er
rorr
ates
.
Pla
inN
etw
ork
s.
We
first
eval
uate
18-la
yer
and
34-la
yer
plai
nne
ts.T
he34
-laye
rpla
inne
tis
inFi
g.3
(mid
dle)
.The
18-la
yerp
lain
neti
sof
asi
mila
rfor
m.
See
Tabl
e1
ford
e-ta
iled
arch
itect
ures
.Th
ere
sults
inTa
ble
2sh
owth
atth
ede
eper
34-la
yerp
lain
neth
ashi
gher
valid
atio
ner
ror
than
the
shal
low
er18
-laye
rpl
ain
net.
Tore
veal
the
reas
ons,
inFi
g.4
(left)
we
com
-pa
reth
eirt
rain
ing/
valid
atio
ner
rors
durin
gth
etra
inin
gpr
o-ce
dure
.W
eha
veob
serv
edth
ede
grad
atio
npr
oble
m-
the
4
Deep Learning for Automatic Music Transcription
• MLP, CNN and RNN have been applied to many AMT Tasks- Onset detection / Beat tracking- Chord recognition- Polyphonic piano transcription
Onset Detection
• Binary classification using Bi-directional LSTM- Input: half-wave rectified differences of mel-spectrograms with different
window sizes (a bit of hand-design)- After training the Bi-LSTM, a peak-detection rule is used to find the final
onsets
Universal Onset Detection with Bidirectional Long Short-Term Memory Neural Networks, Eyben et al. 2010
3.2 Recurrent neural networks
Another technique for introducing past context to neuralnetworks is to add backward (cyclic) connections to FNNs.The resulting network is called a recurrent neural network(RNN). RNNs can theoretically map from the entire his-tory of previous inputs to each output. The recurrent con-nections form a kind of memory, which allows input valuesto persist in the hidden layer(s) and influence the networkoutput in the future. If future context is also necessary re-quired, a delay between the input values and the outputtargets can be introduced.
3.3 Bidirectional recurrent neural networks
A more elegant incorporation future context is provided bybidirectional recurrent networks (BRNNs). Two separatehidden layers are used instead of one, both connected to thesame input and output layers. The first processes the inputsequence forwards and the second backwards. The net-work therefore has always access to the complete past andthe future context in a symmetrical way, without bloatingthe input layer size or displacing the input values from thecorresponding output targets. The disadvantage of BRNNsis that they must have the complete input sequence at handbefore it can be processed.
3.4 Long Short-Term Memory
Although BRNNs have access to both past and future in-formation, the range of context is limited to a few framesdue to the vanishing gradient problem [11]. The influenceof an input value decays or blows up exponentially overtime, as it cycles through the network with its recurrentconnections and gets dominated by new input values.
ForgetGate
OutputGate
Input
InputGate•
•
•
1.0
Output
MemoryCell
Figure 2. An LSTM block with one memory cell
To overcome this deficiency, a method called Long Short-Term Memory (LSTM) was introduced in [13]. In an LSTMhidden layer, the nonlinear units are replaced by LSTMmemory blocks (Figure 2). Each block contains one ormore self connected linear memory cells and three multi-plicative gates. The internal state of the cell is maintainedwith a recurrent connection of constant weight 1.0. Thisconnection enables the cell to store information over long
periods of time. The content of the memory cell is con-trolled by the multiplicative input, output, and forget gates,which – in computer memory terminology – correspondto write, read, and reset operations. More details on thetraining algorithm employed, and the bidirectional LSTMarchitecture in general can be found in [10].
4. PROPOSED APPROACH
This section describes our novel approach for onset de-tection in music signals, which is based on bidirectionalLong Short-Term Memory (BLSTM) recurrent neural net-works. In contrast to previous approaches it is able tomodel the context an onset occurs in. The properties ofan onset and the amount of relevant context are therebylearned from the data set used for training. The audio datais transformed to the frequency domain via two parallelSTFTs with different window sizes. The obtained mag-nitude spectra and their first order differences are used asinputs to the BLSTM network, which produces an onsetactivation function at its output. Figure 3 shows this basicsignal flow. The individual blocks are described in moredetail in the following sections.
STFT & Difference
STFT & Difference
BLSTM Network
PeakdetectionSignal Onsets
Figure 3. Basic signal flow of the new neural networkbased onset detector
4.1 Feature extraction
As input, the raw PCM audio signal with a sampling rate offs = 44.1 kHz is used. To reduce the computational com-plexity, stereo signals are converted to a monaural signalby averaging both channels. The discrete input audio sig-nal x(t) is segmented into overlapping frames of W sam-ples length (W = 1024 and W = 2048 , see Section 4.2),which are sampled at a rate of one per 10 ms (onset an-notations are available on a frame level). A Hammingwindow is applied to these frames. Applying the STFTyields the complex spectrogram X(n, k), with n being theframe index, and k the frequency bin index. The com-plex spectrogram is converted to the power spectrogramS(n, k) = |X(n, k)|2.
The dimensionality of the spectra is reduced by apply-ing psychoacoustic knowledge: a conversion to the Mel-frequency scale is performed with openSMILE [8]. A fil-terbank with 40 triangular filters, which are equidistant onthe Mel scale, is used to transform the spectrogram S(n, k)to the Mel spectrogram M(n, m). To match human per-ception of loudness, a logarithmic representation is cho-sen:
Mlog(n, m) = log (M(n, m) + 1.0) (1)
591
11th International Society for Music Information Retrieval Conference (ISMIR 2010)
The positive first order difference D+(n, m) is calcu-lated by applying a half-wave rectifier function H(x) =x+|x|
2to the difference of two consecutive Mel spectra:
D+(n, m) = H (Mlog(n, m) − Mlog(n − 1, m)) (2)
4.2 Neural Network stage
As a neural network, an RNN with BLSTM units is used.As inputs to the neural network, two log Mel-spectrogramsM23
log(n, m) and M46log(n, m) (computed with window sizes
of 23.2 ms and 46.4 ms (W = 1024 and W = 2048 sam-ples), respectively) and their corresponding positive firstorder differences D+
23s(n, m) and D+46s(n, m) are applied,
resulting in 160 input units. The network has three hiddenlayers for each direction (6 layers in total) with 20 LSTMunits each. The output layer has two units, whose outputsare normalised to both lie between 0 and 1, and to sumto 1, using the softmax function. The normalised outputsrepresent the probabilities for the classes ‘onset’ and ‘noonset’. This allows the use of the cross entropy error crite-rion to train the network [10]. Alternative networks with asingle output, where a value of 1 represents an onset frameand a value of 0 a non-onset frame, which are trained us-ing the mean squared output error as criterion, were not assuccessful.
4.2.1 Network training
For network training, supervised learning with early stop-ping is used. Each audio sequence is presented frame byframe (in correct temporal order) to the network. Stan-dard gradient descent with backpropagation of the outputerrors is used to iteratively update the network weights.To prevent over-fitting, the performance (cross entropy er-ror, cf. [10]) on a separate validation set is evaluated af-ter each training iteration (epoch). If no improvement ofthis performance over 20 epochs is observed, the trainingis stopped and the network with the best performance onthe validation set is used as the final network. The gradi-ent descent algorithm requires the network weights to beinitialised with non zero values. We initialise the weightswith a random Gaussian distribution with mean 0 and stan-dard deviation 0.1. The training data, as well as validationand test sets are described in Section 5.
4.3 Peak detection stage
A network obtained after training as described in the previ-ous section is able to classify each frame into two classes:‘onset’ and ‘no onset’. The standard method of choosingthe output node with the highest activation to determinethe frame class has not proven effective. Hence, only theoutput activation of the ‘onset’ class is used. Thresholdingand peak detection is applied to it, which is described inthe following sections:
4.3.1 Thresholding
One problem with existing magnitude based reduction func-tions (cf. Section 2) is that the amplitude of the detection
Figure 4. Top: log Mel-spectrogram with ground truth on-sets (vertical dashed lines). Bottom: network output withdetected onsets (marked by dots), ground truth onsets (dot-ted vertical lines), and threshold θ (horizontal dashed line).4 s excerpt from ‘Basement Jaxx - Rendez-Vu’.
function depends on the amplitude of the signal or the mag-nitude of its short time spectrum. Thus, to successfullydeal with high dynamic ranges, adaptive thresholds mustbe used when thresholding the detection function prior topeak picking. Similar to phase based reduction functions,the output activation function of the BLSTM network isnot affected by input amplitude variations, since its valuerepresents a probability of observing an onset rather thanrepresenting onset strength. In order to obtain optimal clas-sification for each song, a fixed threshold θ is computedper song proportional to the median of the activation func-tion (frames n = 1 . . . N ), constrained to the range fromθmin = 0.1 to θmax = 0.3:
θ∗ = λ · median{ao(1), . . . , ao(N)} (3)θ = min (max (0.1, θ∗) , 0.3) (4)
with ao(n) being the output activation function of theBLSTM neural network for the onset class, and the scalingfactor λ chosen to maximise the F1-measure on the valida-tion set. The final onset function oo(n) contains only theactivation values greater than this threshold:
oo(n) =
!ao(n) for ao(n) > θ
0 otherwise(5)
4.3.2 Peak picking
The onsets are represented by the local maxima of the on-set detection function oo(n). Thus, using a standard peaksearch, the final onset function o(n) is given by:
o(n) =
!1 for oo(n − 1) ≤ oo(n) ≥ oo(n + 1)
0 otherwise(6)
592
11th International Society for Music Information Retrieval Conference (ISMIR 2010)
Onset Detection
• Binary classification using 2D CNN- Input: 80-band mel spec and three channels with different window sizes - Output: the binary output is determined on the center of input frames - Filter size of the first conv layer: wide in time and narrow in frequency
...
3 inputchannels(15x80)
10 featuremaps(9x78)
convolve (7x3) max-pool (1x3)
10 featuremaps(9x26)
convolve (3x3)
20 featuremaps(7x24)
max-pool (1x3)
20 featuremaps(7x8)
fully connected fully connected
256sigmoidunits
sigmoidoutputunit
... ...
Fig. 2: One of the Convolutional Neural Network architectures used in this work. Starting from a stack of three spectrogram excerpts,convolution and max-pooling in turns compute a set of 20 feature maps classified with a fully-connected network.
imum value in non-overlapping 2x2 pixel cells. This both reducesthe amount of data and introduces some translation invariance. Tobe used for classification, the computation chain of a CNN ends ina fully-connected network that integrates information across all lo-cations in all feature maps of the layer below. When introduced,this type of architecture set the state-of-the-art in handwritten digitrecognition [19], and still defines the state-of-the-art on several com-puter vision tasks [20].
3.2. Application to Onset Detection
To be used as an onset detector, we train a CNN on spectrogramexcerpts centered on the frame to classify, giving binary labels todistinguish onsets from non-onsets (see Fig. 2).
Computer vision usually uses square filters, and square pooling.In spectrograms, the two dimensions represent two different modal-ities, though, and we found rectangular shapes to be more effective(cf. [18]). In particular, as the task mostly entails finding changesover time, we use filters wide in time and narrow in frequency, andas the task requires results of high time resolution, but is obliviousto frequency, we perform max-pooling over frequencies only.
Computer vision often handles color images, presenting the in-put such that each neuron has access to the same local region in allcolor channels (e.g., red, green, and blue). Here we train on a stackof spectrograms instead, with different window sizes, but the sameframe rate, and reduced to the same number of frequency bands withlogarithmic filter banks. This way each neuron can combine infor-mation of high temporal and high frequency accuracy for its location.
To detect onsets in a test signal, we compute the spectrogramsand feed them to the network (instead of giving excerpts of the sizeused in training, we can apply the convolution and pooling opera-tions to the full input at once), obtaining an onset activation functionover time. This function is smoothed by convolution with a Ham-ming window of 5 frames, and local maxima higher than a giventhreshold are reported as onsets.
3.3. Training methodology
We train our networks using mini-batch gradient descent with mo-mentum, minimizing cross-entropy error. As an extension to our ex-periments in [1], for each training case we randomly drop 50% of theinputs of the two fully-connected layers and double the remainingconnection weights, to improve generalization and avoid the needfor early stopping (see [21]). As another extension, we note that forour spectrogram frame rate (100 Hz), assigning each annotated onset
to a single frame may be inappropriate – some annotations are notaccurate enough, and some onsets are not that sharp –, so we assignit to three frames instead, weighting the extra frames less in training.We will investigate the effect of our extensions in the experiments.
4. EXPERIMENTAL RESULTS
Starting from the initial experiment of [1], we perform several mod-ifications to both architecture and training, yielding a further im-provement over the previous state of the art. We will report onthese improvements in detail after describing the data and evalua-tion method.
4.1. Data
We evaluate our networks on a dataset of about 102 minutes of mu-sic annotated with 25,927 onsets detailed in [22, p. 4] and also usedin [5]. It contains monophonic and polyphonic instrumental record-ings as well as popular music excerpts. Following [11], we computethree magnitude spectrograms with a hop size of 10 ms and windowsizes of 23 ms, 46 ms and 93 ms. We apply an 80-band Mel filterfrom 27.5 Hz to 16 kHz and scale magnitudes logarithmically. Wenormalize each frequency band to zero mean and unit variance (con-stants computed on a hold-out set). The network input for a singledecision consists of the frame to classify plus a context of ±70ms(15 frames in total), from all three spectrograms, which is about thecontext we found the RNN of [10] to use.
4.2. Evaluation
As in [22, 5], a reported onset is considered correct if it is not fartherthan 25 ms from an unmatched target annotation; any excess detec-tions and targets are false positives and negatives, respectively. Fromthe precision/recall curve obtained by varying the threshold, we re-port metrics for the point of optimal F-score only. As in [22, 5], allresults are obtained in 8-fold cross-validation.
4.3. Initial Architecture
Our initial architecture from [1] is depicted in Fig. 2: From the 3-channel spectrogram excerpts of 15 frames by 80 bands, a convo-lutional layer with filters of 7 frames by 3 bands (by 3 channels)computes 10 feature maps of 9 frames by 78 bands. The next layerperforms max-pooling over 3 adjacent bands without overlap, reduc-ing the maps to 26 bands. Another convolutional layer of 3⇥3 filters
Improved musical onset detection with Convolutional Neural Networks, Schlüter and Böck, 2014
Onset Detection
• CNN is powerful!
• CNN is more interpretable
- They learn the difference of short- and long-window spectrograms to find
onsets.
Precision Recall F-scoreRNN [10, 5] 0.892 0.855 0.873CNN [1] 0.905 0.866 0.885+ Dropout 0.909 0.871 0.890+ Fuzziness 0.914 0.885 0.899+ ReLU 0.917 0.889 0.903SuperFlux [5] 0.883 0.793 0.836
Table 1: Performance of the state-of-the-art RNN compared to theproposed CNN and a hand-designed method. See Sections 4.3–4.6for details on rows 2–5.
and another 3-band max-pooling layer result in 20 maps of 7 framesby 8 bands (1120 neurons in total). These are processed by a fully-connected layer of 256 units and a final fully-connected layer of asingle output unit predicting onsets. Both convolutional layers usethe tanh nonlinearity (with a scalar bias per feature map), and thefully-connected layers use the logistic sigmoid.
The network is trained in mini-batches of 256 examples, for 100epochs, using a fixed learning rate of 0.05, and an initial momentumof 0.45, linearly increased to 0.9 between epochs 10 and 20.
It achieves an F-score of 88.5%, about one percent point abovethe state-of-the-art RNN (Table 1, rows 1–2). (Trained on single-channel spectrograms, both models lose about one percent point.)
4.4. Bagging and Dropout
Bagging is a straightforward way to improve the performance of aclassifier without changing its architecture: Training four RNNs andaveraging their outputs gives a slight improvement to 87.7% F-score.Similarly, bagging two of our CNNs improves results to 89.1%, butfour CNNs perform the same. A single CNN with twice the numberof units in each layer overfits and obtains 87.9% only. Jointly train-ing two CNNs connected to the same output unit does not overfit, butis inferior to training them separately. We conclude that the benefitof bagging over simply enlarging the network does not stem from thefact that its constituent parts do not overfit, but that they are forcedto solve the task on their own – when training two CNNs jointly, thesecond will not receive any learning signal when the first producesthe correct answer with high confidence and vice versa.
The same holds for the hidden units within each CNN. An el-egant way to ensure that each unit receives a learning signal andis encouraged to solve its task independently of its peers is usingdropout [21]: For each training case, half of the units are omittedfrom the network (cheaply accomplished by masking their output),chosen at random, and remaining weights are doubled to compen-sate. Applying this to the inputs of the two fully-connected layersand increasing the learning rate to 1.0, multiplied with 0.995 aftereach epoch, yields 89.0% F-score. Note that dropout does not incurany higher costs at test time, while bagging two CNNs is twice asexpensive. Another key advantage is that it prevents overfitting, al-lowing us to fix training time to 300 epochs and try different setupswithout the need for early stopping on a validation set.
4.5. Fuzzier Training Examples
Onsets are annotated as time points. For training, we associate eachannotation with its closest spectrogram frame and use this frame(along with its ±7 frames of context) as a positive example, and allothers as negative examples. Some onsets have a soft attack, though,
or are not annotated with 10ms precision, resulting in actual onsetsbeing presented to the network as negative training examples. Tocounter this, we would like to train on less sharply defined groundtruth. One solution would be to replace the binary targets with sharpGaussians and turn the classification problem into a regression one,but preliminary experiments on the RNN showed no improvement.Instead, we define a single frame before and after each annotatedonset to be additional positive examples. To still teach the networkabout the most salient onset position, these examples are weightedwith only 25% during training. This measure improves F-score to89.9%, using a higher detection threshold than before. Simply ex-cluding 1 or 2 frames around each onset from training, letting thenetwork freely decide on those, works just slightly worse.
4.6. Rectified Linear Units
Both the hand-designed SuperFlux algorithm [5] and the state-of-the-art RNN [10] build on precomputed positive differences in spec-tral energy over time. Replacing the tanh activation function in theconvolutional layers with the linear rectifier y(x) = max(0, x) pro-vides a direct way for the CNN to learn to compute positive differ-ences in its spectral input, and has been generally shown useful forsupervisedly trained networks [23]. In our case, it improves F-scoreto our final result of 90.3%. Using rectified linear units for the fully-connected hidden layer as well reduces performance to 89.6%.
5. INTROSPECTION
While we have developed a state-of-the-art musical onset detectorthat is perfectly usable as a black box, we would like to know howit works. In particular, we hope to gain some insights on why it isbetter than existing hand-crafted algorithms, and possibly learn fromits solution to improve these algorithms.
For this purpose, we train a CNN with the second convolutionallayer and max-pooling layer removed to make it easier to interpret,and tanh units for the remaining convolutional layer (dropout andfuzziness as before). It achieves 88.8% F-score, which is still farsuperior to the SuperFlux algorithm (Table 1, last row), making it aninteresting model to study. We will visualize both the connectionslearned by the model and its hidden unit states on test data to under-stand its computations. To guide us, we will start at the output unitand work our way backwards through the network, concentrating onthe parts that contribute most to its classification decisions.
5.1. Output Unit
The output unit computes a weighted sum of the 256 hidden unitstates below, then applies the logistic sigmoid function, resulting ina value between 0.0 and 1.0 interpretable as an onset probability.Fig. 3b shows this output over time for two well-chosen test signals:One rich in percussive onsets, the other in transient-free harmonicones.1 Except for a false positive in the latter, the network outputwell matches the ground truth.
To understand how the output is driven by the 256 hidden units,we visualize their states for the two signals, ordered by connectionweight to the output unit (Fig. 3c). Interestingly, the most stronglyconnected units (near the top and bottom border) are hardly activeand do not seem to be useful for these examples – they may havespecialized to exotic corner cases in the training data. In contrast, alarge number of units with small connection weights (near the sign
1http://ofai.at/~jan.schlueter/pubs/2014_icassp/
(Eyben et al. 2010)
(Schlüter and Böck, 2014)
change prominently visible in the figure) clearly reflects the onset lo-cations. Comparing states for the two signals, we see that a numberof positively connected units (below the sign change) detect percus-sive onsets only, while others also detect harmonic ones.
5.2. Fully-Connected Hidden Layer
Having identified the most interesting hidden units (the ones near thesign change), we will investigate what they compute. Fig. 3d visu-alizes the connections of two units to the feature maps in the layerbelow. The second one displays a sharp wide-band off-on-off con-nection to the fourth map, and similarly sharp connections to othermaps. It is good in detecting percussive onsets, which are short wide-band bursts. The first unit computes more long-term differences, no-tably in the first and ninth map, and manages to capture harmoniconsets. Other units look very similar to the two types shown, withvariations in timing and covered frequency bands.
5.3. Convolutional Layer
To close the remaining gap to the input, we will study the featuremaps computed by the convolutional layer. From the previous inves-tigation, maps 4 and 9 seem to play an important role. For the firstsignal, map 4 highlights the onsets very sharply (Fig. 3e). Looking atthe corresponding filter (Fig. 3g), it seems to detect energy bursts of1 to 3 frames in the mid-sized spectrogram, and compute a temporaldifference in the long-window one. Map 9 also computes this tem-poral difference and contrasts it against a slightly offset difference inthe short-window spectrogram (Fig. 3h). While still very fuzzy, thisenhances onsets of the second signal (Fig. 3f).
5.4. Insights
Although our inspection was highly selective, covering a small partof the network only, we formed a basic intuition of what it does. Likespectral flux based methods, the network computes spectral differ-ences over time. In doing so, it adapts the context to the spectrogramwindow length, which was also found to be crucial in [5]. And like[4], the CNN separates the detection of percussive and pitched on-sets. As a novel feature, the network computes the difference ofshort- and long-window spectrograms to find onsets. However, im-itating this is not enough to build a good onset detector. In fact,the key factor seems to be that the network combines hundreds ofminor variations of the same approach, something that cannot be re-produced with hand-designed algorithms.
6. DISCUSSION
Through a combination of recent neural network training methods,we significantly advanced the state of the art in musical onset de-tection. Analyzing the learned model, we find that it rediscoveredseveral ideas used in hand-designed methods, but is superior by com-bining results of many slightly different detectors. This shows thateven for easily understandable problems, labelling data and applyingmachine learning may be more worthwhile than directly engineeringa solution. Further improvements may be achieved by training largernetworks, by trying other filter shapes, by regularizing the convolu-tional layers [24], and by including phase information. More insightsmight be won by recent CNN visualization techniques [25, 26]. An-other direction for future research is to combine ideas from CNNswith RNNs, such as local connectivity and pooling, to obtain a state-of-the-art model suitable for low-latency real-time processing.
(a) input spectrograms (mid-sized window length only)
(b) network output (blue line) and ground truth (vertical red bars)
(c) penultimate layer states ordered by connection weight to out-put, from strongly negative (top) to strongly positive (bottom)
(d) weights of two penultimate layer units: each block shows con-nections to one of the ten feature maps in the layer below, withtime increasing from left to right, frequency from bottom to top,red and blue denoting negative and positive weights, respectively
(e) feature map 4 for the firstsignal (after pooling and tanh)
(f) feature map 9 for the secondsignal (after pooling and tanh)
(g) filter kernel for map 4: three7 ⇥ 3 blocks for the three inputspectrograms (mid, short, long)
(h) filter kernel for map 9: three7 ⇥ 3 blocks for the three inputspectrograms (mid, short, long)
Fig. 3: Network weights and states for two test signals (see Sect. 5).
Acknowledgements: This research is supported by the Aus-trian Science Fund (FWF): TRP 307-N23, and by the EuropeanUnion Seventh Framework Programme FP7 / 2007-2013 throughthe PHENICX project (grant agreement no. 601166). The AustrianResearch Institute for Artificial Intelligence is supported by the Aus-trian Federal Ministry for Transport, Innovation, and Technology.
Chord Recognition
• Deep chroma- Supervised feature Learning of chroma- Input: 15 frames of quarter-tone spectorgram- MLP: 3 dense layers of 512 rectified units - Output: chord labels
Feature Learning for Chord Recognition: The Deep Chroma Extractor , Korzeniowski and Widmer, 2014
spectively, and �l is a (usually non-linear) activation func-tion applied point-wise.
We define two additional special layers: an input layerthat is feeding values to h1 as h0(x) = x, with U0 beingthe input’s dimensionality; and an output layer hL+1 thattakes the same form as shown in Eq. 1, but has a specificsemantic purpose: it represents the output of the network,and thus its dimensionality UL+1 and activation function�L+1 have to be set accordingly. 2
The weights and biases constitute the model’s parame-ters. They are trained in a supervised manner by gradientmethods and error back-propagation in order to minimisethe loss of the network’s output. The loss function de-pends on the domain, but is generally some measure of dif-ference between the current output and the desired output(e.g. mean squared error, categorical cross-entropy, etc.)
In the following, we describe how we compute the inputto the DNN, the concrete DNN architecture and how it wastrained.
4.1 Input Processing
We compute the time-frequency representation of the sig-nal based on the magnitude of its STFT X . The STFTgives significantly worse results than the constant-q trans-form if used as basis for traditional chroma extractors, butwe found in preliminary experiments that our model is notsensitive to this phenomenon. We use a frame size of 8192with a hop size of 4410 at a sample rate of 44100 Hz. Then,we apply triangular filters to convert the linear frequencyscale of the magnitude spectrogram to a logarithmic one inwhat we call the quarter-tone spectrogram S = F4
Log · |X|,where F4
Log is the filter bank. The quarter-tone spectro-gram contains only bins corresponding to frequencies be-tween 30 Hz and 5500 Hz and has 24 bins per octave. Thisresults in a dimensionality of 178 bins. Finally, we applya logarithmic compression such that Slog = log (1 + S),which we will call the logarithmic quarter-tone spectro-gram. To be concise, we will refer to SLog as “spectro-gram” in rest of this paper.
Our model uses a context window around a target frameas input. Through systematic experiments on the validationfolds (see Sec.5.1) we found a context window of ±0.7 s towork best. Since we operate at 10 fps, we feed our networkat each time 15 consecutive frames, which we will denoteas super-frame.
4.2 Model
We define the model architecture and set the model’shyper-parameters based on validation performance in sev-eral preliminary experiments. Although a more systematicapproach might reveal better configurations, we found thatresults do not vary by much once we reach a certain modelcomplexity.
2 For example, for a 3-class classification problem one would use 3units in the output layer and a softmax activation function such that thenetwork’s output can be interpreted as probability distribution of classesgiven the data.
Figure 1. Model overview. At each time 15 consecutiveframes of the input quarter-tone spectrogram SLog are fedto a series of 3 dense layers of 512 rectifier units, and fi-nally to a sigmoid output layer of 12 units (one per pitchclass), which represents the chroma vector for the centreinput frame.
Our model is a deep neural network with 3 hidden layersof 512 rectifier units [11] each. Thus, �l(x) = max(0, x)for 1 l L. The output layer, representing the chromavector, consists of 12 units (one unit per pitch class) with asigmoid activation function �L+1(x) = 1/1+exp(�x). Theinput layer represents the input super-frame and thus has adimensionality of 2670. Fig. 1 shows an overview of ourmodel.
4.3 Training
To train the network, we propagate back through the net-work the gradient of the loss L with relation to the net-work parameters. Our loss is the binary cross-entropybetween each pitch class in the predicted chroma vectorp = hL+1(Slog) and the target chroma vector t, which isderived from the ground truth chord label. For a single datainstance,
L =1
12
12X
i=1
�ti log(pi)� (1� ti) log(1� pi). (2)
We learn the parameters with mini-batch training (batchsize 512) using the ADAM update rule [16]. We also triedsimple stochastic gradient descent with Nesterov momen-tum and a number of manual learn rate schedules, but couldnot achieve better results (to the contrary, using ADAMtraining usually converged earlier). To prevent over-fitting,we apply dropout [26] with probability 0.5 after each hid-den layer and early stopping if validation accuracy doesnot increase after 20 epochs.
5. EXPERIMENTS
To evaluate the chroma features our method produces, weset up a simple chord recognition task. We ignore any post-filtering methods and use a simple, linear classifier (logis-tic regression) to match features to chords. This way wewant to isolate the effect of the feature on recognition ac-curacy. As it is common, we restrict ourselves to distinctonly major/minor chords, resulting in 24 chord classes anda ’no chord’ class.
Chord Recognition
• Deep chroma VS hand-crafted chroma
Figure 6. Average saliency of all input frames of the Bea-tles dataset (bottom image), summed over the time axis(top plot). We see that most relevant information can becollected in barely 3 octaves between G3 at 196 Hz andE6 at 1319 Hz. Hardly any harmonic information residesbelow 110 Hz and above 3136 Hz. The plot is spiky atfrequency bins that correspond to clean semitones becausemost of the songs in the dataset seem to be tuned to a refer-ence frequency of 440 Hz. The network thus usually payslittle attention to the frequency bins between semitones.
Figure 7. Excerpts of chromagrams extracted from thesong “Yesterday” by the Beatles. The lower image showschroma computed by the CW
Log without smoothing. We seea good temporal resolution, but also noise. The centre im-age shows the same chromas after a moving average filterof 1.5 seconds. The filter reduced noise considerably, atthe cost blurring chord transitions. The upper plot showsthe chromagram extracted by our proposed method. It dis-plays precise pitch activations and low noise, while keep-ing chord boundaries crisp. Pixel values are scaled suchthat for each image, the lowest value in the respective chro-magram is mapped to white, the highest to black.
seems to suffice as input to chord recognition methods. Us-ing saliency maps and preliminary experiments on valida-tion folds we also found that a context of 1.5 seconds isadequate for local harmony estimation.
There are plenty possibilities for future work to extendand/or improve our method. To achieve better results, wecould use DNN ensembles instead of a single DNN. Wecould ensure that the network sees data for which its pre-dictions are wrong more often during training, or similarly,we could simulate a more balanced dataset by showingthe net super-frames of rare chords more often. To fur-ther assess how useful the extracted features are for chordrecognition, we shall investigate how well they interactwith post-filtering methods; since the feature extractor istrained discriminatively, Conditional Random Fields [17]would be a natural choice.
Finally, we believe that the proposed method extractsfeatures that are useful in any other MIR applications thatuse chroma features (e.g. structural segmentation, key esti-mation, cover song detection). To facilitate respective ex-periments, we provide source code for our method as partof the madmom audio processing framework [2]. Informa-tion and source code to reproduce our experiments can befound at http://www.cp.jku.at/people/korzeniowski/dc.
8. ACKNOWLEDGEMENTS
This work is supported by the European Research Coun-cil (ERC) under the EU’s Horizon 2020 Framework Pro-gramme (ERC Grant Agreement number 670035, project”Con Espressione”). The Tesla K40 used for this researchwas donated by the NVIDIA Corporation.
9. REFERENCES
[1] Y. Bengio, A. Courville, and P. Vincent. Representa-tion Learning: A Review and New Perspectives. IEEETransactions on Pattern Analysis and Machine Intelli-gence, 35(8):1798–1828, Aug. 2013.
[2] S. Bock, F. Korzeniowski, J. Schluter, F. Krebs,and G. Widmer. madmom: a new Python Audioand Music Signal Processing Library. arXiv preprintarXiv:1605.07008, 2016.
[3] S. Bock, F. Krebs, and G. Widmer. A multi-model ap-proach to beat tracking considering heterogeneous mu-sic styles. In Proceedings of the 15th International So-ciety for Music Information Retrieval Conference (IS-MIR), Taipei, Taiwan, 2014.
[4] S. Bock, F. Krebs, and G. Widmer. Accurate tempo es-timation based on recurrent neural networks and res-onating comb filters. In Proceedings of the 16th Inter-national Society for Music Information Retrieval Con-ference (ISMIR), Malaga, Spain, 2015.
[5] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin-cent. Audio chord recognition with recurrent neuralnetworks. In Proceedings of the 14th International So-ciety for Music Information Retrieval Conference (IS-MIR), Curitiba, Brazil, 2013.
Deep chroma
Hand-craftedchroma
Figure 2. Validation WCSR for Major/minor chord recog-nition of different methods given different audio contextsizes. Whiskers represent 0.95 confidence intervals.
Our compound evaluation dataset comprises the Beat-les [13], Queen and Zweieck [18] datasets (which form the“Isophonics” dataset used in the MIREX 3 competition),the RWC pop dataset 4 [12], and the Robbie Williamsdataset [8]. The datasets total 383 songs or approx. 21hours and 39 minutes of music.
We perform 8-fold cross validation with random splits.For the Beatles dataset, we ensure that each fold has thesame album distribution. For each test fold, we use six ofthe remaining folds for training and one for validation.
As evaluation measure, we compute the WeightedChord Symbol Recall (WCSR), often called Weighted Av-erage Overlap Ratio (WAOR) of major and minor chordsusing the mir eval library [23].
5.1 Compared Features
We evaluate our extracted features CD against threebaselines: a standard chromagram C computed froma constant-q transform, a chromagram with frequencyweighting and logarithmic compression of the underlyingconstant-q transform CW
Log , and the quarter-tone spectro-gram SLog . The chromagrams are computed using the li-brosa library 5 . Their parametrisation follows closely thesuggestions in [7], where CW
Log was found to be the bestchroma feature for chord recognition.
Each baseline can take advantage of context informa-tion. Instead of computing a running mean or median,we allow logistic regression to consider multiple frames ofeach feature 6 . This is a more general way to incorporatecontext, because running mean is a subset of the contextaggregation functions possible in our setup. Since traininglogistic regression is a convex problem, the result is at leastas good as if we used a running mean.
3 http://www.music-ir.org/mirex4 Chord annotations available at https://github.com/tmc323/
Chord-Annotations5 https://github.com/bmcfee/librosa6 Note that this description applies only to the baseline methods. For
our DNN feature extractor, “context” means the amount of context theDNN sees. The logistic regression always sees only one frame of thefeature the DNN computed.
Btls Iso RWC RW Total
C 71.0±0.1 69.5 ±0.1 67.4±0.2 71.1±0.1 69.2±0.1
CWLog 76.0±0.1 74.2 ±0.1 70.3±0.3 74.4±0.2 73.0±0.1
SLog 78.0±0.2 76.5 ±0.2 74.4±0.4 77.8±0.4 76.1±0.2
CD 80.2±0.1 79.3±0.1 77.3±0.1 80.1±0.1 78.8±0.1
Table 1. Cross-validated WCSR on the Maj/min task ofcompared methods on various datasets. Best results arebold-faced (p < 10�9). Small numbers indicate stan-dard deviation over 10 experiments. “Btls” stands for theBeatles, “Iso” for Isophonics, and “RW” for the RobbieWilliams datasets. Note that the Isophonics dataset com-prises the Beatles, Queen and Zweieck datasets.
We determined the optimal amount of context foreach baseline experimentally using the validation folds, asshown in Fig. 2. The best results achieved were 79.0% with1.5 s context for CD, 76.8% with 1.1 s context for SLog ,73.3% with 3.1 s context for CW
Log , and 69.5% with 2.7 scontext for C. We fix these context lengths for testing.
6. RESULTS
Table 1 presents the results of our method compared to thebaselines on several datasets. The chroma features C andCW
Log achieve results comparable to those [7] reported ona slightly different compound dataset. Our proposed fea-ture extractor CD clearly performs best, with p < 10�9
according to a paired t-test. The results indicate that thechroma vectors extracted by the proposed method are bet-ter suited for chord recognition than those computed by thebaselines.
To our surprise, the raw quarter-tone spectrogram SLog
performed better than the chroma features. This indicatesthat computing chroma vectors in the traditional way mixesharmonically relevant features found in the time-frequencyrepresentation with irrelevant ones, and the final classifiercannot disentangle them. This raises the question of whychroma features are preferred to spectrograms in the firstplace. We speculate that the main reason is their muchlower dimensionality and thus ease of modelling (e.g. us-ing Gaussian mixtures).
Artificial neural networks often give good results, butit is difficult to understand what they learned, or on whichbasis they generate their output. In the following, we willtry to dissect the proposed model, understand its workings,and see what it pays attention to. To this end, we com-pute saliency maps using guided back-propagation [25],adapting code freely available 7 for the Lasagne library [9].Leaving out the technical details, a saliency map can be in-terpreted as an attention map of the same size as the input.The higher the absolute saliency at a specific input dimen-sion, the stronger its influence on the output, where pos-itive values indicate a direct relationship, negative valuesan indirect one.
Fig. 3 shows a saliency map and its correspondingsuper-frame, representing a C major chord. As expected,
7 https://github.com/Lasagne/Recipes/
MadMom
• Python-based audio library: http://madmom.readthedocs.io/- Contain pre-trained neural networks for onset detection, beat-tracking and
chord recognition
Polyphonic Note Transcription
• Bi-directional RNN- Input: two log-spectrograms with different window sizes (short window:
onset-sensitive, Long window: pitch-sensitive)
• Use a regression output layer: mean-square-error- Use a thresholding to detect onset and pitch
POLYPHONIC PIANO NOTE TRANSCRIPTION WITH RECURRENT NEURAL NETWORKS
Sebastian Bock, Markus Schedl
Department of Computational Perception, Johannes Kepler University, Linz, [email protected]
ABSTRACTIn this paper a new approach for polyphonic piano note onsettranscription is presented. It is based on a recurrent neuralnetwork to simultaneously detect the onsets and the pitchesof the notes from spectral features. Long Short-Term Mem-ory units are used in a bidirectional neural network to modelthe context of the notes. The use of a single regression out-put layer instead of the often used one-versus-all classifica-tion approach enables the system to significantly lower thenumber of erroneous note detections. Evaluation is basedon common test sets and shows exceptional temporal preci-sion combined with a significant boost in note transcriptionperformance compared to current state-of-the-art approaches.The system is trained jointly with various synthesized pianoinstruments and real piano recordings and thus generalizesmuch better than existing systems.
Index Terms— music information retrieval, neural net-works
1. INTRODUCTION
Music transcription is the process of converting an audiorecording into a musical score or a similar representation. Inthis paper we concentrate on the transcription of piano notes,especially on the two most important aspects of notes, theirpitch and onset times. To detect them as accurately as possi-ble is crucial for a proper transcription of the musical piece.We leave out higher level tasks like determining the lengthof a note (given either in seconds or in a musical notationlike quarter note). Also we do not consider the velocity orintensity. The output of the system is a simplified piano-rollnotation of the audio signal.
Traditional music transcription systems are based on awide range of different technologies, but all have to deal withthe subtasks of estimating the fundamental frequencies andthe onset locations of the notes. A very basic approach for-mulated by Dixon [1] solely relies on the spectral peaks ofthe signal to detect notes; local maxima represent the onsetsand the drop of energy below a minimum threshold marks theoffset of the note. Bello et al. [2] additionally incorporatetime-domain features to predict multiple sounding pitchesassuming that the signal can be constructed as a linear sumof individual waveforms based on a database of piano notes.Raphael [3] proposes a probability-based system which usesa hidden Markov model (HMM) to find chord sequences.The states are represented by frames with labels based on
the sounding pitches. Ryynanen and Klapuri [4] also useHMMs to model note events based on multiple fundamentalfrequency features. Transition between notes are controlledvia musical knowledge.
Most of today’s top performing piano transcription sys-tems rely on machine learning approaches. Marolt [5] de-scribes an elaborate approach based on different neural net-works to recognize tones in an audio recording, combinedwith adaptive oscillators to track partials. Poliner and Ellis [6]use multiple support vector machine (SVM) classifiers trainedon spectral features to detect the sounding fundamental fre-quencies of a frame. Post-processing with HMM is appliedto temporally smooth the output. Boogaart and Lienhart [7]use a cascade of boosted classifiers to predict the onsets andthe corresponding pitches of each note. All these systemsuse multiple classifiers and thus can not reliably distinguishwhether a sounding pitch is the fundamental frequency of anote or a partial of another one. This results in lots of falsenote detections. In contrast, our system uses a single regres-sion model and is thus able to distinguish between these statesand hence lowers the number of false detections significantly.
2. SYSTEM DESCRIPTION
Figure 1 shows the proposed piano transcription system. Ittakes a discretely sampled audio signal as its input. The sig-nal is transferred to the frequency domain via two parallelShort-Time Fourier Transforms (STFT) with different win-dow lengths. The logarithmic magnitude spectrogram of eachSTFT is then filtered to obtain a compressed representationwith the frequency bins corresponding to the tone scale ofa piano with a semitone resolution. This representation isused as input to a bidirectional Long Short-Term Memory(BLSTM) recurrent neural network. The output of the net-work is a piano-roll like representation of the note onsets foreach MIDI note.
Spectrogram46.4ms
Spectrogram185.5ms
BLSTM Network
Note Onset & Pitch DetectionAudio
SemitoneFilterbank
SemitoneFilterbank
Notes
Fig. 1: Proposed piano transcription system overview.
121978-1-4673-0046-9/12/$26.00 ©2012 IEEE ICASSP 2012
Polyphonic piano note transcription with recurrent neural networks , Böck and Schedl, 2012
Polyphonic Note Transcription
• Frame-wise classification - Compare DNN, ConvNet (Conv+FC) and AllConv (Conv only)- Input: semi-tone filterbank with 5 frames for Conv
On the Potential of Simple Framewise Approaches to Piano Transcription, Keltz et al., 2016
Optimizer (Plain SGD, Momentum, Nesterov Momentum, Adam)Learning Rate (0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 50.0, 100.0)Momentum (Off, 0.7, 0.8, 0.9)Learning rate Scheduler (On, Off)Batch Normalization (On, Off)Dropout (Off, 0.1, 0.3, 0.5)L1 Penalty (Off, 1e-07, 1e-08, 1e-09)L2 Penalty (Off, 1e-07, 1e-08, 1e-09)
Table 3: The list of additional hyper-parameters varied,and their ranges.
Figure 3: Mean predicted performance for the shallow net
model class, dependent on learning rate (on a logarithmicscale). The dark line shows the mean performance, and thegray area shows the standard deviation.
5. STATE OF THE ART MODELS
Having completed the analysis of input representation,more powerful model classes were tried: a deep neu-ral network consisting entirely of dense layers (DNN),a mixed network with convolutional layers directly afterthe input followed by dense layers (ConvNet), and an all-convolutional network (AllConv [29]). Their architecturesare described in detail in Table 4. To the best of our knowl-edge, this is the first time an all-convolutional net has beenproposed for the task of framewise piano transcription.
We computed a logarithmically filtered spectrogramwith logarithmic magnitude from audio with a sample rateof 44.1 kHz, a filterbank with 48 bins per octave, normedarea filters, no circular shift and no zero padding. Thechoices for circular shift and zero padding ranged very lowon the importance scale, so we simply left them switchedoff. This resulted in only 229 bins, which are logarithmi-cally spaced in the higher frequency regions, and almostlinearly spaced in the lower frequency regions as men-tioned in Section 2.1. The dense network was presentedone frame at a time, whereas the convolutional networkwas given a context in time of two frames to either side ofthe current frame, summing to 5 frames in total.
All further hyper-parameter tuning and architecturalchoices have been left to a human expert. Models withina model class were selected based on average F-measureacross the four validation sets. An automatic search viaa hyper-parameter search algorithm for these larger model
DNN ConvNet AllConv
Input 229 Input 5x229 Input 5x229Dropout 0.1 Conv 32x3x3 Conv 32x3x3Dense 512 Conv 32x3x3 Conv 32x3x3BatchNorm BatchNorm BatchNormDropout 0.25 MaxPool 1x2 MaxPool 1x2Dense 512 Dropout 0.25 Dropout 0.25BatchNorm Conv 64x3x3 Conv 32x1x3Dropout 0.25 MaxPool 1x2 BatchNormDense 512 Dropout 0.25 Conv 32x1x3BatchNorm Dense 512 BatchNormDropout 0.25 Dropout 0.5 MaxPool 1x2Dense 88 Dense 88 Dropout 0.25
Conv 64x1x25BatchNormConv 128x1x25BatchNormDropout 0.5Conv 88x1x1BatchNormAvgPool 1x6Sigmoid
# Params 691288 1877880 284544
Table 4: Model Architectures
classes, as described in [4, 11, 28] is left for future work(the training time for a convolutional model is roughly 8�9hours on a Tesla K40 GPU, which leaves us with 204·3·4·8hours (variants ⇥ #models ⇥ #folds ⇥ hours per model),or on the order of 800 � 900 days of compute time to de-termine the best input representation exactly).
For these powerful models, we followed practical rec-ommendations for training neural networks via gradientdescent found in [1]. Particularly relevant is the way ofsetting the initial learning rate. Strategies that dynamicallyadapt the learning rate, such as Adam or Nesterov Momen-
tum [19, 22] help to a certain extent, but still do not spareus from tuning the initial learning rate and its schedule.
We observed that using a combination of batch normal-ization and dropout together with very simple optimiza-tion strategies leads to low validation error fairly quickly,in terms of the number of epochs trained. The strategythat worked best for determining the learning rate and itsschedule was trying learning rates on a logarithmic scale,starting at 10.0, until the optimization did not diverge any-more [1], then training until the validation error flattenedout for a few epochs, then multiplying the learning ratewith a factor from the set {0.1, 0.25, 0.5, 0.75}. The ratesand schedules we finally settled on were:
• DNN: SGD with Momentum, ↵ = 0.1, µ = 0.9 andhalving of ↵ every 10 epochs
• ConvNet: SGD with Momentum, ↵ = 0.1, µ = 0.9and a halving of ↵ every 5 epochs
• AllConv: SGD with Momentum, ↵ = 1.0, µ = 0.9and a halving of ↵ every 10 epochs
The results for framewise prediction on the MAPSdataset can be found in Table 5. It should be noted that wecompare straightforward, simple, and largely un-smoothed
systems (ours) with hybrid systems [26]. There is a smalldegree of temporal smoothing happening when processingspectrograms with convolutional nets. The term simple issupposed to mean that the resulting models have a smallamount of parameters and the models are composed of afew low-complexity building blocks. All systems are eval-uated on the same train-test splits, referred to as configu-
ration I in [26] as well as on realistic train-test splits, thatwere constructed in the same fashion as configuration II
in [26].
Model Class P R F1
Hybrid DNN [26] 65.66 70.34 67.92Hybrid RNN [26] 67.89 70.66 69.25Hybrid ConvNet [26] 72.45 76.56 74.45DNN 76.63 70.12 73.11ConvNet 80.19 78.66 79.33AllConv 80.75 75.77 78.07
Table 5: Results on the MAPS dataset. Test set perfor-mance was averaged across 4 folds as defined in configu-
ration I in [26].
Model Class P R F1
DNN [26] - - 59.91RNN [26] - - 57.67ConvNet [26] - - 64.14DNN 75.51 57.30 65.15ConvNet 74.50 67.10 70.60AllConv 76.53 63.46 69.38
Table 6: Results on the MAPS dataset. Test set perfor-mance was averaged across 4 folds as defined in configu-
ration II in [26].
6. CONCLUSION
We argue that the results demonstrate: the importance ofproper choice of input representation, and the importanceof hyper-parameter tuning, especially the tuning of learn-ing rate and its schedule; that convolutional networks havea distinct advantage over their deep and dense siblings, be-cause of their context window and that all-convolutionalnetworks perform nearly as well as mixed networks, al-though they have far fewer parameters. We propose thesestraightforward, framewise transcription networks as a newstate-of-the art baseline for framewise piano transcriptionfor the MAPS dataset.
7. ACKNOWLEDGEMENTS
This work is supported by the European ResearchCouncil (ERC Grant Agreement 670035, projectCON ESPRESSIONE), the Austrian Ministries BMVITand BMWFW, the Province of Upper Austria (via theCOMET Center SCCH) and the European Union SeventhFramework Programme FP7 / 2007-2013 through theGiantSteps project (grant agreement no. 610591). Wewould like to thank all developers of Theano [3] andLasagne [9] for providing comprehensive and easy to use
deep learning frameworks. The Tesla K40 used for thisresearch was donated by the NVIDIA Corporation.
8. REFERENCES
[1] Yoshua Bengio. Practical recommendations forgradient-based training of deep architectures. InNeural Networks: Tricks of the Trade, pages 437–478.Springer, 2012.
[2] Taylor Berg-Kirkpatrick, Jacob Andreas, and DanKlein. Unsupervised Transcription of Piano Music. InAdvances in Neural Information Processing Systems,pages 1538–1546, 2014.
[3] James Bergstra, Olivier Breuleux, Frederic Bastien,Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. Theano: a CPU and GPU Math Ex-pression Compiler. In Proceedings of the Python for
Scientific Computing Conference (SciPy), 2010.
[4] James Bergstra, Dan Yamins, and David D. Cox. Hy-peropt: A python library for optimizing the hyperpa-rameters of machine learning algorithms. In Proceed-
ings of the 12th Python in Science Conference, pages13–20, 2013.
[5] Sebastian Bock, Filip Korzeniowski, Jan Schluter, Flo-rian Krebs, and Gerhard Widmer. madmom: a newPython Audio and Music Signal Processing Library.arXiv preprint arXiv:1605.07008, 2016.
[6] Sebastian Bock and Markus Schedl. Polyphonic pi-ano note transcription with recurrent neural networks.In Acoustics, Speech and Signal Processing (ICASSP),
2012 IEEE International Conference on, pages 121–124. IEEE, 2012.
[7] Nicolas Boulanger-Lewandowski, Yoshua Bengio, andPascal Vincent. Modeling temporal dependencies inhigh-dimensional sequences: Application to poly-phonic music generation and transcription. arXiv
preprint arXiv:1206.6392, 2012.
[8] Judith C. Brown. Calculation of a constant Q spec-tral transform. The Journal of the Acoustical Society
of America, 89(1):425–434, 1991.
[9] Sander Dieleman, Jan Schluter, Colin Raffel, Eben Ol-son, Søren Kaae Sønderby, Daniel Nouri, Daniel Mat-urana, Martin Thoma, Eric Battenberg, Jack Kelly,Jeffrey De Fauw, Michael Heilman, diogo149, BrianMcFee, Hendrik Weideman, takacsg84, peterderivaz,Jon, instagibbs, Dr. Kashif Rasul, CongLiu, Britefury,and Jonas Degrave. Lasagne: First release., August2015.
[10] Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and
Signal Processing (ICASSP), 2014 IEEE International
Conference on, pages 6964–6968. IEEE, 2014.
Polyphonic Note Transcription
• Convolutional recurrent neural networks (CRNN) - Onset detection networks and frame-level pitch detection networks are
combined with a causal connection
repeated notes that should be held. Note onsets are important, buta piece played with only onset information would either have to beentirely staccato or use some kind of heuristic to determine when torelease notes. A high note with offset score will correspond to a tran-scription that sounds good because it captures the perceptual infor-mation from both onsets and durations. More perceptually accuratemetrics may be possible and warrant further research. In this workwe focus on improving the note with offset score, but also achievestate of the art results for the more common frame and note scores.
3. MODEL CONFIGURATION
Framewise piano transcription tasks typically process frames of rawaudio and produce frames of note activations. Previous framewiseprediction models [3, 4] have treated frames as both independent andof equal importance, at least prior to being processed by a separatelanguage model. We propose that some frames are more importantthan others, specifically the onset frame for any given note. Pianonote energy decays starting immediately after the onset, so the on-set is both the easiest frame to identify and the most perceptuallysignificant.
We take advantage of the significance of onset frames by train-ing a dedicated note onset detector and using the raw output of thatdetector as additional input for the framewise note activation detec-tor. We also use the thresholded output of the onset detector duringthe inference process. An activation from the frame detector is onlyallowed to start a note if the onset detector agrees that an onset ispresent in that frame.
Our onset and frame detectors are built upon the convolutionlayer acoustic model architecture presented in [4], with some mod-ifications. We use librosa [12] to compute the same input data rep-resentation of mel-scaled spectrograms with log amplitude of theinput raw audio with 229 logarithmically-spaced frequency bins, ahop length of 512, an FFT window of 2048, and a sample rate of16khz. However, instead of presenting the network with one targetframe at a time we instead present the entire sequence at once. Theadvantage of this approach is that we can then use the output of theconvolution layers as input to an RNN layer.
The onset detector is composed of the acoustic model, followedby a bidirectional LSTM [13] with 128 units in both the forward andbackward directions, followed by a fully connected sigmoid layerwith 88 outputs for representing the probability of an onset for eachof the 88 piano keys.
The frame activation detector is composed of a separate acousticmodel, followed by a fully connected sigmoid layer with 88 outputs.Its output is concatenated together with the output of the onset detec-tor and followed by a bidirectional LSTM with 128 units in both theforward and backward directions. Finally, the output of that LSTMis followed by a fully connected sigmoid layer with 88 outputs. Dur-ing inference, we use a threshold of 0.5 to determine whether theonset detector or frame detector is active.
Training RNNs over long sequences can require large amountsof memory and is generally faster with larger batch sizes. To ex-pedite training, we split the training audio into smaller files. How-ever, when we do this splitting we do not want to cut the audio dur-ing notes because the onset detector would miss an onset while theframe detector would still need to predict the note’s presence. Wefound that 20 second splits allowed us to achieve a reasonable batchsize during training of at least 8, while also forcing splits in only asmall number of places where notes are active. When notes are ac-tive and we must split, we choose a zero-crossing of the audio signal.Inference is performed on the original and un-split audio file.
Log Mel-Spectrogram
Conv StackConv Stack
BiLSTM
BiLSTM
Onset Loss
Frame Loss
Onset Predictions
Frame Predictions
FC Sigmoid
FC Sigmoid
FC Sigmoid
Fig. 1. Diagram of Network Architecture
Our ground truth note labels are in continuous time, but the re-sults from audio processing are in spectrogram frames. So, we quan-tize our labels to calculate our training loss. When quantizing, weuse the same frame size as the output of the spectrogram. However,when calculating metrics, we compare our inference results againstthe original, continuous time labels.
Our loss function is the sum of two cross-entropy losses: onefrom the onset side and one from the note side.
Ltotal = Lonset + Lframe (1)
Lonset(l, p) =X
i
�l(i) log p(i)� (1� l(i)) log(1� p(i)) (2)
where l = labelsonsets and p = predictionsonsets. The la-bels for the onset loss are created by truncating note lengths tomin(note length, onset length) prior to quantization. We per-formed a coarse hyperparameter search over onset length andfound that 32ms worked best. In hindsight this is not surprising as itis also the length of our frames and so almost all onsets will end upspanning exactly two frames. Labeling only the frame that containsthe exact beginning of the onset doesn’t work as well because ofpossible mis-alignments of the audio and labels. We experimentedwith requiring a minimum amount of time a note had be to presentin a frame before it was labeled, but found that the optimum valuewas to include any presence.
Even within the frame-based loss term, we apply a weighting toencourage accuracy at the start of the note. A note starts at frame t1,completes its onset at t2 and ends at frame t3. Because the weightvector assigns higher weights to the early frames of notes, the model
Onsets and Frames: Dual-Objective Piano Transcription , Hawthorne et al., 2018
Blue: frame predictionRed: onset predictionMagenta: both
Yellow: True PositiveRed: False NegativeGreen: False Positive
Polyphonic Note Transcription
• The “Onset and Frames” model outperforms previous state-of-the-arts
Frame Note Note with offsetPrecision Recall F1 Precision Recall F1 Precision Recall F1
Sigtia [3] (our reimpl.) 71.99 73.32 72.22 44.97 49.55 46.58 17.64 19.71 18.38Kelz [4] (our reimpl.) 81.18 65.07 71.60 44.27 61.29 50.94 20.13 27.80 23.14
Melodyne (decay mode) 71.85 50.39 58.57 62.08 48.53 54.02 21.09 16.56 18.40Onsets and Frames 88.53 70.89 78.30 84.24 80.67 82.29 51.32 49.31 50.22
Table 1. Results on MAPS configuration 2 test dataset (ENSTDkCl and ENSTDkAm full-length .wav files). Note-based scores calculatedby the mir eval library, frame-based scores as defined in [11]. Final metric is the mean of scores calculated per piece. MIDI files used tocalculate these scores are available at https://goo.gl/U3YoJz.
a small impact for each component (< 6%). It is encouraging thatforward-only RNNs have only a small accuracy impact as they canbe used for online piano transcription.
We tried many other architectures and data augmentation strate-gies not listed in the table, none of which resulted in any improve-ment. Significantly, augmenting the training audio by adding nor-malization, reverb, compression, noise, and synthesizing the trainingMIDI files with other synthesizers made no difference. We believethis indicates a need for a much larger training dataset of real pianorecordings that have fully accurate label alignments. These require-ments are not satisfied by the current MAPS dataset because only60 of its 270 recordings are from real pianos, and they are also notsatisfied by MusicNet [16] because its alignments are not fully accu-rate. Other approaches, such as seq2seq [17] may not require fullyaccurate alignments.
F1Frame Note Note with offset
Onset and Frames 78.30 82.29 50.22
(a) Frame-only LSTM 76.12 62.71 27.89(b) No Onset Inference 78.37 67.44 34.15
(c) Onset forward LSTM 75.98 80.77 46.36(d) Frame forward LSTM 76.30 82.27 49.50
(e) No Onset LSTM 75.90 80.99 46.14(f) Pretrain Onsets 75.56 81.95 48.02
(g) No Weighted Loss 75.54 80.07 48.55(h) Shared conv 76.85 81.64 43.61
(i) Disconnected Detectors 73.91 82.67 44.83(j) CQT Input 73.07 76.38 41.14
(k) No LSTM, shared conv 67.60 75.34 37.03
Table 2. Ablation Study Results.
6. NEED FOR MORE DATA, MORE RIGOROUS
EVALUATION
The most common dataset for evaluation of piano transcription tasksis the MAPS dataset, in particular the ENSTDkCl and ENSTDkAmrenderings of the MUS collection of pieces. This set has several de-sirable properties: the pieces are real music as opposed to randomly-generated sequences, the pieces are played on a real physical pianoas opposed to a synthesizer, and multiple recording environments areavailable (“close” and “ambient” configurations). The main draw-back of this dataset is that it is only 60 .wav files.
Many papers, for example [8, 3, 18, 19], further restrict the dataused in evaluation by using only the “close” collection and/or onlythe first 30 seconds or less of each file. We believe this results in anevaluation that is not representative of real-world transcription tasks.
Table 3 shows how the score of our model increases dramatically aswe increasingly restrict the dataset.
NotePrecision Recall F1
Cl and Am, Full length 84.00 80.25 81.96Cl only, Full length 85.95 83.05 84.34Cl only, First 30s 87.13 85.96 86.38
Wang [8] Cl only, First 30s 85.93 75.24 80.23Gao [18] Cl only, First 30s? 83.38 87.34 85.06
Table 3. Model results on various dataset configurations.? Results from Gao cannot be directly compared to the other results in thistable because their model was trained on data from the test piano.
In addition to the small number of the MAPS Disklavier record-ings, we have also noticed several cases where the Disklavier appearsto skip some notes played at low velocity. For example, at the be-ginning of the Beethoven Sonata No. 9, 2nd movement, several A[notes played with MIDI velocities in the mid-20s are clearly miss-ing from the audio (https://goo.gl/U3YoJz). More analysisis needed to determine how frequently missed notes occur, but wehave noticed that our model performs particularly poorly on noteswith velocities below 30.
To best measure transcription quality, we believe a new andmuch larger dataset is needed. However, until that exists, evalua-tions should make full use of the data that is currently available.
7. CONCLUSION AND FUTURE WORK
We demonstrate a jointly trained onsets and frames model for tran-scribing polyphonic piano music and also show that using onset in-formation during inference yields significant improvements. Thismodel transfers well between the disparate train and test distribu-tions.
The current quality of the model’s output is on the cusp of en-abling downstream applications such as MIR and automatic musicgeneration. To further improve the results we need to create a newdataset that is much larger and more representative of various pi-ano recording environments and music genres for both training andevaluation. Combining an improved acoustic model with a languagemodel is a natural next step. Another direction is to go beyond tradi-tional spectrogram representations of audio signals. Dilated convo-lutions [20] could enable sub-frame timing predictions.
It is very much worth listening to the examples of transcription.Consider Mozart Sonata K331, 3rd movement. Our system does agood job in terms of capturing harmony, melody and even rhythm. Ifwe compare this to the other systems, the difference is quite audible.Audio examples are available at https://goo.gl/U3YoJz.
Frame Note Note with offsetPrecision Recall F1 Precision Recall F1 Precision Recall F1
Sigtia [3] (our reimpl.) 71.99 73.32 72.22 44.97 49.55 46.58 17.64 19.71 18.38Kelz [4] (our reimpl.) 81.18 65.07 71.60 44.27 61.29 50.94 20.13 27.80 23.14
Melodyne (decay mode) 71.85 50.39 58.57 62.08 48.53 54.02 21.09 16.56 18.40Onsets and Frames 88.53 70.89 78.30 84.24 80.67 82.29 51.32 49.31 50.22
Table 1. Results on MAPS configuration 2 test dataset (ENSTDkCl and ENSTDkAm full-length .wav files). Note-based scores calculatedby the mir eval library, frame-based scores as defined in [11]. Final metric is the mean of scores calculated per piece. MIDI files used tocalculate these scores are available at https://goo.gl/U3YoJz.
a small impact for each component (< 6%). It is encouraging thatforward-only RNNs have only a small accuracy impact as they canbe used for online piano transcription.
We tried many other architectures and data augmentation strate-gies not listed in the table, none of which resulted in any improve-ment. Significantly, augmenting the training audio by adding nor-malization, reverb, compression, noise, and synthesizing the trainingMIDI files with other synthesizers made no difference. We believethis indicates a need for a much larger training dataset of real pianorecordings that have fully accurate label alignments. These require-ments are not satisfied by the current MAPS dataset because only60 of its 270 recordings are from real pianos, and they are also notsatisfied by MusicNet [16] because its alignments are not fully accu-rate. Other approaches, such as seq2seq [17] may not require fullyaccurate alignments.
F1Frame Note Note with offset
Onset and Frames 78.30 82.29 50.22
(a) Frame-only LSTM 76.12 62.71 27.89(b) No Onset Inference 78.37 67.44 34.15
(c) Onset forward LSTM 75.98 80.77 46.36(d) Frame forward LSTM 76.30 82.27 49.50
(e) No Onset LSTM 75.90 80.99 46.14(f) Pretrain Onsets 75.56 81.95 48.02
(g) No Weighted Loss 75.54 80.07 48.55(h) Shared conv 76.85 81.64 43.61
(i) Disconnected Detectors 73.91 82.67 44.83(j) CQT Input 73.07 76.38 41.14
(k) No LSTM, shared conv 67.60 75.34 37.03
Table 2. Ablation Study Results.
6. NEED FOR MORE DATA, MORE RIGOROUS
EVALUATION
The most common dataset for evaluation of piano transcription tasksis the MAPS dataset, in particular the ENSTDkCl and ENSTDkAmrenderings of the MUS collection of pieces. This set has several de-sirable properties: the pieces are real music as opposed to randomly-generated sequences, the pieces are played on a real physical pianoas opposed to a synthesizer, and multiple recording environments areavailable (“close” and “ambient” configurations). The main draw-back of this dataset is that it is only 60 .wav files.
Many papers, for example [8, 3, 18, 19], further restrict the dataused in evaluation by using only the “close” collection and/or onlythe first 30 seconds or less of each file. We believe this results in anevaluation that is not representative of real-world transcription tasks.
Table 3 shows how the score of our model increases dramatically aswe increasingly restrict the dataset.
NotePrecision Recall F1
Cl and Am, Full length 84.00 80.25 81.96Cl only, Full length 85.95 83.05 84.34Cl only, First 30s 87.13 85.96 86.38
Wang [8] Cl only, First 30s 85.93 75.24 80.23Gao [18] Cl only, First 30s? 83.38 87.34 85.06
Table 3. Model results on various dataset configurations.? Results from Gao cannot be directly compared to the other results in thistable because their model was trained on data from the test piano.
In addition to the small number of the MAPS Disklavier record-ings, we have also noticed several cases where the Disklavier appearsto skip some notes played at low velocity. For example, at the be-ginning of the Beethoven Sonata No. 9, 2nd movement, several A[notes played with MIDI velocities in the mid-20s are clearly miss-ing from the audio (https://goo.gl/U3YoJz). More analysisis needed to determine how frequently missed notes occur, but wehave noticed that our model performs particularly poorly on noteswith velocities below 30.
To best measure transcription quality, we believe a new andmuch larger dataset is needed. However, until that exists, evalua-tions should make full use of the data that is currently available.
7. CONCLUSION AND FUTURE WORK
We demonstrate a jointly trained onsets and frames model for tran-scribing polyphonic piano music and also show that using onset in-formation during inference yields significant improvements. Thismodel transfers well between the disparate train and test distribu-tions.
The current quality of the model’s output is on the cusp of en-abling downstream applications such as MIR and automatic musicgeneration. To further improve the results we need to create a newdataset that is much larger and more representative of various pi-ano recording environments and music genres for both training andevaluation. Combining an improved acoustic model with a languagemodel is a natural next step. Another direction is to go beyond tradi-tional spectrogram representations of audio signals. Dilated convo-lutions [20] could enable sub-frame timing predictions.
It is very much worth listening to the examples of transcription.Consider Mozart Sonata K331, 3rd movement. Our system does agood job in terms of capturing harmony, melody and even rhythm. Ifwe compare this to the other systems, the difference is quite audible.Audio examples are available at https://goo.gl/U3YoJz.
Polyphonic Note Transcription
• Demo: Onset and Frames- https://magenta.tensorflow.org/onsets-frames