Musical style transfer with generative neural network models€¦ · musical style transfer is the...
Transcript of Musical style transfer with generative neural network models€¦ · musical style transfer is the...
modelsMusical style transfer with generative neural network
Academic year 2018-2019
Master of Science in Computer Science Engineering
Master's dissertation submitted in order to obtain the academic degree of
Supervisors: Prof. dr. Tijl De Bie, Dr. ir. Thomas Demeester
Student number: 01304857Maarten Moens
modelsMusical style transfer with generative neural network
Academic year 2018-2019
Master of Science in Computer Science Engineering
Master's dissertation submitted in order to obtain the academic degree of
Supervisors: Prof. dr. Tijl De Bie, Dr. ir. Thomas Demeester
Student number: 01304857Maarten Moens
Preface
In this preface I would like to show some gratitude to some of the persons that made this
dissertation a possibility.
I would like to thank my mother for all the love and hard work she has done for me.
Without her, I probably would never have succeeded at even getting to my Master’s. I
would like to thank my dad for inspiring me and sparking my interest in science and
maths. May you both live long and healthy lives.
I would like to thank Thomas and Tijl, my counselors, for giving me advice all year
round. I would have been very lost without you.
A special thanks to Afif Hasan, for feeding me when I was hungry, for helping me with
setting up the survey, for proofreading my dissertation and for countless other things.
Thank you, from the bottom of my heart.
A thank you to all the persons I’ve ever played music with and all the persons that
make me actively enjoy music, even when I’m not playing it myself. In particular, a thank
you to Jeroen De Vos, for inspiring me to pick up the bass guitar.
Last but not least, a thank you to all of my engineering friends. My time at UGent
would have been a lot duller without you.
Permission for loan
The author gives permission to make this master’s dissertation available for consultation
and to copy parts of this master’s dissertation for personal use. In all cases of other use,
the copyright terms have to be respected, in particular with regard to the obligation to
explicitly state the source when quoting results from this master’s dissertation.
Maarten Moens, 2019
Abstract
Overview
• Master’s dissertation submitted in order to obtain the academic degree of Master
of Science in Computer Science Engineering
• Title: Musical style transfer with generative neural network models
• Author: Maarten Moens
• Supervisors: Prof. dr. Tijl De Bie, Dr. ir. Thomas Demeester
• Faculty of Engineering and Architecture, Academic year 2018-2019
Summary
The research presented proposes a neural algorithm for making realistic covers of human
composed music available in a symbolic format. Using the Cyclic Generative Adversarial
Network algorithm, we have trained a network to stylize music from one out of two specific
genres as music from the other one. The quality of this generated music has been evaluated
by humans and a comparison with human composed symbolic music is made. Whilst our
generated music is less pleasing than human-composed music, respondents rated it as
being closer to pleasant than unpleasant.
Keywords
Music Generation, Neural Style Transfer, Generative Adversarial Networks
Musical style transfer with generative neuralnetwork models
Maarten Moens
Supervisor(s): Prof. Dr. Tijl De Bie, Dr. Ir. Thomas Demeester
Abstract— The research presented proposes a neural algo-rithm for making realistic covers of human composed musicavailable in a symbolic format. Using the Cyclic GenerativeAdversarial Network algorithm, we have trained a networkto stylize music from one out of two specific genres as musicfrom the other one. The quality of this generated music hasbeen evaluated by humans and a comparison with humancomposed symbolic music is made. Whilst our generatedmusic is less pleasing than human-composed music, respon-dents rated it as being closer to pleasant than unpleasant.
Keywords—Music Generation, Neural Style Transfer, Gen-erative Adversarial Networks
I. Introduction
Man has tried to make algorithms for deceitfully real-istic computer art since the dawn of the computer. Ascreating art is one of the more unique aspects of human in-telligence in comparison to those of other sentient beings,understanding and artificially replicating it might give usinsight into the inner workings of the human brain.
To advance our knowledge in one sub-task in this grandendeavor, this dissertation is about generating musicalpieces that are interpretations of other existing, symbolic,human composed examples using modern day generativemachine learning techniques, that are pleasant to the hu-man ear. In particular, deep neural networks and adver-sarial networks were used to try to generate this music.
II. Related Works
This section is about related literature consulted duringthe research. First, let us introduce a couple of terms. Amusical style transfer is the generation of music, featuringthe content of an input song, stylized to match the style ofa genre learned during training. Symbolic music is musicwhich is represented by some type of high level notation,instead of by a waveform. No characteristics of human per-formance or instrument quality are thus passed on. In thissection, we will only introduce papers significant to sym-bolic musical style transfer. Some research has been doneon waveform audio style transfer as well, but we suspectthat those models have completely different models andcharacteristics than ours.
Malik (2017) presents GenreNet, a network they trainedon note velocity data for different genres. [1] By using asingle encoder and multiple LSTM decoders (one for eachgenre), the network would output note velocities for inputsongs. 46% of songs were correctly identified, 25% wrongly
identified and 25% couldn’t be identified. Thus, it passedwhat the author calls the ”musical Turing test”.
Spitael (2017) presents a master’s dissertation on styletransfer between classical composers in 2017. The authortried to transfer the style of one composer to a composi-tion of another one by having multiple decoders (one foreach composer) coupled to only one universal encoder. Anew way of vectorizing MIDI was proposed: instead of thepiano-roll representation (a vector representing the com-plete state of the keyboard at every time instant), only onepitch (input as a value and not as vector), its correspondingvelocity and the lapsed time would be encoded. The sys-tem would still be polyphonic as one could input multiplenotes as having only a time difference of zero. An expertand non-expert audience found the generated transferredsongs to be more pleasing than random noise, but preferredthe originals.
De Coster (2017) presents a master’s dissertation on us-ing RNNs to generate polyphonic music with style transi-tions. [2]. Instead of using different decoders for differentcomposers, as his colleague Spitael did, the authors of thispaper tried to predict the next sample given a certain in-put sample and a given composer whose style the generatortries to match. This output can then be fed into the net-work again, generating new sequences. During generation,the style vector can change from one composer to anotherone, forcing the machine to predict a different future out-put, making it possible to interpolate between composers.The author assessed that the quality of output is lower thanthat of the real compositions, but not by much.
Brunner (2018) used a Variational Auto Encoder to gen-erate style transfered music by adding a genre tag to thefeatures emitted by the encoder. [3] By training the VAEuntil it reproduced the original input, one could simplyswap the genre tag during evaluation to listen to the styletransfered output. One of the difficulties with this setup isthe fact that the genre has to be a meaningful feature forthe decoder: if it just ignores that particular feature, swap-ping the genre tag will not result in much change. Contraryto most other techniques, they use note velocity as a input.Note length is also fed separately to the encoder.
In another paper, Brunner (2018) used a Cycle GANsetup to generate style transferred music. [4] To evaluatethe generated music automatically, they used a genre classi-
fier trained on three genres. If the detector would evaluatethe song in the correct genre, the generated sample wouldbe evaluated positively. They would transform the songbar per bar, transforming a whole song this way for someof the examples uploaded online. No human evaluation wasdone in their paper.
III. Problem Statement
Consider two datasets consisting of fixed length excerptsof songs, SA and SB . We wish to generate an excerptthat would fit the general style of dataset SB , based on anelement from SA, in such a way the element from SA isrecognizable. Let us call the excerpt from SA x and thegenerated excerpt x´. Our algorithm is successful if:• A human observer finds the generated excerpt x´ pleasingto listen to.• A human observer could imagine x´ being a part of thedataset SB .• A human observer could recognize the similarity betweenx and x´.An example: the nursery rhyme ’Mary has a Little Lamb’has a very recognizable melody. It is rhythmically simplehowever. Jazz is a genre know for its musical intricacies,especially in terms of rhythm. If a human would wish tocreate a Jazz cover of ’Mary has a Little Lamb’, the rhythmwould be changed quite a bit, but the melody would stillbe recognizable. We would want our algorithm to behavethe same way.
IV. Dataset and representation
A dataset was created by searching the internet for songsdownloadable in the MIDI format. A choice was made toonly use symbolic music written for solo piano. A numberof reasons why this decision was made:• Piano songs are mostly polyphonic. Most songs have aa melody played by the right hand and an accompanimentplayed by the left hand.• Compared to songs using a distinct instrument to playmelodies and a distinct instrument to play basslines, cut-ting out the right tracks out of the MIDI was easier.• Outside of note velocity (which is something to accountfor with every musical instrument), no unusual nuances canbe played on a piano. There is no modulation of the pitchlike with string elements (”string bending”) or no multipleways to play the notes of the keyboard (as Pizzicato vsBowed for certain string elements). This is valuable infor-mation that would be cut out in the binarization.
Two genres with a large library of songs written for solopiano are Classical Music and Ragtime Music. The songsthat fit the above criteria were then binarized into pianorolls, quantized at a sixteenth-note level, with an excerptconsisting out of four bars of sixteen of these sixteenth-notes. One can imagine these piano rolls as multihot vec-tors that correspond to the pitches played on the piano, onefor every quantized timestep. This quantization is a non-trivial problem: we used the pypianoroll library for this. [5]
By searching the internet, a set of 240 songs per genrewas created. To remove bias, every song was pitch-shiftedso every key was as frequent in the dataset. This gaveus 2880 different songs. This dataset was then split intotraining, validation and test sets consisting of 80%, 10%and 10% of the data respectively. Up to ten excerpts oflength four bars were taken out of these songs.
V. Methodology
Our methodology will be discussed in this section. Asmentioned in the abstract and the introduction, we used aCycle Consistent Generative Adversarial Network to gen-erate new music from the excerpts in the dataset. A GANor generative adverserial network is a network consistingof two sub networks: a generator that tries to generatetruthful data and a discriminator that tries to discriminatethe real data from the fake data. [6] The discriminator istrained on example data from the dataset and the outputof the generator. The generator tries to optimize its outputin such a way that the discriminator can’t tell the differ-ence between the original and the generated data. This isvisualized in Figure 1.
Fig. 1
The standard GAN architecture visualized. Image from
Slideshare. [7]
In Equation 1 and Equation 2, loss functions for a nor-mal generator and discriminator are shown: σ stands forthe sigmoid function, x for original data samples from thetraining dataset and z for a random noise sample.
DD(x, z) = −log(σ(D(x)))− log(σ(−D(G(z)))) (1)
DG(x, z) = −log(σ(D(G(z)))) (2)
A Cycle Consistent GAN uses two pairs of generators anddiscriminators instead of one. Consider two datasets X andY. One generator, G tries to generate artificial data, as ifit were elements of Y, given elements from X as input.The other generator, F tries to generate artificial data,as if it were elements of X, given an element from Y asinput. One discriminator is trained using the elements ofX and artificial elements generated by F , the other byusing the elements of Y and artificial elements generatedby G. The cycle consistency comes from a loss function,forcing application of F to the output of F to be the input
given to F and vice versa. G and F should thus be inversesof one another. This concept is visualised in Figure 2 andFigure 3.
Fig. 2
The Cycle GAN algorithm. Image from Zhu (2017). [8]
Fig. 3
A more concrete visualization of the Cycle GAN algorithm
with musical examples. [8]
One can also enforce idempotency, by adding a loss func-tion where the generators have to perform an identity map-ping to certain elements. For instance, G(x) = x andF (y) = y. The loss functions for this type of GAN can beseen in Equation 3 and Equation 4. The α and β variablesare hyperparameters that control the relative importanceof the invertability and idempotency losses. The completeloss function for the generator can be found in Equation 5.Please note that this loss is centered around the generatorG. The loss for F is symmetrical to the one given. Theloss function for the discriminators does not change with
respect to GANs.
LFGcyc(x, y) = α|F (G(x))− x|+ β|G(y)− y| (3)
LFGstyle(x) = −log(σ(DY (G(x)))) (4)
LFG(x, y) = LFGcyc(x, y) + LFG
style(x) (5)
As a caveat, since our generators try to predict if a specificpitch will be played or not played at a specific timestep, amore logical metric for comparing the input and the outputof F (G(x)) would be the Binary Cross Entropy, instead ofthe L1 loss presented in Equation 3.
The Wasserstein GAN paper introduced a new class ofloss functions that do not use sigmoids to bound the dis-criminators output. [9] To use this type of loss functions,one’s discriminator has to be Lipschitz continuous. A func-tion f is Lipschitz continuous if Equation 6 holds. Wasser-stein GANs do not mode collapse. Mode collapse is a failuremode for GANs where only one output is generated by thegenerator. They are also less sensitive to hyperparameterchanges.
||f(x)− f(y)|| ≤ ||x− y|| (6)
Two possible combinations of discriminator and genera-tor loss functions for Wassertein GANs be seen in Equa-tion 7 and Equation 8, which we will call WGAN style lossfunctions from now on, and in Equation 9 and Equation 10,which we will call Hinge style loss functions from now on.The R in Equation 9 stands for the Rectified Linear func-tion: R(x) = max(0, x).
DDW (x, z) = −ExD(x) + E
zD(G(z)) (7)
DGW (x, z) = −EzD(G(z)) (8)
DDH(x, z) = ExR(1−D(x)) + E
zR(1 +D(G(z))) (9)
DGH(x, z) = −EzD(G(z)) (10)
These functions, F and G will be implemented with anencoder-decoder style setup. As our examples are fixedlength, we have chosen for a convolutional neural networkarchitecture. Inspired by Zhang (2018), we also added at-tentional layers. [10] We will discuss the two attemptedarchitectures:• Architecture A: the architecture we did the most humanevaluation on, which uses a Hinge style loss and attentionin the discriminator. This architecture showed irregularlearning behavior.• Architecture B: an architecture which improved upon Ar-chitecture A’s learning behavior. It uses WGAN style lossfunctions.
One can see architecture A visualized in Figure 4 andFigure 5. We used Batch Norm in the generator to speedup training. [11] We used Spectral Norm in the discrimina-tor to enforce Lipschitz continuity. [12] The sigmoid activa-tion function found in the last layer ensures us our output
is between 0 and 1. Please note that the kernel sizes inthe downsampling and upsampling layers, thus those witha stride different than one, are multiples of the strides.Odena (2016) shows that this is a preferable kernel size tonot have checkerboarding artifacts. [13] For hyperparame-ters, architecture A used an α of 10 and a β of 0, a learningrate of 0.0005 for the generators, a learning rate of 0.005for the discriminators and a batch size of 40.
Fig. 4
Architecture A Generator.
As the quality of the output of architecture A regressedafter training for longer periods of time, early stopping hadto be employed. A training imbalance was noted: one gen-erator’s quality would stay more or less constant, whilstthe quality of the other generator degraded. This can beseen in Figure 6. The lower the loss, the better the quality.Architecture B is very similar to architecture A. The gen-erator runs a bit deeper and the discriminator does not useattention. We suspect that the attentional layer broke theLipschitz contuity of the discriminator and that this wasthe cause for the training instability. The number of chan-nels was also changed to better reflect the amount of datathat was removed by downsampling. For hyperparameters,architecture A used an α of 1 and a β of 1, a learning rateof 0.0005 for the generators, a learning rate of 0.001 for the
Fig. 5
Architecture A Discriminator.
Fig. 6
Training imbalance.
discriminators and a batch size of 40. One can see the gen-erator and the discriminator of architecture B in Figure 7and Figure 8. Early stopping did not have to be used forArchitecture B. One can see the improved training profilein Figure 9. Also note that the training and evaluationcycle consistency losses are very close to zero: these werelower than in Figure 6, but this is not really visible on thegraph.
VI. Results
In this section the generated music of both architectureswill be evaluated. As no survey could be done for Archi-
Fig. 7
Architecture B Generator.
tecture B, only general properties of the generated musicwill be described.
A. Architecture A
In this subsection we will describe the music generated byarchitecture A and show the results of the survey conductedon its output. The amount of rhythm changes generatedby architecture A was minimal. The network tried addingextra notes to a song sometimes and also had a tendencyto change the key of the song. A survey was conducted toassess the quality of the samples generated by this network,once with and once without attention layers in the gener-ator. 41 persons responded, 51.4% of calling themselvesknowledgeable about music. All tests were single blind. Inthe first section, the respondents listened to one examplesong from the test set and its style transfer. In general,only 15% of the respondents founds the style transferredsamples pleasing to listen to when compared with the orig-inal. However, the respondents scored the samples high interms of the possibility of being a reinterpretation of the
Fig. 8
Architecture B Discriminator.
Fig. 9
Training profile of architecture B.
originals, as can be seen in Table I.
Genre of og. Architecture Yes Neutral NoClassical Attention 71.1% 15.8% 13.2%Ragtime Attention 42.1% 23.7% 34.2%Classical No Attention 47.4% 34.2% 18.4%Ragtime No Attention 44.7% 28.9% 26.3%
TABLE I
Raw results of the reinterpretation questions in the A-B
comparisons between generated and non-generated excerpts.
In the second section, respondents had to listen to 3 gen-erated samples and one excerpt of a song composed by ahuman and give these fragments a score between 1 and 5in terms of pleasingness. The respondents were asked todo this twice. In the first question, one of the generatedsamples came from a generator that failed catastrophicly,one from Architecture A with attention and one from Ar-chitecture A without attention. In the second question,one of the generated samples came from a generator thatfailed, as it only emitted one type of output, albeit onewhich had a certain appealing quality to it: which shall bereferred to as Pleasant Collapse from here on out; one fromArchitecture A with attention and one from ArchitectureA without attention. The results can be seen in Table II.
Type Avg ScoreReal 4.603Generator Attention 2.778Generator No Attention 2.793Generated Classical (no collapse) 2.699Generated Ragtime (no collapse) 2.872Catastrophic Failure 1.842Pleasant Collapse 3.474
TABLE II
Averages of qualitative experiment
Our generated music scored better than the catastrophicfailure. No big difference in pleasingness between genera-tors was noted. Strangely enough, people found the pleas-ing collapse to sound better than the generated samples.Neither comes close to the score the human composed frag-ment was given on average, however.
B. Architecture B
In this subsection we will describe the music generatedby architecture B. No survey was conducted to evaluatethe output. Output generated by Architecture B used themelody almost verbatim, but a big difference in the accom-paniment of the melody was noticeable. For instance, theClassical-To-Ragtime generator learned a specific rhythmpattern that fits the genre quite well. This behavior wasnot present in Architecture A.
C. Comparison with recent research
The setup used by Brunner (2018) is very similar toours. [4] The size of their dataset is roughly the same asour unaugmented dataset. As we augmented our data byshifting the keys, our dataset became 12 times larger thantheirs. Brunner (2018) generated covers bar per bar, in-stead of in groups of 4 bars like we do. We suspect theirmodel exhibited mode collapse or an imbalance like ourmodel did, as they stop after training for only 30 epochs.
VII. Conclusion and Future Work
Cycle Consistent GANs are a viable way to generaterealistic covers. Whilst there is still a significant differ-
ence between the quality of human-made music and ourcomputer generated covers, an architecture which can betrained without early stopping might be the beginning ofmuch improved generators. We have also proven that Cyc-GANs can produce generated music at larger source inputsthan Brunner (2018), as we use 16 bars per sample andthey only use 4. [4] Music generated by this algorithm wasrated by humans as being closer to pleasing music than tonon-pleasing music. One of the things that could speed-updevelopment in this subject by a large margin would bethe development of a musical analogue to the Frechet In-ception Distance, [14] for qualitative computer evaluationof generated music, as evaluation by a developer is timeconsuming. An interesting angle of attack for the future ofstyle transfer is the use of a Transformer like architecturefor the generators, which Huang (2018) has proven to bevery effective in the generation of music. [15]
References
[1] Iman Malik and Carl Henrik Ek, “Neural translation of musicalstyle,” CoRR, vol. abs/1708.03535, 2017.
[2] Mathieu De Coster, “Polyphonic music generation with styletransitions using recurrent neural networks,” A UGent mastersdissertation, 2017.
[3] Gino Brunner, Andres Konrad, Yuyi Wang, and Roger Watten-hofer, “MIDI-VAE: Modeling Dynamics and Instrumentation ofMusic with Applications to Style Transfer,” in 19th Interna-tional Society for Music Information Retrieval Conference (IS-MIR), Paris, France, September 2018.
[4] Gino Brunner, Yuyi Wang, Roger Wattenhofer, and Sumu Zhao,“Symbolic music genre transfer with cyclegan,” CoRR, vol.abs/1809.07575, 2018.
[5] Hao-Wen Dong, Wen-Yi Hsiao, and Yi-Hsuan Yang, “Pypi-anoroll: Open source python package for handling multitrackpianorolls,” in Late-Breaking Demos of the 19th InternationalSociety for Music Information Retrieval Conference (ISMIR),2018.
[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Ad-versarial Networks,” ArXiv e-prints, June 2014.
[7] Kevin McGuiness, “Deep learning for computer vision: Genera-tive models and adversarial training,” Slideshare, August 2016.
[8] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Net-works,” ArXiv e-prints, Mar. 2017.
[9] Martin Arjovsky, Soumith Chintala, and Leon Bottou, “Wasser-stein generative adversarial networks,” in Proceedings of the 34thInternational Conference on Machine Learning, Doina Precupand Yee Whye Teh, Eds., International Convention Centre, Syd-ney, Australia, 06–11 Aug 2017, vol. 70 of Proceedings of Ma-chine Learning Research, pp. 214–223, PMLR.
[10] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-tus Odena, “Self-Attention Generative Adversarial Networks,”arXiv e-prints, p. arXiv:1805.08318, May 2018.
[11] Sergey Ioffe and Christian Szegedy, “Batch normalization: Ac-celerating deep network training by reducing internal covariateshift,” CoRR, vol. abs/1502.03167, 2015.
[12] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and YuichiYoshida, “Spectral normalization for generative adversarial net-works,” CoRR, vol. abs/1802.05957, 2018.
[13] Augustus Odena, Vincent Dumoulin, and Chris Olah, “Decon-volution and checkerboard artifacts,” Distill, 2016.
[14] Shaohui Liu, Yi Wei, Jiwen Lu, and Jie Zhou, “An im-proved evaluation framework for generative adversarial net-works,” CoRR, vol. abs/1803.07474, 2018.
[15] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit,Noam Shazeer, Curtis Hawthorne, Andrew M. Dai, Matthew D.Hoffman, and Douglas Eck, “An improved relative self-attentionmechanism for transformer with application to music genera-tion,” CoRR, vol. abs/1809.04281, 2018.
Contents
1 Introduction 1
2 Neural Network Building Blocks 3
2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Convolutional Network Layer . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Transposed Convolutional Layers . . . . . . . . . . . . . . . . . . . 5
2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Residual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Network layer output normalization . . . . . . . . . . . . . . . . . . 9
2.4.2 Layer Weight normalization . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 Lipschitz continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.4 Spectral Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Generative Neural Networks 16
3.1 Variational Auto Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Original GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Mode collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Wasserstein GAN and other improvements . . . . . . . . . . . . . . 19
4 Style Transfer 23
4.1 Pioneering style transfer works . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Cycle consistent style transfer . . . . . . . . . . . . . . . . . . . . . . . . . 26
i
CONTENTS ii
5 State of the Art in Musical Style Transfer 29
5.1 Generative techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Music Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Previous master’s dissertations received . . . . . . . . . . . . . . . . . . . . 30
5.4 Musical Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Dataset 35
6.1 MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 Piano rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.2 Song selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7 Architecture 41
7.1 High level overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Original Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3 Training problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.4 Improved Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8 Results 53
8.1 Researchers evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9 Conclusion 59
List of Figures
2.1 Application of a Sobel filter visualized. Image from Medium. [1] . . . . . . 4
2.2 One dimensional transposed convolution visualized. Image from Distill. [2] 6
2.3 View of the inner-workings of how a peephole LSTM works. Image from
Medium. [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 The Resnet pattern visualised. Image from He (2015). [4] . . . . . . . . . . 8
2.5 Resnet training curves.Image from He (2015). [4] . . . . . . . . . . . . . . . 8
2.6 Multiple types of layer output normalization. Image from Towards Data
Science. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Lipschitz cone. Image from Wikimedia. [6] . . . . . . . . . . . . . . . . . . 11
2.8 Picture explaining the SAGAN self attention implementation. Image from
Zhang (2018). [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Comparison between the o variable in the attention algorithm and the
original input to the layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Dropout visualization. Image from Srivastae (2014). [8] . . . . . . . . . . . 15
3.1 A VAE architecture. Image from Github.io. [9] . . . . . . . . . . . . . . . . 17
3.2 The standard GAN architecture visualized. Image from Slideshare [10] . . 19
3.3 An example of mode collapse from a network learning the popular MNIST
dataset. Image from Metz (2016). [11] . . . . . . . . . . . . . . . . . . . . 20
3.4 The DCGAN guidelines. Image from Radford (2015). [12] . . . . . . . . . . 21
3.5 The DCGAN architecture. Image from Radford (2015). [12] . . . . . . . . 21
4.1 Filter responses throughout the convolutional network. Image from Gatys
(2014). [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Example of style transfer. Image from Gatys (2015). [13] . . . . . . . . . . 25
4.3 How the offline style transfer algorithm calculates its loss. Image from
Johnson (2016). [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Cycle Consistent style algorithm. Image from Zhu (2017). [15] . . . . . . . 27
iii
LIST OF FIGURES iv
4.5 An example of style consistent style transfers on pictures of horses and
zebras. Image from Zhu (2017). [15] . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Architecture of CycleGAN. Image from Zhu (2017). [15] . . . . . . . . . . . 27
4.7 Cycletransfered skymap with added details that aren’t visible in the roadmap.
Image from Chu (2017). [16] . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 A comparison between a evaluation roadmap and a training roadmap, re-
spectively. High-frequency encoding is made visible by adaptive histogram
equalization. Image from Chu (2017). [16] . . . . . . . . . . . . . . . . . . 28
5.1 The architecture used by Spitael (2017). [17] The numbered sequences por-
tray how the autoencoder was trained. . . . . . . . . . . . . . . . . . . . . 31
5.2 The architecture used by De Coster (2017). [18] . . . . . . . . . . . . . . . 31
5.3 The architecture for velocity prediction in GenreNet.Image from Malik
(2017). [19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 The VAE architecture used by Brunner (2018). [20] . . . . . . . . . . . . . 33
5.5 The Cycle GAN architecture used by Brunner (2018). [20] . . . . . . . . . 33
5.6 Genre classifier architecture used by Brunner (2018). [20] . . . . . . . . . . 34
6.1 A MIDI-message. Image from PlanetofTunes. [21] . . . . . . . . . . . . . . 36
6.2 Typical visualisation of a piano roll. Image from Researchgate. [22] . . . . 36
6.3 The distribution of notes in the unaugmented training set. . . . . . . . . . 38
6.4 The distribution of notes in the training set with augmented training set. . 39
6.5 Note count of multiple datasets. . . . . . . . . . . . . . . . . . . . . . . . . 39
6.6 Average amount of notes per song of multiple datasets. . . . . . . . . . . . 40
7.1 Algorithm outline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2 Original Encoder and decoder architecture. . . . . . . . . . . . . . . . . . . 45
7.3 Original discriminator architecture. . . . . . . . . . . . . . . . . . . . . . . 46
7.4 Original VAE encoder architecture. . . . . . . . . . . . . . . . . . . . . . . 47
7.5 Comparison between images with checkerboarding artifact and without.
Image from Distill. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.6 A plot showing the inevitable imbalance between the two generators. . . . 48
7.7 A less imbalanced learning profile. . . . . . . . . . . . . . . . . . . . . . . . 50
7.8 Updated discriminator architecture. . . . . . . . . . . . . . . . . . . . . . . 51
7.9 Updated discriminator architecture. . . . . . . . . . . . . . . . . . . . . . . 52
8.1 Original rhythm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.2 Style transferred rhythm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
LIST OF FIGURES v
8.3 Age distribution of our respondents. . . . . . . . . . . . . . . . . . . . . . . 54
8.4 Distribution of our respondents self-reported knowledge about music. . . . 55
List of Tables
8.1 Raw results of the coherency questions in the A-B comparisons between
generated and non-generated excerpts. . . . . . . . . . . . . . . . . . . . . 55
8.2 Raw results of the similarity questions in the A-B comparisons between
generated and non-generated excerpts. . . . . . . . . . . . . . . . . . . . . 56
8.3 Raw results of the reinterpretation questions in the A-B comparisons be-
tween generated and non-generated excerpts. . . . . . . . . . . . . . . . . . 56
8.4 Musicians positive assessment for generated songs. . . . . . . . . . . . . . . 56
8.5 Non-musicians positive assessment for generated songs. . . . . . . . . . . . 57
8.6 Raw results of the qualitative comparison between generators. . . . . . . . 57
8.7 Averages of qualitative experiment . . . . . . . . . . . . . . . . . . . . . . 58
vi
Chapter 1
Introduction
Mankind has tried to make algorithms for deceitfully realistic computer art since the dawn
of the computer. As creating art is one of the more unique aspects of human intelligence
in comparison to those of other sentient beings, understanding and artificially replicating
it might give us insight into the inner workings of the human brain. To advance our
knowledge in one sub-task of this grand endeavor, this dissertation is about generating
convincing musical pieces that are interpretations of other existing, human composed ex-
amples using modern day generative machine learning techniques.
In other words, we want to know if it’s possible to generate a new song that is a
cover or interpretation of another song, by learning from examples out of two datasets
of music from different genres. Our datasets consists of symbolic examples from classical
and ragtime piano music. No rules will be hard coded : our neural network will try to
create a style transfer by only using the examples and a set of constraints to optimize.
To differentiate ourselves from Brunner (2018), [20] who released a article about a
similar research question, we have tried applying some novel research into GAN train-
ing, as the authors didn’t experiment with attentional layers, more modern Generative
Adversarial Networks like the Wasserstein GAN or generator architectures using latent
variables. What follows is a summation of the content of this dissertation:
• In chapter 2, we will introduce some technical constructs related to deep neural
networks. We expect the reader to have some basic knowledge about neural networks
and machine learning and will explain the more modern concepts.
• In chapter 3, we will introduce some advanced generative algorithms, mostly imple-
mented with the constructs introduced in the previous chapter, that were used in
state of the art literature or in our research.
1
CHAPTER 1. INTRODUCTION 2
• In chapter 4, we will introduce the most common style transfer algorithms. Feasi-
bility for use in music is regarded when discussing these algorithms.
• In chapter 5, we will shortly digress upon some state of the art literature and
previous dissertations with a similar subject submitted to the University of Ghent.
• In chapter 6, we will present our dataset and walk through some of the underlying
technical details of converting the data we gathered to something our neural network
can act upon.
• In chapter 7, we will present the used architectures and a reasoning for why these
were chosen. We will also go into some of the technical problems we encountered
during the course of our research.
• In chapter 8, we will evaluate the generated music by means of human evaluation.
• Finally, in chapter 9, we will present a conclusion about our research and a future
work section containing some of the more interesting ways one could improve upon
our research.
Chapter 2
Neural Network Building Blocks
In this section we will explore all of the general neural network building blocks used to
create our final network architecture. Some of these might have been covered by the
introductory machine learning course offered at our university, but will not have been
covered in depth. A knowledge of how Stochastic Gradient Descent works and how it can
change the weights in different layers of a neural network is required.
What follows is a summation of the content of this dissertation:
• In section 2.1, we will discuss convolutional neural networks, a type of neural net-
work with less weights than the fully connected neural network, as well as pooling
operations often used in conjunction with this type of network.
• In section 2.2, we will discuss Recurrent Neural Networks. Although these haven’t
been used in any of the research we present, it has been used in multiple state of
the art publications and thus deserves a mention.
• In section 2.3, we will discuss all types of networks with so-called skip connections,
including the most popular one, the residual neural network.
• In section 2.5 we will explain the attention mechanism and its uses.
• In section 2.6 Dropout will be introduced, a regularization technique often used in
neural networks.
2.1 Convolutional Neural Networks
As the fully connected neural network suffers from a plethora of issues, ranging from a
tendency to overfit to a large runtime cost, network types with less parameters per hidden
3
2.1. CONVOLUTIONAL NEURAL NETWORKS 4
layer per channel started to become popular. In this section we will go deeper into how
Convolutional Neural Networks work and what types of variations exist.
As our experiments often used convolutional layers, this section was added for those
not familiar with this type of neural network.
2.1.1 Convolutional Network Layer
A convolutional layer consists of a kernel of HXW learnable parameters. This kernel
will slide over the input matrix, generating an output value by summing multiplication of
the weights in the correct places in the kernel with the values in the correct place of the
sliding window, as depicted in Figure 2.1. The way this windows slides over the original
one is controlled by the stride: this is a vector consisting of the amount of elements the
center of the kernel slides in the corresponding dimension. A convolutional layer with a
stride greater than one performs a downsampling operation, reducing the dimensionality
of the input. One can see a visual explanation of the convolutional operation in Figure 2.1.
In this image, a Sobel high pass filter is used instead of a kernel with learnable parameters.
Figure 2.1: Application of a Sobel filter visualized. Image from Medium. [1]
Because the weights in the sliding window get reused, there are less weights to learn in
comparison to a dense or fully connected layer, where every input has a weight connect-
ing it to an output. By splitting our input into multiple channels and applying different
kernels to those channels at the same time, we can learn multiple representations of the
same input, often in a smaller dimensional space.
2.1. CONVOLUTIONAL NEURAL NETWORKS 5
A network architecture consisting out of convolutional layers and max pooling oper-
ations was first used by LeCun (1989) to recognize handwritten digits. [23] Krizhevsky
(2012) used a similar architecture but deeper. [24] The dawn of the general purpose
graphics processing unit allowed them to train this network more efficiently than their
predecessors. This won them multiple image recognition contests. Van den Oord (2016)
used a special type of convolutional network with dilations to generate realistic speech. [25]
Convolutional networks have been used to generate symbolic music as well. Yang
(2017) used it for his MidiNet. [26] According to human evaluation, MidiNet generated as
realistic music as other contemporary state of the art technologies, but was more pleasant
to listen to.
2.1.2 Pooling
Pooling operations are operations that pool together values in a sliding window without
a learnable kernel. The operation done on the values can be non-linear (for instance max
pooling takes the maximum in the window as output value). These operations are often
used for downsampling, as using these with a stride bigger than one will downsample, just
like a convolutional neural network.
According to Springenberg (2014) max pooling layers can be substituted by convolu-
tional networks with as big a stride for downsampling purposes. [27] This result is used
in the Deeply Convolutional Generative Adversarial Network from 2016, since max pool-
ing layers create too sparse a gradient. [12] We have used this result in our own experiment.
2.1.3 Transposed Convolutional Layers
Downsampling techniques have been covered in the sections above. Transposed convolu-
tions can learn an upsampling kernel. [28] It maps one value into a region of outputs, by
multiplying the input value with multiple learned parameters. One can see an example in
Figure 2.2. Note that these can be used on multiple channels like normal convolutional
networks.
These layers can thus be used in auto-encoders to upsample a downsampled signal.
2.2. RECURRENT NEURAL NETWORKS 6
Figure 2.2: One dimensional transposed convolution visualized. Image from Distill. [2]
We will use them for this purpose in our experiment.
2.2 Recurrent Neural Networks
Recurrent Neural Networks are types of neural network with recurrent connections and
thus some sort of memory-like behavior. They are often used for signals that have some
type of time-series like behavior, for instance in natural language processing (as the mean-
ing of a current word depends on the context previous words have set) or music. In this
section we will give a small introduction on LSTMs and GRUs, the two most popular
classes of Recurrent Neural Networls, albeit not too technically.
While not directly used in this research, much music generation research has been
done using recurrent neural networks or RNNs. Some of these attempts will be discussed
in later sections.
2.2.1 LSTM
The most common recurrent neural network used is the Long Short Term Memory net-
work or LSTM network. [29] Whilst a detailed explanation of how they work is quite out
of scope for a cursory understanding, one can see how multiple nodes are interconnected
in a LSTM in Figure 2.3. The little x symbols are multiplications of the input. Else, the
node just takes a weighted sum of its inputs. The S -like circles stand for the preferred
activation function, often the logistic sigmoid or tanh function.
Roberts (2017) used LSTMs in a Variational Auto Encoder style setup to generate
melody loops and drum loops, even simultaneously for trios. [30] They used an extremely
large data set consisting of every MIDI song the Google search engine could scrape off
the web, using only songs in 4/4 and even augmented the data by shifting the pitches so
every pitch would be as prevalent.
2.3. RESIDUAL NETWORK 7
Figure 2.3: View of the inner-workings of how a peephole LSTM works. Image from Medium. [3]
2.2.2 GRU
Another often encountered recurrent neural network is the Gated Recurrent Unit or the
GRU. [31] GRUs are fairly new in comparison to LSTMs and have better performance on
some applications, however, they fail at certain tasks LSTMs can handle. [32]
2.3 Residual Network
Training very deep networks gets increasingly complex for every layer added. In 2015, two
papers, explored the idea of using skip connections. Srivastava (2015) proved that CNNs
often had a loss in performance the deeper the network was, unrelated to underfitting. [33]
They showed that using their highway networks, deeper networks could be trained more
efficiently.
He (2015) made the hypothesis that the objective of certain layers was hard to op-
timize for. [4] Using skip connections, summing a previous layer together with a newly
calculated feature map, added layers can easily perform a identity function if no other op-
timization seems to be available. They were able to train networks up to a hundred layers
deep using these skip connections. This concept gets visually explained in Figure 2.4.
ResNet layers can only be used when no dimensionality change between layers will
happen, as summing the input and the output means they both have to have the same
dimensionality. We have used these ResNet layers in our own experiment after downsam-
pling the original songs.
2.4. NORMALIZATION 8
Figure 2.4: The Resnet pattern visualised. Image from He (2015). [4]
In Figure 2.5 one can see the difference between non-ResNet networks and ResNet
networks. The ResNet network has better performance the greater the layer count is.
Figure 2.5: Resnet training curves.Image from He (2015). [4]
2.4 Normalization
Most neural networks layers have better performance when their input is in some deter-
mined range. A good technique to enforce this is normalization, or centering the distribu-
tion of an input around a predetermined mean and rescaling the variance to a fitting value.
Initially, one could only normalize the input data of network. However, techniques have
been designed to allow inter hidden layer normalization. What follows is a enumeration
2.4. NORMALIZATION 9
of topics discussed in the following subsections:
• In subsection 2.4.1, we will discuss normalization techniques where dataset statistics
of the outputs of a layer are learned to normalize the outputs of layers.
• In subsection 2.4.2, we will discuss normalization techniques where the norm of the
weight vector and its direction are split.
• In subsection 2.4.3, we will explain what Lipschitz-Continuous functions are. This
type of functions is used in Wasserstein GANs, introduced in subsection 3.2.3. We
advice the reader to read that section first.
• In the last subsection, subsection 2.4.4, we will explain how spectral normalization
works. This type of normalization is only used in GAN-like networks, introduced in
section 3.2. We advice the reader to read that section first.
2.4.1 Network layer output normalization
Normalizing the output of a layer normally happens by calculating a running mean of
features and a running variance of features. The resulting vector x has the Normal
Distribution with mean 0 and variance 1. It can then be rescaled by two different learned
parameters σ and µ: µ for centering the distribution around a new mean and σ for giving
it a new standard deviation. The equations have been listed more cleanly in Equation 2.1
and Equation 2.2.
x =x− E[x]√V ar[x]
(2.1)
x∗ = σ ∗ ˆx+ µ (2.2)
In a multi-dimensional vector this mean can be taken in multiple ways. The most com-
mon and the original output normalization algorithm was the Batch Normalization algo-
rithm. [34] Introduced by Ioffe (2015), it normalizes the output in the batch dimension,
thus also serving as a sort of regulizer since features have a dependence across multiple
inputs. This isn’t always the best case (for instance RNNs often have difficulty with Batch
Norm) so other normalization directions have been tried over the years.
Batch Norm and some of the other normalization directions are depicted graphically
in Figure 2.6. The N dimension is the batch dimension, the C dimension is the channel di-
mension and H and W are the vertical and horizontal dimensions respectively. One can see
four types of output normalization commonly used in machine learning: one normalizes
2.4. NORMALIZATION 10
features over the batch dimension in the Batch Normalization algorithm, one normalizes
over the H and W dimensions for layer normalization and one normalizes over the channel
dimension for Instance normalization. [34, 35, 36] Group normalization is a combination
of Instance norm and Group norm where one normalizes over multiple dimensions at the
same time. [37] Image from Towards Data Science. [5]
Figure 2.6: Multiple types of layer output normalization. Image from Towards Data Science. [5]
2.4.2 Layer Weight normalization
Weight normalizations try to normalize the outputs of a layer by normalizing the norm of
the weights of a layer. [38] It does so by dividing the weight vector v by its own norm after
each iteration. A learned scalar g can then rescale the output of the layer to a preferred
scale for the next layer. The authors proved that this improves the learning speed for
networks where Batch Norm’s natural noisiness proved a problem. Another benefit to
using weight norm is that it has a lower runtime impact than Batch Norm.
w =g
||v||v (2.3)
2.4.3 Lipschitz continuity
The following subsection pertains to a mathematical property of a function. This property
is used in the Wasserstein GAN paper and is approximated by the spectral normalization
described in the following section. If the reader is unfamiliar with Generative Adversarial
Networks, we advise them to read section 3.2 first.
A function is Lipschitz continuous if Equation 2.4 holds. [39]
||f(x)− f(y)|| ≤ ||x− y||,∀x, y ∈ A (2.4)
2.4. NORMALIZATION 11
An interesting property is portrayed in Figure 2.7. If one would slide a cone as pictured
along the plot of a 2D function, the function would be Lipschitz continuous if it would
never cross into the white section of the cone.
Figure 2.7: Lipschitz cone. Image from Wikimedia. [6]
Lipschitz continuous functions exhibit properties that are interesting for calculating
an approximation of the Earth Movers distance between distributions, as discussed in
Arjovsky (2017), a very maths-heavy discussion that is out of scope for this master’s
dissertation. [40] However, a simplified explanation can easily be given: the Earth Movers
distance is the minimum-cost distance to move a distribution from shape A to shape B.
We want to use this in the WGAN algorithm to change the distribution of a generated
shape B closer to the shape A of the original. If the shape of A and Bs norm has too
large a difference, the distance doesn’t make any sense.
2.4.4 Spectral Normalization
The following subsection pertains to a special type of normalization, only used for stabi-
lizing the training of GANs. If the reader is unfamiliar with this topic, we advise them
to read section 3.2 first.
Miyato et. al introduced spectral normalization, a novel type of weight normalisation
that enforces Lipschitz continuity. [41] The algorithm changes the weights of the layer so
the function becomes Lipschitz-1 continuous and thus usable for newer GAN-like meth-
ods like the Wasserstein GAN. [40] A weight manipulation as in WeightNorm is used, but
instead of using the Euclidean norm the spectral norm is used.
2.5. ATTENTION 12
The spectral norm is the maximum singular value of a matrix. The singular values
are the absolute values of the square roots of the eigenvalues of a vector. As a quick
refresher, the eigenvalues are the amount a vector multiplied with a matrix gets stretched
in the direction of one of its eigenvectors. If this amount of stretch is small, one can see
that the function this matrix implements will not increase drastically. Whilst no proof
was given by the authors of the paper that spectrally normalized networks are strictly
Lipschitz continuous, this type of normalizations gives a good approximation.
2.5 Attention
Attention is a memory-like mechanism used in the machine learning field, introduced by
Bahdanau (2014). [42] By using a measure for similarity between inputs in a sequence
and a hidden state, one creates a sort of heatmap over the original, with the more im-
portant values having a higher value. It has been used successfully to create a music
generator with very long term memory using a Transformer network, a special type of
architecture using only attention, [43] as a replacement for recurrent neural networks. [44].
Hard and soft attention exist: hard attention causes a network to only attend to a
single location, soft attention is spread out over the whole input domain. However, it’s
not possible to use hard attention in conjunction with backpropagation, since there is no
gradient for the parts not attended to, so it can only be used with evolutionary networks
or networks trained via reinforcement learning. Most soft attention implementations use
a dot product as a measure of similarity.
Zhang (2018) used self-attention to improve the performance of GANs. [7] A visual ex-
planation of how they used attention can be seen in Figure 2.8. Image from the SAGAN
paper. Note the 1x1 convolutional kernels used in the paper: this is actually just a
channel-wide multiplication. One can see images taken from a top-level attention layer
from one of our own networks in Figure 2.9.
2.6 Dropout
Dropout is a technique to minimize the overfitting phenomenon in neural networks, in-
troduced by Srivastava (2014). [8] By probabilistically zeroing out certain elements from
2.6. DROPOUT 13
Figure 2.8: Picture explaining the SAGAN self attention implementation. Image from Zhang (2018). [7]
a hidden layer, deeper layers will have to take a more averaged weight from the layer,
meaning that it will be less likely to overfit. During evaluation, the Dropout is deactivated
as to improve quality. A figure can be seen in Figure 2.10.
Spatial Dropout was introduced by Thompson (2014). [45] It is used for zeroing out
a complete channel when using Convolutional Neural Networks. If nearby pixels within
channels are strongly correlated, using standard dropout would not regularize the net-
works. They used this for creating a better human body posture classifier.
2.6. DROPOUT 14
Figure 2.9: Comparison between the o variable in the attention algorithm and the original input to thelayer.
2.6. DROPOUT 15
Figure 2.10: Dropout visualization. Image from Srivastae (2014). [8]
Chapter 3
Generative Neural Networks
In this section, two broad general techniques for generating new samples from a example
dataset will be explained: the Variational Autoencoder or VAE and the Generative Ad-
versarial Network or GAN. Both of these will be key to understanding how our network
generates new samples from the given dataset. A brief overview of generative techniques
used for transferring the style of an input example to the style of another input or input
dataset, also called style transferring, will be given later on in the section. What follows
is a enumeration of the sections in this chapter:
• In section 3.1 we will discuss Variational Auto Encoders, a generative network that
uses an autoencoder like architecture with a twist.
• In section 3.2 we will discuss different types of Generative Adversarial Networks:
an algorithm where one uses two networks with different purposes. One network
tries to generate convincing samples. The other network tries to correctly classify
generated samples from real samples.
3.1 Variational Auto Encoder
An auto-encoder is a neural network that minimizes input in one or more dimensions
and tries to emit it’s own input as output. One can then try to generate new output
which could behave similarly as the input by feeding some type of noise into the decoder.
However, one doesn’t know the distribution of the features the encoder emits.
Kingma (2013) laid the foundation for the variational autoencoder. [46] Autoencoders
are networks that try to recreate their input as output, after encoding it in a smaller
intermediate representation. VAEs are autoencoders with a probabilistic latent space as
intermediate representation, which can be seen in Figure 3.1. The features in the latent
16
3.1. VARIATIONAL AUTO ENCODER 17
Figure 3.1: A VAE architecture. Image from Github.io. [9]
space are thus probability distributions.
Since the output of a probability distribution does not necessarily convey any meaning-
ful information about the mean or variance to backpropagate upon, the reparametrization
trick is used. Adding a number to a random variable shifts the mean of that random vari-
able, and multiplying it by a number shifts the variance. These basic properties can be
used to create any Gaussian Distribution from the Normal distribution, and backpropa-
gate through the mean and variance of this distribution, as seen in Equation 3.1.
X ∼ N (µ, σ2) = σ2N (0, 1) + µ (3.1)
To generate from the decoder of the VAE, a extra loss term is added: the Kullback–Leibler
divergence (or KL divergence). The KL divergence is a measure of how one probability
distribution is different from a second one. This divergence between the current dis-
tributions emitted by the encoder and a prior distribution (often the univariate normal
distribution) is used to map the latent variables overall distribution to the distribution of
the prior, as diminishing this divergence makes the latent variable act more like the prior.
It can also enforce some regularization (as distributions with for instance zero variance
are not all that probabilistic and amount to overfitting) when a VAE-like structure is used
without the need for manually generating from the decoder later. Equation 3.2 is the stan-
3.2. GENERATIVE ADVERSARIAL NETWORK 18
darad formula for calculating the KL-divergence. Equation 3.3 is the KL-divergence for
a random variable R where we know that it consists of a set of multiple random variables
where mean and variance are known. The derivation can be found in the appendix of the
original VAE publication. [46]
DKL(P ‖ Q) = −∑x
P (x)log(P (x)
Q(x)) (3.2)
DKL(R ‖ N (0, 1)) = −1
2
∑j
1 + log(σ2j )− µ2
j − σ2j (3.3)
3.2 Generative Adversarial Network
A GAN or generative adverserial network is a network consisting of two sub networks: a
generator that tries to generate truthful data and a discriminator that tries to discrimi-
nate the real data from the fake data. There are many different types of GANs. In the
following section, we will discuss some of the developments pertaining GANs and some
known difficulties concerning their use. A good introductory paper to the current state
of the art has been written by Kurach (2018). [47]
3.2.1 Original GAN
A GAN or generative adverserial network is a network consisting of two sub networks: a
generator that tries to generate truthful data and a discriminator that tries to discrim-
inate the real data from the fake data. [48] The generator can use the discriminators
output to optimize its output, and vice versa. This way, the networks play a minimax
game where they both try to update their parameters in such a way to minimize their
loss until they reach a Nash equilibrium. This is visualized in Figure 3.2. Image from a
Kevin McGuinness presentation found on Slideshare. [10]
DD(x, z) = −log(σ(D(x)))− log(σ(−D(G(z)))) (3.4)
DG(x, z) = −log(σ(D(G(z)))) (3.5)
In Equation 3.4 and Equation 3.5, one can see the equations for the losses of the
networks, σ stands for the sigmoid function, x for original data samples from the training
3.2. GENERATIVE ADVERSARIAL NETWORK 19
Figure 3.2: The standard GAN architecture visualized. Image from Slideshare [10]
dataset and z for a random noise sample.
3.2.2 Mode collapse
GAN type networks are notoriously difficult to train: small hyperparameter changes can
change the networks output by a lot and cause it to show all types of failures. Many pub-
lications have been presented with solutions or explanations for this subsets of problems:
however the biggest cause of instability is the mode collapse failure mode. An example can
be seen in Figure 3.3. The network from the figure was trained on the popular MNIST
dataset. After a certain amount of epochs, its output would ’collapse’ to only contain
sixes. Image from the Unrolled GAN paper. [11]
3.2.3 Wasserstein GAN and other improvements
The most popular basis for a solution was presented in the Wasserstein GAN papers and
its follow ups. The authors of the Wasserstein GAN paper claimed to have solved the
mode collapse problem theoretically, [40] by changing the loss and enforcing the discrim-
inator to implement a Lipschitz continuous function, a topic handled in subsection 2.4.3,
so a heuristic for the Earth Movers distance could be used. The way they would enforce
this strict constraint is by clipping the weights of the discriminator, as this would limit the
norm of the gradient of the network. The new loss function can be seen in Equation 3.6
and Equation 3.7.
3.2. GENERATIVE ADVERSARIAL NETWORK 20
Figure 3.3: An example of mode collapse from a network learning the popular MNIST dataset. Imagefrom Metz (2016). [11]
DL(x, z) = −D(x) +D(G(z)) (3.6)
DG(x, z) = −D(G(z)) (3.7)
The reason why this Lipschitz continuity is such a useful property for a discriminator,
is because one is guaranteed that ||f(x) − f(x + ε)|| ≤ ||ε||, therefore small changes in
the input will not completely change the discriminators output. In other words, if the
discriminator recognizes a fake that greatly resembles some examples, it still has to give
it good marks.
Radford (2015) was the first to scale up GANs in their deeply convolutional GAN or
DCGAN. [12] This paper is the baseline for image based GANs. In Figure 3.4, one can
see the author’s guidelines to training a GAN. Most of the points on this image are still
valid. In Figure 3.5 the used architecture is visualized. The length of each block signifies
the number of channels used. Note that they upsample with stride 2 and the channel
dimensions are divided by 2 in every step: the amount of data gets doubled up until the
last layer.
Gulrajani (2017) tried to improve this method of enforcing L1-continuity by hav-
ing an extra gradient penalty term. [49] Another technique that enforces this Lipschitz
3.2. GENERATIVE ADVERSARIAL NETWORK 21
Figure 3.4: The DCGAN guidelines. Image from Radford (2015). [12]
Figure 3.5: The DCGAN architecture. Image from Radford (2015). [12]
3.2. GENERATIVE ADVERSARIAL NETWORK 22
continuity is the spectral normalization introduced by Miyato (2018) and discussed in
subsection 2.4.4. [41]
Brock (2018) improved upon previous GAN papers by scaling up the setting to an
immense size. [50] He used 512 cores and trained his network for multiple days to get the
current state of the art GAN results. He also used what he calls the ’truncation’ trick:
by truncating the noise if it falls outside of a certain predefined range he improved the
GANs performance during evaluation.
Zhou (2018) researched numerous loss functions and proposed a optimal class of func-
tions for GAN losses. [51] They described the useful properties a loss function for a GAN
should have and devised new loss functions. For instance, the standard GAN loss pre-
sented in Equation 3.4 is actually their improvement upon the original.
Chapter 4
Style Transfer
A style transfer is the mapping of a given work of art to a new work of art with the
artistic style of another work of art, but still having content resemblance to its original.
All of the pioneering research presented in this section was done on images, but some of
the discussed techniques can be applied on musical style transfer as well.
An enumeration of the content in this chapter:
• In section 4.1 a brief history of the style transfer algorithm is given.
• In section 4.2 the cycle consistent style transfer will be explained, an important
algorithm for our own setup.
4.1 Pioneering style transfer works
The pioneering paper by Gatys (2014) used the response of a network trained in visual
object detection to backpropagate through a image that is initialized as random noise,
until it had similar object features (final layer output) as the original content image. [13]
To make sure the picture would have the same style as the style input image, a loss was
defined on the correlations of multiple filter responses throughout the network. In Fig-
ure 4.1, one can see the filter response that are chosen to signify style for both pictures,
in multiple layers of the network. Notice that the content representation becomes noisier
when going deeper into the network.
The biggest problem with this algorithm, is the the lack of reusability: the algorithm
would have to backpropagate through the network for any style input or content input:
there was no re-usable component so to speak. There also does not exist a analogue to
an object detector for music, speaking specifically about our research. One can see the
23
4.1. PIONEERING STYLE TRANSFER WORKS 24
Figure 4.1: Filter responses throughout the convolutional network. Image from Gatys (2014). [13]
4.1. PIONEERING STYLE TRANSFER WORKS 25
effect of their algorithm in Figure 4.2. the left image is the content input and the painting
on the bottom left of the right image is the style input. Image from the pioneering style
transfer paper. [13]
Figure 4.2: Example of style transfer. Image from Gatys (2015). [13]
Later on, a offline approach was proposed that would try to generate the correctly
stylized image for any given content inputs during evaluation and a single style image
chosen before training. [14] Instead of changing noise until it became a correctly stylized
image, the authors proposed using a network that would undergo the same losses as in the
original algorithm (thus acting on the responses of the layers of a pre-trained network),
but would be trained on converting images into one single style. This solved the biggest
problem the pioneering style transfer algorithm had, as no additional training would have
to be done, to convert an image to the learned style. A visual of how the loss is calculated
can be seen in Figure 4.3.
Figure 4.3: How the offline style transfer algorithm calculates its loss. Image from Johnson (2016). [14]
4.2. CYCLE CONSISTENT STYLE TRANSFER 26
4.2 Cycle consistent style transfer
The techniques discussed above would still be impossible to implement for music, since
content in music can’t really be expressed in terms of objects, thus the content metric
coming from which objects are detected would not work.
In 2017, the Cycle GAN paper introduced a style transfer algorithm that is more suit-
able for music using the GAN architecture. [15] If one has two datasets X and Y containing
works of art of a particular kind, one could define a mapping F from domain X to Y and a
mapping G from domain Y to X. One could use invertability as a loss criterion. However,
the outputs of the mappings would not necessarily be realistic in the other domain. The
discriminators from the GAN architecture come into play here: they can enforce that the
style of F(x), for x an element the dataset X, is indistinguishable to the style of Y.
One can see a detailed infographic of their algorithm in Figure 4.4 and an example of
the fine grained content change this setup is able to achieve in Figure 4.5. In Figure 4.4,
X and Y are the image datasets, and F and G the functions mapping an image from Y
to X and from X to Y, respectively. The cycle consistency loss is visually explained in
the two rightmost pictures: mapping sample x to Y should cause it to get mapped close
to x again, idem for the the sample y shown on the right. In Figure 4.5, one can see
the results of a cycle consistent style transfer on datasets containing pictures of horses
and zebras: notice that the horses and zebras are exactly in the same position, only the
hair color has changed. In Figure 4.6, one can see the architecture they used. Notice the
similarity between it and a standard autoencoder, as one downsamples and upsamples
the image. This convolutionally downsampled middle layer can be thought of as a high
level representation of the image, although they do not downsample that heavily. These
images have been taken from the Cycle GAN paper. [15]
However, Chu (2017) found out that Cycle GAN networks could add unwanted high
frequency information to aid itself in learning a complex cyclical mapping. [16] Alma-
hairi (2018) added a defence to mitigate this type of error by adding a latent variable
in their Augmented CycleGAN framework. [52] In Figure 4.7, one can see an example of
the ”steganography” a CycleGAN could employ: whilst there is no detail of a chimney in
the 2D map, in the recreated satellite vision map there are chimneys. In Figure 4.8 one
can see the reason: imperceptible Gaussian noise was added in the second feature map to
recreate the image.
4.2. CYCLE CONSISTENT STYLE TRANSFER 27
Figure 4.4: Cycle Consistent style algorithm. Image from Zhu (2017). [15]
Figure 4.5: An example of style consistent style transfers on pictures of horses and zebras. Image fromZhu (2017). [15]
Figure 4.6: Architecture of CycleGAN. Image from Zhu (2017). [15]
4.2. CYCLE CONSISTENT STYLE TRANSFER 28
Figure 4.7: Cycletransfered skymap with added details that aren’t visible in the roadmap. Image fromChu (2017). [16]
Figure 4.8: A comparison between a evaluation roadmap and a training roadmap, respectively. High-frequency encoding is made visible by adaptive histogram equalization. Image from Chu(2017). [16]
Chapter 5
State of the Art in Musical Style
Transfer
This section will be a listing of the most important papers in regards to this dissertation.
A small summary of the sections:
• section 5.1 is a section about which generative techniques are the most novel.
• section 5.2 is about computer music generators. As our style transfer also has to
generate music, some insight into what works and what doesn’t in that department
is interesting.
• section 5.3 is a small look on master’s dissertations done about similar subjects in
our university.
• section 5.4 is about style transfers in the musical domain. Not too many papers on
this subject have been accepted at conferences, thus a brief overview of what has
been tried in the last couple of years will be given.
5.1 Generative techniques
Zhang (2018) introduced SAGAN (Self-Attention Generative Adversarial Networks). [7]
These are GANs using Spectral Norm, with an attentional layer in between. This network
generates very realistic looking images and recorded a very high FID score during evalu-
ation. One peculiarity however is that they use Hinge style losses instead of the normal
WGAN loss.
Almahairi (2018) created the Augmented CycleGAN. Instead of just using a one-to-
one type mapping, they went further and designed a Many-To-Many type mapping and
29
5.2. MUSIC GENERATION 30
used an extra latent variable given as extra input.
5.2 Music Generation
Yang (2017) used a convolutional GAN for their MidiNet. [26] According to human eval-
uation, MidiNet generated as realistic music as other contemporary state of the art tech-
niques at time, but was more pleasant to listen to.
Huang (2018) created a Transformer based music generator. [44] A special property
about this generator is its ability to generate long term structurally sound music. Whilst it
still glitches, the music this generator produces was evaluated as very realistic by humans.
5.3 Previous master’s dissertations received
In 2017, two master’s dissertations were written at our university about style transitions
and a style transfer. We thought it would be interesting to feature their work in our own
dissertation. Training GANs was not as stable as it is nowadays, so auto-encoder style
setup would have looked a lot more feasible during that era.
Spitael (2017) published a master’s dissertation on style transfer between classical
composers. [17] The author tried to transfer the style of one composer to a composition
of another one by having multiple decoders (one for each composer) coupled to only one
universal encoder. The setup they used can be seen in Figure 5.1. The numbered lines
represent how the encoder and corresponding decoders were trained: during evaluation
time these could then be coupled in another way. A new way of vectorizing MIDI was
proposed: instead of the piano-roll representation (a vector representing the complete
state of the keyboard at every time instant), only one pitch (input as a value and not as
vector), it’s corresponding velocity and the lapsed time would be encoded. The system
would still be polyphonic as one could input multiple notes as having only a time differ-
ence of zero. An expert and non-expert audience found the generated transferred songs
to be more pleasing than random noise, but preferred the originals. Sadly, the published
audio files have been removed from the author’s SoundCloud.
De Coster (2017) published a master’s dissertation on using RNNs to generate poly-
phonic music with style transitions. [18]. Instead of using different decoders for different
composers, as his colleague Spitael did, the author of this paper tried to predict the next
5.4. MUSICAL STYLE TRANSFER 31
Figure 5.1: The architecture used by Spitael (2017). [17] The numbered sequences portray how theautoencoder was trained.
sample given a certain input sample and a given composer whose style the generator tries
to match. This output can then be fed into the network again, generating new sequences.
During generation, the style vector can change from one composer to another one, forc-
ing the machine to predict a different future output, making it possible to interpolate
between composers. The authors assessed that the quality of output is lower than that of
the real compositions, but not by much. One can see his network architecture in Figure 5.2
Figure 5.2: The architecture used by De Coster (2017). [18]
5.4 Musical Style Transfer
Malik (2017). published a paper about GenreNet, a network they trained on note velocity
data for different genres. [19] By using a single encoder and multiple LSTM decoders (one
5.4. MUSICAL STYLE TRANSFER 32
for each genre), the network would output note velocities for input songs. This archi-
tecture can be seen in Figure 5.3 46% of songs were correctly identified, 25% wrongly
identified and 25% couldn’t be identified. Thus, it passed what the author calls the ”mu-
sical Turing test”.
Figure 5.3: The architecture for velocity prediction in GenreNet.Image from Malik (2017). [19]
Brunner published two papers about musical style transfer. One of them uses a Vari-
ational Autoencoder, the other one a Cycle GAN style setup.
Brunner (2018) used a Variational Auto Encoder to generate style transfered music
by adding a genre tag to the features emitted by the encoder. [53] By training the VAE
until it reproduced the original input, one could simply swap the genre tag during eval-
utation to listen to the style transfered output. One of the difficulties with this setup is
the fact that the genre has to be a meaningful feature for the decoder: if it just ignores
that particular feature, swapping the genre tag will not result in much change. Contrary
to most other techniques, they actually use note velocity as a input. Note length is also
fed separetely to the encoder.
In another paper, Brunner (2018) used a Cycle GAN setup to generate style trans-
ferred music. [20] To evaluate the generated music automatically, they used a genre clas-
sifier trained on three genres. If the detector would evaluate the song in the correct genre,
the generated sample would be evaluated positively. They would transform the song bar
5.4. MUSICAL STYLE TRANSFER 33
Figure 5.4: The VAE architecture used by Brunner (2018). [20]
per bar, transforming a whole song this way for some of the examples uploaded online.
No human evaluation was done in their paper. The architecture of the generator and
discriminator can be seen in Figure 5.5 and the architecture of the classifier in Figure 5.6.
Figure 5.5: The Cycle GAN architecture used by Brunner (2018). [20]
5.4. MUSICAL STYLE TRANSFER 34
Figure 5.6: Genre classifier architecture used by Brunner (2018). [20]
Chapter 6
Dataset
As the network one has can only be as good as the data it receives, the dataset is an im-
portant part of the experiment. There are multiple choices concerning the interpretation
and laying out of data of all kinds. This also holds for music. We collected a dataset with
only symbolic music in the MIDI format and binarized this using existing libraries.
What follows is an enumeration of all the sections in this chapter:
• In section 6.1 we will explain what MIDI is, how it can be used and in what way
we use it.
• In subsection 6.1.1 we will explain what piano rolls are, a way to represent music
played on a particular instrument in terms of pitch and rhythm.
• In subsection 6.1.2 we will explain why we chose the songs we chose, out of a larger
dataset.
6.1 MIDI
MIDI (Musical Instrument Digital Interface) is a format originally intended for communi-
cation between multiple musical instruments. MIDI describes a piece of music in events.
As it was originally meant for live music, an event can be a change of tempo, a change of
instruments or a note event. Note events describe which notes to be set on or off at certain
time-steps and how loud they should be played, also known as the velocity of the note.
If a MIDI-stream gets recorded, playing it back on another synthesizer should play the
same song (albeit possibly with a different timbre due to differences in equipment). This
way, MIDI can be played back on a PC or smartphone by using the operating system’s
wavetable synth. A dissection of a MIDI message can be seen on Figure 6.1. This message
35
6.1. MIDI 36
contains the following sections: the on or off message code, the pitch of the note that has
to be put on or off and the velocity (note loudness). For our research we have omitted the
velocity information everywhere, as it would further complicate the design of the network.
Figure 6.1: A MIDI-message. Image from PlanetofTunes. [21]
6.1.1 Piano rolls
However, a MIDI message’s structure does not fit into our neural network without some
prerequisite operations, which works on tensors and vectors. We convert our piano pieces
to a piano roll vector to achieve this. A piano roll is a representation of the notes as
a concatenation of multi-hot vectors, one for the notes played on each timestep. These
can be imagined as vectors containing the notes held down at a specific time-step or the
played notes on a keyboard, laid out in the time dimension. A example can be see in
Figure 6.2. The time axis goes from left to right, the pitch axis has low pitches on the
upside of the image and high pitches on the downside of the image.
Figure 6.2: Typical visualisation of a piano roll. Image from Researchgate. [22]
6.1. MIDI 37
6.1.2 Song selection
A dataset containing piano only symbolic MIDI music was assembled from MIDI songs
found on the internet. We chose to only use piano data, as we suspected it would simplify
the experiment and lead to better results: people tend to play different notes on different
instruments, for example, the general dissonance (or how many notes are played from
outside of a chosen scale) in standard electric guitar playing will be higher than for piano,
even when looking at a fixed genre, for example rock music.
The selected genres were classical and ragtime, as many piano-only MIDI files could be
find for aforementioned genres and both genres have a distinct feel to them. The ragtime
MIDI files were played by a human pianist, whilst it is suspected that the classical MIDI
files were composed by a human. This could be important for the noise on the timing, as
a human performed piece would be slightly more swung. The classical dataset is slightly
larger, with 320 pieces instead of the ragtime sets 220 pieces.
For use in our neural network, these MIDI files are converted to piano-roll sequences.
Piano-rolls are vectors containing the notes held down at a specific time-step. Since the
MIDI format allows for up to 128 notes to be played on each instrument, these vectors are
128 elements long, even thought a real life piano only has 88 keys. One could criticise this
choice, as other, even multi-instrument setups like Brunner (2018) use a smaller piano
roll per timestep [20], but having the possibility to have learn it any dataset for any MIDI
instrument in the future is a definite plus.
Both sets are quantized up to the 16th note, for both timing noise reduction and
network complexity. The songs are then cut up in pieces of 4 bars. We take multiple
non-overlapping 4 bar excerpts from these songs to train our data. Data augmentation
is done by by adding pitch shifted versions of songs to the dataset: every song is pitch
shifted up to six pitches down and five pitches up. This also nullifies an inherent scale
imbalance in our dataset.
One can compare the augmented and unaugmented datasets by comparing the graphs
in Figure 6.3 and Figure 6.4. The top graphs are distributions by pitch, the bottom
graph a distribution of the notes in the chromatic scale, the most commonly used music
system. A clear imbalance can be seen in the unaugmented graphs, as the distributions of
pitches isn’t smooth at all, and the distribution by notes not being near constant. After
augmentation, these problems have gone away. Note that a lot of possible pitches in the
higher and lower end of the MIDI-pitch spectrum are unused, this is because these notes
aren’t often picked by composers. A real life piano also only has 88 keys, in comparison
6.1. MIDI 38
to the MIDI standards 128 pitches.
Figure 6.3: The distribution of notes in the unaugmented training set.
Keen readers will notice that the notes still aren’t perfectly balanced after the shifting
augmentation. This is because a small number of shifted examples were removed from the
training dataset, to better assess overfitting if needed. This set did not get used, however.
Loading and quantizing the MIDI files was done with the pypiano roll library. [54] The
MIDI was binarized, which was bugged because of a typo in the pip version, so a nightly
version from GitHub had to be installed.
In Figure 6.5 one can see the amount of notes per dataset. One of the reasons is the
fact that there are more classical songs than ragtime songs in our dataset. In Figure 6.6
one can see the average amount of notes per excerpt. One can see that the has more notes
by every measure.
6.1. MIDI 39
Figure 6.4: The distribution of notes in the training set with augmented training set.
Figure 6.5: Note count of multiple datasets.
6.1. MIDI 40
Figure 6.6: Average amount of notes per song of multiple datasets.
Chapter 7
Architecture
Many architectural and algorithmic choices have to be made. In this section we will detail
2 architectures used in our experiment and some problems encountered while engineering
these.
What follows is an enumeration of the different sections in this chapter.
• In section 7.1 we will discuss the general idea of our algorithm and model.
• In section 7.2 we will divulge all of the details about our first attempt at making a
style transfer network.
• In section 7.3 we will try to lay out an imbalance between generators that was noted
to happen after training for long enough.
• In section 7.4 we will explain how we solved the above problem.
7.1 High level overview
Looking at our list of generative techniques and style transfers, we could see two possible
choices: using a VAE or using a Cycle GAN. We opted for the second option, as the
research on Cycle GAN wasn’t public at the time we started writing.
Although we opted for the Cycle GAN architecture to enforce the stylization, [15] we
did not rule out that a latent layer could make the style transfer more realistic. Another
new angle for us to explore was the use of attentional layers. The SAGAN paper showed
that Attentional layers can work well in generative settings. [7] One last thing that we
wanted try using was the WGAN style Lipschitz discriminator. [40] All of the things
41
7.2. ORIGINAL IMPLEMENTATION 42
mentioned in this paragraph weren’t explored by Brunner (2018). [20] One can see a
high-level overview of the algorithm in Figure 7.1.
Figure 7.1: Algorithm outline.
7.2 Original Implementation
In this section we will explain all of the architectural building blocks which were ex-
perimented with and give reasoning to why these specific blocks were put in the places
they were. All neural networks were designed with the PyTorch machine learning frame-
7.2. ORIGINAL IMPLEMENTATION 43
work. [55] All graphs were made with matplotlib. [56]
In Equation 7.1 and Equation 7.2 one can see the complete loss function used at the
end: for the discriminator we used a hinge loss and for the generator we used the standard
WGAN loss in conjunction with a binary cross entropy. The subscript for the generators
are used as notation as genre a gets converted to genre b. The loss for the b to a generator
and a discriminator is symmetrical to the losses listed in the equations. The second BCE
term in the generators loss is a way to enforce idempotency on the network. The w0 and
w1 params are used to scale the losses on the side of the generator. w1 was mostly set to
zero however, w0 to five.
LDb(xa, xb) = Exb
max(0, 1−D(xb)) + Exa
max(0, 1 +D(G(xa))) (7.1)
LGab(xa, xb) = − Exa
DB(Gab(xa))+w0 Exa
WBCE(xa, Gba(Gab(xa)))+w1 Exb
WBCE(xb, Gab(xb))
(7.2)
In the following figures, we divulge our architectures for our encoder, decoder and
discriminator in Figure 7.2 and Figure 7.3. Please note that this discriminator is L1 con-
tinuous, so we will have to use one of the unbounded discriminator loss functions to use
this architecture. Dimensions noted as HXWXC. One can see the architecture of the
VAE’s encoder in Figure 7.4.
The attentional layers were implemented as explained in the SAGAN paper and shown
in Figure 2.8: this allows the network to have a receptive field that is bigger than the com-
position of convolutional layers allows, which is very interesting for music, as it doesn’t
necessarily have a structure that fits into the space of one filter (as edges do in normal
pictures).
For normalization in the generator, Batch Norm was chosen. [34] While it does have
some issues for running in evaluation mode because the running mean over batches has
to be recalculated, [57] it kept the model regularized to a bigger extent than Instance
Norm. [36]
For normalization and enforcing the Lipschitz-continuity in the discriminator, Spec-
tral Normalization was chosen. Models without this type of discriminator showed mode
collapse problems.
As for filter sizes, at first we opted for convolutional layers with kernel size 3x3 and
7.3. TRAINING PROBLEMS 44
stride 2 for the encoding and decoding part of our generators, like Brunner (2018) did. [20]
But this type of convolutional layer has the tendency to put more information in certain
output elements as one that has a multiple of the stride in the kernel size. [2] A hypoth-
esis is that these type of artifacts had a smaller effect on Brunner’s model because their
outputs were only of size 64x84 instead of 256x128. All other recent GAN papers use
deconvolutions for which the non-checkerboarding property applies. A comparison for
pictures can be seen in Figure 7.5.
A typical failure mode for GANs consists of generator or discriminator getting too
good too early on. Instead of training the generator or discriminator for more iterations
as the other one to counteract this situation, we have opted for choosing different learning
rates for generator and discriminator. This method was introduced in the Two Time-scale
Update Rule (TTUR) GAN paper. [58] We used a TTUR of 10 as used in the SAGAN
paper, meaning the learning rate of the discriminator is 10 times as big as the learning
rate of the generator. [7] As our images are pretty lightweight, we chose for a bigger
batch size and a higher learning rate than most other GAN papers. The standard GAN
β values for the Adam-optimizer were chosen, thus β1 being set to 0 and β2 being set to 0.9.
Our experiments with using a VAE style encoder did not improve or decrease output
and learning behaviour of the network by much; however, in general, we noted that using
the VAE style encoder used more resources and needed a bigger number of epochs for
reaching the same reconstruction loss.
7.3 Training problems
Even after improving the stability by using the Lipschitz-1 discriminator, training prob-
lems persisted. The Wasserstein distance of one of the two generators steadily increased,
whilst the Wasserstein distance of the other one would stay relatively small. In the gen-
erated output Whilst the combined network is in this regime, the output of one of the
generators is often something very amusical, whilst the other generator produces some-
thing more akin to human made music. An example training loss graph can be seen in
Figure 7.6: please note that this example was generated with a smaller dataset with only
one sample per song. Note that the development dataset loss rises up to the vicinity of
one: this means the network also starts overfitting (it is hard to see due to the large scale
of this graph and the lack of significant losses in the training BCE loss).
A hyphothesis was made about it possibly having to do with the way the weights were
7.3. TRAINING PROBLEMS 45
Figure 7.2: Original Encoder and decoder architecture.
7.3. TRAINING PROBLEMS 46
Figure 7.3: Original discriminator architecture.
7.3. TRAINING PROBLEMS 47
Figure 7.4: Original VAE encoder architecture.
7.3. TRAINING PROBLEMS 48
Figure 7.5: Comparison between images with checkerboarding artifact and without. Image from Dis-till. [2]
Figure 7.6: A plot showing the inevitable imbalance between the two generators.
7.3. TRAINING PROBLEMS 49
updated. None of the papers using a similar algorithm explored this. We tried three
different strategies as demonstrated in Algorithm 1, Algorithm 2, Algorithm 3.
Algorithm 1 One of the three backpropagation choices one could make.
for i in epochs dofor Batch a, Batch b in Dataset A, Dataset B do
Calculate loss and backprop through discriminatorsCalculate loss of batch aBackprop through both generator AB and BACalculate loss of batch bBackprop through both generator AB and BA
end forend for
Algorithm 2 One of the three backpropagation choices one could make.
for i in epochs dofor Batch a, Batch b in Dataset A, Dataset B do
Calculate loss and backprop through discriminatorsCalculate loss of batch aCalculate loss of batch bSum losses and backprop through both generator AB and BA
end forend for
Algorithm 3 One of the three backpropagation choices one could make.
for i in epochs dofor Batch a, Batch b in Dataset A, Dataset B do
Calculate loss and backprop through discriminatorsCalculate loss of batch aBackprop through both generator ABCalculate loss of batch bBackprop through both generator BA
end forend for
Algorithm 3, where the generator and discriminator can only change weights accord-
ing to their own loss, clearly had the worst performance in this regard, as the generators
failed to learn anything meaningful. Algorithm 1 and Algorithm 2 behaved similiarly, but
Algorithm 2 was faster. Algorithm 1 uses less memory, however. Neither of the strategies
solved the imbalance, however.
Regularizing the network caused the imbalances to show up later but did not com-
pletely eliminate the imbalances.
7.4. IMPROVED IMPLEMENTATION 50
7.4 Improved Implementation
One problem not addressed in the previous sections still went unsolved: not using Hinge
Loss but a standard Wasserstein style loss ruined the performance of our network, having
ever increasing losses and making the program crash at the end of the run. It turns out,
our discriminator (and the one used in the SAGAN paper) probably are not Lipschitz
continuous: in the attentional layers there is a sum term, which violates the definition of
Lipschitz continuity: removing this attentional layer from the discriminator immediately
caused the loss to steadily decrease after the first number of epochs and stabilize. w0 and
w1 from the loss functin were both seth to 1 instead of the values declared above.
Using this loss function also stabilized the long term learning profile of our overall
network: discriminator and generator would take turns in improving their performance,
as can be seen in Figure 7.7. Note that the Wasserstein-loss keeps steadily increasing
until the 80th epoch and stays around the same value at that moment in time. This
model could be trained with 10 samples per song, something which was impossible with
the previous one as one had to stop very early or use a very small dataset to not have the
imbalance ruin the output.
The architecture was also slightly changed. One can see this reflected in Figure 7.8
and Figure 7.9.
Figure 7.7: A less imbalanced learning profile.
7.4. IMPROVED IMPLEMENTATION 51
Figure 7.8: Updated discriminator architecture.
7.4. IMPROVED IMPLEMENTATION 52
Figure 7.9: Updated discriminator architecture.
Chapter 8
Results
In this chapter we will assess the quality of our generated music. A breakthrough in terms
of model imbalance only came after a survey was sent out, so all of the quantitative data
is about our first generation generator.
In section 8.1 we will evaluate the music generators ourselves. The results of our
survey will be discussed in section 8.2.
8.1 Researchers evaluation
The music generated with Architecture A has a tendency to only shuffle notes around
and to change the structural rhythm of the input song. Some outliers exist: a Classical
exerpt consisting of multiple piano glissandos gets transformed into a rather quiet piece.
Architecture B (and especially the Classical to Ragtime generator) had a tendency
to try and match the left hand rhythm of the genre it tries to generate music for. The
Classical to Ragtime generator generated a lot of songs with a simple, driving, 16 note
rhythm, which is something that does get often in Ragtime. Whilst these covers do
not sound revolutionary by any means, humanlike behavior is noticed. One can see this
pictured in piano rolls in Figure 8.1 and Figure 8.2
8.2 Human Evaluation
To assess the quality of our work a Google forms survey was sent out, consisting of 4 parts.
The test was executed single blind style, as only one questionnaire was sent out, and the
samples sent were known beforehand. First, we asked the respondents name and their
age. Their name was asked to be able to detect vandalism on the questionnaire (although
53
8.2. HUMAN EVALUATION 54
Figure 8.1: Original rhythm.
Figure 8.2: Style transferred rhythm.
it was not a required question). 41 persons responded, 51.4% of the surveyed considered
themselves knowledgeable about music and age distribution as shown in following figure.
31 persons gave up a name: no clear vandalism of the study was detected.
Figure 8.3: Age distribution of our respondents.
First, we showcased some reference ragtime and classical samples, each one from the
test set of the dataset. The respondents had to answer some questions about the likeness
of two unlabeled excerpts (coming from either the test set or from the output one of our
8.2. HUMAN EVALUATION 55
Figure 8.4: Distribution of our respondents self-reported knowledge about music.
generators. The first two questions had one ragtime example and one sample generated
from that example. The fourth and fifth sample had one classical example and one ragtime
sample generated from the classical example. The third and sixth question had random
samples generated and not generated, once with different genre and once without. The
raw results of the AB comparison can be seen in Table 8.1.
Most of the respondents found the majority of generated samples incoherent in this
first test, whilst all samples that were covers of each other scored high in the ’Could one
be a cover of the other’ question and the ’Are the previous songs similar’ question. We
research the perceived quality, once per style and once per generator architecture.
Question Generated Genre Architecture Coherent Neutral Not Coherent1 Yes Classical 1 5.3% 10.5% 78.9%1 No Ragtime 1 76.3% 15.8% 13.2%2 Yes Ragtime 1 15.8% 21.1% 15.8%2 No Classical 1 86.8% 7.9% 5.3%3 Yes Ragtime 1 14.6% 63.4% 22%3 No Classical 1 95.1% 0% 4.9%4 Yes Classical 2 13.2% 13.2% 73.7%4 No Ragtime 2 73.7% 10.5% 15.8%5 Yes Ragtime 2 18.4% 10.5% 71.1%5 No Classical 2 92.1% 7.9% 0%6 No Classical 2 68.3% 19.5% 12.2%6 No Classical 2 95.1% 4.9% %
Table 8.1: Raw results of the coherency questions in the A-B comparisons between generated and non-generated excerpts.
The respondents had a tendency to vote for the generated excerpt as unpleasant and
8.2. HUMAN EVALUATION 56
Question Genre of original Architecture Yes Neutral No1 Classical 1 86.8% 7.9% 5.3%2 Ragtime 1 63.2% 15.8% 21.1%3 Mixed / 5.3% 23.7% 71.1%4 Classical 2 73.7% 10.5% 15.8%5 Ragtime 2 65.8% 5.3% 28.9%6 Classical / 15.8% 10.5% 73.7%
Table 8.2: Raw results of the similarity questions in the A-B comparisons between generated and non-generated excerpts.
Question Genre of original Architecture Yes Neutral No1 Classical 1 71.1% 15.8% 13.2%2 Ragtime 1 42.1% 23.7% 34.2%3 Mixed 1 0% 26.3% 73.7%4 Classical 2 47.4% 34.2% 18.4%5 Ragtime 2 44.7% 28.9% 26.3%6 Classical 2 10.5% 23.7% 65.8%
Table 8.3: Raw results of the reinterpretation questions in the A-B comparisons between generated andnon-generated excerpts.
the non-generated song as pleasant. However, when pitting two non-generated songs
against each other, there is a significant dip in coherency for one of the two samples.
Many respondents found the excerpts to be similar, but not necessarily a reinterpreta-
tion. The generator with the attentional layers has a better score in terms of similarity
and likelihood of being a cover.
Musicians had a tendency to rate the generated samples higher than non-musicians.
A Pure yes/no comparison on the fakes can be seen in Table 8.4 and Table 8.5. This
might indicate that people who play music have a softer boundary for claiming musical
coherence.
Question Positive answers Percentage1 1 5.2%2 5 26%4 4 21%5 2 10.5%
Table 8.4: Musicians positive assessment for generated songs.
In a second section respondents were asked to score 4 samples on their musical co-
herence, on a scale from one to five. They were asked to do this twice. For each set of
samples, one was a fake from an old model that exhibited mode collapse (albeit it was a
8.2. HUMAN EVALUATION 57
Question Positive answers Percentage1 1 4.7%2 1 4.7%4 2 9.4%5 5 23.5%
Table 8.5: Non-musicians positive assessment for generated songs.
non unpleasant mode collapse for the first sample) and an old model that generated an
unpleasant sound according to the researcher. Another sample was a sample from the
testsets. The other two were generated by the architectures described above. You can
see a bar plot of the scores displayed below. It is clear that the real sample is the most
popular by far.
Q. Arch. Genre Score of 1 Score of 2 Score of 3 Score of 4 Score of 57 1 Class. 7.9% 26.3% 31.6% 26.3% 7.9%7 2 Class. 15.8% 47.4% 23.7% 7.9% 5.3%7 Real Class. 0% 0% 10.5% 28.9% 60.5%7 P.C. Class. 0% 15.8% 34.2% 36.8% 13.2%8 1 Rag. 13.2% 34.2% 42.1% 5.3% 5.3%8 2 Rag. 0 % 15.8% 55.3% 23.7% 5.3%8 Real Rag. 0% 0% 5.3% 18.4% 76.3%8 U.C. Rag. 42.1% 31.6% 26.3% 0% 0%
Table 8.6: Raw results of the qualitative comparison between generators.
We suspect that one of the reasons the third section had ’worse’ responses was because
they were coupled with the originals, making any dissonances quite obvious because the
source material could be heard. Some respondents told me after the questionnaire they
noticed a increase in quality in the generated samples from the fourth section. A lot of
people liked the more pleasant sample from the mode collapsed generator, averaged out
they liked it more than our generated examples. However, this architecture was absolutely
not the way to go, as it outputted the same output, regardless of the input sample. One
can see the average score of the samples in Table 8.7.
8.2. HUMAN EVALUATION 58
Type Avg ScoreReal 4.603Generator 1 2.778Generator 2 2.793Generated Classical 2.699Generated Ragtime 2.872Catastrophic Failure 1.842Pleasant Collapse 3.474
Table 8.7: Averages of qualitative experiment
Chapter 9
Conclusion
We can answer the question we stated in the introduction, if it would be possible to
generate a new song that is a cover or interpretation of another song only by giving
examples to an algorithm, with a yes. However this music is still of inferior quality to
the original human compositions. A foundation for algorithms emitting longer music was
laid out in this dissertation, but for non-fixed length input other ideas will have to be
explored, as one can only scale up the input of a convolutional network so much. Our very
latest model could clearly pick-up some high level genre characterics like frequently used
rhythms and could separate melody from accompaniment. Still some questions remain,
even for fixed length Cyclic-GANs:
• Can a similar technique be used when a dataset using multiple instrument inputs
is used?
• Can note velocity be added to this type of model?
• How would this work for different genres?
On a more technical note, we succeeded in creating a Cycle-GAN that could be trained
in a relatively stable manner without having to resort to early stopping using modern
GAN techniques. We suspect that Brunner (2018), [20] did not succeed in this fact as
they trained their GAN for only 30 epochs. Our networks also used a lower amount of
channels.
It is hard to compare our work to current state-of-the-art solutions in terms of pure
generation and covering ability as they all work on different scales. We think that the
invertability property lowers the quality of the generated samples, but also firmly roots
the generators in the realm of the originals.
59
CHAPTER 9. CONCLUSION 60
Something that could significantly improve the speed by which these network get eval-
uated is by creating a pre-trained network like the Inception network for an evaluation
loss like the FID. One has to wait until a network is fully trained to get any type of
qualitative information about the network (as the discriminator is also tainted by the
training data and cycle consistency isn’t that good an indicator of how good the network
is performing). [59]
As a follow up to this dissertation, one could try to implement the same architecture
but with a Transformer instead of a convolutional approach, seeing their ability to gen-
erate long term realistic musical sequences. [44] Since it is more rooted in the realm of
NLP, tricks for an unbounded input could be experimented with.
Another option would be trying to use a Many to Many Cycle GAN framework. For
one, it completely eliminates the ’steganography’ problem a standard Cycle GAN may
have. It also allows one to emit songs in multiple styles.
Bibliography
[1] D. Cornelisse, “An intuitive guide to convolutional neural networks,” Medium.com,
2018.
[2] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,”
Distill, 2016.
[3] O. Davydova, “7 types of artificial neural networks for natural language processing,”
Medium.com, 2018.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015.
[5] F. Doukkali, “Batch normalization in neural networks,” Towards Data Science, 2017.
[6] A Wikimedia user named Taschee. https://commons.wikimedia.org/wiki/File:
Lipschitz_Visualisierung.gif, 2017.
[7] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-Attention Generative
Adversarial Networks,” arXiv e-prints, p. arXiv:1805.08318, May 2018.
[8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,” Journal of
Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
[9] B. Keng, “Semi-supervised learning with variational autoencoders,” Self-published
via Github.io, September 2017.
[10] K. McGuiness, “Deep learning for computer vision: Generative models and adver-
sarial training,” Slideshare, August 2016.
[11] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial
networks,” CoRR, vol. abs/1611.02163, 2016.
61
BIBLIOGRAPHY 62
[12] A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learn-
ing with Deep Convolutional Generative Adversarial Networks,” arXiv e-prints,
p. arXiv:1511.06434, Nov 2015.
[13] L. A. Gatys, A. S. Ecker, and M. Bethge, “A Neural Algorithm of Artistic Style,”
ArXiv e-prints, Aug. 2015.
[14] J. Johnson, A. Alahi, and F. Li, “Perceptual losses for real-time style transfer and
super-resolution,” CoRR, vol. abs/1603.08155, 2016.
[15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks,” ArXiv e-prints, Mar. 2017.
[16] C. Chu, A. Zhmoginov, and M. Sandler, “Cyclegan, a master of steganography,”
CoRR, vol. abs/1712.02950, 2017.
[17] D. Spitael, “Style transfer for polyphonic music,” A UGent masters dissertation,
2017.
[18] M. D. Coster, “Polyphonic music generation with style transitions using recurrent
neural networks,” A UGent masters dissertation, 2017.
[19] I. Malik and C. H. Ek, “Neural translation of musical style,” CoRR,
vol. abs/1708.03535, 2017.
[20] G. Brunner, Y. Wang, R. Wattenhofer, and S. Zhao, “Symbolic music genre transfer
with cyclegan,” CoRR, vol. abs/1809.07575, 2018.
[21] Unknown, “What is a midi message?.” http://www.planetoftunes.com/
midi-sequencing/midi-messages.html, 1998.
[22] J.-P. Briot, G. Hadjeres, and F. Pachet, “Deep learning techniques for music gener-
ation - a survey,” Researchgate, 09 2017.
[23] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural
Comput., vol. 1, pp. 541–551, Dec. 1989.
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-
volutional neural networks,” in Advances in Neural Information Processing Systems
25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–
1105, Curran Associates, Inc., 2012.
BIBLIOGRAPHY 63
[25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-
brenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw
audio,” CoRR, vol. abs/1609.03499, 2016.
[26] L. Yang, S. Chou, and Y. Yang, “Midinet: A convolutional generative adversarial
network for symbolic-domain music generation using 1d and 2d conditions,” CoRR,
vol. abs/1703.10847, 2017.
[27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for Sim-
plicity: The All Convolutional Net,” arXiv e-prints, p. arXiv:1412.6806, Dec 2014.
[28] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,”
arXiv e-prints, p. arXiv:1603.07285, Mar 2016.
[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput.,
vol. 9, pp. 1735–1780, Nov. 1997.
[30] A. Roberts, J. Engel, and D. Eck, “Hierarchical variational autoencoders for music,”
in NIPS Workshop on Machine Learning for Creativity and Design, 2017.
[31] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated
recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.
[32] G. Weiss, Y. Goldberg, and E. Yahav, “On the practical computational power of
finite precision rnns for language recognition,” CoRR, vol. abs/1805.04908, 2018.
[33] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” CoRR,
vol. abs/1505.00387, 2015.
[34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.
[35] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv e-prints,
p. arXiv:1607.06450, Jul 2016.
[36] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing
ingredient for fast stylization,” CoRR, vol. abs/1607.08022, 2016.
[37] Y. Wu and K. He, “Group normalization,” CoRR, vol. abs/1803.08494, 2018.
[38] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization
to accelerate training of deep neural networks,” CoRR, vol. abs/1602.07868, 2016.
[39] H. F. Walker, “Course handouts,” None, Unknown.
BIBLIOGRAPHY 64
[40] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial net-
works,” in Proceedings of the 34th International Conference on Machine Learning
(D. Precup and Y. W. Teh, eds.), vol. 70 of Proceedings of Machine Learning Re-
search, (International Convention Centre, Sydney, Australia), pp. 214–223, PMLR,
06–11 Aug 2017.
[41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for
generative adversarial networks,” CoRR, vol. abs/1802.05957, 2018.
[42] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learn-
ing to Align and Translate,” arXiv e-prints, p. arXiv:1409.0473, Sep 2014.
[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.
[44] C. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D.
Hoffman, and D. Eck, “An improved relative self-attention mechanism for trans-
former with application to music generation,” CoRR, vol. abs/1809.04281, 2018.
[45] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object
localization using convolutional networks,” CoRR, vol. abs/1411.4280, 2014.
[46] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” ArXiv e-prints,
Dec. 2013.
[47] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly, “The GAN
landscape: Losses, architectures, regularization, and normalization,” CoRR,
vol. abs/1807.04720, 2018.
[48] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative Adversarial Networks,” ArXiv e-prints,
June 2014.
[49] C. Gang. https://github.com/caogang/wgan-gp/.
[50] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity
natural image synthesis,” CoRR, vol. abs/1809.11096, 2018.
[51] Z. Zhou, Y. Song, L. Yu, and Y. Yu, “Understanding the effectiveness of lipschitz
constraint in training of gans via gradient analysis,” CoRR, vol. abs/1807.00751,
2018.
BIBLIOGRAPHY 65
[52] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. C. Courville, “Aug-
mented cyclegan: Learning many-to-many mappings from unpaired data,” CoRR,
vol. abs/1802.10151, 2018.
[53] G. Brunner, A. Konrad, Y. Wang, and R. Wattenhofer, “MIDI-VAE: Modeling Dy-
namics and Instrumentation of Music with Applications to Style Transfer,” in 19th
International Society for Music Information Retrieval Conference (ISMIR), Paris,
France, September 2018.
[54] H.-W. Dong, W.-Y. Hsiao, and Y.-H. Yang, “Pypianoroll: Open source python pack-
age for handling multitrack pianorolls,” in Late-Breaking Demos of the 19th Interna-
tional Society for Music Information Retrieval Conference (ISMIR), 2018.
[55] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmai-
son, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W,
2017.
[56] J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing In Science &
Engineering, vol. 9, no. 3, pp. 90–95, 2007.
[57] smth and multiple PyTorch users, “Why don’t we put models in .train() or
.eval() modes in dcgan example.” A forum thread with replies by a lead
dev of the popular PyTorch library, ”https://discuss.pytorch.org/t/
why-dont-we-put-models-in-train-or-eval-modes-in-dcgan-example/7422”,
2017.
[58] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochre-
iter, “Gans trained by a two time-scale update rule converge to a nash equilibrium,”
CoRR, vol. abs/1706.08500, 2017.
[59] S. Liu, Y. Wei, J. Lu, and J. Zhou, “An improved evaluation framework for generative
adversarial networks,” CoRR, vol. abs/1803.07474, 2018.
modelsMusical style transfer with generative neural network
Academic year 2018-2019
Master of Science in Computer Science Engineering
Master's dissertation submitted in order to obtain the academic degree of
Supervisors: Prof. dr. Tijl De Bie, Dr. ir. Thomas Demeester
Student number: 01304857Maarten Moens