Musical style transfer with generative neural network models€¦ · musical style transfer is the...

modelsMusical style transfer with generative neural network

Academic year 2018-2019

Master of Science in Computer Science Engineering

Master's dissertation submitted in order to obtain the academic degree of

Supervisors: Prof. dr. Tijl De Bie, Dr. ir. Thomas Demeester

Student number: 01304857Maarten Moens

Preface

In this preface I would like to show some gratitude to some of the persons that made this

dissertation a possibility.

I would like to thank my mother for all the love and hard work she has done for me.

Without her, I probably would never have succeeded at even getting to my Master’s. I

would like to thank my dad for inspiring me and sparking my interest in science and

maths. May you both live long and healthy lives.

I would like to thank Thomas and Tijl, my counselors, for giving me advice all year

round. I would have been very lost without you.

A special thanks to Afif Hasan, for feeding me when I was hungry, for helping me with

setting up the survey, for proofreading my dissertation and for countless other things.

Thank you, from the bottom of my heart.

A thank you to all the persons I’ve ever played music with and all the persons that

make me actively enjoy music, even when I’m not playing it myself. In particular, a thank

you to Jeroen De Vos, for inspiring me to pick up the bass guitar.

Last but not least, a thank you to all of my engineering friends. My time at UGent

would have been a lot duller without you.

Permission for loan

The author gives permission to make this master’s dissertation available for consultation

and to copy parts of this master’s dissertation for personal use. In all cases of other use,

the copyright terms have to be respected, in particular with regard to the obligation to

explicitly state the source when quoting results from this master’s dissertation.

Maarten Moens, 2019

Abstract

Overview

• Master’s dissertation submitted in order to obtain the academic degree of Master

of Science in Computer Science Engineering

• Title: Musical style transfer with generative neural network models

• Author: Maarten Moens

• Supervisors: Prof. dr. Tijl De Bie, Dr. ir. Thomas Demeester

• Faculty of Engineering and Architecture, Academic year 2018-2019

Summary

The research presented proposes a neural algorithm for making realistic covers of human

composed music available in a symbolic format. Using the Cyclic Generative Adversarial

Network algorithm, we have trained a network to stylize music from one out of two specific

genres as music from the other one. The quality of this generated music has been evaluated

by humans and a comparison with human composed symbolic music is made. Whilst our

generated music is less pleasing than human-composed music, respondents rated it as

being closer to pleasant than unpleasant.

Keywords

Music Generation, Neural Style Transfer, Generative Adversarial Networks

Musical style transfer with generative neuralnetwork models

Maarten Moens

Supervisor(s): Prof. Dr. Tijl De Bie, Dr. Ir. Thomas Demeester

Abstract— The research presented proposes a neural algo-rithm for making realistic covers of human composed musicavailable in a symbolic format. Using the Cyclic GenerativeAdversarial Network algorithm, we have trained a networkto stylize music from one out of two specific genres as musicfrom the other one. The quality of this generated music hasbeen evaluated by humans and a comparison with humancomposed symbolic music is made. Whilst our generatedmusic is less pleasing than human-composed music, respon-dents rated it as being closer to pleasant than unpleasant.

Keywords—Music Generation, Neural Style Transfer, Gen-erative Adversarial Networks

I. Introduction

Man has tried to make algorithms for deceitfully real-istic computer art since the dawn of the computer. Ascreating art is one of the more unique aspects of human in-telligence in comparison to those of other sentient beings,understanding and artificially replicating it might give usinsight into the inner workings of the human brain.

To advance our knowledge in one sub-task in this grandendeavor, this dissertation is about generating musicalpieces that are interpretations of other existing, symbolic,human composed examples using modern day generativemachine learning techniques, that are pleasant to the hu-man ear. In particular, deep neural networks and adver-sarial networks were used to try to generate this music.

II. Related Works

This section is about related literature consulted duringthe research. First, let us introduce a couple of terms. Amusical style transfer is the generation of music, featuringthe content of an input song, stylized to match the style ofa genre learned during training. Symbolic music is musicwhich is represented by some type of high level notation,instead of by a waveform. No characteristics of human per-formance or instrument quality are thus passed on. In thissection, we will only introduce papers significant to sym-bolic musical style transfer. Some research has been doneon waveform audio style transfer as well, but we suspectthat those models have completely different models andcharacteristics than ours.

Malik (2017) presents GenreNet, a network they trainedon note velocity data for different genres. [1] By using asingle encoder and multiple LSTM decoders (one for eachgenre), the network would output note velocities for inputsongs. 46% of songs were correctly identified, 25% wrongly

identified and 25% couldn’t be identified. Thus, it passedwhat the author calls the ”musical Turing test”.

Spitael (2017) presents a master’s dissertation on styletransfer between classical composers in 2017. The authortried to transfer the style of one composer to a composi-tion of another one by having multiple decoders (one foreach composer) coupled to only one universal encoder. Anew way of vectorizing MIDI was proposed: instead of thepiano-roll representation (a vector representing the com-plete state of the keyboard at every time instant), only onepitch (input as a value and not as vector), its correspondingvelocity and the lapsed time would be encoded. The sys-tem would still be polyphonic as one could input multiplenotes as having only a time difference of zero. An expertand non-expert audience found the generated transferredsongs to be more pleasing than random noise, but preferredthe originals.

De Coster (2017) presents a master’s dissertation on us-ing RNNs to generate polyphonic music with style transi-tions. [2]. Instead of using different decoders for differentcomposers, as his colleague Spitael did, the authors of thispaper tried to predict the next sample given a certain in-put sample and a given composer whose style the generatortries to match. This output can then be fed into the net-work again, generating new sequences. During generation,the style vector can change from one composer to anotherone, forcing the machine to predict a different future out-put, making it possible to interpolate between composers.The author assessed that the quality of output is lower thanthat of the real compositions, but not by much.

Brunner (2018) used a Variational Auto Encoder to gen-erate style transfered music by adding a genre tag to thefeatures emitted by the encoder. [3] By training the VAEuntil it reproduced the original input, one could simplyswap the genre tag during evaluation to listen to the styletransfered output. One of the difficulties with this setup isthe fact that the genre has to be a meaningful feature forthe decoder: if it just ignores that particular feature, swap-ping the genre tag will not result in much change. Contraryto most other techniques, they use note velocity as a input.Note length is also fed separately to the encoder.

In another paper, Brunner (2018) used a Cycle GANsetup to generate style transferred music. [4] To evaluatethe generated music automatically, they used a genre classi-

fier trained on three genres. If the detector would evaluatethe song in the correct genre, the generated sample wouldbe evaluated positively. They would transform the songbar per bar, transforming a whole song this way for someof the examples uploaded online. No human evaluation wasdone in their paper.

III. Problem Statement

Consider two datasets consisting of fixed length excerptsof songs, SA and SB . We wish to generate an excerptthat would fit the general style of dataset SB , based on anelement from SA, in such a way the element from SA isrecognizable. Let us call the excerpt from SA x and thegenerated excerpt x´. Our algorithm is successful if:• A human observer finds the generated excerpt x´ pleasingto listen to.• A human observer could imagine x´ being a part of thedataset SB .• A human observer could recognize the similarity betweenx and x´.An example: the nursery rhyme ’Mary has a Little Lamb’has a very recognizable melody. It is rhythmically simplehowever. Jazz is a genre know for its musical intricacies,especially in terms of rhythm. If a human would wish tocreate a Jazz cover of ’Mary has a Little Lamb’, the rhythmwould be changed quite a bit, but the melody would stillbe recognizable. We would want our algorithm to behavethe same way.

IV. Dataset and representation

A dataset was created by searching the internet for songsdownloadable in the MIDI format. A choice was made toonly use symbolic music written for solo piano. A numberof reasons why this decision was made:• Piano songs are mostly polyphonic. Most songs have aa melody played by the right hand and an accompanimentplayed by the left hand.• Compared to songs using a distinct instrument to playmelodies and a distinct instrument to play basslines, cut-ting out the right tracks out of the MIDI was easier.• Outside of note velocity (which is something to accountfor with every musical instrument), no unusual nuances canbe played on a piano. There is no modulation of the pitchlike with string elements (”string bending”) or no multipleways to play the notes of the keyboard (as Pizzicato vsBowed for certain string elements). This is valuable infor-mation that would be cut out in the binarization.

Two genres with a large library of songs written for solopiano are Classical Music and Ragtime Music. The songsthat fit the above criteria were then binarized into pianorolls, quantized at a sixteenth-note level, with an excerptconsisting out of four bars of sixteen of these sixteenth-notes. One can imagine these piano rolls as multihot vec-tors that correspond to the pitches played on the piano, onefor every quantized timestep. This quantization is a non-trivial problem: we used the pypianoroll library for this. [5]

By searching the internet, a set of 240 songs per genrewas created. To remove bias, every song was pitch-shiftedso every key was as frequent in the dataset. This gaveus 2880 different songs. This dataset was then split intotraining, validation and test sets consisting of 80%, 10%and 10% of the data respectively. Up to ten excerpts oflength four bars were taken out of these songs.

V. Methodology

Our methodology will be discussed in this section. Asmentioned in the abstract and the introduction, we used aCycle Consistent Generative Adversarial Network to gen-erate new music from the excerpts in the dataset. A GANor generative adverserial network is a network consistingof two sub networks: a generator that tries to generatetruthful data and a discriminator that tries to discriminatethe real data from the fake data. [6] The discriminator istrained on example data from the dataset and the outputof the generator. The generator tries to optimize its outputin such a way that the discriminator can’t tell the differ-ence between the original and the generated data. This isvisualized in Figure 1.

Fig. 1

The standard GAN architecture visualized. Image from

Slideshare. [7]

In Equation 1 and Equation 2, loss functions for a nor-mal generator and discriminator are shown: σ stands forthe sigmoid function, x for original data samples from thetraining dataset and z for a random noise sample.

DD(x, z) = −log(σ(D(x)))− log(σ(−D(G(z)))) (1)

DG(x, z) = −log(σ(D(G(z)))) (2)

A Cycle Consistent GAN uses two pairs of generators anddiscriminators instead of one. Consider two datasets X andY. One generator, G tries to generate artificial data, as ifit were elements of Y, given elements from X as input.The other generator, F tries to generate artificial data,as if it were elements of X, given an element from Y asinput. One discriminator is trained using the elements ofX and artificial elements generated by F , the other byusing the elements of Y and artificial elements generatedby G. The cycle consistency comes from a loss function,forcing application of F to the output of F to be the input

given to F and vice versa. G and F should thus be inversesof one another. This concept is visualised in Figure 2 andFigure 3.

Fig. 2

The Cycle GAN algorithm. Image from Zhu (2017). [8]

Fig. 3

A more concrete visualization of the Cycle GAN algorithm

with musical examples. [8]

One can also enforce idempotency, by adding a loss func-tion where the generators have to perform an identity map-ping to certain elements. For instance, G(x) = x andF (y) = y. The loss functions for this type of GAN can beseen in Equation 3 and Equation 4. The α and β variablesare hyperparameters that control the relative importanceof the invertability and idempotency losses. The completeloss function for the generator can be found in Equation 5.Please note that this loss is centered around the generatorG. The loss for F is symmetrical to the one given. Theloss function for the discriminators does not change with

respect to GANs.

LFGcyc(x, y) = α|F (G(x))− x|+ β|G(y)− y| (3)

LFGstyle(x) = −log(σ(DY (G(x)))) (4)

LFG(x, y) = LFGcyc(x, y) + LFG

style(x) (5)

As a caveat, since our generators try to predict if a specificpitch will be played or not played at a specific timestep, amore logical metric for comparing the input and the outputof F (G(x)) would be the Binary Cross Entropy, instead ofthe L1 loss presented in Equation 3.

The Wasserstein GAN paper introduced a new class ofloss functions that do not use sigmoids to bound the dis-criminators output. [9] To use this type of loss functions,one’s discriminator has to be Lipschitz continuous. A func-tion f is Lipschitz continuous if Equation 6 holds. Wasser-stein GANs do not mode collapse. Mode collapse is a failuremode for GANs where only one output is generated by thegenerator. They are also less sensitive to hyperparameterchanges.

||f(x)− f(y)|| ≤ ||x− y|| (6)

Two possible combinations of discriminator and genera-tor loss functions for Wassertein GANs be seen in Equa-tion 7 and Equation 8, which we will call WGAN style lossfunctions from now on, and in Equation 9 and Equation 10,which we will call Hinge style loss functions from now on.The R in Equation 9 stands for the Rectified Linear func-tion: R(x) = max(0, x).

DDW (x, z) = −ExD(x) + E

zD(G(z)) (7)

DGW (x, z) = −EzD(G(z)) (8)

DDH(x, z) = ExR(1−D(x)) + E

zR(1 +D(G(z))) (9)

DGH(x, z) = −EzD(G(z)) (10)

These functions, F and G will be implemented with anencoder-decoder style setup. As our examples are fixedlength, we have chosen for a convolutional neural networkarchitecture. Inspired by Zhang (2018), we also added at-tentional layers. [10] We will discuss the two attemptedarchitectures:• Architecture A: the architecture we did the most humanevaluation on, which uses a Hinge style loss and attentionin the discriminator. This architecture showed irregularlearning behavior.• Architecture B: an architecture which improved upon Ar-chitecture A’s learning behavior. It uses WGAN style lossfunctions.

One can see architecture A visualized in Figure 4 andFigure 5. We used Batch Norm in the generator to speedup training. [11] We used Spectral Norm in the discrimina-tor to enforce Lipschitz continuity. [12] The sigmoid activa-tion function found in the last layer ensures us our output

is between 0 and 1. Please note that the kernel sizes inthe downsampling and upsampling layers, thus those witha stride different than one, are multiples of the strides.Odena (2016) shows that this is a preferable kernel size tonot have checkerboarding artifacts. [13] For hyperparame-ters, architecture A used an α of 10 and a β of 0, a learningrate of 0.0005 for the generators, a learning rate of 0.005for the discriminators and a batch size of 40.

Fig. 4

Architecture A Generator.

As the quality of the output of architecture A regressedafter training for longer periods of time, early stopping hadto be employed. A training imbalance was noted: one gen-erator’s quality would stay more or less constant, whilstthe quality of the other generator degraded. This can beseen in Figure 6. The lower the loss, the better the quality.Architecture B is very similar to architecture A. The gen-erator runs a bit deeper and the discriminator does not useattention. We suspect that the attentional layer broke theLipschitz contuity of the discriminator and that this wasthe cause for the training instability. The number of chan-nels was also changed to better reflect the amount of datathat was removed by downsampling. For hyperparameters,architecture A used an α of 1 and a β of 1, a learning rateof 0.0005 for the generators, a learning rate of 0.001 for the

Fig. 5

Architecture A Discriminator.

Fig. 6

Training imbalance.

discriminators and a batch size of 40. One can see the gen-erator and the discriminator of architecture B in Figure 7and Figure 8. Early stopping did not have to be used forArchitecture B. One can see the improved training profilein Figure 9. Also note that the training and evaluationcycle consistency losses are very close to zero: these werelower than in Figure 6, but this is not really visible on thegraph.

VI. Results

In this section the generated music of both architectureswill be evaluated. As no survey could be done for Archi-

Fig. 7

Architecture B Generator.

tecture B, only general properties of the generated musicwill be described.

A. Architecture A

In this subsection we will describe the music generated byarchitecture A and show the results of the survey conductedon its output. The amount of rhythm changes generatedby architecture A was minimal. The network tried addingextra notes to a song sometimes and also had a tendencyto change the key of the song. A survey was conducted toassess the quality of the samples generated by this network,once with and once without attention layers in the gener-ator. 41 persons responded, 51.4% of calling themselvesknowledgeable about music. All tests were single blind. Inthe first section, the respondents listened to one examplesong from the test set and its style transfer. In general,only 15% of the respondents founds the style transferredsamples pleasing to listen to when compared with the orig-inal. However, the respondents scored the samples high interms of the possibility of being a reinterpretation of the

Fig. 8

Architecture B Discriminator.

Fig. 9

Training profile of architecture B.

originals, as can be seen in Table I.

Genre of og. Architecture Yes Neutral NoClassical Attention 71.1% 15.8% 13.2%Ragtime Attention 42.1% 23.7% 34.2%Classical No Attention 47.4% 34.2% 18.4%Ragtime No Attention 44.7% 28.9% 26.3%

TABLE I

Raw results of the reinterpretation questions in the A-B

comparisons between generated and non-generated excerpts.

In the second section, respondents had to listen to 3 gen-erated samples and one excerpt of a song composed by ahuman and give these fragments a score between 1 and 5in terms of pleasingness. The respondents were asked todo this twice. In the first question, one of the generatedsamples came from a generator that failed catastrophicly,one from Architecture A with attention and one from Ar-chitecture A without attention. In the second question,one of the generated samples came from a generator thatfailed, as it only emitted one type of output, albeit onewhich had a certain appealing quality to it: which shall bereferred to as Pleasant Collapse from here on out; one fromArchitecture A with attention and one from ArchitectureA without attention. The results can be seen in Table II.

Type Avg ScoreReal 4.603Generator Attention 2.778Generator No Attention 2.793Generated Classical (no collapse) 2.699Generated Ragtime (no collapse) 2.872Catastrophic Failure 1.842Pleasant Collapse 3.474

TABLE II

Averages of qualitative experiment

Our generated music scored better than the catastrophicfailure. No big difference in pleasingness between genera-tors was noted. Strangely enough, people found the pleas-ing collapse to sound better than the generated samples.Neither comes close to the score the human composed frag-ment was given on average, however.

B. Architecture B

In this subsection we will describe the music generatedby architecture B. No survey was conducted to evaluatethe output. Output generated by Architecture B used themelody almost verbatim, but a big difference in the accom-paniment of the melody was noticeable. For instance, theClassical-To-Ragtime generator learned a specific rhythmpattern that fits the genre quite well. This behavior wasnot present in Architecture A.

C. Comparison with recent research

The setup used by Brunner (2018) is very similar toours. [4] The size of their dataset is roughly the same asour unaugmented dataset. As we augmented our data byshifting the keys, our dataset became 12 times larger thantheirs. Brunner (2018) generated covers bar per bar, in-stead of in groups of 4 bars like we do. We suspect theirmodel exhibited mode collapse or an imbalance like ourmodel did, as they stop after training for only 30 epochs.

VII. Conclusion and Future Work

Cycle Consistent GANs are a viable way to generaterealistic covers. Whilst there is still a significant differ-

ence between the quality of human-made music and ourcomputer generated covers, an architecture which can betrained without early stopping might be the beginning ofmuch improved generators. We have also proven that Cyc-GANs can produce generated music at larger source inputsthan Brunner (2018), as we use 16 bars per sample andthey only use 4. [4] Music generated by this algorithm wasrated by humans as being closer to pleasing music than tonon-pleasing music. One of the things that could speed-updevelopment in this subject by a large margin would bethe development of a musical analogue to the Frechet In-ception Distance, [14] for qualitative computer evaluationof generated music, as evaluation by a developer is timeconsuming. An interesting angle of attack for the future ofstyle transfer is the use of a Transformer like architecturefor the generators, which Huang (2018) has proven to bevery effective in the generation of music. [15]

References

[1] Iman Malik and Carl Henrik Ek, “Neural translation of musicalstyle,” CoRR, vol. abs/1708.03535, 2017.

[2] Mathieu De Coster, “Polyphonic music generation with styletransitions using recurrent neural networks,” A UGent mastersdissertation, 2017.

[3] Gino Brunner, Andres Konrad, Yuyi Wang, and Roger Watten-hofer, “MIDI-VAE: Modeling Dynamics and Instrumentation ofMusic with Applications to Style Transfer,” in 19th Interna-tional Society for Music Information Retrieval Conference (IS-MIR), Paris, France, September 2018.

[4] Gino Brunner, Yuyi Wang, Roger Wattenhofer, and Sumu Zhao,“Symbolic music genre transfer with cyclegan,” CoRR, vol.abs/1809.07575, 2018.

[5] Hao-Wen Dong, Wen-Yi Hsiao, and Yi-Hsuan Yang, “Pypi-anoroll: Open source python package for handling multitrackpianorolls,” in Late-Breaking Demos of the 19th InternationalSociety for Music Information Retrieval Conference (ISMIR),2018.

[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Ad-versarial Networks,” ArXiv e-prints, June 2014.

[7] Kevin McGuiness, “Deep learning for computer vision: Genera-tive models and adversarial training,” Slideshare, August 2016.

[8] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Net-works,” ArXiv e-prints, Mar. 2017.

[9] Martin Arjovsky, Soumith Chintala, and Leon Bottou, “Wasser-stein generative adversarial networks,” in Proceedings of the 34thInternational Conference on Machine Learning, Doina Precupand Yee Whye Teh, Eds., International Convention Centre, Syd-ney, Australia, 06–11 Aug 2017, vol. 70 of Proceedings of Ma-chine Learning Research, pp. 214–223, PMLR.

[10] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-tus Odena, “Self-Attention Generative Adversarial Networks,”arXiv e-prints, p. arXiv:1805.08318, May 2018.

[11] Sergey Ioffe and Christian Szegedy, “Batch normalization: Ac-celerating deep network training by reducing internal covariateshift,” CoRR, vol. abs/1502.03167, 2015.

[12] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and YuichiYoshida, “Spectral normalization for generative adversarial net-works,” CoRR, vol. abs/1802.05957, 2018.

[13] Augustus Odena, Vincent Dumoulin, and Chris Olah, “Decon-volution and checkerboard artifacts,” Distill, 2016.

[14] Shaohui Liu, Yi Wei, Jiwen Lu, and Jie Zhou, “An im-proved evaluation framework for generative adversarial net-works,” CoRR, vol. abs/1803.07474, 2018.

[15] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit,Noam Shazeer, Curtis Hawthorne, Andrew M. Dai, Matthew D.Hoffman, and Douglas Eck, “An improved relative self-attentionmechanism for transformer with application to music genera-tion,” CoRR, vol. abs/1809.04281, 2018.

Contents

1 Introduction 1

2 Neural Network Building Blocks 3

2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Convolutional Network Layer . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Transposed Convolutional Layers . . . . . . . . . . . . . . . . . . . 5

2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Residual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Network layer output normalization . . . . . . . . . . . . . . . . . . 9

2.4.2 Layer Weight normalization . . . . . . . . . . . . . . . . . . . . . . 10

2.4.3 Lipschitz continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.4 Spectral Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Generative Neural Networks 16

3.1 Variational Auto Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Original GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.2 Mode collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.3 Wasserstein GAN and other improvements . . . . . . . . . . . . . . 19

4 Style Transfer 23

4.1 Pioneering style transfer works . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Cycle consistent style transfer . . . . . . . . . . . . . . . . . . . . . . . . . 26

i

CONTENTS ii

5 State of the Art in Musical Style Transfer 29

5.1 Generative techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Music Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Previous master’s dissertations received . . . . . . . . . . . . . . . . . . . . 30

5.4 Musical Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Dataset 35

6.1 MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 Piano rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.1.2 Song selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7 Architecture 41

7.1 High level overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.2 Original Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.3 Training problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.4 Improved Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8 Results 53

8.1 Researchers evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

9 Conclusion 59

List of Figures

2.1 Application of a Sobel filter visualized. Image from Medium. [1] . . . . . . 4

2.2 One dimensional transposed convolution visualized. Image from Distill. [2] 6

2.3 View of the inner-workings of how a peephole LSTM works. Image from

Medium. [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 The Resnet pattern visualised. Image from He (2015). [4] . . . . . . . . . . 8

2.5 Resnet training curves.Image from He (2015). [4] . . . . . . . . . . . . . . . 8

2.6 Multiple types of layer output normalization. Image from Towards Data

Science. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.7 Lipschitz cone. Image from Wikimedia. [6] . . . . . . . . . . . . . . . . . . 11

2.8 Picture explaining the SAGAN self attention implementation. Image from

Zhang (2018). [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.9 Comparison between the o variable in the attention algorithm and the

original input to the layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.10 Dropout visualization. Image from Srivastae (2014). [8] . . . . . . . . . . . 15

3.1 A VAE architecture. Image from Github.io. [9] . . . . . . . . . . . . . . . . 17

3.2 The standard GAN architecture visualized. Image from Slideshare [10] . . 19

3.3 An example of mode collapse from a network learning the popular MNIST

dataset. Image from Metz (2016). [11] . . . . . . . . . . . . . . . . . . . . 20

3.4 The DCGAN guidelines. Image from Radford (2015). [12] . . . . . . . . . . 21

3.5 The DCGAN architecture. Image from Radford (2015). [12] . . . . . . . . 21

4.1 Filter responses throughout the convolutional network. Image from Gatys

(2014). [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Example of style transfer. Image from Gatys (2015). [13] . . . . . . . . . . 25

4.3 How the offline style transfer algorithm calculates its loss. Image from

Johnson (2016). [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Cycle Consistent style algorithm. Image from Zhu (2017). [15] . . . . . . . 27

iii

LIST OF FIGURES iv

4.5 An example of style consistent style transfers on pictures of horses and

zebras. Image from Zhu (2017). [15] . . . . . . . . . . . . . . . . . . . . . . 27

4.6 Architecture of CycleGAN. Image from Zhu (2017). [15] . . . . . . . . . . . 27

4.7 Cycletransfered skymap with added details that aren’t visible in the roadmap.

Image from Chu (2017). [16] . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.8 A comparison between a evaluation roadmap and a training roadmap, re-

spectively. High-frequency encoding is made visible by adaptive histogram

equalization. Image from Chu (2017). [16] . . . . . . . . . . . . . . . . . . 28

5.1 The architecture used by Spitael (2017). [17] The numbered sequences por-

tray how the autoencoder was trained. . . . . . . . . . . . . . . . . . . . . 31

5.2 The architecture used by De Coster (2017). [18] . . . . . . . . . . . . . . . 31

5.3 The architecture for velocity prediction in GenreNet.Image from Malik

(2017). [19] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.4 The VAE architecture used by Brunner (2018). [20] . . . . . . . . . . . . . 33

5.5 The Cycle GAN architecture used by Brunner (2018). [20] . . . . . . . . . 33

5.6 Genre classifier architecture used by Brunner (2018). [20] . . . . . . . . . . 34

6.1 A MIDI-message. Image from PlanetofTunes. [21] . . . . . . . . . . . . . . 36

6.2 Typical visualisation of a piano roll. Image from Researchgate. [22] . . . . 36

6.3 The distribution of notes in the unaugmented training set. . . . . . . . . . 38

6.4 The distribution of notes in the training set with augmented training set. . 39

6.5 Note count of multiple datasets. . . . . . . . . . . . . . . . . . . . . . . . . 39

6.6 Average amount of notes per song of multiple datasets. . . . . . . . . . . . 40

7.1 Algorithm outline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.2 Original Encoder and decoder architecture. . . . . . . . . . . . . . . . . . . 45

7.3 Original discriminator architecture. . . . . . . . . . . . . . . . . . . . . . . 46

7.4 Original VAE encoder architecture. . . . . . . . . . . . . . . . . . . . . . . 47

7.5 Comparison between images with checkerboarding artifact and without.

Image from Distill. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7.6 A plot showing the inevitable imbalance between the two generators. . . . 48

7.7 A less imbalanced learning profile. . . . . . . . . . . . . . . . . . . . . . . . 50

7.8 Updated discriminator architecture. . . . . . . . . . . . . . . . . . . . . . . 51

7.9 Updated discriminator architecture. . . . . . . . . . . . . . . . . . . . . . . 52

8.1 Original rhythm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8.2 Style transferred rhythm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

LIST OF FIGURES v

8.3 Age distribution of our respondents. . . . . . . . . . . . . . . . . . . . . . . 54

8.4 Distribution of our respondents self-reported knowledge about music. . . . 55

List of Tables

8.1 Raw results of the coherency questions in the A-B comparisons between

generated and non-generated excerpts. . . . . . . . . . . . . . . . . . . . . 55

8.2 Raw results of the similarity questions in the A-B comparisons between

generated and non-generated excerpts. . . . . . . . . . . . . . . . . . . . . 56

8.3 Raw results of the reinterpretation questions in the A-B comparisons be-

tween generated and non-generated excerpts. . . . . . . . . . . . . . . . . . 56

8.4 Musicians positive assessment for generated songs. . . . . . . . . . . . . . . 56

8.5 Non-musicians positive assessment for generated songs. . . . . . . . . . . . 57

8.6 Raw results of the qualitative comparison between generators. . . . . . . . 57

8.7 Averages of qualitative experiment . . . . . . . . . . . . . . . . . . . . . . 58

vi

Chapter 1

Introduction

Mankind has tried to make algorithms for deceitfully realistic computer art since the dawn

of the computer. As creating art is one of the more unique aspects of human intelligence

in comparison to those of other sentient beings, understanding and artificially replicating

it might give us insight into the inner workings of the human brain. To advance our

knowledge in one sub-task of this grand endeavor, this dissertation is about generating

convincing musical pieces that are interpretations of other existing, human composed ex-

amples using modern day generative machine learning techniques.

In other words, we want to know if it’s possible to generate a new song that is a

cover or interpretation of another song, by learning from examples out of two datasets

of music from different genres. Our datasets consists of symbolic examples from classical

and ragtime piano music. No rules will be hard coded : our neural network will try to

create a style transfer by only using the examples and a set of constraints to optimize.

To differentiate ourselves from Brunner (2018), [20] who released a article about a

similar research question, we have tried applying some novel research into GAN train-

ing, as the authors didn’t experiment with attentional layers, more modern Generative

Adversarial Networks like the Wasserstein GAN or generator architectures using latent

variables. What follows is a summation of the content of this dissertation:

• In chapter 2, we will introduce some technical constructs related to deep neural

networks. We expect the reader to have some basic knowledge about neural networks

and machine learning and will explain the more modern concepts.

• In chapter 3, we will introduce some advanced generative algorithms, mostly imple-

mented with the constructs introduced in the previous chapter, that were used in

state of the art literature or in our research.

1

CHAPTER 1. INTRODUCTION 2

• In chapter 4, we will introduce the most common style transfer algorithms. Feasi-

bility for use in music is regarded when discussing these algorithms.

• In chapter 5, we will shortly digress upon some state of the art literature and

previous dissertations with a similar subject submitted to the University of Ghent.

• In chapter 6, we will present our dataset and walk through some of the underlying

technical details of converting the data we gathered to something our neural network

can act upon.

• In chapter 7, we will present the used architectures and a reasoning for why these

were chosen. We will also go into some of the technical problems we encountered

during the course of our research.

• In chapter 8, we will evaluate the generated music by means of human evaluation.

• Finally, in chapter 9, we will present a conclusion about our research and a future

work section containing some of the more interesting ways one could improve upon

our research.

Chapter 2

Neural Network Building Blocks

In this section we will explore all of the general neural network building blocks used to

create our final network architecture. Some of these might have been covered by the

introductory machine learning course offered at our university, but will not have been

covered in depth. A knowledge of how Stochastic Gradient Descent works and how it can

change the weights in different layers of a neural network is required.

What follows is a summation of the content of this dissertation:

• In section 2.1, we will discuss convolutional neural networks, a type of neural net-

work with less weights than the fully connected neural network, as well as pooling

operations often used in conjunction with this type of network.

• In section 2.2, we will discuss Recurrent Neural Networks. Although these haven’t

been used in any of the research we present, it has been used in multiple state of

the art publications and thus deserves a mention.

• In section 2.3, we will discuss all types of networks with so-called skip connections,

including the most popular one, the residual neural network.

• In section 2.5 we will explain the attention mechanism and its uses.

• In section 2.6 Dropout will be introduced, a regularization technique often used in

neural networks.

2.1 Convolutional Neural Networks

As the fully connected neural network suffers from a plethora of issues, ranging from a

tendency to overfit to a large runtime cost, network types with less parameters per hidden

3

2.1. CONVOLUTIONAL NEURAL NETWORKS 4

layer per channel started to become popular. In this section we will go deeper into how

Convolutional Neural Networks work and what types of variations exist.

As our experiments often used convolutional layers, this section was added for those

not familiar with this type of neural network.

2.1.1 Convolutional Network Layer

A convolutional layer consists of a kernel of HXW learnable parameters. This kernel

will slide over the input matrix, generating an output value by summing multiplication of

the weights in the correct places in the kernel with the values in the correct place of the

sliding window, as depicted in Figure 2.1. The way this windows slides over the original

one is controlled by the stride: this is a vector consisting of the amount of elements the

center of the kernel slides in the corresponding dimension. A convolutional layer with a

stride greater than one performs a downsampling operation, reducing the dimensionality

of the input. One can see a visual explanation of the convolutional operation in Figure 2.1.

In this image, a Sobel high pass filter is used instead of a kernel with learnable parameters.

Figure 2.1: Application of a Sobel filter visualized. Image from Medium. [1]

Because the weights in the sliding window get reused, there are less weights to learn in

comparison to a dense or fully connected layer, where every input has a weight connect-

ing it to an output. By splitting our input into multiple channels and applying different

kernels to those channels at the same time, we can learn multiple representations of the

same input, often in a smaller dimensional space.

2.1. CONVOLUTIONAL NEURAL NETWORKS 5

A network architecture consisting out of convolutional layers and max pooling oper-

ations was first used by LeCun (1989) to recognize handwritten digits. [23] Krizhevsky

(2012) used a similar architecture but deeper. [24] The dawn of the general purpose

graphics processing unit allowed them to train this network more efficiently than their

predecessors. This won them multiple image recognition contests. Van den Oord (2016)

used a special type of convolutional network with dilations to generate realistic speech. [25]

Convolutional networks have been used to generate symbolic music as well. Yang

(2017) used it for his MidiNet. [26] According to human evaluation, MidiNet generated as

realistic music as other contemporary state of the art technologies, but was more pleasant

to listen to.

2.1.2 Pooling

Pooling operations are operations that pool together values in a sliding window without

a learnable kernel. The operation done on the values can be non-linear (for instance max

pooling takes the maximum in the window as output value). These operations are often

used for downsampling, as using these with a stride bigger than one will downsample, just

like a convolutional neural network.

According to Springenberg (2014) max pooling layers can be substituted by convolu-

tional networks with as big a stride for downsampling purposes. [27] This result is used

in the Deeply Convolutional Generative Adversarial Network from 2016, since max pool-

ing layers create too sparse a gradient. [12] We have used this result in our own experiment.

2.1.3 Transposed Convolutional Layers

Downsampling techniques have been covered in the sections above. Transposed convolu-

tions can learn an upsampling kernel. [28] It maps one value into a region of outputs, by

multiplying the input value with multiple learned parameters. One can see an example in

Figure 2.2. Note that these can be used on multiple channels like normal convolutional

networks.

These layers can thus be used in auto-encoders to upsample a downsampled signal.

2.2. RECURRENT NEURAL NETWORKS 6

Figure 2.2: One dimensional transposed convolution visualized. Image from Distill. [2]

We will use them for this purpose in our experiment.

2.2 Recurrent Neural Networks

Recurrent Neural Networks are types of neural network with recurrent connections and

thus some sort of memory-like behavior. They are often used for signals that have some

type of time-series like behavior, for instance in natural language processing (as the mean-

ing of a current word depends on the context previous words have set) or music. In this

section we will give a small introduction on LSTMs and GRUs, the two most popular

classes of Recurrent Neural Networls, albeit not too technically.

While not directly used in this research, much music generation research has been

done using recurrent neural networks or RNNs. Some of these attempts will be discussed

in later sections.

2.2.1 LSTM

The most common recurrent neural network used is the Long Short Term Memory net-

work or LSTM network. [29] Whilst a detailed explanation of how they work is quite out

of scope for a cursory understanding, one can see how multiple nodes are interconnected

in a LSTM in Figure 2.3. The little x symbols are multiplications of the input. Else, the

node just takes a weighted sum of its inputs. The S -like circles stand for the preferred

activation function, often the logistic sigmoid or tanh function.

Roberts (2017) used LSTMs in a Variational Auto Encoder style setup to generate

melody loops and drum loops, even simultaneously for trios. [30] They used an extremely

large data set consisting of every MIDI song the Google search engine could scrape off

the web, using only songs in 4/4 and even augmented the data by shifting the pitches so

every pitch would be as prevalent.

2.3. RESIDUAL NETWORK 7

Figure 2.3: View of the inner-workings of how a peephole LSTM works. Image from Medium. [3]

2.2.2 GRU

Another often encountered recurrent neural network is the Gated Recurrent Unit or the

GRU. [31] GRUs are fairly new in comparison to LSTMs and have better performance on

some applications, however, they fail at certain tasks LSTMs can handle. [32]

2.3 Residual Network

Training very deep networks gets increasingly complex for every layer added. In 2015, two

papers, explored the idea of using skip connections. Srivastava (2015) proved that CNNs

often had a loss in performance the deeper the network was, unrelated to underfitting. [33]

They showed that using their highway networks, deeper networks could be trained more

efficiently.

He (2015) made the hypothesis that the objective of certain layers was hard to op-

timize for. [4] Using skip connections, summing a previous layer together with a newly

calculated feature map, added layers can easily perform a identity function if no other op-

timization seems to be available. They were able to train networks up to a hundred layers

deep using these skip connections. This concept gets visually explained in Figure 2.4.

ResNet layers can only be used when no dimensionality change between layers will

happen, as summing the input and the output means they both have to have the same

dimensionality. We have used these ResNet layers in our own experiment after downsam-

pling the original songs.

2.4. NORMALIZATION 8

Figure 2.4: The Resnet pattern visualised. Image from He (2015). [4]

In Figure 2.5 one can see the difference between non-ResNet networks and ResNet

networks. The ResNet network has better performance the greater the layer count is.

Figure 2.5: Resnet training curves.Image from He (2015). [4]

2.4 Normalization

Most neural networks layers have better performance when their input is in some deter-

mined range. A good technique to enforce this is normalization, or centering the distribu-

tion of an input around a predetermined mean and rescaling the variance to a fitting value.

Initially, one could only normalize the input data of network. However, techniques have

been designed to allow inter hidden layer normalization. What follows is a enumeration


of topics discussed in the following subsections:

• In subsection 2.4.1, we will discuss normalization techniques where dataset statistics

of the outputs of a layer are learned to normalize the outputs of layers.

• In subsection 2.4.2, we will discuss normalization techniques where the norm of the

weight vector and its direction are split.

• In subsection 2.4.3, we will explain what Lipschitz-Continuous functions are. This

type of functions is used in Wasserstein GANs, introduced in subsection 3.2.3. We

advice the reader to read that section first.

• In the last subsection, subsection 2.4.4, we will explain how spectral normalization

works. This type of normalization is only used in GAN-like networks, introduced in

section 3.2. We advice the reader to read that section first.

2.4.1 Network layer output normalization

Normalizing the output of a layer normally happens by calculating a running mean of

features and a running variance of features. The resulting vector x has the Normal

Distribution with mean 0 and variance 1. It can then be rescaled by two different learned

parameters σ and µ: µ for centering the distribution around a new mean and σ for giving

it a new standard deviation. The equations have been listed more cleanly in Equation 2.1

and Equation 2.2.

x =x− E[x]√V ar[x]

(2.1)

x∗ = σ ∗ ˆx+ µ (2.2)

In a multi-dimensional vector this mean can be taken in multiple ways. The most com-

mon and the original output normalization algorithm was the Batch Normalization algo-

rithm. [34] Introduced by Ioffe (2015), it normalizes the output in the batch dimension,

thus also serving as a sort of regulizer since features have a dependence across multiple

inputs. This isn’t always the best case (for instance RNNs often have difficulty with Batch

Norm) so other normalization directions have been tried over the years.

Batch Norm and some of the other normalization directions are depicted graphically

in Figure 2.6. The N dimension is the batch dimension, the C dimension is the channel di-

mension and H and W are the vertical and horizontal dimensions respectively. One can see

four types of output normalization commonly used in machine learning: one normalizes


features over the batch dimension in the Batch Normalization algorithm, one normalizes

over the H and W dimensions for layer normalization and one normalizes over the channel

dimension for Instance normalization. [34, 35, 36] Group normalization is a combination

of Instance norm and Group norm where one normalizes over multiple dimensions at the

same time. [37] Image from Towards Data Science. [5]

Figure 2.6: Multiple types of layer output normalization. Image from Towards Data Science. [5]

2.4.2 Layer Weight normalization

Weight normalizations try to normalize the outputs of a layer by normalizing the norm of

the weights of a layer. [38] It does so by dividing the weight vector v by its own norm after

each iteration. A learned scalar g can then rescale the output of the layer to a preferred

scale for the next layer. The authors proved that this improves the learning speed for

networks where Batch Norm’s natural noisiness proved a problem. Another benefit to

using weight norm is that it has a lower runtime impact than Batch Norm.

w =g

||v||v (2.3)

2.4.3 Lipschitz continuity

The following subsection pertains to a mathematical property of a function. This property

is used in the Wasserstein GAN paper and is approximated by the spectral normalization

described in the following section. If the reader is unfamiliar with Generative Adversarial

Networks, we advise them to read section 3.2 first.

A function is Lipschitz continuous if Equation 2.4 holds. [39]

||f(x)− f(y)|| ≤ ||x− y||,∀x, y ∈ A (2.4)


An interesting property is portrayed in Figure 2.7. If one would slide a cone as pictured

along the plot of a 2D function, the function would be Lipschitz continuous if it would

never cross into the white section of the cone.

Figure 2.7: Lipschitz cone. Image from Wikimedia. [6]

Lipschitz continuous functions exhibit properties that are interesting for calculating

an approximation of the Earth Movers distance between distributions, as discussed in

Arjovsky (2017), a very maths-heavy discussion that is out of scope for this master’s

dissertation. [40] However, a simplified explanation can easily be given: the Earth Movers

distance is the minimum-cost distance to move a distribution from shape A to shape B.

We want to use this in the WGAN algorithm to change the distribution of a generated

shape B closer to the shape A of the original. If the shape of A and Bs norm has too

large a difference, the distance doesn’t make any sense.

2.4.4 Spectral Normalization

The following subsection pertains to a special type of normalization, only used for stabi-

lizing the training of GANs. If the reader is unfamiliar with this topic, we advise them

to read section 3.2 first.

Miyato et. al introduced spectral normalization, a novel type of weight normalisation

that enforces Lipschitz continuity. [41] The algorithm changes the weights of the layer so

the function becomes Lipschitz-1 continuous and thus usable for newer GAN-like meth-

ods like the Wasserstein GAN. [40] A weight manipulation as in WeightNorm is used, but

instead of using the Euclidean norm the spectral norm is used.

2.5. ATTENTION 12

The spectral norm is the maximum singular value of a matrix. The singular values

are the absolute values of the square roots of the eigenvalues of a vector. As a quick

refresher, the eigenvalues are the amount a vector multiplied with a matrix gets stretched

in the direction of one of its eigenvectors. If this amount of stretch is small, one can see

that the function this matrix implements will not increase drastically. Whilst no proof

was given by the authors of the paper that spectrally normalized networks are strictly

Lipschitz continuous, this type of normalizations gives a good approximation.

2.5 Attention

Attention is a memory-like mechanism used in the machine learning field, introduced by

Bahdanau (2014). [42] By using a measure for similarity between inputs in a sequence

and a hidden state, one creates a sort of heatmap over the original, with the more im-

portant values having a higher value. It has been used successfully to create a music

generator with very long term memory using a Transformer network, a special type of

architecture using only attention, [43] as a replacement for recurrent neural networks. [44].

Hard and soft attention exist: hard attention causes a network to only attend to a

single location, soft attention is spread out over the whole input domain. However, it’s

not possible to use hard attention in conjunction with backpropagation, since there is no

gradient for the parts not attended to, so it can only be used with evolutionary networks

or networks trained via reinforcement learning. Most soft attention implementations use

a dot product as a measure of similarity.

Zhang (2018) used self-attention to improve the performance of GANs. [7] A visual ex-

planation of how they used attention can be seen in Figure 2.8. Image from the SAGAN

paper. Note the 1x1 convolutional kernels used in the paper: this is actually just a

channel-wide multiplication. One can see images taken from a top-level attention layer

from one of our own networks in Figure 2.9.

2.6 Dropout

Dropout is a technique to minimize the overfitting phenomenon in neural networks, in-

troduced by Srivastava (2014). [8] By probabilistically zeroing out certain elements from

2.6. DROPOUT 13

Figure 2.8: Picture explaining the SAGAN self attention implementation. Image from Zhang (2018). [7]

a hidden layer, deeper layers will have to take a more averaged weight from the layer,

meaning that it will be less likely to overfit. During evaluation, the Dropout is deactivated

as to improve quality. A figure can be seen in Figure 2.10.

Spatial Dropout was introduced by Thompson (2014). [45] It is used for zeroing out

a complete channel when using Convolutional Neural Networks. If nearby pixels within

channels are strongly correlated, using standard dropout would not regularize the net-

works. They used this for creating a better human body posture classifier.

2.6. DROPOUT 14

Figure 2.9: Comparison between the o variable in the attention algorithm and the original input to thelayer.

2.6. DROPOUT 15

Figure 2.10: Dropout visualization. Image from Srivastae (2014). [8]

Chapter 3

Generative Neural Networks

In this section, two broad general techniques for generating new samples from a example

dataset will be explained: the Variational Autoencoder or VAE and the Generative Ad-

versarial Network or GAN. Both of these will be key to understanding how our network

generates new samples from the given dataset. A brief overview of generative techniques

used for transferring the style of an input example to the style of another input or input

dataset, also called style transferring, will be given later on in the section. What follows

is a enumeration of the sections in this chapter:

• In section 3.1 we will discuss Variational Auto Encoders, a generative network that

uses an autoencoder like architecture with a twist.

• In section 3.2 we will discuss different types of Generative Adversarial Networks:

an algorithm where one uses two networks with different purposes. One network

tries to generate convincing samples. The other network tries to correctly classify

generated samples from real samples.

3.1 Variational Auto Encoder

An auto-encoder is a neural network that minimizes input in one or more dimensions

and tries to emit it’s own input as output. One can then try to generate new output

which could behave similarly as the input by feeding some type of noise into the decoder.

However, one doesn’t know the distribution of the features the encoder emits.

Kingma (2013) laid the foundation for the variational autoencoder. [46] Autoencoders

are networks that try to recreate their input as output, after encoding it in a smaller

intermediate representation. VAEs are autoencoders with a probabilistic latent space as

intermediate representation, which can be seen in Figure 3.1. The features in the latent

16

3.1. VARIATIONAL AUTO ENCODER 17

Figure 3.1: A VAE architecture. Image from Github.io. [9]

space are thus probability distributions.

Since the output of a probability distribution does not necessarily convey any meaning-

ful information about the mean or variance to backpropagate upon, the reparametrization

trick is used. Adding a number to a random variable shifts the mean of that random vari-

able, and multiplying it by a number shifts the variance. These basic properties can be

used to create any Gaussian Distribution from the Normal distribution, and backpropa-

gate through the mean and variance of this distribution, as seen in Equation 3.1.

X ∼ N (µ, σ2) = σ2N (0, 1) + µ (3.1)

To generate from the decoder of the VAE, a extra loss term is added: the Kullback–Leibler

divergence (or KL divergence). The KL divergence is a measure of how one probability

distribution is different from a second one. This divergence between the current dis-

tributions emitted by the encoder and a prior distribution (often the univariate normal

distribution) is used to map the latent variables overall distribution to the distribution of

the prior, as diminishing this divergence makes the latent variable act more like the prior.

It can also enforce some regularization (as distributions with for instance zero variance

are not all that probabilistic and amount to overfitting) when a VAE-like structure is used

without the need for manually generating from the decoder later. Equation 3.2 is the stan-

3.2. GENERATIVE ADVERSARIAL NETWORK 18

darad formula for calculating the KL-divergence. Equation 3.3 is the KL-divergence for

a random variable R where we know that it consists of a set of multiple random variables

where mean and variance are known. The derivation can be found in the appendix of the

original VAE publication. [46]

DKL(P ‖ Q) = −∑x

P (x)log(P (x)

Q(x)) (3.2)

DKL(R ‖ N (0, 1)) = −1

2

∑j

1 + log(σ2j )− µ2

j − σ2j (3.3)

3.2 Generative Adversarial Network

A GAN or generative adverserial network is a network consisting of two sub networks: a

generator that tries to generate truthful data and a discriminator that tries to discrimi-

nate the real data from the fake data. There are many different types of GANs. In the

following section, we will discuss some of the developments pertaining GANs and some

known difficulties concerning their use. A good introductory paper to the current state

of the art has been written by Kurach (2018). [47]

3.2.1 Original GAN

A GAN or generative adverserial network is a network consisting of two sub networks: a

generator that tries to generate truthful data and a discriminator that tries to discrim-

inate the real data from the fake data. [48] The generator can use the discriminators

output to optimize its output, and vice versa. This way, the networks play a minimax

game where they both try to update their parameters in such a way to minimize their

loss until they reach a Nash equilibrium. This is visualized in Figure 3.2. Image from a

Kevin McGuinness presentation found on Slideshare. [10]

DD(x, z) = −log(σ(D(x)))− log(σ(−D(G(z)))) (3.4)

DG(x, z) = −log(σ(D(G(z)))) (3.5)

In Equation 3.4 and Equation 3.5, one can see the equations for the losses of the

networks, σ stands for the sigmoid function, x for original data samples from the training


Figure 3.2: The standard GAN architecture visualized. Image from Slideshare [10]

dataset and z for a random noise sample.

3.2.2 Mode collapse

GAN type networks are notoriously difficult to train: small hyperparameter changes can

change the networks output by a lot and cause it to show all types of failures. Many pub-

lications have been presented with solutions or explanations for this subsets of problems:

however the biggest cause of instability is the mode collapse failure mode. An example can

be seen in Figure 3.3. The network from the figure was trained on the popular MNIST

dataset. After a certain amount of epochs, its output would ’collapse’ to only contain

sixes. Image from the Unrolled GAN paper. [11]

3.2.3 Wasserstein GAN and other improvements

The most popular basis for a solution was presented in the Wasserstein GAN papers and

its follow ups. The authors of the Wasserstein GAN paper claimed to have solved the

mode collapse problem theoretically, [40] by changing the loss and enforcing the discrim-

inator to implement a Lipschitz continuous function, a topic handled in subsection 2.4.3,

so a heuristic for the Earth Movers distance could be used. The way they would enforce

this strict constraint is by clipping the weights of the discriminator, as this would limit the

norm of the gradient of the network. The new loss function can be seen in Equation 3.6

and Equation 3.7.


Figure 3.3: An example of mode collapse from a network learning the popular MNIST dataset. Imagefrom Metz (2016). [11]

DL(x, z) = −D(x) +D(G(z)) (3.6)

DG(x, z) = −D(G(z)) (3.7)

The reason why this Lipschitz continuity is such a useful property for a discriminator,

is because one is guaranteed that ||f(x) − f(x + ε)|| ≤ ||ε||, therefore small changes in

the input will not completely change the discriminators output. In other words, if the

discriminator recognizes a fake that greatly resembles some examples, it still has to give

it good marks.

Radford (2015) was the first to scale up GANs in their deeply convolutional GAN or

DCGAN. [12] This paper is the baseline for image based GANs. In Figure 3.4, one can

see the author’s guidelines to training a GAN. Most of the points on this image are still

valid. In Figure 3.5 the used architecture is visualized. The length of each block signifies

the number of channels used. Note that they upsample with stride 2 and the channel

dimensions are divided by 2 in every step: the amount of data gets doubled up until the

last layer.

Gulrajani (2017) tried to improve this method of enforcing L1-continuity by hav-

ing an extra gradient penalty term. [49] Another technique that enforces this Lipschitz


Figure 3.4: The DCGAN guidelines. Image from Radford (2015). [12]

Figure 3.5: The DCGAN architecture. Image from Radford (2015). [12]


continuity is the spectral normalization introduced by Miyato (2018) and discussed in

subsection 2.4.4. [41]

Brock (2018) improved upon previous GAN papers by scaling up the setting to an

immense size. [50] He used 512 cores and trained his network for multiple days to get the

current state of the art GAN results. He also used what he calls the ’truncation’ trick:

by truncating the noise if it falls outside of a certain predefined range he improved the

GANs performance during evaluation.

Zhou (2018) researched numerous loss functions and proposed a optimal class of func-

tions for GAN losses. [51] They described the useful properties a loss function for a GAN

should have and devised new loss functions. For instance, the standard GAN loss pre-

sented in Equation 3.4 is actually their improvement upon the original.

Chapter 4

Style Transfer

A style transfer is the mapping of a given work of art to a new work of art with the

artistic style of another work of art, but still having content resemblance to its original.

All of the pioneering research presented in this section was done on images, but some of

the discussed techniques can be applied on musical style transfer as well.

An enumeration of the content in this chapter:

• In section 4.1 a brief history of the style transfer algorithm is given.

• In section 4.2 the cycle consistent style transfer will be explained, an important

algorithm for our own setup.

4.1 Pioneering style transfer works

The pioneering paper by Gatys (2014) used the response of a network trained in visual

object detection to backpropagate through a image that is initialized as random noise,

until it had similar object features (final layer output) as the original content image. [13]

To make sure the picture would have the same style as the style input image, a loss was

defined on the correlations of multiple filter responses throughout the network. In Fig-

ure 4.1, one can see the filter response that are chosen to signify style for both pictures,

in multiple layers of the network. Notice that the content representation becomes noisier

when going deeper into the network.

The biggest problem with this algorithm, is the the lack of reusability: the algorithm

would have to backpropagate through the network for any style input or content input:

there was no re-usable component so to speak. There also does not exist a analogue to

an object detector for music, speaking specifically about our research. One can see the

23

4.1. PIONEERING STYLE TRANSFER WORKS 24

Figure 4.1: Filter responses throughout the convolutional network. Image from Gatys (2014). [13]

4.1. PIONEERING STYLE TRANSFER WORKS 25

effect of their algorithm in Figure 4.2. the left image is the content input and the painting

on the bottom left of the right image is the style input. Image from the pioneering style

transfer paper. [13]

Figure 4.2: Example of style transfer. Image from Gatys (2015). [13]

Later on, a offline approach was proposed that would try to generate the correctly

stylized image for any given content inputs during evaluation and a single style image

chosen before training. [14] Instead of changing noise until it became a correctly stylized

image, the authors proposed using a network that would undergo the same losses as in the

original algorithm (thus acting on the responses of the layers of a pre-trained network),

but would be trained on converting images into one single style. This solved the biggest

problem the pioneering style transfer algorithm had, as no additional training would have

to be done, to convert an image to the learned style. A visual of how the loss is calculated

can be seen in Figure 4.3.

Figure 4.3: How the offline style transfer algorithm calculates its loss. Image from Johnson (2016). [14]

4.2. CYCLE CONSISTENT STYLE TRANSFER 26

4.2 Cycle consistent style transfer

The techniques discussed above would still be impossible to implement for music, since

content in music can’t really be expressed in terms of objects, thus the content metric

coming from which objects are detected would not work.

In 2017, the Cycle GAN paper introduced a style transfer algorithm that is more suit-

able for music using the GAN architecture. [15] If one has two datasets X and Y containing

works of art of a particular kind, one could define a mapping F from domain X to Y and a

mapping G from domain Y to X. One could use invertability as a loss criterion. However,

the outputs of the mappings would not necessarily be realistic in the other domain. The

discriminators from the GAN architecture come into play here: they can enforce that the

style of F(x), for x an element the dataset X, is indistinguishable to the style of Y.

One can see a detailed infographic of their algorithm in Figure 4.4 and an example of

the fine grained content change this setup is able to achieve in Figure 4.5. In Figure 4.4,

X and Y are the image datasets, and F and G the functions mapping an image from Y

to X and from X to Y, respectively. The cycle consistency loss is visually explained in

the two rightmost pictures: mapping sample x to Y should cause it to get mapped close

to x again, idem for the the sample y shown on the right. In Figure 4.5, one can see

the results of a cycle consistent style transfer on datasets containing pictures of horses

and zebras: notice that the horses and zebras are exactly in the same position, only the

hair color has changed. In Figure 4.6, one can see the architecture they used. Notice the

similarity between it and a standard autoencoder, as one downsamples and upsamples

the image. This convolutionally downsampled middle layer can be thought of as a high

level representation of the image, although they do not downsample that heavily. These

images have been taken from the Cycle GAN paper. [15]

However, Chu (2017) found out that Cycle GAN networks could add unwanted high

frequency information to aid itself in learning a complex cyclical mapping. [16] Alma-

hairi (2018) added a defence to mitigate this type of error by adding a latent variable

in their Augmented CycleGAN framework. [52] In Figure 4.7, one can see an example of

the ”steganography” a CycleGAN could employ: whilst there is no detail of a chimney in

the 2D map, in the recreated satellite vision map there are chimneys. In Figure 4.8 one

can see the reason: imperceptible Gaussian noise was added in the second feature map to

recreate the image.


Figure 4.4: Cycle Consistent style algorithm. Image from Zhu (2017). [15]

Figure 4.5: An example of style consistent style transfers on pictures of horses and zebras. Image fromZhu (2017). [15]

Figure 4.6: Architecture of CycleGAN. Image from Zhu (2017). [15]


Figure 4.7: Cycletransfered skymap with added details that aren’t visible in the roadmap. Image fromChu (2017). [16]

Figure 4.8: A comparison between a evaluation roadmap and a training roadmap, respectively. High-frequency encoding is made visible by adaptive histogram equalization. Image from Chu(2017). [16]

Chapter 5

State of the Art in Musical Style

Transfer

This section will be a listing of the most important papers in regards to this dissertation.

A small summary of the sections:

• section 5.1 is a section about which generative techniques are the most novel.

• section 5.2 is about computer music generators. As our style transfer also has to

generate music, some insight into what works and what doesn’t in that department

is interesting.

• section 5.3 is a small look on master’s dissertations done about similar subjects in

our university.

• section 5.4 is about style transfers in the musical domain. Not too many papers on

this subject have been accepted at conferences, thus a brief overview of what has

been tried in the last couple of years will be given.

5.1 Generative techniques

Zhang (2018) introduced SAGAN (Self-Attention Generative Adversarial Networks). [7]

These are GANs using Spectral Norm, with an attentional layer in between. This network

generates very realistic looking images and recorded a very high FID score during evalu-

ation. One peculiarity however is that they use Hinge style losses instead of the normal

WGAN loss.

Almahairi (2018) created the Augmented CycleGAN. Instead of just using a one-to-

one type mapping, they went further and designed a Many-To-Many type mapping and

29

5.2. MUSIC GENERATION 30

used an extra latent variable given as extra input.

5.2 Music Generation

Yang (2017) used a convolutional GAN for their MidiNet. [26] According to human eval-

uation, MidiNet generated as realistic music as other contemporary state of the art tech-

niques at time, but was more pleasant to listen to.

Huang (2018) created a Transformer based music generator. [44] A special property

about this generator is its ability to generate long term structurally sound music. Whilst it

still glitches, the music this generator produces was evaluated as very realistic by humans.

5.3 Previous master’s dissertations received

In 2017, two master’s dissertations were written at our university about style transitions

and a style transfer. We thought it would be interesting to feature their work in our own

dissertation. Training GANs was not as stable as it is nowadays, so auto-encoder style

setup would have looked a lot more feasible during that era.

Spitael (2017) published a master’s dissertation on style transfer between classical

composers. [17] The author tried to transfer the style of one composer to a composition

of another one by having multiple decoders (one for each composer) coupled to only one

universal encoder. The setup they used can be seen in Figure 5.1. The numbered lines

represent how the encoder and corresponding decoders were trained: during evaluation

time these could then be coupled in another way. A new way of vectorizing MIDI was

proposed: instead of the piano-roll representation (a vector representing the complete

state of the keyboard at every time instant), only one pitch (input as a value and not as

vector), it’s corresponding velocity and the lapsed time would be encoded. The system

would still be polyphonic as one could input multiple notes as having only a time differ-

ence of zero. An expert and non-expert audience found the generated transferred songs

to be more pleasing than random noise, but preferred the originals. Sadly, the published

audio files have been removed from the author’s SoundCloud.

De Coster (2017) published a master’s dissertation on using RNNs to generate poly-

phonic music with style transitions. [18]. Instead of using different decoders for different

composers, as his colleague Spitael did, the author of this paper tried to predict the next

5.4. MUSICAL STYLE TRANSFER 31

Figure 5.1: The architecture used by Spitael (2017). [17] The numbered sequences portray how theautoencoder was trained.

sample given a certain input sample and a given composer whose style the generator tries

to match. This output can then be fed into the network again, generating new sequences.

During generation, the style vector can change from one composer to another one, forc-

ing the machine to predict a different future output, making it possible to interpolate

between composers. The authors assessed that the quality of output is lower than that of

the real compositions, but not by much. One can see his network architecture in Figure 5.2

Figure 5.2: The architecture used by De Coster (2017). [18]

5.4 Musical Style Transfer

Malik (2017). published a paper about GenreNet, a network they trained on note velocity

data for different genres. [19] By using a single encoder and multiple LSTM decoders (one


for each genre), the network would output note velocities for input songs. This archi-

tecture can be seen in Figure 5.3 46% of songs were correctly identified, 25% wrongly

identified and 25% couldn’t be identified. Thus, it passed what the author calls the ”mu-

sical Turing test”.

Figure 5.3: The architecture for velocity prediction in GenreNet.Image from Malik (2017). [19]

Brunner published two papers about musical style transfer. One of them uses a Vari-

ational Autoencoder, the other one a Cycle GAN style setup.

Brunner (2018) used a Variational Auto Encoder to generate style transfered music

by adding a genre tag to the features emitted by the encoder. [53] By training the VAE

until it reproduced the original input, one could simply swap the genre tag during eval-

utation to listen to the style transfered output. One of the difficulties with this setup is

the fact that the genre has to be a meaningful feature for the decoder: if it just ignores

that particular feature, swapping the genre tag will not result in much change. Contrary

to most other techniques, they actually use note velocity as a input. Note length is also

fed separetely to the encoder.

In another paper, Brunner (2018) used a Cycle GAN setup to generate style trans-

ferred music. [20] To evaluate the generated music automatically, they used a genre clas-

sifier trained on three genres. If the detector would evaluate the song in the correct genre,

the generated sample would be evaluated positively. They would transform the song bar


Figure 5.4: The VAE architecture used by Brunner (2018). [20]

per bar, transforming a whole song this way for some of the examples uploaded online.

No human evaluation was done in their paper. The architecture of the generator and

discriminator can be seen in Figure 5.5 and the architecture of the classifier in Figure 5.6.

Figure 5.5: The Cycle GAN architecture used by Brunner (2018). [20]


Figure 5.6: Genre classifier architecture used by Brunner (2018). [20]

Chapter 6

Dataset

As the network one has can only be as good as the data it receives, the dataset is an im-

portant part of the experiment. There are multiple choices concerning the interpretation

and laying out of data of all kinds. This also holds for music. We collected a dataset with

only symbolic music in the MIDI format and binarized this using existing libraries.

What follows is an enumeration of all the sections in this chapter:

• In section 6.1 we will explain what MIDI is, how it can be used and in what way

we use it.

• In subsection 6.1.1 we will explain what piano rolls are, a way to represent music

played on a particular instrument in terms of pitch and rhythm.

• In subsection 6.1.2 we will explain why we chose the songs we chose, out of a larger

dataset.

6.1 MIDI

MIDI (Musical Instrument Digital Interface) is a format originally intended for communi-

cation between multiple musical instruments. MIDI describes a piece of music in events.

As it was originally meant for live music, an event can be a change of tempo, a change of

instruments or a note event. Note events describe which notes to be set on or off at certain

time-steps and how loud they should be played, also known as the velocity of the note.

If a MIDI-stream gets recorded, playing it back on another synthesizer should play the

same song (albeit possibly with a different timbre due to differences in equipment). This

way, MIDI can be played back on a PC or smartphone by using the operating system’s

wavetable synth. A dissection of a MIDI message can be seen on Figure 6.1. This message

35

6.1. MIDI 36

contains the following sections: the on or off message code, the pitch of the note that has

to be put on or off and the velocity (note loudness). For our research we have omitted the

velocity information everywhere, as it would further complicate the design of the network.

Figure 6.1: A MIDI-message. Image from PlanetofTunes. [21]

6.1.1 Piano rolls

However, a MIDI message’s structure does not fit into our neural network without some

prerequisite operations, which works on tensors and vectors. We convert our piano pieces

to a piano roll vector to achieve this. A piano roll is a representation of the notes as

a concatenation of multi-hot vectors, one for the notes played on each timestep. These

can be imagined as vectors containing the notes held down at a specific time-step or the

played notes on a keyboard, laid out in the time dimension. A example can be see in

Figure 6.2. The time axis goes from left to right, the pitch axis has low pitches on the

upside of the image and high pitches on the downside of the image.

Figure 6.2: Typical visualisation of a piano roll. Image from Researchgate. [22]

6.1. MIDI 37

6.1.2 Song selection

A dataset containing piano only symbolic MIDI music was assembled from MIDI songs

found on the internet. We chose to only use piano data, as we suspected it would simplify

the experiment and lead to better results: people tend to play different notes on different

instruments, for example, the general dissonance (or how many notes are played from

outside of a chosen scale) in standard electric guitar playing will be higher than for piano,

even when looking at a fixed genre, for example rock music.

The selected genres were classical and ragtime, as many piano-only MIDI files could be

find for aforementioned genres and both genres have a distinct feel to them. The ragtime

MIDI files were played by a human pianist, whilst it is suspected that the classical MIDI

files were composed by a human. This could be important for the noise on the timing, as

a human performed piece would be slightly more swung. The classical dataset is slightly

larger, with 320 pieces instead of the ragtime sets 220 pieces.

For use in our neural network, these MIDI files are converted to piano-roll sequences.

Piano-rolls are vectors containing the notes held down at a specific time-step. Since the

MIDI format allows for up to 128 notes to be played on each instrument, these vectors are

128 elements long, even thought a real life piano only has 88 keys. One could criticise this

choice, as other, even multi-instrument setups like Brunner (2018) use a smaller piano

roll per timestep [20], but having the possibility to have learn it any dataset for any MIDI

instrument in the future is a definite plus.

Both sets are quantized up to the 16th note, for both timing noise reduction and

network complexity. The songs are then cut up in pieces of 4 bars. We take multiple

non-overlapping 4 bar excerpts from these songs to train our data. Data augmentation

is done by by adding pitch shifted versions of songs to the dataset: every song is pitch

shifted up to six pitches down and five pitches up. This also nullifies an inherent scale

imbalance in our dataset.

One can compare the augmented and unaugmented datasets by comparing the graphs

in Figure 6.3 and Figure 6.4. The top graphs are distributions by pitch, the bottom

graph a distribution of the notes in the chromatic scale, the most commonly used music

system. A clear imbalance can be seen in the unaugmented graphs, as the distributions of

pitches isn’t smooth at all, and the distribution by notes not being near constant. After

augmentation, these problems have gone away. Note that a lot of possible pitches in the

higher and lower end of the MIDI-pitch spectrum are unused, this is because these notes

aren’t often picked by composers. A real life piano also only has 88 keys, in comparison

6.1. MIDI 38

to the MIDI standards 128 pitches.

Figure 6.3: The distribution of notes in the unaugmented training set.

Keen readers will notice that the notes still aren’t perfectly balanced after the shifting

augmentation. This is because a small number of shifted examples were removed from the

training dataset, to better assess overfitting if needed. This set did not get used, however.

Loading and quantizing the MIDI files was done with the pypiano roll library. [54] The

MIDI was binarized, which was bugged because of a typo in the pip version, so a nightly

version from GitHub had to be installed.

In Figure 6.5 one can see the amount of notes per dataset. One of the reasons is the

fact that there are more classical songs than ragtime songs in our dataset. In Figure 6.6

one can see the average amount of notes per excerpt. One can see that the has more notes

by every measure.

6.1. MIDI 39

Figure 6.4: The distribution of notes in the training set with augmented training set.

Figure 6.5: Note count of multiple datasets.

6.1. MIDI 40

Figure 6.6: Average amount of notes per song of multiple datasets.

Chapter 7

Architecture

Many architectural and algorithmic choices have to be made. In this section we will detail

2 architectures used in our experiment and some problems encountered while engineering

these.

What follows is an enumeration of the different sections in this chapter.

• In section 7.1 we will discuss the general idea of our algorithm and model.

• In section 7.2 we will divulge all of the details about our first attempt at making a

style transfer network.

• In section 7.3 we will try to lay out an imbalance between generators that was noted

to happen after training for long enough.

• In section 7.4 we will explain how we solved the above problem.

7.1 High level overview

Looking at our list of generative techniques and style transfers, we could see two possible

choices: using a VAE or using a Cycle GAN. We opted for the second option, as the

research on Cycle GAN wasn’t public at the time we started writing.

Although we opted for the Cycle GAN architecture to enforce the stylization, [15] we

did not rule out that a latent layer could make the style transfer more realistic. Another

new angle for us to explore was the use of attentional layers. The SAGAN paper showed

that Attentional layers can work well in generative settings. [7] One last thing that we

wanted try using was the WGAN style Lipschitz discriminator. [40] All of the things

41

7.2. ORIGINAL IMPLEMENTATION 42

mentioned in this paragraph weren’t explored by Brunner (2018). [20] One can see a

high-level overview of the algorithm in Figure 7.1.

Figure 7.1: Algorithm outline.

7.2 Original Implementation

In this section we will explain all of the architectural building blocks which were ex-

perimented with and give reasoning to why these specific blocks were put in the places

they were. All neural networks were designed with the PyTorch machine learning frame-

7.2. ORIGINAL IMPLEMENTATION 43

work. [55] All graphs were made with matplotlib. [56]

In Equation 7.1 and Equation 7.2 one can see the complete loss function used at the

end: for the discriminator we used a hinge loss and for the generator we used the standard

WGAN loss in conjunction with a binary cross entropy. The subscript for the generators

are used as notation as genre a gets converted to genre b. The loss for the b to a generator

and a discriminator is symmetrical to the losses listed in the equations. The second BCE

term in the generators loss is a way to enforce idempotency on the network. The w0 and

w1 params are used to scale the losses on the side of the generator. w1 was mostly set to

zero however, w0 to five.

LDb(xa, xb) = Exb

max(0, 1−D(xb)) + Exa

max(0, 1 +D(G(xa))) (7.1)

LGab(xa, xb) = − Exa

DB(Gab(xa))+w0 Exa

WBCE(xa, Gba(Gab(xa)))+w1 Exb

WBCE(xb, Gab(xb))

(7.2)

In the following figures, we divulge our architectures for our encoder, decoder and

discriminator in Figure 7.2 and Figure 7.3. Please note that this discriminator is L1 con-

tinuous, so we will have to use one of the unbounded discriminator loss functions to use

this architecture. Dimensions noted as HXWXC. One can see the architecture of the

VAE’s encoder in Figure 7.4.

The attentional layers were implemented as explained in the SAGAN paper and shown

in Figure 2.8: this allows the network to have a receptive field that is bigger than the com-

position of convolutional layers allows, which is very interesting for music, as it doesn’t

necessarily have a structure that fits into the space of one filter (as edges do in normal

pictures).

For normalization in the generator, Batch Norm was chosen. [34] While it does have

some issues for running in evaluation mode because the running mean over batches has

to be recalculated, [57] it kept the model regularized to a bigger extent than Instance

Norm. [36]

For normalization and enforcing the Lipschitz-continuity in the discriminator, Spec-

tral Normalization was chosen. Models without this type of discriminator showed mode

collapse problems.

As for filter sizes, at first we opted for convolutional layers with kernel size 3x3 and

7.3. TRAINING PROBLEMS 44

stride 2 for the encoding and decoding part of our generators, like Brunner (2018) did. [20]

But this type of convolutional layer has the tendency to put more information in certain

output elements as one that has a multiple of the stride in the kernel size. [2] A hypoth-

esis is that these type of artifacts had a smaller effect on Brunner’s model because their

outputs were only of size 64x84 instead of 256x128. All other recent GAN papers use

deconvolutions for which the non-checkerboarding property applies. A comparison for

pictures can be seen in Figure 7.5.

A typical failure mode for GANs consists of generator or discriminator getting too

good too early on. Instead of training the generator or discriminator for more iterations

as the other one to counteract this situation, we have opted for choosing different learning

rates for generator and discriminator. This method was introduced in the Two Time-scale

Update Rule (TTUR) GAN paper. [58] We used a TTUR of 10 as used in the SAGAN

paper, meaning the learning rate of the discriminator is 10 times as big as the learning

rate of the generator. [7] As our images are pretty lightweight, we chose for a bigger

batch size and a higher learning rate than most other GAN papers. The standard GAN

β values for the Adam-optimizer were chosen, thus β1 being set to 0 and β2 being set to 0.9.

Our experiments with using a VAE style encoder did not improve or decrease output

and learning behaviour of the network by much; however, in general, we noted that using

the VAE style encoder used more resources and needed a bigger number of epochs for

reaching the same reconstruction loss.

7.3 Training problems

Even after improving the stability by using the Lipschitz-1 discriminator, training prob-

lems persisted. The Wasserstein distance of one of the two generators steadily increased,

whilst the Wasserstein distance of the other one would stay relatively small. In the gen-

erated output Whilst the combined network is in this regime, the output of one of the

generators is often something very amusical, whilst the other generator produces some-

thing more akin to human made music. An example training loss graph can be seen in

Figure 7.6: please note that this example was generated with a smaller dataset with only

one sample per song. Note that the development dataset loss rises up to the vicinity of

one: this means the network also starts overfitting (it is hard to see due to the large scale

of this graph and the lack of significant losses in the training BCE loss).

A hyphothesis was made about it possibly having to do with the way the weights were


Figure 7.2: Original Encoder and decoder architecture.


Figure 7.3: Original discriminator architecture.


Figure 7.4: Original VAE encoder architecture.


Figure 7.5: Comparison between images with checkerboarding artifact and without. Image from Dis-till. [2]

Figure 7.6: A plot showing the inevitable imbalance between the two generators.


updated. None of the papers using a similar algorithm explored this. We tried three

different strategies as demonstrated in Algorithm 1, Algorithm 2, Algorithm 3.

Algorithm 1 One of the three backpropagation choices one could make.

for i in epochs dofor Batch a, Batch b in Dataset A, Dataset B do

Calculate loss and backprop through discriminatorsCalculate loss of batch aBackprop through both generator AB and BACalculate loss of batch bBackprop through both generator AB and BA

end forend for



Calculate loss and backprop through discriminatorsCalculate loss of batch aCalculate loss of batch bSum losses and backprop through both generator AB and BA

end forend for



Calculate loss and backprop through discriminatorsCalculate loss of batch aBackprop through both generator ABCalculate loss of batch bBackprop through both generator BA

end forend for

Algorithm 3, where the generator and discriminator can only change weights accord-

ing to their own loss, clearly had the worst performance in this regard, as the generators

failed to learn anything meaningful. Algorithm 1 and Algorithm 2 behaved similiarly, but

Algorithm 2 was faster. Algorithm 1 uses less memory, however. Neither of the strategies

solved the imbalance, however.

Regularizing the network caused the imbalances to show up later but did not com-

pletely eliminate the imbalances.

7.4. IMPROVED IMPLEMENTATION 50

7.4 Improved Implementation

One problem not addressed in the previous sections still went unsolved: not using Hinge

Loss but a standard Wasserstein style loss ruined the performance of our network, having

ever increasing losses and making the program crash at the end of the run. It turns out,

our discriminator (and the one used in the SAGAN paper) probably are not Lipschitz

continuous: in the attentional layers there is a sum term, which violates the definition of

Lipschitz continuity: removing this attentional layer from the discriminator immediately

caused the loss to steadily decrease after the first number of epochs and stabilize. w0 and

w1 from the loss functin were both seth to 1 instead of the values declared above.

Using this loss function also stabilized the long term learning profile of our overall

network: discriminator and generator would take turns in improving their performance,

as can be seen in Figure 7.7. Note that the Wasserstein-loss keeps steadily increasing

until the 80th epoch and stays around the same value at that moment in time. This

model could be trained with 10 samples per song, something which was impossible with

the previous one as one had to stop very early or use a very small dataset to not have the

imbalance ruin the output.

The architecture was also slightly changed. One can see this reflected in Figure 7.8

and Figure 7.9.

Figure 7.7: A less imbalanced learning profile.


Figure 7.8: Updated discriminator architecture.


Figure 7.9: Updated discriminator architecture.

Chapter 8

Results

In this chapter we will assess the quality of our generated music. A breakthrough in terms

of model imbalance only came after a survey was sent out, so all of the quantitative data

is about our first generation generator.

In section 8.1 we will evaluate the music generators ourselves. The results of our

survey will be discussed in section 8.2.

8.1 Researchers evaluation

The music generated with Architecture A has a tendency to only shuffle notes around

and to change the structural rhythm of the input song. Some outliers exist: a Classical

exerpt consisting of multiple piano glissandos gets transformed into a rather quiet piece.

Architecture B (and especially the Classical to Ragtime generator) had a tendency

to try and match the left hand rhythm of the genre it tries to generate music for. The

Classical to Ragtime generator generated a lot of songs with a simple, driving, 16 note

rhythm, which is something that does get often in Ragtime. Whilst these covers do

not sound revolutionary by any means, humanlike behavior is noticed. One can see this

pictured in piano rolls in Figure 8.1 and Figure 8.2

8.2 Human Evaluation

To assess the quality of our work a Google forms survey was sent out, consisting of 4 parts.

The test was executed single blind style, as only one questionnaire was sent out, and the

samples sent were known beforehand. First, we asked the respondents name and their

age. Their name was asked to be able to detect vandalism on the questionnaire (although

53

8.2. HUMAN EVALUATION 54

Figure 8.1: Original rhythm.

Figure 8.2: Style transferred rhythm.

it was not a required question). 41 persons responded, 51.4% of the surveyed considered

themselves knowledgeable about music and age distribution as shown in following figure.

31 persons gave up a name: no clear vandalism of the study was detected.

Figure 8.3: Age distribution of our respondents.

First, we showcased some reference ragtime and classical samples, each one from the

test set of the dataset. The respondents had to answer some questions about the likeness

of two unlabeled excerpts (coming from either the test set or from the output one of our


Figure 8.4: Distribution of our respondents self-reported knowledge about music.

generators. The first two questions had one ragtime example and one sample generated

from that example. The fourth and fifth sample had one classical example and one ragtime

sample generated from the classical example. The third and sixth question had random

samples generated and not generated, once with different genre and once without. The

raw results of the AB comparison can be seen in Table 8.1.

Most of the respondents found the majority of generated samples incoherent in this

first test, whilst all samples that were covers of each other scored high in the ’Could one

be a cover of the other’ question and the ’Are the previous songs similar’ question. We

research the perceived quality, once per style and once per generator architecture.

Question Generated Genre Architecture Coherent Neutral Not Coherent1 Yes Classical 1 5.3% 10.5% 78.9%1 No Ragtime 1 76.3% 15.8% 13.2%2 Yes Ragtime 1 15.8% 21.1% 15.8%2 No Classical 1 86.8% 7.9% 5.3%3 Yes Ragtime 1 14.6% 63.4% 22%3 No Classical 1 95.1% 0% 4.9%4 Yes Classical 2 13.2% 13.2% 73.7%4 No Ragtime 2 73.7% 10.5% 15.8%5 Yes Ragtime 2 18.4% 10.5% 71.1%5 No Classical 2 92.1% 7.9% 0%6 No Classical 2 68.3% 19.5% 12.2%6 No Classical 2 95.1% 4.9% %

Table 8.1: Raw results of the coherency questions in the A-B comparisons between generated and non-generated excerpts.

The respondents had a tendency to vote for the generated excerpt as unpleasant and


Question Genre of original Architecture Yes Neutral No1 Classical 1 86.8% 7.9% 5.3%2 Ragtime 1 63.2% 15.8% 21.1%3 Mixed / 5.3% 23.7% 71.1%4 Classical 2 73.7% 10.5% 15.8%5 Ragtime 2 65.8% 5.3% 28.9%6 Classical / 15.8% 10.5% 73.7%

Table 8.2: Raw results of the similarity questions in the A-B comparisons between generated and non-generated excerpts.

Question Genre of original Architecture Yes Neutral No1 Classical 1 71.1% 15.8% 13.2%2 Ragtime 1 42.1% 23.7% 34.2%3 Mixed 1 0% 26.3% 73.7%4 Classical 2 47.4% 34.2% 18.4%5 Ragtime 2 44.7% 28.9% 26.3%6 Classical 2 10.5% 23.7% 65.8%

Table 8.3: Raw results of the reinterpretation questions in the A-B comparisons between generated andnon-generated excerpts.

the non-generated song as pleasant. However, when pitting two non-generated songs

against each other, there is a significant dip in coherency for one of the two samples.

Many respondents found the excerpts to be similar, but not necessarily a reinterpreta-

tion. The generator with the attentional layers has a better score in terms of similarity

and likelihood of being a cover.

Musicians had a tendency to rate the generated samples higher than non-musicians.

A Pure yes/no comparison on the fakes can be seen in Table 8.4 and Table 8.5. This

might indicate that people who play music have a softer boundary for claiming musical

coherence.

Question Positive answers Percentage1 1 5.2%2 5 26%4 4 21%5 2 10.5%

Table 8.4: Musicians positive assessment for generated songs.

In a second section respondents were asked to score 4 samples on their musical co-

herence, on a scale from one to five. They were asked to do this twice. For each set of

samples, one was a fake from an old model that exhibited mode collapse (albeit it was a


Question Positive answers Percentage1 1 4.7%2 1 4.7%4 2 9.4%5 5 23.5%

Table 8.5: Non-musicians positive assessment for generated songs.

non unpleasant mode collapse for the first sample) and an old model that generated an

unpleasant sound according to the researcher. Another sample was a sample from the

testsets. The other two were generated by the architectures described above. You can

see a bar plot of the scores displayed below. It is clear that the real sample is the most

popular by far.

Q. Arch. Genre Score of 1 Score of 2 Score of 3 Score of 4 Score of 57 1 Class. 7.9% 26.3% 31.6% 26.3% 7.9%7 2 Class. 15.8% 47.4% 23.7% 7.9% 5.3%7 Real Class. 0% 0% 10.5% 28.9% 60.5%7 P.C. Class. 0% 15.8% 34.2% 36.8% 13.2%8 1 Rag. 13.2% 34.2% 42.1% 5.3% 5.3%8 2 Rag. 0 % 15.8% 55.3% 23.7% 5.3%8 Real Rag. 0% 0% 5.3% 18.4% 76.3%8 U.C. Rag. 42.1% 31.6% 26.3% 0% 0%

Table 8.6: Raw results of the qualitative comparison between generators.

We suspect that one of the reasons the third section had ’worse’ responses was because

they were coupled with the originals, making any dissonances quite obvious because the

source material could be heard. Some respondents told me after the questionnaire they

noticed a increase in quality in the generated samples from the fourth section. A lot of

people liked the more pleasant sample from the mode collapsed generator, averaged out

they liked it more than our generated examples. However, this architecture was absolutely

not the way to go, as it outputted the same output, regardless of the input sample. One

can see the average score of the samples in Table 8.7.


Type Avg ScoreReal 4.603Generator 1 2.778Generator 2 2.793Generated Classical 2.699Generated Ragtime 2.872Catastrophic Failure 1.842Pleasant Collapse 3.474

Table 8.7: Averages of qualitative experiment

Chapter 9

Conclusion

We can answer the question we stated in the introduction, if it would be possible to

generate a new song that is a cover or interpretation of another song only by giving

examples to an algorithm, with a yes. However this music is still of inferior quality to

the original human compositions. A foundation for algorithms emitting longer music was

laid out in this dissertation, but for non-fixed length input other ideas will have to be

explored, as one can only scale up the input of a convolutional network so much. Our very

latest model could clearly pick-up some high level genre characterics like frequently used

rhythms and could separate melody from accompaniment. Still some questions remain,

even for fixed length Cyclic-GANs:

• Can a similar technique be used when a dataset using multiple instrument inputs

is used?

• Can note velocity be added to this type of model?

• How would this work for different genres?

On a more technical note, we succeeded in creating a Cycle-GAN that could be trained

in a relatively stable manner without having to resort to early stopping using modern

GAN techniques. We suspect that Brunner (2018), [20] did not succeed in this fact as

they trained their GAN for only 30 epochs. Our networks also used a lower amount of

channels.

It is hard to compare our work to current state-of-the-art solutions in terms of pure

generation and covering ability as they all work on different scales. We think that the

invertability property lowers the quality of the generated samples, but also firmly roots

the generators in the realm of the originals.

59

CHAPTER 9. CONCLUSION 60

Something that could significantly improve the speed by which these network get eval-

uated is by creating a pre-trained network like the Inception network for an evaluation

loss like the FID. One has to wait until a network is fully trained to get any type of

qualitative information about the network (as the discriminator is also tainted by the

training data and cycle consistency isn’t that good an indicator of how good the network

is performing). [59]

As a follow up to this dissertation, one could try to implement the same architecture

but with a Transformer instead of a convolutional approach, seeing their ability to gen-

erate long term realistic musical sequences. [44] Since it is more rooted in the realm of

NLP, tricks for an unbounded input could be experimented with.

Another option would be trying to use a Many to Many Cycle GAN framework. For

one, it completely eliminates the ’steganography’ problem a standard Cycle GAN may

have. It also allows one to emit songs in multiple styles.

Bibliography

[1] D. Cornelisse, “An intuitive guide to convolutional neural networks,” Medium.com,

2018.

[2] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,”

Distill, 2016.

[3] O. Davydova, “7 types of artificial neural networks for natural language processing,”

Medium.com, 2018.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

CoRR, vol. abs/1512.03385, 2015.

[5] F. Doukkali, “Batch normalization in neural networks,” Towards Data Science, 2017.

[6] A Wikimedia user named Taschee. https://commons.wikimedia.org/wiki/File:

Lipschitz_Visualisierung.gif, 2017.

[7] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-Attention Generative

Adversarial Networks,” arXiv e-prints, p. arXiv:1805.08318, May 2018.

[8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: A simple way to prevent neural networks from overfitting,” Journal of

Machine Learning Research, vol. 15, pp. 1929–1958, 2014.

[9] B. Keng, “Semi-supervised learning with variational autoencoders,” Self-published

via Github.io, September 2017.

[10] K. McGuiness, “Deep learning for computer vision: Generative models and adver-

sarial training,” Slideshare, August 2016.

[11] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial

networks,” CoRR, vol. abs/1611.02163, 2016.

61

https://commons.wikimedia.org/wiki/File:Lipschitz_Visualisierung.gif

https://commons.wikimedia.org/wiki/File:Lipschitz_Visualisierung.gif

BIBLIOGRAPHY 62

[12] A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learn-

ing with Deep Convolutional Generative Adversarial Networks,” arXiv e-prints,

p. arXiv:1511.06434, Nov 2015.

[13] L. A. Gatys, A. S. Ecker, and M. Bethge, “A Neural Algorithm of Artistic Style,”

ArXiv e-prints, Aug. 2015.

[14] J. Johnson, A. Alahi, and F. Li, “Perceptual losses for real-time style transfer and

super-resolution,” CoRR, vol. abs/1603.08155, 2016.

[15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation

using Cycle-Consistent Adversarial Networks,” ArXiv e-prints, Mar. 2017.

[16] C. Chu, A. Zhmoginov, and M. Sandler, “Cyclegan, a master of steganography,”

CoRR, vol. abs/1712.02950, 2017.

[17] D. Spitael, “Style transfer for polyphonic music,” A UGent masters dissertation,

2017.

[18] M. D. Coster, “Polyphonic music generation with style transitions using recurrent

neural networks,” A UGent masters dissertation, 2017.

[19] I. Malik and C. H. Ek, “Neural translation of musical style,” CoRR,

vol. abs/1708.03535, 2017.

[20] G. Brunner, Y. Wang, R. Wattenhofer, and S. Zhao, “Symbolic music genre transfer

with cyclegan,” CoRR, vol. abs/1809.07575, 2018.

[21] Unknown, “What is a midi message?.” http://www.planetoftunes.com/

midi-sequencing/midi-messages.html, 1998.

[22] J.-P. Briot, G. Hadjeres, and F. Pachet, “Deep learning techniques for music gener-

ation - a survey,” Researchgate, 09 2017.

[23] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and

L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural

Comput., vol. 1, pp. 541–551, Dec. 1989.

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-

volutional neural networks,” in Advances in Neural Information Processing Systems

25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–

1105, Curran Associates, Inc., 2012.

http://www.planetoftunes.com/midi-sequencing/midi-messages.html

http://www.planetoftunes.com/midi-sequencing/midi-messages.html

BIBLIOGRAPHY 63

[25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-

brenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw

audio,” CoRR, vol. abs/1609.03499, 2016.

[26] L. Yang, S. Chou, and Y. Yang, “Midinet: A convolutional generative adversarial

network for symbolic-domain music generation using 1d and 2d conditions,” CoRR,

vol. abs/1703.10847, 2017.

[27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for Sim-

plicity: The All Convolutional Net,” arXiv e-prints, p. arXiv:1412.6806, Dec 2014.

[28] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,”

arXiv e-prints, p. arXiv:1603.07285, Mar 2016.

[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput.,

vol. 9, pp. 1735–1780, Nov. 1997.

[30] A. Roberts, J. Engel, and D. Eck, “Hierarchical variational autoencoders for music,”

in NIPS Workshop on Machine Learning for Creativity and Design, 2017.

[31] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated

recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.

[32] G. Weiss, Y. Goldberg, and E. Yahav, “On the practical computational power of

finite precision rnns for language recognition,” CoRR, vol. abs/1805.04908, 2018.

[33] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” CoRR,

vol. abs/1505.00387, 2015.

[34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training

by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.

[35] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv e-prints,

p. arXiv:1607.06450, Jul 2016.

[36] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing

ingredient for fast stylization,” CoRR, vol. abs/1607.08022, 2016.

[37] Y. Wu and K. He, “Group normalization,” CoRR, vol. abs/1803.08494, 2018.

[38] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization

to accelerate training of deep neural networks,” CoRR, vol. abs/1602.07868, 2016.

[39] H. F. Walker, “Course handouts,” None, Unknown.

BIBLIOGRAPHY 64

[40] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial net-

works,” in Proceedings of the 34th International Conference on Machine Learning

(D. Precup and Y. W. Teh, eds.), vol. 70 of Proceedings of Machine Learning Re-

search, (International Convention Centre, Sydney, Australia), pp. 214–223, PMLR,

06–11 Aug 2017.

[41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for

generative adversarial networks,” CoRR, vol. abs/1802.05957, 2018.

[42] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learn-

ing to Align and Translate,” arXiv e-prints, p. arXiv:1409.0473, Sep 2014.

[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,

and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.

[44] C. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D.

Hoffman, and D. Eck, “An improved relative self-attention mechanism for trans-

former with application to music generation,” CoRR, vol. abs/1809.04281, 2018.

[45] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object

localization using convolutional networks,” CoRR, vol. abs/1411.4280, 2014.

[46] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” ArXiv e-prints,

Dec. 2013.

[47] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly, “The GAN

landscape: Losses, architectures, regularization, and normalization,” CoRR,

vol. abs/1807.04720, 2018.

[48] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio, “Generative Adversarial Networks,” ArXiv e-prints,

June 2014.

[49] C. Gang. https://github.com/caogang/wgan-gp/.

[50] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity

natural image synthesis,” CoRR, vol. abs/1809.11096, 2018.

[51] Z. Zhou, Y. Song, L. Yu, and Y. Yu, “Understanding the effectiveness of lipschitz

constraint in training of gans via gradient analysis,” CoRR, vol. abs/1807.00751,

2018.

https://github.com/caogang/wgan-gp/

BIBLIOGRAPHY 65

[52] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. C. Courville, “Aug-

mented cyclegan: Learning many-to-many mappings from unpaired data,” CoRR,

vol. abs/1802.10151, 2018.

[53] G. Brunner, A. Konrad, Y. Wang, and R. Wattenhofer, “MIDI-VAE: Modeling Dy-

namics and Instrumentation of Music with Applications to Style Transfer,” in 19th

International Society for Music Information Retrieval Conference (ISMIR), Paris,

France, September 2018.

[54] H.-W. Dong, W.-Y. Hsiao, and Y.-H. Yang, “Pypianoroll: Open source python pack-

age for handling multitrack pianorolls,” in Late-Breaking Demos of the 19th Interna-

tional Society for Music Information Retrieval Conference (ISMIR), 2018.

[55] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmai-

son, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W,

2017.

[56] J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing In Science &

Engineering, vol. 9, no. 3, pp. 90–95, 2007.

[57] smth and multiple PyTorch users, “Why don’t we put models in .train() or

.eval() modes in dcgan example.” A forum thread with replies by a lead

dev of the popular PyTorch library, ”https://discuss.pytorch.org/t/

why-dont-we-put-models-in-train-or-eval-modes-in-dcgan-example/7422”,

2017.

[58] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochre-

iter, “Gans trained by a two time-scale update rule converge to a nash equilibrium,”

CoRR, vol. abs/1706.08500, 2017.

[59] S. Liu, Y. Wei, J. Lu, and J. Zhou, “An improved evaluation framework for generative

adversarial networks,” CoRR, vol. abs/1803.07474, 2018.

https://discuss.pytorch.org/t/why-dont-we-put-models-in-train-or-eval-modes-in-dcgan-example/7422

https://discuss.pytorch.org/t/why-dont-we-put-models-in-train-or-eval-modes-in-dcgan-example/7422

modelsMusical style transfer with generative neural network

Academic year 2018-2019

Master of Science in Computer Science Engineering

Master's dissertation submitted in order to obtain the academic degree of

Supervisors: Prof. dr. Tijl De Bie, Dr. ir. Thomas Demeester

Student number: 01304857Maarten Moens

Musical style transfer with generative neural network models€¦ · musical style transfer is the...

Documents

Transcript of Musical style transfer with generative neural network models€¦ · musical style transfer is the...