Jukepix: A Cross-modality Approach to Transform Paintings into Music … · 2019. 1. 12. ·...

Jukepix: A Cross-modality Approach to Transform Paintings intoMusic Segments

Xingchao Wang1*, Zenghao Gao2*, Huihuan Qian1 and Yangsheng Xu1

Abstract— The challenges in transforming paintings intomusic is well-known, since the relationship between two kindsof art is unclear. Different composers write different musicwhen the same painting is presented to them. In this paper,a cross-mordality model has been proposed for transformingimages into multitrack music based on the framework of deepconvolutional generative adversarial networks (DCGANs). Theproposed model is trained on a classical music dataset and adataset of impressionist paintings. The model can be applied totransfer impressionist paintings into classical music with twotracks. By using music evaluation methods, the harmonicity ofthe generated music can be confirmed. Our model is the firstattempt of our knowledge at transforming paintings into musicsegments.

Keywords: Cross-modality Transformation, DCGANs, Imageto Music

I. INTRODUCTION

In the last three years, AI with the creative ability togenerate artworks has attracted much attention. An increasingnumber of researches use unsupervised networks to generateimages, videos and texts. Some methods also demonstratedtheir ability in generating symbolic music segment. Styletransfer has thereby become a hot research topic, likegenerating paintings in Van Gogh style using VGG-19[1].Researchers pay more attention to constructing an AI, whichcan understand and generate art. However, cross-modalitytransformations, such as transforming paintings into music,are not widely studied and still tempting and challenging, asthe relationship between different modes is hard to find out.

This paper presents a transformation project between d-ifferent forms of art. Generally, different forms of art haveinner relations and common characters. Both painting andmusic are the creative work of artists so we can explore therelation between different art forms by using deep learning.The relationship helps AI to understand human behaviorand creativity. If we can make transformations betweendifferent kinds of art, we will know the relationship. Thus,transformation from the paintings to music helps peopleknow how a composer conveys what he or she feels froma painting by music.

In this project, music composition and transformationbetween different forms of art are two important problems.

*Both author contributed equally to this work.1Xingchao Wang, 1Huihuan Qian, and 1Yangsheng Xu

are with the Robotics and Artificial Intelligence Laboratory,Chinese University of Hong Kong, Shenzhen, 518172, China.[217019013@link,hhqian@,ysxu@].cuhk.edu.cn

2Zenghao Gao is with Choate Rosemary Hall, Wallingford, CT 06492,USA. His work is done as an intern in the Robotics and Artificial Intelli-gence Laboratory. [email protected]

Fig. 1. Sample paintings and the corresponding results of music segments

The attempts at algorithmic composition began in the 1980s.Some researchers have used traditional methods: In A NewKind of Science by Steven Wolfram[2], Wolfram proposed aone dimensional cellular automata, known now as WolframAutomata. Wolfram Tones[3] were based on his findings.It can generate music segments that reflect a long-timepattern, but also allows for short term randomness. Morerecent approach such as RNN- LSTM[4] are able to producebetter results. The same author also proposed the GGT-NNmodel[5]. However, we simply do not know the relationshipbetween impressionist paintings and classical music, andtherefore supervised learning like “RNN-LSTM” would notbe a viable path.

A significant conceptual and technical innovation intro-duced by Goodfellow is Generative adversarial networks.[6]The recent years have seen major progress in the develop-ment of Generative Adversarial networks. Conditional Gen-erative adversarial networks have demonstrated their abilitiesin Image to Image translations[7]. The network can transformimages in different duties, for example, transforming a mapinto an aerial photo. However, the transformation process hasbeen limited to one modality. Additionally, MuseGAN[8] us-es Conditional Generative Adversarial networks to generatemidi music. Conditional GANs have been rigorously studiedin the last 3 years. Nonetheless, few attempts have been madeon cross-modality training.

In this paper, we propose a cross-modality model, whichcan transform paintings into music segments. To correlate

image and music, a modified encoder-decoder network hasbeen proposed. To ensure harmonicity between differenttracks, we use the feature map as the style director. Weuse a temporal generator in order to obtain fluctuation ina segment of music which we call phrase. Given that we putthe concatenated phrases with four bars in the discriminator,boundary harmonicity is enhanced.

We name our model Jukepix. The name originates fromjukebox, which is an automated music-playing device. Sinceour approach generates music from the paintings, pixels ona painting provide information necessary for music com-position. Although transforming the paintings into musicsegments is our focus in this paper, this approach can alsobe utilized in other cross-modality research.

Our contributions:• We propose a novel cross-modality model based on

DCGAN.• We implement the transformation from impressionist

paintings to dual-track classical music segments basedon the model.

• We propose a temporal method to generate temporalmusic without losing image style information.

II. DEEP CONVOLUTION GENERATIVE ADVERSARIALNETWORKS

The initial concept for Generative Adversarial Networkswas proposed by Ian Goodfellow[6] in 2014, which fo-cuses on dealing with unsupervised tasks. His frameworkis designed for training generative models by using anadversarial network. The GANs consist of two networks:the generator networks and the discriminator networks. Thegenerator G is to map input data that follows a distribution.The discriminator D estimates the probability that a samplecomes from the training data rather than G. Thus, the trainingprocess corresponds to a two-player minimax game betweenthe generator and the discriminator:

minG

maxD

Ex∼pd[log(D(x))] + Ex∼pz

[1− log(D(G(z)))]

where z represents the input, x is the real data (trainingdata), pd represents the distribution of real data, pz is thedistribution of z, G(z) represents generate output.

Based on the GANs, Alec Radford and Luke Metz[9] haveproposed the DCGANs with certain architectural constraints.In the model, they replace any pooling layers with stridedconvolutions in discriminator and fractional-strided convo-lutions in generator. They remove fully connected layersin the model. Moreover, ReLU activation and LeakeyReLUactivation are used in G and D, respectively. Training onimage dataset, they demonstrate that the DCGANs are abetter choice for unsupervised learning because it is easierfor DCGANs to converge. They also figured out they cangenerate images most like real ones. Therefore, we adoptthe layer construction of DCGANs to construct our modelin this project.

Furthermore, Image-to-Image translation is proposed byPhillip Isola in 2016. This project uses DCGANs to im-

plement several image-to-image translation tasks. Their net-works learn a loss function to train the mapping from inputimage to output image. Isola et al. used encoder-decoderconstruction to implement image to image task. Moreover,DCGANs have been combined with L1 loss to encourageless blurring:

LL1(G) = Ex,y,z[‖y −G(x, z)‖1]

Therefore, the final generative objective is

G = argminG

maxD

LcGAN (G,D) + λLL1(G)

The image feature decoder has been incorporated fromimage-to-image translation into our network to connect theinput image with the output layer.

Based on previous enhancements, GANs have been effec-tive models to generate images and videos. Hao-Wen Dongused the MuseGAN to generate coherent music of four barsfrom random noise. Several of their evaluation metrics anddata representation have been used in our project.

III. PROPOSED MODEL

The transformation model proposed in this paper consistsof three main parts: data representation, network architectureand the evaluation model. We use the bar as the basic unitof our generator and a phrase with 4 bars is the output of themodel. Although our data representation is similar to that ofMuseGAN, we have made significant update to MuseGAN’snetwork architecture. Our network structure consists of 4parts: an image feature encoder, a temporal input generator,a bar generator and a phrase discriminator. Additionally, weadopt evaluation methods to evaluate whether the outputmusic is harmonic between tracks and satisfies the basicprinciples of classical music.

A. Data Representation

Our music training dataset is in a piano-roll format. Piano-rolls represent music in a matrix, where the vertical axisrepresents the pitch and the horizontal axis represents thesymbolic time. The note pitch has 128 possibilities that gofrom C-1 to G9. The temporal resolution is set to 24 per

Fig. 2. Pianoroll sample

Fig. 3. Networks Architecture

beat in order to cover common temporal patterns. All songsin our dataset are in 4/4 time.We define the basic segment ofmusic “bar” and the obtained segment “phrase”. Four barsmake up one phrase and a bar can be represented as a timestep×note pitch matrix.

B. Network Architecture

To achieve the transformation between different modali-ties, we construct a novel network architecture based on theimage-to-image translation model. Based on MuseGAN, weconsider “bar” the basic generative unit. The generator iscomposed of image feature encoder, temporal informationgenerator and bar generator. The purpose of the discrimina-tor, called phrase discriminator, is to discriminate a phraseof 4 bars with 2 tracks.

1) Image feature encoder: In order to acquire the imagefeature and adopt the feature as style composer director, weuse a encoder network to extract feature from image. Therelationship between images and music is crucial in the trans-formation process. Therefore, the encoder-decoder networkis used to find the correlation between two modalities.

2) Temporal input generator: The goal of this project isto generate a phrase with 4 bars. If we use the same inputfeatures, extracted from the same image, the output willbe totally the same between different bars, which makescomparison between different training examples possible.Furthermore, if we use different noise as the input, theboundary between two bars in one phrase will be notharmonic since the input data does not have the temporalcorrelation. Thus, a temporal generator is used to obtainrelational temporal input.

3) Bar generator: Because music is an art of time, mostpeople use RNN and other temporal model to generate music.The RNN cannot be used to generate music from image,because our work adopts an unsupervised approach. In the

previous part, the temporal generator maps a noise data toa sequence of temporal vector. The temporal vector willbe input to the bar generator bar by bar. In our project,we concatenate the image feature layer and temporal inputto be the input of bar generator. The bar generator is adecoder network, which transfer image feature and temporalinformation into a bar of music

G(i, z) = Gbar(Gtemporal(z) + F (i))

where, i is the input image, z is the basic noise input. Ad-ditionally, we combine the bars obtained from bar generatorinto one phrase. In the training process, the generator updateparameters to obtain the best results. Finally, we get a phrasemusic that consists of 4 bars. For multitrack music, the bargenerator will generate music in respect to its own track.Because the input is the same in generating each bar, thebars share the same information between different tracksto generate correlated multitrack music. The method canenhance the harmonicity between different tracks.

4) Phrase discriminator: In this project, the discriminatoris used to judge whether the generative music is classical.We input the music obtained from the generator to thediscriminator to judge the phrase’s similarity to the trainingdata. If the music segment cannot pass the discriminator,the feedback helps to train the generator. The game helps togenerate music. Notably, a full multitrack phrase with 4 barsis placed into the discriminator.

L(G, D) = E[logD(i, y)] + E[log(1−D(i, G(i, z)))]

where, i is the input image, z is the basic noise input, yis the output music segment.

Over all, image feature encoder can generate a feature mapfrom an input image. Given that the network needs temporalrelation, we combine the feature map with temporal input,which is obtained from the temporal input generator. Then

the combined vector is used to generate a bar of music. Thephrase discriminator judges whether the music segment isreal.

C. Evaluation methods

We adopt several methods to evaluate the generative musicsegments. These methods have been approved effective toevaluate music:

EB: ratio of empty bars.UPC: number of used pitch classes per bar.QN: ratio of “qualified” notes.We can compare the generative outputs to obtain the ad-

equate number of training steps for generation. Besides, thefidelity of our model can also be judged by these evaluationmethods. These metrics give us important evaluation resultsto help us enhance the model.

IV. EXPERIMENT

To train and evaluate the model, a music dataset and an im-pressionist image dataset have been built in the experiment.Additionally, we use Tensorflow to implement our networks.

A. Dataset

1) Music dataset: Our data source is primarily MuseData,Nottingham and Piano-midi.de. Some of our data are alsofrom older sources, such as midi collection by “THE Greatsseries”. The format of data in music dataset is piano-roll. Wethen focus on two separate datasets:

The piano dataset consists of two tracks, piano right andpiano left. Right and left identifies which hand the musicis on. Piano right and Piano left are on different channels.All piano roll data in this category are gathered from piano-midi. In this dataset, there are 60 songs selected and 1280bars segmented from songs.

The dual-track dataset consists of two instruments withtwo tracks. Piano right and piano left are appended together.Then another instrument is added into the data, which wecall “Strings”. This track represents all midi instruments thathave a program number between 41 and 48. Strings and thepiano are on different channels. In this dataset, there are 177songs selected and 4386 bars segmented from songs.

For the training data, we set the fixed matrix of musicsegment to be x(phrases)×4(bars in a phrase)×96(beat perbar)×128(pitch)×2(Channel).

2) Image Dataset: All images in our dataset areimpressionist landscape images. The images arecollected from Kaggle Competition: Painters byNumber(https://www.kaggle.com/c/painter-by-numbers).This dataset includes 5666 images.The format of an imagein the dataset is a numpy matrix. The dimensions of animage are 256×256×3.

B. Data Preprocessing

1) Music data processing: To get the clean and regularpiano-roll data, we need to process the midi file collectedfrom Internet. Given that the regular format is important fortraining, we clean out all the music that is not in 4/4 time.

Fig. 4. Results from Matlab analysis

Then we parse the track from piano-midi.de, putting Pianoright and piano left to different tracks using the pypianorollpackage provided by Haowen Dong.

Our dual instrument dataset is prepared similarly. How-ever, we have much more midi files available for the dualinstruments. We statistically analyze all midi files we collect-ed, and find that piano is used around 5000 times, followedby string instruments, which are used around 3000 times.Therefore, our dual instrument sets contain piano and strings.

After parsing, these data are segmented and labeled byusing Matlab. Since there is no segmentation algorithm for“symbolic data”, we need to treat the midi data as raw audiofiles, which contain information about multiple instrumentsin one file. So we compress the piano-rolls into one bysummation and place it into MATLAB for parsing[10]. Weresort to structural features for segmentation because of theirsimplicity and effectiveness. Matlab segments the data andthen generate the corresponding matrix file. Fig. 4 showsthe sample segmentation. We then placed the segments in anumpy array.

2) Image processing: Because the images gathered are notin the same size, some even with the thick black borders. Weutilize canny edge detection in OpenCV to crop out imageframes[11]. After obtaining the paintings without frames, wecrop the maximum possible square within the image. We thenresize the cropped squares to 256×256×3. We then calculate

TABLE IGENERATOR ARCHITECTURE

Layer ParameterOutput shape Kernel size Strides Activation

conv 16 [1,2] [1,2] LeakeyReLUconv 32 [1,2] [1,2] LeakeyReLUconv 64 [1,2] [1,2] LeakeyReLUconv 64 [1,2] [1,2] LeakeyReLUconv 64 [1,2] [1,2] LeakeyReLUtransconv [1,1] [1,1] [1,1] ReLUtransconv [4,1] [2,1] [2,1] ReLUtransconv [8,1] [2,1] [2,1] ReLUtransconv [16,1] [2,1] [2,1] ReLUtransconv [32,1] [2,1] [2,1] ReLUtransconv [96,1] [3,1] [3,1] ReLUtransconv [96,2] [1,2] [1,2] ReLUtransconv [96,16] [1,8] [1,8] ReLUtransconv [96,128] [1,8] [1,8] ReLU

Fig. 5. Piano results

Fig. 6. Dual tracks results

the color variances within the image, since we want imageswith maximum possible amount of color variances.

C. Network Settings and Training Details

In the Network Architecture part, we propose the mainparts of our model. CNN is used to connect every layer ofthe network. Since of the input dimensions are different fromoutput, we change the strides to get the appropriate shapefor the output matrix. Between fifth and sixth layers, theconcatenate manipulate fixes the input vectors to 128.

In table I, the parameters of the generator consisting ofan encoder and a bar generator are shown. In the trainingprocess, we use the Adam optimizer to train the networks.To accelerate the training process, we update discriminator50 times every time the generator updates before 100 steps.Then we update the generator once every five updates ofD, suggested by Gulrajani et in 2017[12]. We implementour network in Tensorflow and use the Nvidia 1070 GPU totrain the model and this training process takes 48 hours forthe dual instrument dataset, with the batch size set to 64.

V. RESULTSBoth music dataset are combined with the impressionist

image dataset to train our model. Outputs from differenttraining iteratons are stored and written onto a music sheetfor better visualization.

A. Dual piano tracks

The generated samples from different training steps showsignificant differences. When the model is trained 500 times,the generative music mostly consists of random noise, andthe distribution of the notes is similar to that of the pointcloud. After 10000 times, the result has a regular distributionbut still contains much noise. When the model is trained80000 times, the generated music data shares more similar-ities with the real music data, and contains little noise.

Nevertheless, the results of the the piano dataset are notimpressive, and even as we have trained the model 200000times, the program fails to remove the remaining noise. Webelieve that this is caused by the differences among com-posers. Different composers display different characteristicsin composing music. Bach for example, does not have strongcharacteristics for left and right hand. Combining Bach with

another composer will lead to confusion. We also believethat “classical” music is too broad a scope if we want todo training on “each hand”. The dataset should be split intoeven smaller categories in different composers and differentmusic types (symphony, solo, etc). the discriminator.

The dual piano tracks’ results are shown in Fig.5.

B. Dual instruments tracks

Analyzing the instruments of every music in the datasetis crucial, because if two pieces of music have the sameinstrument combination “from beginning to the end”, we canconclude the music is similar. From our dataset, most musicsegments contain both string instruments and the piano. Theoutput samples are significantly improved through training.At step 500, the output is still just random noise. As ourtraining continues, the sample shows clear patterns betweendifferent images. The network learns both the patterns withina music instrument and the patterns between different instru-ments. At step 80000, the meaningful conclusion is formed intwo tracks. There is little noise on the music segment, whenthe training ends at step 100000. Comparing the results, wefind that different paintings can generate different segments,and this means that the input images have an effect on thegenerated music.

The dual instrument tracks’ results are shown in Fig 6.

C. Evaluation

The evaluations are performed during the training process.In Fig. 7, we graph the values of three evaluation metrics withthe corresponding training steps. The ratio of empty barsincreases in a short time and then remains mostly constant.This means the program has learned the proper ratio of emptybars. The number of used pitch classes per bar has reducedto the fitting value. This means the generator knows thatthis number of used pitch classes is proper and meaningful.Similarity, for the ratio of qualified notes, the G finds aproper value after enough training time. However, given thatthe generated music of the dual piano sounds unconvincing,we can find that the value of EB fluctuates significantly.Hence, the reason why the generated music sounds likeirregular noise is that the G cannot find a proper EB.

VI. CONCLUSIONS

In this paper, a cross-modality model is proposed basedon DCGAN to transform images into multitrack music seg-ments. The results suggest that the generated music segmentsare harmonic and unique to the paintings’ styles. Althoughthey still fall behind human composers, this model has madeit has made it possible for AI to establish a method oftransformation between two forms of art. In the future, wewill improve the model by combining solo instrument trackswith orchestral tracks in order to generate concertante music.

ACKNOWLEDGMENT

We thank our colleagues in the AGV team at the Roboticsand Artificial Intelligence Laboratory for their continued sup-port. We are grateful to Hao-Wen Dong and Wen-Yi Hsiao

Fig. 7. Evaluation(EB, UPC, QN)

who authored MuseGAN and helped our understanding inthe MuseGAN architecture. We would also like to show ourgratitude to Haoming Li for his assistance in the editing ofthis paper.

REFERENCES

[1] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer usingconvolutional neural networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2016, pp. 2414-2423.

[2] S. Wolfram. A new kind of science. Champaign, IL: Wolfram media,2002, p. 130.

[3] S. Wolfram. Statistical mechanics of cellular automata. Reviews ofmodern physics, 55(3), 1983, p.601.

[4] D. Johnson. Composing music with recurrent neural networks, 2015.[5] D. D. Johnson. Learning graphical state transitions, 2016.[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

S. Ozair, A. Courville, and Y. Bengio, 2014. Generative adversarialnets. In Advances in neural information processing systems, 2014, pp.2672-2680.

[7] J. Y. Zhu, T. Park, P. Isola and A. A. Efros. Unpaired image-to-imagetranslation using cycle-consistent adversarial networks. arXiv preprint,2017.

[8] H. W. Dong, W. Y. Hsiao, L. C. Yang, and Y. H. Yang. MuseGAN:Symbolic-domain music generation and accompaniment with multi-track sequential generative adversarial networks. arXiv preprint arX-iv:1709.06298, 2017.

[9] A. Radford, L. Metz, and S. Chintala. Unsupervised representationlearning with deep convolutional generative adversarial networks.arXiv preprint arXiv:1511.06434, 2015.

[10] M. Mller. Fundamentals of music processing: Audio, analysis, algo-rithms, applications. Springer. 2015.

[11] K. Dawson-Howe. A practical introduction to computer vision withopencv. John and Wiley Sons. 2014.

[12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville. Improved training of wasserstein gans. In Advances inNeural Information Processing Systems, 2017, pp. 5767-5777.

Jukepix: A Cross-modality Approach to Transform Paintings into Music … · 2019. 1. 12. ·...

Documents

Transcript of Jukepix: A Cross-modality Approach to Transform Paintings into Music … · 2019. 1. 12. ·...