Predictive songwriting with concatenative...

Predictive songwriting withconcatenative accompaniment

Benedikte Wallace

Thesis submitted for the degree ofMaster in Programming and Networks

60 credits

Department of InformaticsFaculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Spring 2018

Predictive songwriting withconcatenative accompaniment

Benedikte Wallace

© 2018 Benedikte Wallace

Predictive songwriting with concatenative accompaniment

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

http://www.duo.uio.no/

Abstract

Musicians often use tools such as loop-pedals and multitrack recorders toassist in improvisation and songwriting. While these devices are usefulin creating new compositions from scratch, they do not contribute to thecomposition directly. In recent years, new musical instruments, interfacesand controllers using machine learning algorithms to create new sounds,generate accompaniment or construct novel compositions, have becomeavailable for both professional musicians and novices to enjoy.

This thesis describes the design, implementation and evaluation of a sys-tem for predictive songwriting and improvisation using concatenative ac-companiment which has been given the nickname PSCA. In its most simpleform, the PSCA functions as an audio looper for vocal improvisation, butthe system also utilises machine learning approaches to predict suitableharmonies to accompany the playback loop. Two machine learning algo-rithms were compared and implemented into the PSCA to facilitate har-mony prediction: the hidden Markov model (HMM) and the BidirectionalLong Short-Term Memory (BLSTM). The HMM and BLSTM algorithms aretrained on a dataset of lead sheets in order to learn the relationship be-tween the notes in a melody and the chord which accompanies it as wellas learning dependencies between chords to model chord progressions. Inquantitative testing, the BLSTM model was found to be able to learn theharmony prediction task more effectively than the HMM model, this wasalso supported by a qualitative analysis of musicians using the PSCA sys-tem.

The system proposed in this thesis provides a novel approach inwhich these two machine learning models are compared with regards toprediction accuracy on the dataset as well as the perceived musicalityof each model when used for harmony prediction in the PSCA. Thisapproach results in a system which can contribute to the improvisation andsongwriting process by adding harmonies to the audio loop on-the-fly.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Songwriting . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Accompaniment . . . . . . . . . . . . . . . . . . . . . 31.2.3 Concatenation . . . . . . . . . . . . . . . . . . . . . . . 41.2.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Machine learning approaches . . . . . . . . . . . . . . . . . . 51.4 Research Methods . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Quantitative model evaluation . . . . . . . . . . . . . 61.4.2 Qualitative model evaluation . . . . . . . . . . . . . . 71.4.3 User study . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 92.1 Programming and music . . . . . . . . . . . . . . . . . . . . . 92.2 NIMEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Music Information Retrieval . . . . . . . . . . . . . . . . . . . 132.4 Musical Artificial Intelligence . . . . . . . . . . . . . . . . . . 15

3 Sequence Classifiers 193.1 Harmony prediction . . . . . . . . . . . . . . . . . . . . . . . 193.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . 203.2.2 Hidden Events . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.4 Applying HMM to Harmony Prediction . . . . . . . . 23

3.3 Recurrent Neural Nets . . . . . . . . . . . . . . . . . . . . . . 233.3.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 LSTM: Long Short-Term Memory . . . . . . . . . . . 243.3.3 Bidirectional RNN . . . . . . . . . . . . . . . . . . . . 253.3.4 Applying BLSTM for Harmony Prediction . . . . . . 26

4 Design and Implementation 274.1 The audio looper . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Program structure . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.1 Pitch analysis . . . . . . . . . . . . . . . . . . . . . . . 29

iii

4.2.2 The note bank . . . . . . . . . . . . . . . . . . . . . . . 314.2.3 Chord construction . . . . . . . . . . . . . . . . . . . . 314.2.4 Creating note vectors . . . . . . . . . . . . . . . . . . . 324.2.5 Base case implementation . . . . . . . . . . . . . . . . 34

4.3 Machine learning approaches . . . . . . . . . . . . . . . . . . 344.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . 364.3.3 Chord types . . . . . . . . . . . . . . . . . . . . . . . . 364.3.4 HMM model . . . . . . . . . . . . . . . . . . . . . . . 374.3.5 BLSTM model . . . . . . . . . . . . . . . . . . . . . . . 394.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Testing and evaluation 455.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 455.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Qualitative Model Analysis . . . . . . . . . . . . . . . . . . . 505.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.4 Handling imbalanced datasets . . . . . . . . . . . . . 55

5.3 Production Model Analysis . . . . . . . . . . . . . . . . . . . 575.3.1 Heuristic BLSTM . . . . . . . . . . . . . . . . . . . . . 585.3.2 Heuristic HMM . . . . . . . . . . . . . . . . . . . . . . 585.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 User experience study . . . . . . . . . . . . . . . . . . . . . . 625.4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 625.4.2 Session overview . . . . . . . . . . . . . . . . . . . . . 635.4.3 Data collection . . . . . . . . . . . . . . . . . . . . . . 645.4.4 Data analysis . . . . . . . . . . . . . . . . . . . . . . . 655.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Conclusion and future work 696.1 Overview of results . . . . . . . . . . . . . . . . . . . . . . . . 716.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.2.1 Improving the audio looper and concatenation process 726.2.2 Improving the PSCA controller . . . . . . . . . . . . . 726.2.3 Harmony prediction . . . . . . . . . . . . . . . . . . . 72

6.3 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

A Music Concepts 77A.1 Music concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 77

iv

B Written feedback from UX study & participant form 79B.1 Reflective feedback participant 1: . . . . . . . . . . . . . . . . 79B.2 Reflective feedback participant 2: . . . . . . . . . . . . . . . . 79B.3 User experience study participant form . . . . . . . . . . . . 80

v

List of Figures

1.1 The PSCA system setup: laptop running PSCA software,sound card, microphone, headset and Arduino foot switchcontroller. Photo: Benedikte Wallace . . . . . . . . . . . . . . 1

1.2 Common tools to aid in songwriting and improvisation . . . 31.3 Melody with missing chord notation, the machine learning

algorithms used in the PSCA are trained to predict a suitablechord for each bar. . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Songwriter with computer assistance . . . . . . . . . . . . . . 51.5 PSCA system overview: Audio recorded by the user is

added to the playback loop together with a chord selected bythe harmony prediction models. The harmony is constructedusing concatenated segments of the recorded audio. . . . . . 6

2.1 First attempt at implementing an audio looper for the PSCAusing PureData. Photo: Benedikte Wallace . . . . . . . . . . . 11

2.2 Results reprinted from Nayebi and Vitelli’s paper GRUV: Al-gorithmic Music Generation using Recurrent Neural Networks [53] 17

2.3 Open NSynth Super physical interface for the NSynthalgorithm. [43] . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 The harmony prediction task: Given a set of consecutivemeasures containing a monophonic melody, select a suitableharmony (qn) for each measure. . . . . . . . . . . . . . . . . . 19

3.2 A Markov model for transitions between four chords. Tran-sition and entry probabilities are omitted. . . . . . . . . . . . 20

3.3 HMM chord states and note emissions . . . . . . . . . . . . . 223.4 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5 RNN unrolled over time . . . . . . . . . . . . . . . . . . . . . 243.6 LSTM cell unrolled over time . . . . . . . . . . . . . . . . . . 253.7 Figure showing the three gates of a single LSTM cell. . . . . 26

4.1 The PSCA system consists of software and hardware compo-nents. The Python scripts for running the PSCA system areavailable at https://github.com/benediktewallace/PSCA.Photos: Benedikte Wallace . . . . . . . . . . . . . . . . . . . . 27

4.2 System diagram of the hardware setup for using the PSCAsystem: Laptop, sound card, microphone, Arduino and footswitches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii

https://github.com/benediktewallace/PSCA

4.3 First prototype for controlling playback and recording withthe PSCA. Photo: Benedikte Wallace . . . . . . . . . . . . . . 28

4.4 Program flow for the PSCA: User records audio which inturn is analysed, segmented and added to the note bank.The harmony prediction model predicts a chord, this chordis constructed by concatenating and adding audio segmentsfrom the note bank. Finally this concatenated audio layer isadded to the playback loop together with the latest recordingmade by the user. . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 librosa piptrack function. Definition and example of use . . 314.6 Mappings of notes to integers from 0 - 11. Highlighted is a C

major triad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.7 Example of measure and resulting note vector . . . . . . . . 324.8 Note vectors created when singing one 4.8(a), two 4.8(b) and

three notes 4.8(c) . . . . . . . . . . . . . . . . . . . . . . . . . 334.9 Example of a lead sheet from the dataset: Let Me Call You

Sweetheart (Freidman and Whitson, 1910) . . . . . . . . . . . 354.10 Transition probabilities (A) and emission probabilities (B) for

major-key and minor-key datasets as generated by the HMM. 384.11 Implementation of the bidirectional LSTM using Keras . . . 404.12 The sample function for reweighting and sampling a proba-

bility distribution as presented in listing 8.6 in Deep learningwith Python [16] . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 K-fold cross validation accuracy using 24 and 60-chorddatasets. The BLSTM model had the highest medianaccuracy over all tests, it also had a much narrower inter-quartile range than the HMM models, suggesting that it isless sensitive to changes in the dataset. . . . . . . . . . . . . . 49

5.2 Confusion Matrix for HMM and BLSTM using minor keydataset with 24 unique chords. These results show that theHMM misclassifies most samples as belonging to classes C,G or Am while the matrix pertaining to the BLSTM modelshows a clear diagonal line where the model has predictedthe correct class. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Confusion Matrix for HMM and BLSTM using major keydataset with 24 unique chords. These results show that theHMM misclassifies most samples as belonging to classes C,G or F while the matrix pertaining to the BLSTM modelshows a clear diagonal line where the model has predictedthe correct class. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Confusion Matrix for HMM and BLSTM using mixed datasetwith 24 unique chords. The predictions generated by theHMM using the mixed dataset are skewed towards C, F andG. The results are similar to those generated by the HMMusing only songs in major keys. As with the major andminor song datasets, the BLSTM achieves better predictionaccuracy than the HMM on the mixed dataset. . . . . . . . . 52

viii

5.5 Confusion Matrix for HMM and BLSTM using minor keydataset with 60 unique chords. These figures show that theHMM predictions are skewed towards C, some misclassifi-cations also fall into classes Am and Asus. Predictions gen-erated by the BLSTM mostly belong to the major and mi-nor chord types. Other chord types such as suspended, aug-mented and so on, are rarely generated by either model . . . 53

5.6 Confusion Matrix for HMM and BLSTM using major keydataset with 60 unique chords. These figures show thatwhen using major key songs only, predictions from bothmodels are skewed towards C, G and F. . . . . . . . . . . . . 53

5.7 Confusion Matrix for HMM and BLSTM using mixed datasetwith 60 unique chords. As with the 24-chord mixed dataset,the HMM predictions are similar to those generated withthe major key dataset. The BLSTM model generates morevaried predictions, and seems to model the underlyingmixed dataset more precisely. . . . . . . . . . . . . . . . . . . 54

5.8 Prediction distribution for BLSTM and HMM trained on24-chord mixed dataset. The figures show that the HMMpredicts class C more than 80% of the time. Predictionsgenerated by the BLSTM model show a distribution similarto the true chord distribution of the 24-chord mixed dataset(shown in figure 5.9(a)). . . . . . . . . . . . . . . . . . . . . . 54

5.9 Distribution of the chords found in the 24-chord and 60-chord mixed datasets . . . . . . . . . . . . . . . . . . . . . . . 56

5.10 Distribution of chords generated by heuristic BLSTM . . . . 595.11 Distribution of chords generated by heuristic HMM . . . . . 605.12 Examples of different chord sequences as predicted by the

models trained for use in the PSCA, as well as the originalchord sequence. Both the HMM and BLSTM predict suitablechords for each bar. Although the naive, base case approachresults in acceptable choices for the first and third measure,the second and fourth chords create dissonance. This showshow the naive approach generates unpredictable chordprogressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.13 Performers interacting with the PSCA system, a video isavailable on YouTube: https://www.youtube.com/watch?v=t-owEHRYmi4 and Zenodo: https://zenodo.org/record/1214505#.WumcwMixXVM 62

6.1 Evolution of the PSCA system . . . . . . . . . . . . . . . . . . 75

ix

https://www.youtube.com/watch?v=t-owEHRYmi4https://www.youtube.com/watch?v=t-owEHRYmi4https://zenodo.org/record/1214505#.WumcwMixXVM

List of Tables

4.1 Examples of original chords found in the lead sheets and theresulting triad label used when simplifying to closest triad . 37

4.2 Example of predictions sampled at different temperatures . 42

5.1 Lead sheet groupings and resulting datasets. The models aretrained on each of the six datasets in order to compare the 60-chord vs. 24-chord approach as well as the consequences ofsegmenting the dataset into separate sets of major and minorkey songs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 24-chord models: Average accuracy for k-fold cross valida-tion with k = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 60-chord models: Average accuracy for k-fold cross valida-tion with k = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 User experience study: Session structure . . . . . . . . . . . . 645.5 Concept codes for feedback analysis . . . . . . . . . . . . . . 655.6 Feedback and labels from UX study . . . . . . . . . . . . . . 665.7 User experience study: Session details . . . . . . . . . . . . . 67

xi

Abbreviations

BLSTM Bidirectional Long Short-Term Memory. i, xi, 5, 7, 26, 32, 34, 39–43,45, 47, 48, 50, 54–59, 61, 66, 67, 69, 70, 73

CNN Convolutional Neural Network. xi, 17

HMM hidden Markov model. i, xi, 5, 7, 15, 16, 18–20, 22, 23, 32, 34, 37, 38,42, 43, 45, 47, 48, 50, 54, 56–59, 61, 66, 69, 70, 73

LSTM Long Short-Term Memory. xi, 5, 16–18, 25, 26, 39

MIR Music Information Retrieval. xi, 9, 13–15

ML machine learning. xi, 5, 7, 12, 69, 72, 74, 75

NIME New Interfaces for Musical Expression. xi, 9, 12, 13

RNN Recurrent Neural Network. xi, 5, 16, 18, 19, 23–26, 34, 39

TA Thematic Analysis. xi, 7, 64

UX User Experience. xi, 7, 45, 61, 62

xiii

Acknowledgement

I would like to extend my deepest gratitude to my supervisor, postdoctoralfellow Charles Martin for his continuous support, inspiration and guidancethroughout the thesis work. I would also like to thank my family, especiallymy talented sister, jazz singer and composer Karoline Wallace, for sharingher musical expertise and my significant other Magnus B. Skarpheðinssonfor his endless support.

Chapter 1

Introduction

Figure 1.1: The PSCA system setup: laptop running PSCA software, soundcard, microphone, headset and Arduino foot switch controller. Photo:Benedikte Wallace

1.1 Motivation

Examining the intersection between science and art has always fascinatedme and driven my studies. Studying machine learning has allowed meto delve deeper and participate in exciting research in this field as we

1

see machine learning being used for many different tasks in music, frombuilding new musical instruments and creating sound emulators, to musicgeneration. In this thesis I propose a novel system which applies machinelearning methods to predict suitable harmonies to accompany a real-timeaudio looper for vocal improvisation. The system has been nicknamedPSCA, referring to the acronym of the thesis title.

During the songwriting process, musicians often use tools such asmultitrack recorders and loop-pedals to assist in improvisation and tryingout new musical ideas. Today’s multitrack recorders are often referred toas digital audio workstations (DAWs). DAWs such as the software shownin figure 1.2(b) allow the user to record and edit several recordings toconstruct a piece of music. Although offering a great deal of control and arange of different audio processing tools, creating music this way requiresknowledge of the DAW software and can be a difficult task for novices.Using a loop-pedal enables the musician to layer and loop recordings on-the-fly. By adding layers, a single performer can create complex harmonies,rhythms and dynamics. Since the loop-pedal provides the musician witha way to record new layers without stopping the playback of the previousrecordings, loopers such as the Boss RC-30 presented in figure 1.2(a), areoften used in live performance.

The challenge for many performers and songwriters when working withsuch tools is that they have to create every piece of sound that is used, eitherby recording or synthesising them directly or by relying on libraries con-taining synthesised instruments. This can make the songwriting processuninspiring as well as time consuming. If instead, accompaniment couldbe generated using the performers’ own creative output, this may enhancethe songwriting process. Furthermore, if additional accompaniment couldbe produced by use of machine learning methods this would create novelcontributions to the composition, inspired, but not directly generated, bythe performer.

1.2 Goals

I developed the PSCA system to address these problems in the followingways: the PSCA is, in its simplest form, an audio looper, but the PSCAgoes beyond the playback of live recordings by generating accompanimentin the form of harmonies that suit the recorded melody. This adds a newdimension to the standard audio looping functions. Also, these harmoniesare constructed using the recordings of the musician’s own voice insteadof using pre-recorded or synthesised instruments. This provides the userwith automatic accompaniment with an interesting sound. In the followingsubsections I have outlined how the system relates to each of the conceptscontained in the acronym, PSCA.

2

1.2.1 Songwriting

As mentioned above, my work is based on creating an audio looperfor vocals. The user sings into a microphone and the system loops therecording, adding new layers to the playback loop as the user records them.These audio looping functions are implemented in Python and controlledusing a pair of foot switches and an Arduino. The goal of the PSCA systemis to provide a similarly simple interface to looping audio as in the "LoopStation" pedal, but to harness some of the power of computer-based audioas in a DAW. Using the PSCA system a musician can improvise or test outnew ideas to assist in the songwriting process.

(a) Boss RC30 Loop Station. Photo:Copyright ©Roland Corporation

(b) Ableton Live DAW for multitrackrecording. Photo: Benedikte Wallace

Figure 1.2: Common tools to aid in songwriting and improvisation

1.2.2 Accompaniment

In addition to a standard audio looping function, the PSCA systemgenerates chords to accompany the audio recording. This creates anunderlying harmony that changes as the user records new layers to theloop. Several systems exist today to aid in songwriting and improvisationby generating additional accompaniment to the user input, for example,Microsoft SongSmith [71]. The user of the SongSmith system records avocal melody by following a metronome. After this recording has beenanalysed the system plays it back with added harmonies. The user canalso set certain variables such as the preferred genre, the amount of happyand sad chords (reflecting the major or minor feel of the accompaniment),and a jazz-factor which controls the complexity and predictability of thegenerated harmonies. Similarly, the Reflexive Looper [57] functions inmuch the same way as traditional loop-pedals, but is also able to provideaccompaniment to a predefined song structure by arranging the musician’srecordings. Other systems such as ChordRipple [38] provide musicianswith a user interface which recommends chord choices and allows the userto experiment with different chord progressions. The PSCA aims to predict

3

harmonies as in the SongSmith system, but use the performer’s own audioas the basis for the accompaniment, as in Reflexive Looper.

1.2.3 Concatenation

A common trait among systems that generate musical accompanimentis the use of prerecorded or synthesised instruments to generate theaccompaniment. Instead of using predefined instrument libraries thePSCA system concatenates smaller slices of the recorded audio to createthe desired triad. This process is similar to that of granular synthesis.In granular synthesis pieces of audio are sampled and split into smallsegments (less than 1 second) referred to as grains. Curtis Roads, one ofthe prominent names in granular synthesis has written on the topic in hisbook Microsound [65]. The grains can be processed to change parameterssuch as pitch and volume, be layered on top of each other or played as asequence. The PSCA collects 18 th note segments of the recorded audio in a“note bank”. When a chord is constructed the audio segments containingthe appropriate notes are chosen from the note bank and concatenated tocreate an underlying harmony for the duration of the audio loop.

Figure 1.3: Melody with missing chord notation, the machine learningalgorithms used in the PSCA are trained to predict a suitable chord foreach bar.

1.2.4 Prediction

Though theoretically, there are no “wrong” chord choices, most peoplewould agree that, given a melody, some chord choices create a dissonancethat does not sound particularly good [9]. In order to decide preciselywhat chords should accompany a melody, a model needs to be able to learnrelationships between the notes in a melody and the chords that are chosento accompany them. Also, in order to model chord progression, the modelneeds to be able to learn temporal dependencies between chords.

This type of problem is often referred to as a sequence classificationproblem; the model is given a sequence of tokens and its goal is to predictthe next token. In this work, this problem is referred to as "harmonyprediction". An example of this problem is shown in Figure 1.3 where amelody is given with blank spaces for the harmony. A human would solve

4

this problem by choosing a sequence of chords that works for the melody,as well as forming a sensible harmonic sequence. The goal of PSCA is toimplement chord prediction in an interactive system, further we aim to usethe system to compare two machine learning approaches to this problem.

1.3 Machine learning approaches

(a) Songwriter with audio software (b) Songwriter with smart audio soft-ware such as the PSCA

Figure 1.4: Songwriter with computer assistance

Two different machine learning (ML) approaches for harmony predic-tion are compared in this work and implemented into the PSCA system,firstly a HMM, one of the most prominent methods used in similar re-search. The strategy proposed by Simon et al. in their paper on MicrosoftSongSmith [71] is used as a basis for the implementation of the HMM. Si-mon et al. train an HMM on a dataset of lead sheets in order to generatechords for a recorded vocal melody. In addition, a deep learning approachto this problem was implemented. For this a Recurrent Neural Network(RNN), more specifically, a Long Short-Term Memory (LSTM) [36] was im-plemented. For the LSTM, the approach proposed by Lim et al. [46] in theirpaper “Chord Generation from Symbolic Melody using BLSTM Networks”was used. Both papers, as well as my own implementations, use modelsthat are trained on symbolic melody input and target chord labels from adataset of lead sheets. When using the PSCA, the audio recorded by theperformer is transformed into a note vector and fed to the trained models,resulting in a new chord prediction.

The research of Lim et al. does not include an interactive system,therefore there is no way to utilise the models they have created in a musicsystem. Also, in contrast to SongSmith which uses synthesised instrumentsand must process the input before adding harmonies, the PSCA includesconcatenative accompaniment which is generated on-the-fly. My workpresents a novel approach in which the HMM and BLSTM are comparedwith regard to prediction accuracy on the lead sheet dataset as well as theperceived musicality of each model when used for chord generation in thePSCA. Figure 1.5 shows an overview of the PSCA system.

5

Figure 1.5: PSCA system overview: Audio recorded by the user is added tothe playback loop together with a chord selected by the harmony predictionmodels. The harmony is constructed using concatenated segments of therecorded audio.

1.4 Research Methods

Multiple research methods are used in this thesis to evaluate differentaspects of the PSCA system. Three studies are presented in this work.A quantitative analysis is performed in order to assess how much eachmodel has learned, a qualitative analysis of each model is executed todetermine how good their proposed solutions are. And finally, a user studyis performed which addresses the performers’ experience using the system.In this final study the PSCA overall is evaluated, not just the individualmodels. As a base case for evaluating the machine learning approaches anaive approach to harmony prediction is implemented as well. This naiveapproach chooses the harmony based only on the pitch of the first note ineach measure.

1.4.1 Quantitative model evaluation

Cross validation is used in order to evaluate how much the models havelearned. The dataset is split into two disjoint sets, one set for training andone set for testing. The songs included in the training data are used to trainthe models, while the test set is withheld until testing. When the models aredone training, the test data is used to evaluate the accuracy of the modelson unseen data by comparing the predicted chords for each measure withthe true chords from the test set. The training set contains approximately80% of the data while the remaining 20% is used for testing.

6

1.4.2 Qualitative model evaluation

Only examining the prediction accuracy of the models is not sufficient tounderstand the quality of the predictions they make. Therefore, qualitativeevaluations of the model’s predictions are needed. This is done usingconfusion matrices as well as prediction histograms, allowing for a visualrepresentation of which chords the models are suggesting. The evaluationprocedure is oriented towards identifying how well the models generateinteresting, and aesthetically pleasing, as well as theoretically correct chordprogressions.

1.4.3 User study

The human evaluation of the PSCA system is an equally important aspectto consider. Since the underlying mechanic of the system is an audio looper,this means the accompanying chord is not the only element affecting theoverall harmony. The previously recorded layers loop in the backgroundas well, creating more complex and slow-changing harmonies than thosethat would be created if the recordings were not looped and layered.Consequently, the accuracies of the ML models on the dataset are naturallynot transferable to the users experience of the different models when usedin the PSCA system. The performer’s experience when working with thePSCA is used to evaluate the enjoyability and creative potential of thedifferent modes of the PSCA system (base-case mode, BLSTM mode orHMM mode). A User Experience (UX) study with two participants wascarried out, feedback that is collected from these participants has beenanalysed using a Thematic Analysis (TA) approach [12] which aims toidentify recurring themes found in the user comments.

1.5 Outline

This thesis consists of six chapters: introduction, background, sequenceclassifiers, design and implementation, testing and evaluation and con-clusions and future work. Chapter 2 surveys past and related work andchapter 3 presents descriptions of the ML models used. Chapter 4 containsan overview of the design and implementation process, presents the pro-gram structure of the PSCA system and describes details on training theML models for the harmony prediction task.

Chapter 5 outlines the experiments performed in order to evaluate theML models and presents the results. This chapter also includes a UX studyof the PSCA system. The final chapter contains a general discussion andconclusions of this thesis as well as suggestions for future work.

7

Chapter 2

Background

Last year, 2017, marked the 60th anniversary of the first time a piece ofmusic was played by a computer. In 1957 Max Mathews, often referred toas the father of computer music synthesis, composed a short piece of music,which was then played by an IBM 704. In the more than 60 years thatfollowed, the field has grown, and the development of new digital musicinstruments, music programming languages and instrument controllershave continued to interest developers, researchers and musicians alike.This chapter attempts to give an overview of the different fields relevant tothis thesis. First a history of programming in music is presented, followedby an overview and recent works from the New Interfaces for MusicalExpression (NIME) and Music Information Retrieval (MIR) communities,and finally a closer look at machine learning in music.

2.1 Programming and music

When developing an application for music and sound creation, there areseveral available programming paradigms to consider. Which approach isthe best fit depends on the application being made. The following sectionpresents some of the music programming paradigms that are developedspecifically to assist musicians in creating music.

The evolution of the modern computer music systems and dedicatedprogramming languages for music generation can be said to have its be-ginning in the early programming environments of MUSIC-N, developedby Max Mathews in the late 1950s. The MUSIC-N systems introduced keyconcepts [82] that still influence the development of new computer musicsystems. The sound and music computing system Csound [45] which iswidely used today is considered its direct descendant. This is also true ofthe Max/MSP music programming environment.

Named after Max Mathews, Max was developed by Miller S. Puckette inthe mid 1980s [61]. Max/MSP is a modular music and sound developingplatform allowing for the assembly of modules, or patches, for custom

9

sound creation and manipulation. Max/MSP works with any type ofphysical controller that outputs MIDI or OSC (Open Sound Control) [30].The platform can also output these messages for controlling external,physical devices. Max/MSP remains a particularly popular tool for awide variety of tasks ranging from visual and sonic art to construction ofnew music interfaces and controllers. Puckette also developed the opensource environment Pure Data, shortened to Pd, in the late 1990s [62]. Pdoffers a high-level programming language that allows the user to buildlarger modules out of smaller, basic functions as well as offering a rangeof convenient modules for quick assembly and testing of new ideas. Theuser connects modules by dragging “wires” between the in- and outputs ofeach module using the Pd GUI. As we enter the 2000s and the era of highperformance personal computers we see an increase in on-the-fly musicprogramming systems for use in live performance. An early example ofthis is ChucK, developed in 2003 by Wang and Cook [83].

These systems have in common that they are developed in order toassist musicians in music composition, and so, low-level features are oftenhidden from the user. This can be practical when the system is used directlyfor creating music, or when quick prototyping is needed, but often doesnot allow for control over details such as how audio is stored and parsed,and how communication between the system and other applications work.One approach allowing for more control is to communicate to one of theselanguages from a more traditional programming environment (e.g., C++ orJava).

Supercollider [84], developed initially in the 1990s, uses OSC to commu-nicate between a server and client and uses its own, smalltalk-like language[49]. Today several music programming languages communicate with a Su-perCollider synthesis server which provides an ability to define arbitrarysynthesisers and trigger and manipulate them in real time. One such ex-ample is SonicPi. Sonic Pi was developed to help children learn program-ming by creating music [2]. The Sonic Pi language, developed by Aaronet al. at Cambridge University, is implemented using a Ruby DSL that fa-cilitates communication with a SuperCollider server. Aaron and Blackwellalso presented Overtone in 2013 [1] an elegant functional language whichalso communicates with a SuperCollider server. Similarly, it is possible tocommunicate with the Pd environment using libPd [13], which turns Pure-Data into an embeddable audio library enabling developers to use Pd asthe sound engine for their applications.

Another approach is to alternatively use all-purpose programminglanguages such as C++ or Python and utilise external libraries especiallydesigned for sound and music creation. Today, several libraries exist foraudio analysis and manipulation for almost every programming language.In the PSCA, a Python module called sounddevice [31] is used to accessand control recording and playback. This module provides bindings

10

Figure 2.1: First attempt at implementing an audio looper for the PSCAusing PureData. Photo: Benedikte Wallace

to PortAudio, a cross-platform audio input/output library written in C.Using sounddevice the audio is stored in and read from NumPy arrays, apractical format as audio can then be manipulated in an efficient way witha multitude of convenient functions to add, slice and store their content.

Often graphical representation and high-level abstraction can come at thecost of low-level controls, and choosing the right tool for the job is as keyfor music applications as for any other system. For the development of thePSCA it was crucial to be able to access and manipulate the audio directlyin order to store and analyse audio segments, as well as reformattingthem to generate predictions from the models. Hence, this approach wasthe most sensible for this work, using a programming language I wasalready familiar with to build an application which can deal with bothmachine learning aspects of the program and the audio manipulation. Inthe initial stages of this work a looping system was prototyped in Pd. Animage of this Pd patch can be found in figure 2.1. However, in order tointegrate the looping functions with the machine learning models, it waseasier to transfer development to Python. Although this Pd approachwas not used in the development of the PSCA in the end, the graphicalrepresentations of the Pd environment allowed for easy construction andbetter understanding of the functions needed to build the audio looperwhen I later began developing the Python script using the sounddevicelibrary.

2.2 NIMEs

During the last 20 years we have also seen a rapid evolution in musicalcontrollers due in part to the availability of inexpensive hardware and

11

new sensors that can capture features such as force and position, asdiscussed by Cook in 2001 [19]. This same year the first New Interfaces forMusical Expression (NIME) workshop [60] was held in Seattle, Washington.It subsequently became a regular conference for musicians, composers,researchers and developers dedicated to research on the development ofnew technologies and their role in musical expression and performance.

The initial NIME workshop had as its aim to initiate an exchange ofideas in order to, amongst other things, “survey and discuss the currentstate of control interfaces for musical performance, identify current andpromising directions of research and unsolved problems” [60, p. 1]. Sincethis workshop, the resulting conference has showcased research from awide variety of contexts, not only pertaining to music controllers but alsocross-disciplinary research to answer fundamental questions like “What isa musical instrument?” and “What is a composition?”.

In 2017 Jensenius and Lyons published A NIME Reader [40], as the needfor a collection of articles that could broadly represent the conferenceseries became apparent. In the preface of this book the editors describethe acronym NIME as having several potential meanings and indeedthis field encompasses many themes, from reflective studies on the fielditself, to development of new music systems [68], art installations [56],and naturally, the creation of new interfaces and controllers for musicalexpression. The Wekinator, presented at NIME in 2009 is an example of aNIME that can be used to create NIMEs. The Wekinator [28] is a meta-instrument that allows its user to interactively modify and train manydifferent machine learning (ML) algorithms in real-time. The software iscreated to allow musicians, composers and new instrument designers totrain an algorithm to react in certain ways to certain input. The inputsources can vary from custom sensors to gestures and audio. Using theWekinator GUI the user records examples of input and output mappingsand trains the algorithm to recognise these inputs in the future and thusmap them to the correct output.

The Wekinator is just one example of a NIME that uses machinelearning tools. Another theme at NIME which ties in with my work isthe generation of accompaniment for musical improvisation. Kitani andKoike [42] presented an online system which generates accompanimentfor live percussion input, simulating the improvisation between twopercussionists. Similarly, Xia and Dannenberg [86] present an “artificialpianist” that can learn to interact with a human pianist using rehearsalexperience. In the paper by Pachet et al. [57] titled Reflexive Loopers forSolo Musical Improvisation they propose a new approach to loop pedalsthat addresses the issue that loop pedals only play back the same audio,making performances potentially monotonous. Solo improvisation can bedifficult, and the goal for the reflexive looper is to create accompanimentthat fits the given input. Reflexive loopers (RLs) differ from standard

12

loop pedals in two major ways: firstly, the RL determines the playbackmaterial according to a predetermined chord sequence as well as the styleof the musician’s playing so that the generated accompaniment followsthe musician’s performance. Secondly, the RL’s can distinguish betweenseveral playing modes. As a result, if the musician is playing bassfor example, the RL will follow the “other members" principle and playdifferently if the input had been a harmony. The NIME conferences havealso showcased works relating to vocal performance. The VOICON [58]is an augmented microphone which allows mappings between the singersgestures when holding the microphone and vocal modifications. Circularmotion of the microphone generates vibrato, while tilting causes a changein pitch. The user can also assign a vocal effect and control the amount ofthe effect by asserting pressure.

The PSCA is in its own way a NIME, although New Instrument forMusical Experimentation would possibly be a more fitting acronym. Whilethe PSCA so far does little in the way of presenting an interface or gesturecontrols to the user, it does combine some of the other themes seen at NIME,specifically using machine learning in improvisation and accompanimentand new systems for vocal performance.

2.3 Music Information Retrieval

Another important research area that has facilitated the development ofsystems such as the one presented in this thesis is MIR. Research in thisfield focuses on technologies to aid in information mining in music. MIRresearch is often in the intersection of digital signal processing, informationretrieval and musicology. As noted by Downie in his chapter on MusicInformation Retrieval in the annual review of information science andtechnology in 2003; “The difficulties that arise from the complex interactionof the different music information facets can be labeled the “multifacetedchallenge.” ” [23, p. 297]

Downie presents 7 facets of MIR: pitch, temporal, harmonic, timbral,editorial, textual, and bibliographic facets. Each facet has its ownrepresentations and brings its own challenges. The pitch facet is concernedwith analysing pitch. Pitch is represented in several different ways, i.e.,graphical representation (�, ♩, �), note names (e.g., C, C#, D) or pitch classnumbers (0, 1, 2, . . . , 11). Given that pitch can be represented in so manyways, techniques used for analysing pitch vary greatly. The temporal facetincludes any information pertaining to the duration of musical events.This may refer to the duration of pitches, rests or harmonies, metronomeindication (i.e., beats per minute) or other, relative, tempo descriptions suchas rubato or accelerando. In many music styles it is expected that musiciansmay stray from the strict tempo notation of written music, retrievingtemporal information can therefore be a quite complex task.

13

Information concerning simultaneous pitches falls into the harmonicfacet. Harmonic events can be denoted using chord names, as in the leadsheets used for this thesis, but this is not always the case. As long as twoor more pitches sound at the same time, this is considered a harmony,and often such events are not denoted with chord names. Retrievingharmony information in such cases may require analysis of each of thepitches contained in the harmony.

The timbral facet is composed of information pertaining to the colour ofa note. Timbre is what enables a listener to distinguish between a noteplayed on a piano and the same note being played on a clarinet. Thereby,the timbre of an instrument can be used to identify it amongst otherinstruments in an audio recording. This requires relatively advanced signalprocessing capabilities in order to separate and examine the amplitudeand duration of the frequencies that make up the audio signal. Theeditorial facet is comprised of performance instructions, or lack thereof.Performance instructions include dynamic instructions, articulations andornamentation and can be given in a textual (e.g., crescendo, decrescendo)or an iconic (e.g., ) format, and sometimes both. Different musiciansmay use differing editorial information when transcribing the same pieceof music, this causes problems when attempting to decide on a definitiveversion of a work for inclusion in a MIR system. The final two facets arethe textual and bibliographic facets. The textual facet is concerned withsong lyrics, while the bibliographic facet pertains to information such as thework’s title, names of composers and publishers, publication dates and soon. In other words, information pertaining to the bibliographic facet is notderived from the content of the composition, rather it is music metadata.

In 2008 Lartillot et al. presented the MIRToolbox [44], a set of functionswritten in Matlab for extracting features from audio. The MIRToolboxcontains the necessary functions to extract information on for examplerhythm, tonality, timbre and form of the audio. A problem shared bythose in the MIR community is that of multiple representations of music.Collections of music can contain several different types of audio recording(CD, LP, MP3 and so on) as well as multiple symbolic representations(printed notes, MIDI and many others). Therefore, MIR encompassesmany different techniques and technologies in order to handle the variousrepresentations and tasks.

Another significant problem in MIR research is the ability for researchersto access large collections of music. Sonny Bono Copyright Term ExtensionAct and other similar laws have led to a considerable decrease in thenumber of sound recordings in the public domain. This also meansthat researchers are unable build a shared, persistent, collection of musicwritten after 1922 [24]. This is an issue I experienced myself during mywork, as the lead sheet datasets used by other researches cannot be sharedpublicly. Wikifonia.org obtained a temporary licence to share a large set

14

of lead sheets from various genres, but unfortunately their licence expiredin 2013, and the dataset is no longer available through their site. Insteadresearchers in need of lead sheets often use private collections and givelittle specific information regarding their contents.

Numerous MIR tools have been used in the development of the PSCA.Most prominently the music21 package [20] and the librosa module [50].As well as using several of the convenience functions offered by librosafor transforming pitches to notes and so on, the function piptrack is usedfor pitch analysis. The piptrack function uses parabolic interpolation tocalculate STFT, short-time Fourier transform [73]. The Fourier transform isa core component of sound processing and synthesis. Using the Fouriertransform we can decompose sounds into its elementary sine waves,enabling both analysis and transformation of the original audio signal.These principles are also used for creation of new sounds from computergenerated sine waves. Music21 is a Python toolkit for analysing andgenerating music scores in the music XML format. It allows for easyextraction of chord and note information as well as information pertainingto the score itself such as key, time signature and tempo.

2.4 Musical Artificial Intelligence

Examples of artificial intelligence, or AI, in music can be divided intothe following categories: compositional AI, improvisational AI and AIfor musical performance [21]. The history of musical AI has its roots inalgorithmic music composition. Therefore, the examples of compositionalAI have the longest history. Though often considered to be pioneered byLejaren Hiller’s development of a language for computer-assisted musiccomposition named MUSICOMP, the history of algorithmic compositioncan be said to go back a good deal further [66]. The pianola, forexample, was constructed in the 1800s and plays songs automatically byreading a piano roll, similar to a punch card. Other pioneers of thisfield, such as Iannis Xenakis, also used composition algorithms withoutaid of computers. Xenakis criticised [85] the results of serialism’s strictpredetermination and suggested a solution using statistical methods suchas Markov chains and probabilistic logic. This approach of using Markovmodels, or the hidden Markov model (HMM), to generate new music is stillused today. Allan and Williams used this approach for generating choralesin the style of Johann Sebastian Bach [4] and it is also the algorithm usedby Simon et al. in the SongSmith system.

SongSmith [71] is a system that automatically chooses chords to accom-pany vocal melodies. By singing into a microphone the user can experi-ment with different music styles and chord patterns through a user inter-face designed to be intuitive also for non-musicians. The system is basedon a HMM trained on a music database consisting of 298 lead sheets, each

15

consisting of a melody and an associated chord sequence. The HMM repre-sents probability distributions over sequences of observations, allowing itto learn chord transition probabilities. Examining each song in the databasethe system learns the statistics governing chord transitions in a song by tal-lying the number of transitions from each chord to every other chord.

Although the HMM has been used for music generation and harmonyprediction tasks with some success, probabilistic models like the HMM dohave some limitations. Markov models can, at best, reproduce examplesthey have seen in the training data. In contrast, neural networks such as theRecurrent Neural Network (RNN) are able to learn a fuzzy representationsof the data. Such artificial neural nets use their internal representations toperform a high dimensional interpolation between training examples, asnoted by Graves [33], and rarely generates the same thing twice.

In A Connectionist Approach To Algorithmic Composition [78] published in1989, Peter M. Todd describes a RNN approach to generating symbolicmusic data note by note. Using RNNs has been shown to be somewhatlimited for music generation as the RNN shows poor results for longterm dependencies due to the problem of vanishing gradients duringback propagation [8]. In an attempt to resolve this issue, the LongShort-Term Memory (LSTM) was presented in 1997 by Hochreiter andSchmidhuber [36]. In 2015, Nayebi and Vitelli at Stanford Universitypresented GRUV [53], a music generation system based on recurrent neuralnets. The networks take raw audio as input. The input is formatted intoa sequence of audio vectors representing the waveform at time t and thesystem aims to predict the waveform at time t + 1. Their dataset consistsof two corpuses each containing 20 songs each. One data set containssongs written by the electronic musician Madeon, the other, songs byrock musician David Bowie. Two recurrent neural nets were comparedin their paper, a GRU, gated recurrent unit (first presented by Cho etal. in 2014 [15]), and a LSTM network. Their findings showed that theaudio generated by the LSTM significantly outperformed that of the GRU.Though the training and validation loss was only slightly lower for theLSTM, the GRU generated audio consisting mostly of white noise, whilethe LSTM was able to generate musically plausible audio sequences. Theimage of the generated spectrograms can be found in figure 2.2. It has beenincluded here in order to show the improvements of LSTM networks onlong-term dependencies.

JamBot, developed in 2017 [14] uses two RNN LSTMs, one for predictingchord progressions based on a chord embedding similar to the naturallanguage approach of word embeddings used in Google’s word2vec [52],another LSTM is used for generating polyphonic music for the given chordprogression. Their results do exhibit long term structures similar to whatone would hear during a jam session. The LSTM unit was used bySchmidhuber and Eck in 2002 to improvise blues music [25], concluding

16

that the LSTM “successfully learned the global structure of a musical form,and used that information to compose new pieces in the form.” [25, p. 755]Eck later became a part of the Magenta team at Google Brain.

Figure 2.2: Results reprinted from Nayebi and Vitelli’s paper GRUV:Algorithmic Music Generation using Recurrent Neural Networks [53]

Magenta is a research project devoted to the exploration of the role ofmachine learning in creating art and music. Recent developments includeMusicVAE [67] and NSynth (Neural Synthesizer) [26]. The autoencoderused in NSynth is similar to WaveNet, a network designed at Google Deep-Mind in 2016. WaveNet [55] uses a fully convolutional network designedto input and output raw audio. As well as showing promising potential forenhancing text-to-speech applications, the researchers have also appliedthe WaveNet to the task of generating music. The Convolutional NeuralNetwork (CNN) approach is not often used to solve problems that containlong-term dependencies such as music generation. When the network istrained on about 60 hours of solo piano music collected from YouTube theresearchers found that enlarging the receptive field of the convolution fil-ters was crucial to obtain samples that sounded musical. Modelling theselonger sequences proved to be problematic even with receptive fields last-ing several seconds. Nevertheless, the paper notes that the samples pro-duced were often both harmonic and aesthetically pleasing.

With NSynth, Magenta presents both a WaveNet-style autoencoder anda large-scale and high-quality dataset of musical notes. This dataset is anorder of magnitude larger than comparable public datasets. In contrastto traditional audio synthesis, NSynth generates sounds using the learnedembeddings of the autoencoder. Users can then experiment with differenttimbres to create new sounds. In collaboration with Google CreativeLab the Open NSynth Super (shown in figure 2.3) was created as anexperimental physical interface for the algorithm.

17

Figure 2.3: Open NSynth Super physical interface for the NSynthalgorithm. [43]

Magenta is also the team behind the Performance-RNN [72] whichgenerates expressive timing and dynamics via a stream of MIDI events.These MIDI events are then transferred to a synthesiser to generate pianosounds. Their network learns melody, timing and velocity using a datasetof MIDI performances created by skilled pianists.

The research presented so far in this section is mostly focused on com-positional AI, but as mobile touch screens and cloud-based applicationsbecome able to perform more and more advanced tasks we have also seena rise in machine learning in musical performance and collaboration. Mar-tin and Torresen recently presented RoboJam [48], a machine learning sys-tem that assists users of a music app by generating short responses to theirimprovisations in near real-time. It uses an RNN followed by a MixtureDensity network in order to learn touch screen interactions and predicts asequence of control gestures, thus taking the role of a remote collaborator.Since the system focuses on free-form touch expressions it has the potentialfor use in different creative apps, with many different musical mappings.Martin et al. also presented an artificial neural net for ensemble improvi-sation using a touch screen musical app [47]. The network is trained ona dataset of time-series gesture data collected from 150 performances ona mobile percussion app. The network can generate the simulated perfor-mance of one or three other players to accompany a real, lead player, thuscreating either a duet, or a quartet. This allows a single performer to gen-erate an ensemble performance.

The work presented in my thesis contributes to the research on popularML approaches, the LSTM and HMM, for music generation and use inperformance, by comparing prediction accuracy on the training data aswell as implementing them into the PSCA for real-time improvisation.

18

Chapter 3

Sequence Classifiers

This chapter presents descriptions of the machine learning algorithms thatwere compared and implemented into the PSCA system. As the goal forthe PSCA is to generate chords using note observations, a classifier isneeded that can model the relation between notes and chords as well asthe progression of the chords over time.

In this section, two sequential machine learning models are discussed:the hidden Markov model (HMM) and Recurrent Neural Network (RNN).The RNN and the HMM are often referred to as sequence models, orsequence classifiers, as their goal is to map a series of observations to aseries of class labels. These models are used in the PSCA to solve theharmony prediction problem. This task involves learning the dependenciesbetween chords in a chord progression and the relationships between thenotes in a melody and the chords that accompany them.

Figure 3.1: The harmony prediction task: Given a set of consecutivemeasures containing a monophonic melody, select a suitable harmony (qn)for each measure.

3.1 Harmony prediction

The task these models are trained to solve is in this thesis referred to as theharmony (or chord) prediction problem. At times the problem will also be

19

referred to as chord prediction as the harmonies predicted by the modelsrepresent a triad chord. Moreover, in some sections it has been practicalto separate between references to the chords predicted by the models andthe harmonies produced when a performer uses the PSCA. The latter isaffected by the underlying audio loop as well as the predicted chords. Theharmony prediction task can be stated as follows:

Given a sequence of consecutive measures containing a monophonicmelody, select a suitable harmony to accompany each measure.

The inputs to the models are the notes contained in a measure. The targetvalue of each measure is a harmony represented by a chord symbol. Thetask is illustrated in figure 3.1. Here the harmonies are denoted as q andthe figure shows how the choice of harmony is affected by the notes in thecorresponding measure as well as the preceding harmonies.

3.2 Hidden Markov Model

Hidden Markov models were first presented by Baum et al. in the articleA maximisation technique occurring in the statistical analysis of probabilisticfunctions of Markov chains [7], and have been applied to a range of problemsinvolving time series data [63], most notably perhaps, to problems inspeech processing [79], [5]. The HMM is characterised by an underlyingprocess governing an observable sequence. For the harmony predictionproblem, the observable sequence is the notes found in each measure, whilethe process which governs these observable sequences is the underlyingchord progression.

Figure 3.2: A Markov model for transitions between four chords. Transitionand entry probabilities are omitted.

3.2.1 Markov Chains

In order to describe the structure of the HMM, we begin by describingMarkov chains. A Markov chain is a sequence of states that satisfies theMarkov assumption, that is, that the probability of moving to any statedepends only on the current state. A Markov chain is a sequence which isgenerated using the following components:

1. A set of N states, S

S = {s1, s2, s3, s4, . . . , sN} (3.1)

20

2. A transition probability matrix A

A = {a0,1, a0,2, . . . , aN,1, . . . , aN,M} (3.2)

Every ai,j represents the probability of moving from state i to state j.And,

∀i ∈ N,N

∑j=1

aij = 1 (3.3)

3. π, the initial probability distribution over all states.πi is the probability that the Markov chain will start in state i.Also,

N

∑πi=1

= 1 (3.4)

3.2.2 Hidden Events

The Markov chain is useful for calculating probabilities for the next eventin a sequence when we can observe the sequence directly in the world.Other times, what we are interested in is not the sequence we can observe,but a “hidden” sequence which governs the resulting observable sequence.In the task of speech recognition, the observations would be the audioinput, and the hidden states would be the words that are contained in theaudio. In harmony prediction tasks, the hidden states are the chords, andthe duration of the notes in the melody are the observations.

Markov chains with hidden states can be described formally as follows[11]:

1. A set of N states, S

S = {s1, s2, s3, s4, . . . , sN} (3.5)

2. A transition probability matrix A

A = {a0,1, a0,2, . . . , aN,1, . . . , aN,M} (3.6)

Every ai,j represents the probability of moving from state i to state j.And,

∀i ∈ N,N

∑j=1

aij = 1 (3.7)

3. π, the initial probability distribution over all states.πi is the probability that the Markov chain will start in state i.Also,

N

∑πi=1

= 1 (3.8)

21

4. The observation space, O:

O = {o1, o2, o3. . . ok} (3.9)

The set of unique, observable tokens.

5. An emission probability matrix, B.

B = {b1,1, b1,2, b1,3, . . . bN,k} (3.10)

Every bi,j represents the probability of observing oj in state i.

6. A sequence of l observations, Y.

Y = y1, y2, y3. . . yl (3.11)

Each observation yi contains the observed data at time step i.

In a supervised learning scenario (such as harmony prediction), theHMM is learned by building the aforementioned matrices (A, B and π)empirically according to a set of training examples. Training the HMM forthe harmony prediction problem consists of calculating the transition (A),start (π) and emission (B) probabilities for each state using the informationfound in the lead sheets. Unsupervised procedures that are often usedin training HMMs when the hidden processes are unknown, require theuse of Expectation Maximisation [22] or similar strategies. In contrast, forthe harmony prediction problem presented in my work, the hidden statesand the probabilities for transitioning between them can be drawn directlyfrom the data. The matrices are populated directly by counting occurrenceswhile traversing each measure of each lead sheet. This process is describedfurther in the upcoming chapter on the design and implementation of thesystem (Section 4.3.4).

Figure 3.3: HMM chord states and note emissions

3.2.3 Decoding

Decoding is the process of deciding what hidden sequence is most likelyfor a given sequence of observations. For this task one of the most commondecoding algorithms for HMMs, the Viterbi algorithm [29] is used. Givena model, HMM = (A, B, π) and a set of observations, O, the Viterbialgorithm begins by looking at the observations from left to right. For each

22

observation, the max probability for all states is calculated. By keeping apointer to the previously chosen state the algorithm can back trace throughthese pointers to construct the most likely path. Given a set of observationsy = y1, y2, . . . , yt as the input, the Viterbi algorithm will return the path ofstates x = x1, x2, . . . , xt using the following steps:

Viterbi1,N = P(y1|N) ∗ πN (3.12a)Viterbit,N = maxx∈S(P(yt|N) · ax,t ·Viterbit−1,x) (3.12b)

3.2.4 Applying HMM to Harmony Prediction

Disregarding octave information there are 12 possible notes we canobserve. Thereby, an observation can be defined as a vector whichrepresents the 12 notes of the chromatic scale and contains the durationsof each note. The emission probabilities represent the relationship betweenwhat notes are played and our set of chord states. The emission probabilitymatrix contains note probability distributions for each state.

3.3 Recurrent Neural Nets

The recurrent neural net is a type of artificial neural net (ANN), a systemwhich contains interconnected layers of nodes or neurone. Each nodehas an input where signals from the previous layer of nodes can enter,and an output where the signal from the current node can travel to thenext layer of nodes. The strength of each node’s signal is regulated byits internal weights. Learning is achieved by updating these weights byfeeding the input forwards and then propagating the error of the outputlayer backwards through the network, from the output layer to the inputlayer. Such ANNs are called feedforward nets. The internal weights ofeach node are updated individually, meaning the weights of one node arenot directly affected by the weights of any other nodes. In a way the RNNcan be thought of as a feedforward net which is unrolled over time wherethe weights are shared across time steps. The RNN applies a recurrentconnection, which allows the state of the previous time step to affect thecurrent time step.

When humans take in information, for example when watching a film,we process the audio and video bit by bit from start till end. We do notforget what we have seen or heard as soon as it has been processed by ourbrains, we hold the information in our memory in order to understand thestory. This is the underlying idea behind RNN. As with HMMs, RNNshave also shown good results when used in other time series tasks such asspeech recognition, as demonstrated by Graves et al. in 2014 [34]

The input to a simple RNN has the form (time_step, input_features).The RNN internally loops over the input features at each time step while

23

Figure 3.4: RNN

maintaining an internal state. In each loop the RNN considers the state atthis time step as well as the input features at this time step when calculatingits output. The output then becomes the new state for the next time step.

3.3.1 Learning

This way the RNN can learn temporal dependencies from the start ofthe sequence till the end. More specifically the RNN is trained usingBPTT, “Back Propagation Through Time” or RTRL, “Real-Time RecurrentLearning”. At the start of every new sequence the internal state of the RNNis reset, hence the RNN does not retain any information between sequences,only between time steps in each sequence.

Figure 3.5: RNN unrolled over time

3.3.2 LSTM: Long Short-Term Memory

Even though the RNN can, in theory, learn time dependencies foundin the input sequence, there are still problems with learning long-termdependencies. As the number of time steps in a sequence increases, theRNN starts to “forget” what it has learned from the first time steps. Thisis due to the back propagation training procedure producing vanishing or

24

exploding gradients. This phenomenon was examined by Hochreiter etal. in their article Gradient Flow in Recurrent Nets: the Difficulty of LearningLong-Term Dependencies [37] as well as by Bengio et al. in Learning Long-TermDependencies with Gradient Descent is Difficult [8]. Long Short-Term Memory(LSTM) is designed to get rid of the exactly these issues.

Proposed in 1997 by Hochreiter and Schmidhuber [36], LSTM uses a so-called CEC (“Constant Error Carousel”) to carry information across manytime steps. This prevents the information from earlier time steps fromvanishing. When looking at the LSTM unrolled over time (shown in figure3.6) one can imagine the CEC as a conveyor belt carrying informationacross time steps.

Figure 3.6: LSTM cell unrolled over time

Since LSTM allows information to be stored across arbitrary time steps, itis also important to forget information which is not important. Otherwise,the states of a cell may grow in an unbounded way. In the report Learningto Forget: Continual Prediction with LSTM, Schmidhuber, Gets and Cumminsdescribe a solution to the problem of unbounded growth in cell states, theForget Gate. The Forget Gate learns to reset the cell memory when theinformation in it has become useless [32]. In general, the LSTM cell consistsof three multiplicative gates (see figure 3.7): the input, output and forgetgates.

3.3.3 Bidirectional RNN

The bidirectional RNN [70], as the name implies, the BRNN combines anRNN which loops through the input sequence forwards through time withan RNN which moves through the input backwards through time. Thereby,bidirectional networks can learn contexts over time in both directions.Separate nodes handle information forwards and backwards through the

25

Figure 3.7: Figure showing the three gates of a single LSTM cell.

network. Thus, the output at time t can utilise a summary of what has beenlearned from the beginning of the sequence, forwards till t as well as what islearned from the end of the sequence, backwards till t. Bidirectional LSTMwas presented by Graves and Schmidhuber in 2005 [35] and is arguablya better choice than RNN or LSTM for the PSCA, as the current chord atsome time step is affected both by the preceding chord and the next chord.The Bidirectional Long Short-Term Memory (BLSTM) can learn both thesecontexts.

3.3.4 Applying BLSTM for Harmony Prediction

When training the BLSTM for harmony prediction the measures foundin the lead sheets represent the time steps. The dataset is structuredinto sequences of eight measures. The first sample of the dataset isthe first eight measures: measure 1 to 8, for the next sample we moveone step to the right and create a sequence which contains measures2 to 9 and so on. Each measure consists of a chord symbol (rep-resented using one-hot encoded matrices) and an associated note vec-tor with 12 elements corresponding to the duration of each of the12 possible notes. The input to the BLSTM is thus given the form(number_of_sequences, sequence_length, note_vector) and the out-put will have the form(number_of_sequences, sequence_length, one_hot_encoded_targets).In the upcoming chapter, section 4.3.5, the implementation of the BLSTM isdescribed in further detail. In figure 4.11 the code for building the BLSTMmodel is shown.

26

Chapter 4

Design and Implementation

This chapter gives an overview of the work done to develop the PSCA.This includes the implementation of an audio looper written in Python,descriptions of pitch detection and chord construction as well as details ofthe dataset and the implementation of the machine learning approaches.

4.1 The audio looper

(a) Python looper implementation (b) Arduino and foot switches

Figure 4.1: The PSCA system consists of software and hardware compo-nents. The Python scripts for running the PSCA system are available athttps://github.com/benediktewallace/PSCA. Photos: Benedikte Wallace

The underlying system developed for the PSCA is an audio looperwritten in Python that is controlled by a pair of foot switches through anArduino. When the program is running the user sings into a microphoneand controls the system using the foot switches. One switch controlsplayback on or off, the other controls recording on or off. The user can alsoclear the playback loop by pressing a small button to start over. During

27

https://github.com/benediktewallace/PSCA

Figure 4.2: System diagram of the hardware setup for using the PSCAsystem: Laptop, sound card, microphone, Arduino and foot switches

early prototyping a circuit consisting of three buttons was used (see figure4.3), ultimately the foot pedal version was added in order for the user tohave their hands free.

For the construction of the main audio looping function the Pythonmodule sounddevice was used. Sounddevice provides bindings toPortAudio, the cross-platform audio I/O library. This module alsofacilitates treating the audio signals as NumPy arrays. In the PSCA system,two 2D NumPy arrays are used for audio data, the recording array and theplayback array. When the system is recording the audio is continuouslywritten to the recording array. The playback array is the array that is readfrom when the program is in playback mode. A third NumPy array isused to hold the concatenated chord harmony. The dimensions of each ofthe arrays are a function of the audio sample rate, the tempo (BPM) andduration of our loop (set in beats per measure). These values are hardcoded into the program and so the array dimensions are static. The defaultsample rate is set at 48 kHz, and the default BPM is set to 80. In future workon this system these values could be set by the user instead. The user couldset sample rate and beats per measure at program start, and by adding apotentiometer to the Arduino circuit the user could set their preferred BPMon-the-fly.

Figure 4.3: First prototype for controlling playback and recording with thePSCA. Photo: Benedikte Wallace

28

4.2 Program structure

In addition to the standard audio looping functions, the system alsoconstructs an underlying harmony, or chord, which is added to theplayback array. This chord is created by concatenating 18 th note segmentstaken from the recorded audio. A third array, named the chord layerarray, is used to hold the concatenated audio. When the chord has beenconstructed the chord layer array is added to the playback loop. Thestructure of the program can be divided into three main procedures;the functions pertaining to simultaneous playback and recording, theanalysis and storing of the notes found in the recordings and lastly theconcatenation of audio to create a chord. The following sections describethe program flow step by step. Furthermore, figure 4.4 gives a visualrepresentation of the program flow of the PSCA.

As the user presses the foot switch to initiate playback the programbegins reading from the so far empty playback array, then reading theaudio from beginning to end, finally looping back to play from thebeginning of the array again. When the record switch is pressed, the audiofrom the microphone is written to the recording array. In the same wayaudio is read from the playback array, the audio is written to the recordingarray from start to end. If the user has not pressed the switch again tostop the recording, the audio is written to the beginning of the array againuntil the user stops recording. When the record switch is pressed again tosignal the end of a recording, the recording array contains approximatelythe last 6 seconds of audio recorded given the default loop duration andBPM. Next, the pitches contained in this new recording are analysed.

4.2.1 Pitch analysis

For pitch analysis the Librosa module is used [50]. More precisely, thepiptrack function. This function tracks the pitch using thresholdedparabolically-interpolated STFT [73]. piptrack takes as its input theNumPy array containing the audio and returns two arrays, the pitchesand the corresponding magnitudes found in the input. The pitch andmagnitude arrays have shape (d,t). Where d is the subset of FFT binswithin the minimum and maximum frequency and t represents a timestep. Time steps are calculated using the hop_step parameter. The functionallows for control of maximum and minimum frequency (fmin and fmax),thus truncating the frequency area to the area of interest. In the PSCA fminis set to 250 Hz and fmax to 1050 Hz. This range contains approximately twooctaves. From C4 (261 Hz) to C6 (1046 Hz). In addition to the fundamentalfrequencies, the human voice also generates overtones. The overtones arewhat make up the timbre of a sound, they vary from one person’s voice toanother and of course also varies greatly between instruments. The pitch,sustain and amplitude of each of the frequencies are one of the ways we ashumans differentiate between different instruments playing the same note.

29

Figure 4.4: Program flow for the PSCA: User records audio which in turn isanalysed, segmented and added to the note bank. The harmony predictionmodel predicts a chord, this chord is constructed by concatenating andadding audio segments from the note bank. Finally this concatenated audiolayer is added to the playback loop together with the latest recording madeby the user.

30

Since the goal is to discover what notes were recorded, the overtonescan be misleading and should be ignored, instead the aim is to find thefundamental frequencies. Experimenting with different frequency rangescould improve the pitch detection function of the PSCA. One could arguethat the fmin and fmax values used in the system today could be reducedeven further, for example to a single octave, as octave information isdisregarded in the PSCA anyway, and a slimmer frequency range wouldhelp to exclude the overtone information. The piptrack function is alsoused to provide a naive assessment of what key the user will be singingin. As the user records the first layer to the playback loop, this recordingis analysed in order to find the first note which is present. This note isassumed to be equal to the key of the recording.

’ ’ ’l i b r o s a . c o r e . p i p t r a c k ( y=None , s r =22050 , S=None ,n _ f f t =2048 , h o p _ l e n g t h =None ,fmin =150 .0 , fmax =4000 .0 , t h r e s h o l d = 0 . 1 )’ ’ ’pi tches , magnitudes = l i b r o s a . p ipt rack ( y=indata ,s r=SAMPLE_RATE, fmin =250 .0 , fmax =1050 .0 )

Figure 4.5: librosa piptrack function. Definition and example of use

4.2.2 The note bank

The audio used for building chords is stored in a dictionary, the note bank.From the note bank the audio can be accessed later when a certain noteis needed to build a chord. Each key corresponds to one of the 12 notes.The recording is divided into segments of 18 th note durations. The pitch ofeach segment is analysed and mapped to one of the 12 notes, disregardingthe octave information. When new examples of a note are found in thecurrent recording array they are added to the bank. Since the user may nothave perfect pitch some variations in pitch is tolerated, but not too muchvariation as this would potentially affect the harmony when it is used toconstruct a chord. Therefore, only recordings where the singer is within25 cent of the correct pitch are added. As the program runs and the userrecords new audio the program continues to fill the note bank.

4.2.3 Chord construction

Chords are represented as a set of note indexes corresponding to the integerrepresentation (0-11) of the notes that constitute that chord. For example,C major = 0, 3, 7. The root of a C major chord is the note C which maps tozero, the second note is the major third interval, 3 half note steps above theroot. The third note in the major triad is a perfect fifth. See figure 4.6 for avisual representation of these note mappings.

31

Figure 4.6: Mappings of notes to integers from 0 - 11. Highlighted is a Cmajor triad

When a chord has been selected the system attempts to find audio at therelevant indexes in the note bank. If the note bank does not contain anexample of this pitch, audio from the closest non-empty index of the notebank is shifted to the correct note using the librosa function pitch_shift.Each of the notes in the chosen chord is concatenated to fill the chord layerarray. This creates a continuous harmony for the duration of the loop. Achord can be selected either by one of the machine learning models (Section4.3) or using the base case approach (Section 4.2.5).

Figure 4.7: Example of measure and resulting note vector

4.2.4 Creating note vectors

To predict a new chord a note vector is fed to the hidden Markov model(HMM) or Bidirectional Long Short-Term Memory (BLSTM) model. Thenote vector contains the duration of each note that was present in thenew recording. The note vectors are created using the piptrack function.After the entire recording has been analysed by piptrack the resultingpitch array is iterated through and the occurrences of each of the 12 notesare counted. Finally, the note vector is normalised according to the totalnumber of time steps as returned by piptrack. Figure 4.8 shows the barplots generated from note vectors created when singing one, two or threenotes into the PSCA. The figures show that the piptrack function capturesthe notes that were sung, but also some information at other notes. Thesecould be generated due to slight drifts in pitch as well as some of theovertones that are present in my voice.

32

(a) Note vector created when singingone note: C

(b) Note vector created when singingtwo consecutive notes: F and A

(c) Note vector created when singingthree consecutivenotes: C, F and A

Figure 4.8: Note vectors created when singing one 4.8(a), two 4.8(b) andthree notes 4.8(c)

33

4.2.5 Base case implementation

As a base case for comparing the results of our machine learning models anaive approach to chord prediction is implemented. Running the PSCA inthe base case mode means that no machine learning techniques will be usedto choose an appropriate chord. Instead, the program will attempt to detectthe first note in each new recorded loop and will generate a major triadwith this first note as its root. For this approach song key is disregardedcompletely as this would not affect the chord choice.

4.3 Machine learning approaches

Many have researched generating musical accompaniment using HMMsor Recurrent Neural Network (RNN)s. In the PSCA system, two machinelearning approaches are used for the task of harmony prediction: HMMand BLSTM. These models, discussed in chapter 3, have previously beenused in other harmony prediction systems. In particular, the HMMimplementation in PSCA is inspired by the article by Simon et al. atMicrosoft [71] presenting MySong. Similarly, the research of Lim et al. [46]inspired the BLSTM implementation.

4.3.1 Dataset

The dataset consists of 1850 lead sheets in music XML format (.mxl)collected online. These were sourced from a dataset previously shared atwikifonia.org until 2013. Each lead sheet contains a monophonic melodyand chord notation as well as key and time signature. The dataset containswestern popular music, with examples from genres like pop, rock, RnB,jazz as well as songs from musical theatre and children’s songs. The datasetcontains song in both major and minor keys. Approximately 70% of thesongs are classified as major keys and 30% as minor keys.

Using the music21 toolkit [20] a python toolkit developed to analyse,transform and search musical scores, each lead sheet can be flattenedinto consecutive measures containing melody and chord information. Themusic21 toolkit also extracts information about key and time signaturefrom each lead sheet. In order to create example and target pairs formachine learning each measure in each of the songs should contain onlyone chord. For the songs where the occasional measure has more than onechord, all but the final chord in the measure are ignored. Should a measurecontain no chord notation, the chord from the preceding measure is used.Similarly, measures that do not contain any notes are not added. Songsthat contain key changes are removed from the dataset, though one optionwould be to process these songs by segmenting the parts with different keysignatures and processing each part as a separate song. In figure 4.9 youcan see an example of a lead sheet from the dataset.

34

Figure 4.9: Example of a lead sheet from the dataset: Let Me Call YouSweetheart (Freidman and Whitson, 1910)

35

4.3.2 Pre-processing

The music21 toolkit parses the music xml files in the dataset creating usefulPython objects to represent the score. From each of the songs, a list ofconsecutive measures is extracted. Each measure contains the notes foundin the melody of this measure and the measure’s chord symbol. The formerbecomes the note vector input to our classifiers, and the chord symbolbecomes the corresponding target class label. The information availablethrough the chord object attribute pitchedCommonName is used to acquirethe type (major, minor, augmented, diminished or suspended) of the chord.Similarly, the duration attribute of the note object is used to generate thenote vector of each measure by counting the duration of each note in themeasure and dividing by the songs time signature reciprocal. In total, thedataset contains 47564 measures.

4.3.3 Chord types

The songs in the dataset, as well as in most popular songs in western music,can be classified as being in one of 12 musical keys. The key of a song tellsus something about which notes and chords are most likely to be used inthat song. By shifting all the notes up or down (transposing), the key ofa song can be changed without losing any information. For example, thenotes CCGGAAG, Twinkle twinkle little star in the key of C are equivalentto FFCCDDC in the key of F. As long as the intervals between notes remainthe same, we experience the melodies as equivalent. To be able to analyseand compare the chord progressions and melodies in the dataset all songsare transposed to the same key without the loss of any information. Songsoriginally in a major key to C major and all minor songs to A minor key.C major and A minor keys are equivalent in that the C major and A minorscales contain the same notes.

The number of unique chords in the dataset, after the lead sheets aretransposed to the key of C, is 151. Attempting to use each of these uniquechords as a separate class, would result in very few training examples formost of the chords. To minimise the number of possible chord choiceseach chord is simplified to its closest triad. Each chord is labelled as oneof 5 rudimentary chord types: minor, major, diminished, augmented andsuspended, as proposed in the SongSmith system. This gives us 60 uniquechords, as there are 5 chord types for each of the 12 root notes. Even whensimplifying into the chords closest triad some of the triads still have few, orno, examples (Specifically suspended versions of the chords C#, D#, F#, G#and A#).

During training I also experimented with categorising the chords inthe dataset into 2 types, major and minor, resulting in 24 unique chords.However, this approach still leads to an unbalanced dataset. For the modelsused in the PSCA the 60-chord approach was used in order to allow for the

36

Table 4.1: Examples of original chords found in the lead sheets and theresulting triad label used when simplifying to closest triad

Triad Label Original ChordMajor CMaj7Minor Cm7Sus G5Aug F7+5Dim Edim7

possibility of more varied chord types in the predictions. More on this topicwill be explored in section 5.2.4.

4.3.4 HMM model

Training the HMM

Training of the HMM is done by building probability matrices directly fromthe data. As there are no HMM libraries available that are particularlyappropriate for the task, the HMM has been coded from scratch in Python.As described in the previous chapter in section 3.2, the HMM consists ofthree matrices; The transition probability matrix, the emission probabilitymatrix and the start probability matrix. Hence the problem can be dividedinto three separate tasks;

1. Transition probabilitie

Predictive songwriting with concatenative...

Documents

Transcript of Predictive songwriting with concatenative...