Automatic Analysis of Music in Standard MIDI Files

Automatic Analysis of Music in StandardMIDI Files

Zheng Jiang

May 2019

Music and TechnologyCarnegie Mellon University

Pittsburgh, PA 15213

Thesis Committee:Roger Dannenberg, Chair

Barnabas Poczos

Submitted in partial fulfillment of the requirementsfor the degree of Master of Science.

Copyright c© 2019 Zheng Jiang

Keywords: MIDI, Melody Analysis, Chord Analysis, Structure Analysis

For my love.

AbstractThe identification of melody, bass, chord and structure are important steps in

music analysis. This thesis presents a comprehensive tool to identify the melody andbass in each measure of a Standard MIDI File, and provides chord labels and generalstructure information. We also share an open dataset of manually labeled musicfor researchers. We use a Bayesian maximum-likelihood approach and dynamicprogramming as the basis of our work in the melody identification. We have trainedparameters on data sampled from the million song dataset [16, 21] and tested ona dataset including 1703 measures of music from different genres. Our algorithmachieves an overall accuracy of 89% in the test dataset. We compare our results toprevious work. For bass identification, since our algorithm is rule-based, we onlylabel the test set. And the bass identification achieves over 95% accuracy. Ourchord-labeling algorithm is adopted from Temperley [20] and tested on a manuallylabeled test set containing 1890 labels. Currently, this algorithm achieves around78% accuracy for chord root and chord type matching and around 82% accuracy forchord root matching only. We also discovered an optimization: by removing notes inmelody channel, we can improve the accuracy for both evaluations by 1.3%. For thestructure analysis, we also provide a simple algorithm based on a similarity matrixand give some analysis of the result. By automatically labeling and analyzing MIDIfiles, a rich source of symbolic music information, we hope to enable new studies ofmusic style, music composition, music structure, and computational music theory.

AcknowledgmentsLooking as far into the past as my memory allows, I discern a few milestones

that led to the creation of this project. Each milestone is associated with someonewho played a vital role in my ability to realize the present work.

Two years ago, I submitted the application for the program of Music and Tech-nology because Music and Computer Science are two essentials in my life. And inthis program I am so pleasure to be advised by the pioneer in this field, my advisor,Prof. Roger B. Dannenberg. And I am so honored with a little surprise to becomethe only student in that year. In addition, I had the opportunity to use the facilities ofhis research lab. This was a very generous arrangement that few graduate studentshave the opportunity to enjoy. I am forever grateful to Roger for this.

In the second semester, I took a course named Art and Machine Learning givenby Prof. Barnabas Poczos and Prof. Eunsu Kang. In that course we made a Chi-nese Opera by machine learning algorithms and also present it in the Thirty-secondConference on Neural Information Processing Systems 2018 workshop. And it ismy honor that Barnabas becomes my co-advisor to support my research in computerbased music analysis.

During the research, our work of melody analysis has been accepted in the Soundand Music Conference 2019. It is such a milestone in my academic life during mygraduate.

And great thanks to my parents to support my dreams in both music and com-puter science!

Thanks to Prof. Richard Randall, Prof. Richard Stern and Prof. Riccardo Schulzfor giving me comments on my proposal and thesis.

Also thanks to Shuqi Dai and Junyan Jiang for providing some MIDI files andsome of the labels in my dataset.

Contents

1 Introduction 1

2 Related Work 32.1 Melody and Bass Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Chord Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Structure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Machine Readable Music Representation 73.1 Introduction to MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Music Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Melody Analysis 114.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3.1 Bayesian Probability Model . . . . . . . . . . . . . . . . . . . . . . . . 134.3.2 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3.3 Melodic probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.4 Experiment and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.5.1 A Successful Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.5.2 Failed Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Bass Analysis 235.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Experiment and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

ix

6 Chord Analysis 256.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.3 Definition of Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.4.1 Optimization: Subtracting Melody . . . . . . . . . . . . . . . . . . . . . 276.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.5.1 Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.5.2 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7 Structure Analysis 317.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.2.1 Similarity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.2.2 Repetition Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.2.3 Representation Transformation . . . . . . . . . . . . . . . . . . . . . . . 33

7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.3.1 Evaluation Method for Structure Analysis . . . . . . . . . . . . . . . . . 36

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Future Work 418.1 Future Work on Melody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418.2 Future Work on Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418.3 Future Work on Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Bibliography 43

x

List of Figures

4.1 Measure Level Melody Channel Distribution . . . . . . . . . . . . . . . . . . . . 124.2 Accuracy for different values of window size and switch penalty. . . . . . . . . . 194.3 Accuracy on the test dataset vs. Window Size (= 1 + 2N ) for Switch Penalty = 36. 194.4 Accuracy on the test dataset vs. Switch Penalty with Window Size = 5. The

highest accuracy is 89.15% using any Switch Penalty ∈ {30, 32, . . . 38}. . . . . . 204.5 The melody detected in the song “Hotel California.” . . . . . . . . . . . . . . . 214.6 The algorithm identified the top (red) channel as melody of “Being,” but the

correct melody is shown in yellow at the bottom. . . . . . . . . . . . . . . . . . 214.7 Two channels ensemble the melody . . . . . . . . . . . . . . . . . . . . . . . . 22

7.1 Maximum Sharing Rate Similarity Matrix for Alimountain . . . . . . . . . . . . 327.2 Chroma Vector Similarity Matrix for Alimountain . . . . . . . . . . . . . . . . . 337.3 After the Energy Based Algorithm for Alimountain . . . . . . . . . . . . . . . . 357.4 After the Filter and Cluster Algorithm for Alimountain . . . . . . . . . . . . . . 367.5 Reference and Analysis Result of Alimountain . . . . . . . . . . . . . . . . . . . 377.6 Reference and Analysis Result of Yesterday . . . . . . . . . . . . . . . . . . . . 387.7 Reference and Analysis Result of Moonwish . . . . . . . . . . . . . . . . . . . . 39

xi

List of Tables

3.1 Melody Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Bass Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Harmony Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 C Major Pitch Class Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.5 Harmony Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.6 Example Structure Representation . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Mean and standard deviation of accuracy in 5-fold cross validation using thetop 5 feature sets, window size = 5. Here, nd means note density, pm meanspitch mean, ps means pitch std, im means IOI mean, is means IOI std, vmmeans vel mean, and vs means vel std. . . . . . . . . . . . . . . . . . . . . . 18

4.2 Mean and standard deviation of accuracy in 5-fold cross validation using indi-vidual features, window size = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.1 Chord Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Chord Test Result without Optimization . . . . . . . . . . . . . . . . . . . . . . 286.3 Chord Test Result with Optimization . . . . . . . . . . . . . . . . . . . . . . . . 29

7.1 Final Output of Alimountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367.2 Evaluation scores for 10 songs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

xiii

Chapter 1

Introduction

The field of Music Information Retrieval began with a focus on audio content-based search andclassification. After a decade of intense activity, it became clear that a major problem for machineprocessing of music is a deep understanding of music structure, patterns, and relationships. Thus,music analysis which identifies melody, bass, chords, and other elements is seen as an importantchallenge for MIR. Recently, machine learning algorithms have dominated work in MIR. Oneof the properties of machine learning algorithms is that they require a lot of data. There is awealth of data in the form of MIDI files, which have pitch, and beat information. This symbolicrepresentation of music is very difficult to extract directly and precisely from audio files. Onthe other hand, humans do understand the concept of notes, pitches and beats. Working withthis level of representation is relevant to human music perception even if we cannot model thetransformation of audio into these representations. MIDI gives us the opportunity to extractmuch useful information for studying music. But so far, there are not any systems that canactually do a good job extracting data from MIDI files in general. Typically, researchers usecarefully selected or prepared MIDI files and look for particular information. As far as we know,there is no comprehensive systems for MIDI file analysis. The aim of this thesis is to build acomprehensive system that can extract melody, bass, harmony and structure information, whichis the basic information for the music theorist. It is a starting point for any kind of analysis inmusic theory. The hope is that by building this tool, researchers will be able to get more usefulinformation from MIDI files and use it for various tasks. These tasks include automatic musiccomposition, building models for music listening (the models for music and music expectationwhen people look at the music notes or listen to music), etc. So having music data to look at isnot only important for generating new music but also for studies of music listening. And becausethere are so many music genres, in this thesis, we decide to focus on pop music.

The thesis consists of the following chapters: Chapter 2 on related work, Chapter 3 onmachine-readable music representations, Chapter 4 on melody analysis, Chapter 5 on bass anal-ysis, Chapter 6 on harmony analysis, Chapter 7 on structure analysis and Chapter 8 presents myconclusion.

1

Chapter 2

Related Work

2.1 Melody and Bass Analysis

In the field of symbolic files, Skyline is a very simple algorithm proposed by Uitdenbogerd [22].In brief, the idea of this method is to pick the highest pitch at any moment as belonging to themelody. Chai and Vercoe offer an enhanced version of this approach [5]. In pop music, weobserve that there are often accompaniment notes above the melody line, leading to failure ofthe Skyline algorithms. Uitdenbogerd presents three more methods [9]: 1) Top Channel (choosethe channel with the highest mean pitch), 2) Entropy Channel (choose the channel with thehighest entropy), and 3) Entropy Part (segment first, then use Entropy Channel). Shan [18]proposed using greatest volume (MIDI velocity) because melody is typically emphasized throughdynamics. Li et al. identify melodies by finding common sequences in multiple MIDI files, butthis obviously requires multiple versions of songs [13]. Li, Yang, and Chen [12] use a NeuralNetwork and features such as chord rate, pronunciation rate, average note pitch, instrument, etc.,trained on 800 songs to estimate the likelihood that a channel is the melody channel. Velusamy,et al. [23] use a similar approach, but prune notes that do not satisfy certain heuristics and use ahand-crafted linear model for ranking channels [12].

All of these algorithms assume that the melody appears on one and only one channel, sothe problem is always to select one of up to 16 channels as the melody channel. Depending onthe data, this can be a frequent cause of failure, since the melody can be expressed by differentinstruments in different channels at different times. An interesting approach is Tunerank [25],which groups and labels notes according to harmony and dissonance with other notes, pitchintervals between consecutive notes, and instrumentation, without assuming the melody is inonly one channel.

Previous work is hard to evaluate, with accuracy reports ranging from 60% to 97%, no la-beled public datasets, and few shared implementations. The properties of music arrangementsand orchestrations in MIDI files can cause many problems. The simplest case, often assumedin the literature, is that the melody appears in one and only one channel. At least four morecomplex conditions are often found: 1) The melody is sometimes played in unison or octavesin another channel, 2) the melody switches from one instrument (channel) to another from onephrase or repetition to another, 3) the melody is fragmented across channels even within a single

3

measure (this happens but seems to be rare), and 4) there are multiple overlapping melodies asin counterpoint, rounds, etc.

Unlike melody, there is not so many papers that discuss bass extraction in MIDI files. Oneresearch area that is related is bass transcription from audio [17]. However, since this thesis willfocus on symbolic data, we will not discuss the audio transcription topic here.

There are some applications of detected bass lines. One of them is automatic music genreclassification using bass lines [19]. Beside that, the output of our system can also be used inmusic generation systems.

2.2 Chord Analysis

Temperley [20] proposes an algorithm for harmonic analysis. It generates a representation inwhich the piece is divided into segments labeled with roots. One of the major innovations of itis that pitches and chords are both represented on a spatial representation known as the “line offifths”.

The dataset is the eternal topic when we consider a computational model for analyzing music.In the audio area, there are some datasets that have been well labeled. For example, Burgoyne [4]proposes such a dataset. Working with a team of trained jazz musicians, they have collected time-aligned transcriptions of the harmony in more than a thousand songs selected randomly from theBillboard “Hot 100” chart in the United States between 1958 and 1991. However, in the symbolicarea, such manually labeled datasets are very rare. One method is to convert chord informationfrom leadsheets. Lim et al. [14] publish a CSV Leadsheet Database which contains a lead sheetdatabase provided by Wikifonia.org. It has Western music lead sheets in MusicXML format,including rock, pop, country, jazz, folk, RnB, childrens song, etc. The dataset is split into twosets: a training set of 1802 songs, and a test set of 450 songs. However, because leadsheets donot contain notes other than the melody, it is hard to consider this as a training or testing datasetfor the chord analysis task.

Pardo [15] also describes a system for segmenting and labeling tonal music based on thetemplate matching and graph-search techniques. And his idea about evaluation also inspires ourevaluation methods in this thesis.

2.3 Structure Analysis

Dannenberg and Goto [6] introduce methods to analyze textures, repetition of phrases or entiresections, and the algorithms to leverage the similarity matrix are quite inspiring. Our algorithmwill take some inspiration from this work, although the analysis space is changed from audio toMIDI. And Klapuri et al. [10] gives an overview of state-of-the art methods for computationalmusic structure analysis, where the general goal is to divide an audio recording into temporalsegments corresponding to musical parts and to group these segments into musically meaningfulcategories. For example, Lerdahl et al. [11] models music understanding from the perspective ofcognitive science and a grammar of music with the aid of generative linguistics, which is also a

4

kind of structure analysis. Bruderer et al. [2] investigate the perception of structural boundariesto Western popular music and examine the musical cues responsible for their perception.

Goto and Dannenberg [8] introduce some applications and interfaces that are enabled byadvances in automatic music structure analysis. One of them is the SmartMusicKIOSK inter-face [7]. A user can actively listen to various parts of a song, guided by the visualized musicstructure (“music map”) in the window.

5

Chapter 3

Machine Readable Music Representation

3.1 Introduction to MIDICurrently, MIDI is one of the most popular formats for storing the music information. In thissection, we will introduce some properties of MIDI to explain the parts that will be used in ourresearch and the information that we ignore.

We can consider a MIDI file to be a list of events of different categories. In general, there arethree major categories: channel event, system event and meta event. A channel event contains achannel number (in the specification of MIDI, there are 16 channels, each channel can be as-signed an instrument) and is used to represent note on/off, control change, program (instrument)change, pitch wheel, etc. The system event contains start, stop and reset, etc. commands. Themeta event contains time signature, tempo, etc. MIDI files do not necessary tell what is the beatnumber or measure number for each note. The meta event will be used in our system to identifybeats and measures. If the meta events of time signature and tempo are missing, that piece will beconsidered as invalid input to our system. In our research, we are mainly concerned with pitches,durations, and instruments. So we ignore many events associated with synthesizes control, suchas volume and pitch wheel events.

Because the MIDI file is stored in a binary format, it is not very convenient for programmerto read/write MIDI information. In our system, we use Serpent1 to read binary MIDI files andthen write the content that we need to an ASCII text file. Our algorithms only consider theinformation in the text files instead of the original MIDI file.

3.2 Music RepresentationHow to store music information in a digital computer is an essential concern when we design amusic processing system. In this section, we will discuss the music representation for the outputof our system. Since there are mainly four parts in our output: melody, bass, harmony, andstructure, we will list the specification below:

1https://www.cs.cmu.edu/ music/serpent/serpent.htm

7

Considering that there are many researchers using Matlab, our music representation will besaved in a CSV format, which has at least two advantages: 1) it is easy to check the correctnessby eye; 2) it is easy to read CSV data in a variety of languages such as Matlab or Python. Thisdesign may introduce an extra cost of storage, but nowadays memory is much cheaper thandecades ago when the Standard MIDI File format was designed. So we think this convenienceand space trade-off is reasonable.• Melody

In our output, we use the quantization to restrict the data to integers. We use 124

beat asthe minimal unit for duration because it can represent 32nd notes and 64th triplets, whichis fine enough for almost any rhythm. So the melody note in the first measure (m) secondbeat (b) which has the duration(b) of one beat, pitch (p) of 60 (the C4 on the keyboard),velocity (v) of 127, channel (c) of 0, instrument (i) number of 1 (piano) is arranged asTable 3.1

Table 3.1: Melody Representation

m b d p v c i0 24 24 60 127 0 1

• BassThe bass representation is similar to melody. For example, the bass note in the secondmeasure (m) first beat (b) which has the duration (b) of four beats, pitch (p) of 36 (the C2on the keyboard), velocity (v) of 77, channel (c) of 1, instrument (i) number of 32 (acousticbass guitar) is arranged as Table 3.2

Table 3.2: Bass Representation

m b d p v c i1 0 96 48 77 1 33

• HarmonyUnlike melody and bass, in harmony parts, the pitch, velocity, channel, and instrumentare ignored. Instead, we encode the root, type and pitch class set. The root is an integerfrom 0 to 11 (chromatic scale, [C, C#/Db, D, ..., B]). The type is the classification such asmajor or minor, represented as an integer. The mapping from number to detail meaningcan be found in Table 3.3. The pitch class set is a number to represent a pitch class vectoras a decimal integer. For example, the C major chord has the pitch class of C, E, and G.The pitch class vector is Table 3.4. In binary, with C as the low order bit, we can encodethe set {C, E, G} as number {0, 4, 7} or the binary number 000010010001B, which isthe decimal number 145. We can use the pitch class set to represent rare chords withoutconventional types. Also, we do not distinguish, say, Major from Major-Seventh chords,so the pitch class set gives additional information on chords. Note also that the pitch classset alone is not sufficient information. For example, C-major-6th has the same pitch class

8

set as A-minor-7th. The pitch class set is simply the set of all pitches that are observedwithin the segment of music labeled by the chord. For example, suppose music expertsconsider the true chord type to be a G dominant 7 chord, but while the chord is playing,the melody plays not only chord tones but non-chord passing tones of Ab and F#. Ourchord label will be G Major (we do not consider dominant seventh chords to be a differentcategory. The pitch class set will contain all the observed pitch classes: {D, F, F#, G, Ab,B}. Alternatively, if the true chord type is G dominant 7 and we have the same melody,but the fifth (D) is not played, then the pitch class set will not contain the fifth (D): {F, F#,G, Ab, B}.

Table 3.3: Harmony Representation

# Meaning0 Major1 Minor2 Diminished3 Augmented4 Suspended 4-1 Others

Table 3.4: C Major Pitch Class Vector

B A# A G# G F# F E D# D C# C0 0 0 0 1 0 0 1 0 0 0 1

So a C major chord in the first measure (m), first beat (b), with the duration (d) of 4beats, root (r) of C, type (t) of Major chord and pitch class vector (s) of 145 is arranged asTable 3.5

Table 3.5: Harmony Representation

m b d r t s0 0 96 0 0 145

• StructureTo represent structure, we only consider the measure, duration and section information.As Table 3.6 shows, the first section starts from the first measure and lasts for 8 measures.The first section repeats once (9th-16th measures). The second section starts from the 17th

measure and lasts for 8 measures. Then, the first section repeats again. So that is anAABA structure. And in this thesis, we only consider the very high-level structure so thevery detailed structures will not be shown. Also notice here the unit of duration is not 24th

but the measure, and numbering is zero-based, i.e. 0 denotes the 1st measure, etc.

9

Table 3.6: Example Structure Representation

m d s0 8 08 8 0

16 8 124 8 0

10

Chapter 4

Melody Analysis

Melody identification is an important early step in music analysis. In this chapter, we presenta tool to identify the melody in each measure of a Standard MIDI File. We also share an opendataset of manually labeled music for researchers. We use a Bayesian maximum-likelihoodapproach and dynamic programming as the basis of our work. We have trained parameters ondata sampled from the million song dataset [16, 21] and tested on a dataset including 1703measures of music from different genres. Our algorithm achieves an overall accuracy of 89% inthe test dataset. We compare our results to previous work.

4.1 IntroductionWhen we listen to a piece of music, the melody is usually the first thing that catches our attention.Therefore, the identification of melody is one of the most important elements of music analysis.Melody is commonly understood to be a prominent linear sequence of pitches, usually higherthan harmonizing and bass pitches. The concept of melody resists formalization, making melodyidentification an interesting music analysis task. Melody is used to identify songs. Often, otherelements such as harmony and rhythm are best understood in relation to melody.

Many music applications depend on melody, including Query-by-Humming systems, musiccover song identification, emotion detection [24], and expressive performance rendering. Manyefforts in automatic composition could benefit from training data consisting of isolated melodies.

There has been a lot of research on extracting melody from audio [9]. The problem is gen-erally easier for MIDI than audio because at least notes are already identified and separated.However, compared to audio, there seems to be less research on melody extraction. Most of theresearch on MIDI melody is on channel-level identification. This paper will propose an algo-rithm combining Bayesian probability models and dynamic programming to extract melody atthe measure level.

4.2 DatasetIn most related work, published links to datasets have expired, so we collected and manuallylabeled a new dataset which contains the training data of 5823 measures in 51 songs and test

11

Figure 4.1: Measure Level Melody Channel Distribution

data of 1703 measures in 22 songs. For each song in the training data, the melody is mostly allon one channel, which is labeled as such. (This made labeling much easier, but we had to rejectfiles where the melody appeared significantly in multiple channels or there is no melody at all.)In the test data, the melody is not constrained to a single channel, and each measure of the songis labeled with the channel that contains the melody. Measure boundaries are based on tempoand time signature information in the MIDI file. The MIDI files are drawn from songs in theLakh dataset [16]. This collection contains MIDI files that are matched to a subset of files in themillion song dataset [21]. We used tags there to limit our selection to pop songs.

The test data is collecting from Chinese, Japanese, and American pop songs. We specificallychose popular music because melody is usually present and there is usually a single melody.In addition, we hope to use this research in learning about melody structure in popular music.Figure 4.1 is the distribution of measure level channel number in test data. Channel 0 is mostpopular melody channel with around 38%. Some channels in our dataset has no existing data,such as channel 2, channel 5, channel 8, channel 9 and channel 15.

It might be noted that there are many high quality MIDI files of piano music. Since all pianonotes are typically on one channel, this can make the melody identification or separation a morechallenging problem, and different techniques are required. We assume that in our data, oncethe channel containing the melody is identified, it is fairly easy to obtain the melody. Either themelody is the only thing present in the channel, or the melody is harmonized, and the melody isobtained by removing the lower notes using the Skyline algorithm.

12

4.3 AlgorithmOur problem consists of labeling each measure of a song with the channel that contains themelody. (It would be useful also to allow non-labels, or nil, indicating there is no melody, butour study ignores this option.) The algorithm begins with a Bayesian model to estimate Mm,c,the probability that channel c in measure m contains the melody. The estimation uses featuresthat are assumed to be jointly normally distributed and independent. Features are calculatedfrom the content of each channel, considering the measure itself and N previous and subsequentmeasures, with N ∈ 0, 1, .... In all of our experiments, we assume that melody never appears onchannel 10, which is used for drums in General MIDI.

Although we could stop there and report the most likely channel in each measure,

cm = argmaxc

Mm,c (4.1)

this does not work well in practice. There are many cases where a measure of accompaniment,counter-melody or bass appears to be more “melody-like” than the true melody. (For example,the melody could simply be a whole note in some measures.) However, it is rare for the melodyto switch from one channel to another because typically the melody is played by one instrumenton one channel. Channel switches are only likely to occur when the melody is repeated or onmajor phrase boundaries.

We can consider the melody channel for each measure, cm, as a sequence of hidden statesand per-measure probabilities as observations. We wish to find the most likely overall sequencecm according to the per-measure probabilities, and taking into account a penalty for switchingchannels from one measure to the next. We model the probability of the hidden state sequencecm as:

P (cm) =∏m

Mm,cmScm−1,cm (4.2)

where Scm−1,cm = 1 if there is no change in the channel (cm−1 = cm), and Scm−1,cm is somepenalty less than one if there is a channel change (cm−1 6= cm). Thus, channel switches areallowed from any measure to the next, but channel switches are considered unlikely, and anylabeling that switches channels frequently is considered highly unlikely.

The parameters of this model must be learned, including: statistics for features used to es-timate Mm,c, the best feature set, the number of neighboring measures N to use in computingfeatures, and the penalty S for changing channels. We select the feature set and compute featurestatistics using our training dataset, and we evaluate their performance and sensitivity to N andS using the test dataset.

4.3.1 Bayesian Probability ModelThe probability of melody given a set of features is represented by Equation 4.3, where C0 is thecondition that the melody is present, C1 indicates the melody is not present, xi are feature values,and n is the number of real-valued features. The details of features will be discussed in a laterparagraph.

P (C0|x1, . . . , xn) (4.3)

13

By Bayes’ theorem, this conditional probability can be rewritten as Equation 4.4.

P (C0|x1, . . . , xn) =P (C0)P (x1, . . . , xn|C0)

P (x1, . . . , xn)(4.4)

With the assumption of independence for each feature, we can rewrite this as Equation 4.5:

P (C0|x1, . . . , xn) =1

ZP (C0)

n∏i=1

P (xi|C0) (4.5)

where Z is:

Z = P (x1, . . . , xn) =1∑

k=0

(P (Ck)n∏

i=1

P (xi|Ck)) (4.6)

Our features xi are continuous values, and we assume they are distributed according to a Gaussiandistribution as in Equation 4.7. Under this assumption, we can simply collect feature statisticsµi,k and σi,k from training data to estimate the probability model.

P (xi = v|Ck) =1√

2πσ2i,k

e−

(v−µi,k)2

2σ2i,k (4.7)

We now describe the details of features, which are note density, vel mean, vel std, pitch mean,pitch std, IOI mean, and IOI std:

Note Density

The note density is the sum of all note durations divided by the total length of the music (Equation4.8). A melody without rests has a note density of 1, a rest has note density of 0, a sequence oftriads without rests has a note density of 3, etc.

note density =Σnotenote.durtotal length

(4.8)

Velocity

We take the mean and standard deviation of velocity (Equations 4.9 and 4.10).

vel mean =ΣN

i=1notei.velN

(4.9)

vel std =

√1

N − 1ΣN

i=1(notei.vel− vel mean)2 (4.10)

14

Pitch

We take the mean and standard deviation of pitch (Equations 4.11 and 4.12).

pitch mean =ΣN

i=1notei.pitchN

(4.11)

pitch std =

√1

N − 1ΣN

i=1(pitchi.vel− pitch mean)2 (4.12)

Inter-Onset Interval

The Inter-Onset Interval, or IOI, means the interval between onsets of successive notes. Consid-ering that ornaments and chords may introduce a very short IOIs, we set a window of 75 ms, andwhen two note onsets are within that window, we treat them as a single onset [1]. IOI calculationis described in detail in Algorithm 1.

Result: The mean and standard deviation of a list of notes in onset-time orderstats is an object that implements the calculation of mean and standard deviation;note[i] is the ith note;N is the number of notes;i← 0;while i < N do

j ← i+ 1;while (j < N) ∧ (note[j].on time− note[j − 1].on time) < 0.075 do

j ← j + 1;endif j < N then

IOI ← note[j].on time− note[i].on time;stats.add point(IOI);

endi← j;

endIOI mean← stats.get mean();IOI std← stats.get std();

Algorithm 1: IOI Feature Calculation

4.3.2 Training dataWe compute features for each measure and channel of the training data. For feature selection,we use cross-validation, dividing the training songs into 5 groups, holding out each group andestimating µi,k and σi,k from the remaining training data, and evaluating the resulting model bycounting the number of measures where the melody channel is judged most likely by the model.We take the average result over all five groups.

15

After doing this for every combination of features (7 features, thus 127 combinations), forvarious values of N (the maximum distance to neighboring measures to use in feature calcula-tion), we determine the features that produce the best result for each value of N .

We then re-estimate the probability model using all of the training data. In principle, weshould also use the training data to learn the best window size N and the best penalty S, but inour training data, melodies are all in one channel, so the ideal value of N for this data shouldbe large, and the ideal S should be zero (highest penalty) to prevent the melody from changingchannels. Instead, we will determine N and S from our test data, where every measure is labeledas melody or not, and we will report how these parameters effect accuracy using the test dataset.

4.3.3 Melodic probabilityTo prepare for the dynamic programming step, we compute Mm,c, which is the natural log ofthe probability of melody in channel c at measure m. (The feature values are different for eachcombination of m and c.) In the next step, note that if we find the labels with the greatest sum oflog probabilities, it is equivalent to finding the labels with the greatest product of probabilities.Logarithms are used to avoid numerical underflow.

4.3.4 Dynamic ProgrammingWe use dynamic programming to select the channel containing the melody in each measure.Algorithm 2 shows how the assignment of channels maximizes the sum of Mm,c values adjustedby subtracting SP = − log(S) each time the melody changes channels. The backtracking stepis not shown since it is standard.1

4.4 Experiment and Result

4.4.1 TrainingWe tried different combinations of features, and the results are shown in Table 4.1 for windowswith 5 measures (N = 2). The top 5 feature sets are shown along with the mean and standarddeviation of accuracy across 5-fold cross-validation. Differences among the top feature sets areminimal. We use all features except velocity standard deviation.

Table 4.2 shows the results using each feature individually for 5-measure windows. Thisshows that all features offer some information (random guessing would be 1/15 or less than 7%correct), but no single feature works nearly as well as the best combination.

If we assume the melody appears in only one channel, which is mostly the case for thistraining dataset, we can consider the measure-by-measure melody channel results as votes, pick-ing the channel with the majority of votes as the melody channel. Our best feature set (all butvelocity standard deviation) gives an accuracy of 96% (2 errors out of 51 songs), using 5-foldcross-validation. In the next section, we relax the assumption that the melody appears in onlyone channel.

1https://en.wikipedia.org/wiki/Viterbi algorithm

16

Result: For each measure, determine the channel containing the melody.N is the number of measures, indexed from 0 to N − 1;C is the number of channels, indexed from 0 to C − 1;SP is channel switch penalty, a parameter; SP = -ln(S);Am,c is the accumulated score for measure m and channel c;Bm,c stores the optimal channel number of previous measure;Mm,c tells how melodic is channel c in measure m;for i in [0 . . . C) do

A0,i ←M 0,i;endfor m in [0 . . . N) do

for c in [0 . . . C) dox← Am−1,c + Mm,c;Bm,c = c;for i in [0 . . . C) do

y ← Am−1,i + Mm,c − SP ;if c 6= i ∧ y > x then

x← y;Bm,c ← i;

endendAm,c ← x;

endend

Algorithm 2: Dynammic Programming

17

Table 4.1: Mean and standard deviation of accuracy in 5-fold cross validation using the top 5feature sets, window size = 5. Here, nd means note density, pm means pitch mean, ps meanspitch std, im means IOI mean, is means IOI std, vm means vel mean, and vs means vel std.

nd pm ps im is vm vs mean std1 1 1 1 1 1 0 72.40% 7.20%1 1 0 1 1 1 0 71.60% 7.06%1 1 1 1 1 1 1 71.40% 8.26%1 1 1 1 0 1 0 71.40% 7.70%1 1 1 1 0 1 1 71.00% 8.83%

Table 4.2: Mean and standard deviation of accuracy in 5-fold cross validation using individualfeatures, window size = 5

nd pm ps im is vm vs mean std1 0 0 0 0 0 0 45.80% 7.22%0 1 0 0 0 0 0 44.60% 8.11%0 0 1 0 0 0 0 35.80% 6.50%0 0 0 1 0 0 0 31.00% 2.55%0 0 0 0 1 0 0 30.40% 5.64%0 0 0 0 0 1 0 32.00% 5.87%0 0 0 0 0 0 1 20.60% 4.10%

4.4.2 TestingOur test dataset labels each measure with a set of channels containing melody. In measureswith no melody, this is the empty set. In some measures, the melody is duplicated in differentchannels, so the label can can contain more than one channel. Note that if we can identifyone channel containing the melody, it is simple to search for copies in the other melodies. Ouralgorithm labels every measure with exactly one melody channel. We consider the output to becorrect either if it is in the set of true melody channels according to our manual labels, or ifthe label is the empty set. Typically, the empty set (no melody label) appears in introductions,endings, and measures where the melody channel rests. In these cases (approximately 12% of allmeasures), there is no clearly correct answer, so any output is counted as correct.

We evaluated accuracy on the test dataset with many values of N and SP . For each valueof N , we used the best feature set as determined from the training data and then evaluated thesystem with different values of SP . The results are shown in Figure 4.2.

Since N and SP are optimized on the test dataset to obtain a best accuracy of 89.15%, thereis some risk of overfitting parameters to the test data. Given more labeled data, we would haveused a different dataset to select N and SP , and then we could evaluate the entire system on thetest dataset. Instead, we argue that the system is not very sensitive to N or SP , so overfitting isunlikely. Figure 4.3 shows how accuracy is affected by varying the window size using an optimalvalue of SP = 36 (again, the window includes the measure±N measures, so the window size is2N + 1). This figure shows that 5-measure windows worked the best, but windows up to about

18

Figure 4.2: Accuracy for different values of window size and switch penalty.

Figure 4.3: Accuracy on the test dataset vs. Window Size (= 1 + 2N ) for Switch Penalty = 36.

15 measures also work well, with just a few percent variation in accuracy.Figure 4.4 shows how the accuracy is affected by varying SP , the switch penalty, using the

optimal value of N = 2. The best performance is obtained with SP between 30 and 38, but anyvalue from 2 to 38 will achieve performance within a few percent of the best. Since both graphsare fairly flat around the best values of N and SP , the exact values of these parameters are notcritical for good performance. In fact, we would expect the best values may depend upon style,genre, and other factors.

From the results, we can observe that the increase of window size helps the performance.The highest accuracy goes from 58.00% to 89.15% when the window size grows from 1 to 5.

19

Figure 4.4: Accuracy on the test dataset vs. Switch Penalty with Window Size = 5. The highestaccuracy is 89.15% using any Switch Penalty ∈ {30, 32, . . . 38}.

However, accuracy does not continue to increase for even larger windows. We believe the sizeof 5 measures is large enough to obtain some meaningful statistical features yet small enough toregister when the melody has switched channels. In the next section, we analyze some specificexamples of success and failure. We also see that the switch penalty matters. When we set thepenalty to zero, we get the locally best choice of melody channel at each measure, independentof other measures, but it seems clear that this “locally best, indepenent” policy is not particularlygood, and this is why we introduced the Viterbi step to our algorithm. On the other hand, whenthe penalty is very large, we force all measures to be labeled with the same channel. This isnot a good policy either, with at best 71.24% accuracy. The results show that our Viterbi step iseffective in using context to improve melody identification.

4.5 Analysis

To better understand our approach, we analyzed some songs in our test dataset.

4.5.1 A Successful Sample

In most of the cases, this algorithm works well. For example, the figure 4.5 shows a clip fromthe popular song “Hotel California.” In the figure, the melody is labeled by our algorithm inred (at the top) and other notes are shown in yellow. Notice that this melody is not particularly“melodic” in that it only uses two pitches and there is a lot of repetition. This is one illustrationof the need for multiple features and statistical methods. The features for the melody channelhave a higher likelihood according to our learned probabilistic model, and the melody is correctlyidentified.

20

Figure 4.5: The melody detected in the song “Hotel California.”

4.5.2 Failed Samples

A failure case is shown in Figure 4.6. Here, the detected melody is shown in red at the top ofthe figure. The same (red) channel actually contained the melody in the immediately precedingchannels, but at this point the melody switched to another channel, shown below in yellow. (Inthe figure, the lower melody is visually separated for clarity, but both channels actually occupythe same pitch range.) Evidently, the algorithm continued to label the top (red) channel as melodyto avoid the switch penalty that would be required to label the melody correctly. In fact, the true(yellow) melody appeared earlier in the top (red) channel, so perhaps a higher-level analysis ofmusic structure would also be useful for melody identification and disambiguation.

Figure 4.6: The algorithm identified the top (red) channel as melody of “Being,” but the correctmelody is shown in yellow at the bottom.

Another song in our dataset is “Ali Mountain.” In this piece, at measure 12, the melody issplit across two different channels, represented in red (darker) and yellow (lighter) in Figure 4.7.Taken together, the combined channels would be judged to be very melodic. However, when weconsider the channels separately, it is hard to hear whether either is part of a melody, and ouralgorithm does not rate either channel highly. Since we assume that the melody will be playedby one and only one channel within a measure, the melody is not identified in this test case.

21

Figure 4.7: Two channels ensemble the melody

4.6 ConclusionIn this paper, we contributed a novel algorithm to detect the melody channel for each measure ina MIDI file. We utilize a Bayesian probability model to estimate the probability that the melodyis on a particular channel in each measure. We then use dynamic programming to find the mostlikely channel for melody in each measure considering that switching channels from measure tomeasure is unlikely. We obtained an overall accuracy of 89% on our test dataset, which seemsto compare favorably to most other results in the literature. The lack of a large shared datasetprohibits a detailed comparison.

Our dataset, including Standard MIDI Files, melody labels, associated software, and docu-mentation are available at the following website:http://www.cs.cmu.edu/∼music/data/melody-identification.

22

http://www.cs.cmu.edu/~music/data/melody-identification

Chapter 5

Bass Analysis

5.1 Introduction

Beside the melody, the bass part is also very important for music generation, listening and musicanalysis. The movement of the bass influences the inversion of harmony. Some times, the rhythmof the bass will be the spotlight in a piece of music. Thus, the identification of bass plays one ofthe key roles in music analysis. Also, harmony and rhythm analysis refer to the information inthe bass. A change in the bass often indicates a change of harmony. Many efforts in automaticcomposition could benefit from training data consisting of isolated bass.

Compared to the melody analysis, it is not so difficult to do the bass analysis. The reason isthat melody can appear in the top, middle or even in the bottom pitch range. However, the basspart seldom appears in a middle or high part. Thus, this problem can be solved by a relativelysimple algorithm. This thesis will propose a simple algorithm combining to extract bass at themeasure level.

5.2 Algorithm

The algorithm for bass analysis is relatively simple compared to melody analysis. Roughly, itconsists two parts: 1) pitch mean calculation and 2) argmax selection. We can easily calculatethe pitch mean for each measure. The bass channel for each measure is the channel that has thelowest pitch mean, as described by Algorithm 3

5.3 Experiment and Result

Since the algorithm for bass analysis is deterministic, which means there is no training process,we only label the test dataset for this part. As same with melody analysis, the MIDI files are readand transformed to the CSV text file format. Then the system runs the simple algorithm we justdiscussed. The criterion that we use to evaluate the algorithm is the same as with melody. Theoverall accuracy is 97.36%

23

Result: For each measure, determine the channel containing the bass.N is the number of measures, indexed from 0 to N − 1;C is the number of channels, indexed from 0 to C − 1;Mm,c tells the pitch mean of channel c in measure m if there is no note, assign +∞;Am tells the bass channel in measure m;for m in [0 . . . N) do

c′ ← argmincMm,c;if Mm,c == +∞ then

Am ← null;endelse

Am ← c′;end

endAlgorithm 3: Simple Bass Analysis Algorithm

12 songs (among the 22) get 100% accuracy. 9 other songs have the accuracy range of [90%,100%). Only 1 out of 22 has the lowest accuracy, which is 80%. The reason is that in our simplealgorithm, there is no restriction or penalty for channel switching. So when the pitch mean of theaccompaniment is lower than the bass channel, the result will switch immediately. One methodto alleviate this is to using an algorithm that has some penalty for the bass channel switch.

5.4 ConclusionCompared to melody analysis, bass analysis is much easier. We use a bass version of the Skylinealgorithm to solve this problem. It is easy to implement and the time complexity is linear of theinput size.

24

Chapter 6

Chord Analysis

6.1 IntroductionCompared to melody and bass which provide a horizontal continuity through time, chords orharmony represent vertical connections across voice in music. Sequences of specific chords areused in analyzing higher levels of music functions. For example, the progress of the DominantChord to the Tonic Chord can be considered as evidence of an Authentic Cadence. The differentmodes of chords also provide different harmonic functions. In the same example, according tothe inversion and the highest note, the cadences can be Perfect Authentic Cadence (the chords arein root position, and the tonic is in the highest voice of the final chord), root position ImperfectAuthentic Cadence (similar to a perfect authentic cadence, but the highest voice is not the tonic)or inverted Imperfect Authentic Cadence (similar to a Perfect Authentic Cadence, but one or bothchords is inverted). In undergraduate music education, Harmony is also one of the most importantcourses. Harmonic analysis involves human judgement and subjective decision making becausemusic is full of non-chord tones that only suggest chords and harmonic function. Chord labelsare ambiguous. However, chord analysis is very important because it provides an explanation ofthe music. One promising application is that harmonic analysis can be used as input to a musicgeneration system. For example, an algorithm proposed by Brunner [3] generates music by anadvanced machine learning model based on generated chords.The analysis of chords includes several aspects: 1) indicating the starting and ending points ofeach chord; 2) providing the root of the chord; 3) providing the mode of the chord.

In our representation, the starting and ending points of each chord are indicated by time andduration, which are indicated in beats. The root is given directly. The mode is indicated roughlyby the chord type, with additional details encoded as the pitch class set.

6.2 DatasetOne dataset that I used in chord analysis is from David Temperley, a corpus of 45 excerpts oftonal music. The excerpts are from the Kostka and Payne music theory textbook (Kostka andPayne 1984), hereafter referred to as the KP corpus. In this dataset, chords are labeled by thepair: timestamp and chord root, without type information. In this thesis, we contribute a chord

25

Table 6.1: Chord NotationsChord Symbol Chord Name Example Notes

m Minor C Eb GM Major C E G

aug Augmented C E G#dim Diminished C Eb Gbsus4 Suspended Fourth C F G

analysis dataset with manual labels. There are 20 songs with 1890 chord labels in total. Thisdataset contains not only root information but also type information for each chord. We use thisdataset to test our system.

It is very time-consuming to label the dataset manually. There are over 800 pages musicnotes for our current dataset and for each page it takes several minutes to label. Comparing toother fields such as Computer Vision, the labels of music is much hard to get.

6.3 Definition of ChordsChord symbols, names, and examples can be found in Table 6.1. As we introduced in Chapter 3,there are only a subset of chord types (major, minor, augmented, diminished and suspended 4)presented in our output. The detailed pitch class information is also presented to compensate thesimplification of chord type.

We decided to use simplified chord types because chord types are often ambiguous. Forexample, a C13 chord may or may not contain a fifth or ninth or eleventh. The 13th has the samepitch class as the 6th, so the chord might be labeled C6. In some Jazz manuscripts, a chord islabeled as a C6 or C13 when the melody has the sixth, and it is unclear and certainly optionalwhether the sixth should be played by piano or guitar, adding further confusion. We believe itis just as meaningful to label the chord type as C or C7 and then provide a set of pitches used(which indicate whether the fifth, ninth, eleventh, and thirteenth are included). The basic chordtype indicates the chord function (major, dominant, etc.) and the pitch set gives details of howthe chord function is realized.

6.4 AlgorithmOur main work was to integrate an algorithm from Temperley [20] into our overall analysissystem. Temperley’s algorithm is described here. In his algorithm, there are several rules:

1. Pitch Variance Rule: Try to label nearby pitches so that they are close together on the lineof fifths.

2. Compatibility Rule: In choosing roots for chord spans, prefer certain TPC(Tonal PitchClass)-root relationships over others. Prefer them in the following order: 1, 5, 3, b3, b7,ornamental. (An ornamental relationship is any relationship besides these five.)

3. Strong-Beat Rule: Prefer chord spans that start on strong beats of the meter.

26

4. Harmonic Variance Rule: Prefer roots that are close to the roots of nearby segments on theline of fifths.

5. Ornamental Dissonance Rule: An event is an ornamental dissonance if it does not have achord-tone relationship to the chosen root. Prefer ornamental dissonances that are closelyfollowed by an event a step or half-step away in pitch height.

However, in Temperley’s algorithm, there is no type information. For example, the outputwill be “A” to indicate the “A chord”. But we do not know it is “A minor” or “A Major” or even“A diminished”. Thus, in our work, we provide a simple algorithm to predict the type of chordbased on the output of Temperley’s algorithm and the MIDI file, which is discussed as followingsteps:

1. In Temperley’s output, the timestamp for each chord is given. The starting time (in mil-liseconds) and finishing time of each chord can be calculated based on that.

2. Given the starting and finishing time, we can get the notes in the corresponding MIDI file.Given the notes, we can get the pitch class set (each element is an integer from 0 to 11).

3. Based on the chord root and pitch class set, the chord type will be determined. For example,if the third ((root + 4) % 12) and fifth ((root + 7) % 12) exist, and the seventh does not, thechord type will be assigned as “Major”.

6.4.1 Optimization: Subtracting Melody

For music expression, the melody often contains dissonant notes. Because we have a goodmelody classifier in Chapter 4, we propose an optimization method, which is to subtract the notesin the melody channel before we apply the chord analysis algorithm. By eliminating dissonantnon-chord tones of the melody, we hope to improve the ability to find the correct chord label.

6.5 Evaluation

6.5.1 Evaluation Criterion

In our work, we evaluate the correctness of output by four methods:1. For each 24th beat, we assign 1 point to it when the corresponding predicted and labeled

chords (both root and type) match.

2. For each 24th beat, we assign 1 point when the corresponding predicted and labeled chordroots match.

3. For each chord, we give it 1 point in total, so if there are N 24th beats, each 24th beat gets1/N points if the predicted chord matches with label (both root and type).

4. The fourth method is like the third, we only give credit when the chord roots match.Because for each song, the number of chords is different, instead of showing the value of

points, we show a percentage value, which is calculated from the assigned point value dividedby the number of total potential points.

27

6.5.2 Result and AnalysisTable 6.5.2 shows the result of Temperley’s algorithm. For different music, the accuracy variesa lot. The songs named SanFrancisco and lovelyrose has high accuracy with over 90% for eachevaluation. However, the song named lovestory has an accuracy lower than 60% for each evalu-ation.

The gap between Eval1 and Eval2 (or Eval3 and Eval4) shows that some errors are introducedby the type prediction instead of root prediction. In some songs, for example, “california”, “love-lyrose”, “SanFrancisco”, the gap is very small. However, for some of them such as “autumn”,the gap is over 10%.

Table 6.2: Chord Test Result without OptimizationMusic name Eval1 Eval2 Eval3 Eval4

0adaf64db7fad3f972b4c4b4a7fb77bf 68.21% 74.67% 69.27% 76.01%alimountain 85.20% 91.37% 80.36% 87.36%areyouhappy 70.08% 75.71% 70.86% 77.12%

autumn 59.59% 74.64% 59.29% 72.13%Being 79.97% 87.43% 76.90% 84.72%

california 90.87% 90.87% 91.56% 91.56%dontwantanything 78.52% 80.59% 75.68% 78.40%

Heaven 85.87% 89.00% 83.35% 87.25%lovelyrose 91.50% 91.50% 92.99% 92.99%lovestory 49.53% 53.83% 50.35% 55.72%

mayflower 78.41% 84.75% 78.36% 86.93%moonwish 86.00% 86.00% 85.66% 85.66%onmyown 82.56% 83.12% 80.85% 81.09%riverside 74.73% 74.73% 77.51% 77.51%

SanFrancisco 91.96% 91.96% 91.47% 91.47%SayYouSayMe 70.12% 76.48% 69.85% 75.31%

skyoftaipei 71.33% 74.95% 66.75% 74.15%smalltown 82.80% 89.36% 84.84% 90.31%tenderness 84.74% 86.89% 83.43% 85.69%

Venus 87.35% 90.31% 90.17% 93.82%Total mean 78.47% 82.41% 77.98% 82.26%Total std 10.78% 9.16% 10.86% 8.98%

Table 6.5.2 shows the results that with the optimization. For each evaluation, the total meanvalue of accuracy shown an increase of 1.3%

6.6 ConclusionIn this chapter, we described a pop music dataset with thousands of chord labels and we testedthe algorithm for chord analysis using this dataset. We also showed that Temperley’s algorithm

28

Table 6.3: Chord Test Result with OptimizationMusic name Eval1 Eval2 Eval3 Eval4

0adaf64db7fad3f972b4c4b4a7fb77bf 69.54% 75.64% 69.93% 75.90%alimountain 79.06% 84.74% 71.78% 77.82%areyouhappy 68.28% 75.86% 69.81% 77.10%

autumn 64.61% 78.93% 65.90% 78.54%Being 79.97% 87.32% 77.43% 85.23%

california 92.13% 92.13% 92.84% 92.84%dontwantanything 81.30% 83.36% 79.08% 81.80%

Heaven 87.86% 91.18% 85.87% 90.38%lovelyrose 91.50% 91.50% 93.11% 93.11%lovestory 53.13% 57.42% 54.10% 59.54%

mayflower 77.42% 84.78% 77.55% 86.73%moonwish 95.28% 95.28% 93.92% 93.92%onmyown 83.04% 83.60% 81.83% 82.06%riverside 78.43% 78.43% 80.55% 80.55%

SanFrancisco 93.86% 93.86% 93.82% 93.82%SayYouSayMe 73.64% 80.57% 72.66% 78.72%

skyoftaipei 68.80% 71.75% 64.15% 69.64%smalltown 86.33% 93.58% 89.35% 94.95%tenderness 85.02% 86.24% 83.69% 84.97%

Venus 87.35% 90.31% 90.17% 93.82%Total mean 79.83% 83.82% 79.38% 83.57%Total std 10.67% 8.97% 10.99% 9.09%

can be at least slightly improved by using our automated melody identification to remove themelody from the music before performing chord analysis.The system and experiments in this chapter contribute a number of things to the field of chordanalysis by computer. It is easy to extend the work based on this work. The performance ofthe system is measured using some clear evaluation methods, providing an empirical baselineagainst which future work may be measured. We also believe the automated chord labeling isgood enough as the first step in bootstrap learning, which would allow us to label a much largerdatabase of music using more advanced statistical machine leanring methods.

29

Chapter 7

Structure Analysis

7.1 IntroductionIt is obvious that music has structure. When we talk about a music, it is natural to have ideassuch as “This piece’s structure is AABA”. For humans it is not very hard to detect structureby listening or analysing music notation, which could be either common practice music or agraphical representation. And for the task of music generation, the high level structure are usuallyproduced by random number generator or manually assigned table. In this thesis, we will presenta method that to analyze the structure information automatically from MIDI files and output the“AABA” information in the form that was introduced by Chapter 3.

7.2 AlgorithmThe structure analysis algorithm can be divided into three parts: 1) calculation of a similaritymatrix; 2) repetition section detection; 3) representation transformation.

7.2.1 Similarity MatrixSimilar with the work in audio music structure analysis as Klapuri et al. [10], we define a Frameto be all of the notes in a certain time duration. For example, in our work, the time interval is setto 200 milliseconds, which means all of the notes that overlap this time interval will be consid-ered in the frame.

Similarity Matrix The similarity matrix Si,j means the similarity value of Framei andFramej . Different similarity functions are possible. In this thesis, we test two categories offormulas to calculate similarity values. The inputs of the calculation are two frames and the out-put is a real number to represent how similar they are.

Maximum Sharing Rate In the Maximum Sharing Rate calculation method, Ci,j is the num-ber of common pitches. Ti,j is the minimum of the number of pitches in Framei or Framej . OrTi,j = min(|Framei|, |Framej|), where |Frame| means the number of pitches in the frame.

31

Figure 7.1 is the similarity matrix of Alimountain in metric of Maximum Sharing Rate.

Figure 7.1: Maximum Sharing Rate Similarity Matrix for Alimountain

Chroma Vector Another metric that we will test in our thesis is the Chroma Vector. Thechroma vector of one frame represents the pitch classes that are present. Suppose that in oneframe there are C4 (pitch value is 60) and G4 (pitch value is 67) and C5 (pitch value is 72). Inthe vector, we have 12 slots represent the pitch classes from 0 to 11. The pitch class of the framethat we have is 0, 7 and 0. So the chroma vector for this frame is [2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,0]/√

5. The counts are divided by√

5 to normalize the length to 1. The similarity value is the dotproduct of two chromatic vectors. The meaning of this calculation is that it considers the higherthe dot product of two vector, the more similar they are. Figure 7.2 is the similarity matrix ofAlimountain in metric of chroma vector.

7.2.2 Repetition Detection

When the similarity matrix is ready, we have the texture of the music. The next step is to detectthe repetition fragment from the similarity matrix. It is easy to detect that for the repetition part,there is a long white line. So the repetition detection task is reduced to find the off-diagonal“long white lines” in the picture of similarity matrix.

In our algorithm, we rotate the original similarity matrix picture by 45 degrees, from thesquare matrix to a diamond matrix. The reason is that in the diamond matrix, each pixel on thesame white lines have nearly equal y-coordinate, which is more convenient for computer visionalgorithms. Here, by the rule of Occam’s Razor, we apply a very simple algorithm to solve thisproblem. Before introducing our algorithm, we will first present our consideration and intuition.

We can observe from the picture that there are some points that are very white (the framehere is very similar) and some points that are grey (the frames here is not very similar). In music,there are many small variation even the they share the same structure. The white-grey patterns

32

Figure 7.2: Chroma Vector Similarity Matrix for Alimountain

are very common and reasonable. Based on this observation and our intuition, we design anenergy based algorithm and clustering method to detect the lines.

Energy Based AlgorithmHere we define the energy to be a real number from 0 to 255. And for convenience, we projectthe similarity value to a range of [0, 256). For a repetition structure in music, it usually has thesame beginning and different variation. With this consideration, we design the starting thresh-old, which means the potential line will start when the similarity value is over a certain value,and we refill the energy to the maximum value. In our algorithm, we manually set it as 200. Thenwe scan the line from left to right, starting from the starting point. When we scan to a grey block(0 < value ≤ 254), the energy will linearly decay by a factor based on how white the block is.And when we scan to a completely dark block (value = 0), the energy will exponentially decayby a decay factor. Algorithm 4 presents this in detail.

Line ClusteringThrough the Energy Based Algorithm, we can get a number of short lines. In the Line Clusteringstep, we will form the short lines into a long line to represent the repetition part. In detail, firstwe build a graph: each node represent one line, and the edge connects two nodes that are nearenough and have some overlap. Here, the definition of near is that the absolute difference of twoy-coordinates is less than a threshold, which is 15 in our work. And the overlap means that twolines share a common interval on the x-axis.

7.2.3 Representation TransformationAs described in Chapter 3, the output of our system for structure analysis is a list where eachelement has the form of “measure”, “duration” and “section”. In this subsection we will introducean algorithm to transform the similarity matrix “picture” into the structure representation.

1. Since the picture is symmetric around the diagonal line from upper left to lower right, we

33

Result: For each block, determine the energy value.N is the number of rows in the diamond picture;C is the number of columns in the diamond picture;Indexi,j is the left most x-coordinate of the line bypass (i,j);Counti,j is the number of points in the line bypass (i,j);DARK=0;START THRESHOLD=200;line index = 0;for i in [0 . . . N) do

last index = 0;last energy = 0;cnt = 0;for j in [0 . . . C) do

data = diamond[i,j];if data ≥ START THRESHOLD then

if last index==0 thenline index += 1;cnt += 1;last index = line index;last energy = MAX ENERGY;

endendif data ≤ DARK then

last energy = exp decay(last energy, data);endelse

last energy = linear decay(last energy, data);endif last energy < FINISH THRESHOLD then

last index = 0;last energy = 0;cnt = 0;

endelse

cnt += 1;endIndexi,j = last index;

endend

Algorithm 4: Energy Based Algorithm

will only consider the lower left triangle in the picture.

34

2. The red line in the picture with coordinate of (x1, x2, y1, y2) will be transformed as twosegments of (x1, y1) and (x2, y2), which means that music from the time of x1 to y1 issimilar with that from the time of x2 to y2.

3. We build an abstract graph based on segments. For each segment, we assign a node of it.For each pair of (x1, y1) and (x2, y2) generated from picture, we connect them by an edge.And for each segment pair (a, b) and (c, d), if their overlap is over a threshold (60% in oursystem), we connect those two nodes by an edge.

4. Then we assign each connected group a color (or a unique group number) to indicate thesection number.

5. Sort the line segments increasingly with the key of the first coordinate (x1 if the line is (x1,y1)). In the final output, we quantize the start time and duration to units of measures andtruncate the tail of section if it overlaps with the next one.

7.3 EvaluationFigure 7.3 shows the result after the Energy Based Algorithm, the lines that meet our requirementare colored. In our work, we want to detect the most general structure information. It is easyto observe that in one piece of music, there are many small repetitions. To filter out the smallrepetitions, we set a threshold of the length of our lines. So we require that the repetition lineshould be at least 1

10of the total length of the original music. The result after the filtering and

the clustering algorithm can be found in Figure 7.4. We can find that the general structure forAlimountain is ABBABB or AA in a more high level view.

Figure 7.3: After the Energy Based Algorithm for Alimountain

The final output of Alimountain is shown in Table 7.3. The output shows that the structurefor the song is BACA, where A’s are measures #3 to #47 and measures #55 to #99. When wecheck the music, we find out that it predicts the main part correctly. However, there is no B or Csection in the music. We notice that there is a tempo change around measure #5 from 76 beats

35

Figure 7.4: After the Filter and Cluster Algorithm for Alimountain

per minutes to 101 beats per minutes. This explains why our algorithm detects there is a separatesection before A.

Table 7.1: Final Output of Alimountainm d s0 2 22 45 1

47 7 354 45 1

7.3.1 Evaluation Method for Structure AnalysisGiven a reference and an analysis result, we want to evaluate the degree to which the resultmatches the reference. So in this section we describe such an evaluation method.

Figure 7.5 shows the reference and analysis result for Alimountain. For example, in thereference, the structure category A has two segments: one is measures #1 to #52; another ismeasures #53 to #101. In the analysis result, there are four different categories. Category A hastwo segments: one is measures #3 to #47; another is measures #55 to #99. Category B has onesegment: measures #1 to #2. Category C and D also has one segment for each: measures #48 to#54 for C and measures #99 to #101.

In structure analysis, the same structure can have different names. For example, a structureABAB is as same as structure BABA. So finding a quantitative evaluation can be reduced to amatching in a bipartite graph as follows:

1. We form categories in the reference as the left part in the bipartite graph, and categories inthe analysis result as the right part in the bipartite graph.

36

Figure 7.5: Reference and Analysis Result of Alimountain

2. We connect each node in the left part to each node in the right part with a weighted edge.The weight of the edge between X and Y is the overlap score of category X and Y.

3. Overlap scores are calculated as follows: Category X has segments x1, x2, ..., xn. CategoryY has segments y1, y2, ..., ym. Let Xscorei be the maximum overlap between xi and anyone of segments yj . Xsum is the summation of Xscorei. Similarly, let Y scorei be themaximum overlap between yi and any one of segments xj . Ysum is the summation ofY scorei. The overlap score is the minimum of Xsum and Ysum.

4. Now we have a weighted bipartite graph. We use the Kuhn-Munkres algorithm to calcu-late the matching with the maximum weight. The score is the sum of the weights in thematching.

The evaluation score for Alimountain is shown in the lower left on Figure 7.5. In this chapterwe analyze 10 songs in total, and the evaluation scores are shown in Table 7.3.1

We also put the results for Moonwish (high evaluation score) and Yesterday (low evaluationscore) in Figure 7.6 and Figure 7.7. For Yesterday, the reference structure is AABB. Comparingthis to the analysis result, we can see that the repetition of A is detected but has some error inthe beginning, and the repetition of B is not detected. As expected, this analysis received a lowscore. For Moonwish, it is easy to tell from the picture that the analysis result and the referenceare nearly the same. Both are clearly AABA and the result receives a high score.

7.4 ConclusionIn this thesis we demonstrated that by utilizing the similarity matrix, a very simple algorithm hasthe potential to extract the high-level structure information from music given MIDI file. Structureanalysis has many applications such as finding the chorus section, producing audio “thumbnails”by eliminating repetition, and making music search faster by reducing long files to main themes.

37

Table 7.2: Evaluation scores for 10 songsSong Name Evaluation ScoreAlimountain 89.11%

Being 90.99%california 50.00%Heaven 78.79%

Lovelyrose 70.00%Mayflower 60.82%Moonwish 97.22%Onmyown 66.67%

SanFrancisco 81.58%Yesterday 65.48%

Figure 7.6: Reference and Analysis Result of Yesterday

Current music generation schemes based on sequence learning are not able to learn and generatetypical music structures such as a 32-bar AABA form. Automatic structure analysis might beused in future music generation systems to create more typical song forms.

38

Figure 7.7: Reference and Analysis Result of Moonwish

39

Chapter 8

Future Work

8.1 Future Work on MelodyWe believe further improvements could be made by studying failures. With bootstrapping tech-niques, it might be possible to obtain much more training data and learn note-by-note melodyidentification, which would solve the problem of melodic phrases split across two or more chan-nels. Our current dataset is relatively small, so collecting a larger dataset could be beneficial fortuning this algorithm and developing others. With larger datasets, deep learning and other tech-niques might be enabled. Perhaps bootstrapping (or semi-supervised learning) techniques couldbe used starting with the present algorithm to label a larger dataset automatically. We also believethat music structure can play an important role in melody identification. Melodies are likely tobe longer sequences that are repeated and/or transposed, and these non-local properties mighthelp to distinguish “true” melodies as perceived by human listeners, even when the melodies arenot particularly “melodic” in terms of local features.

8.2 Future Work on ChordIn our thesis we adopted Temperley’s rule-based algorithm for chord analysis. If we had morelabels, it would be promising to invent new data-driven algorithms for this task. Similar withmelody, we believe that it is useful to use this algorithm as the bootstrap algorithm to label moredata. Another potential direction is to render the music from MIDI and then utilize an audiomodel as a source of ensemble learning to raise the performance.

8.3 Future Work on StructureCurrently, the structure analysis is focused on very high level abstraction. There are potentialdetails that are meaningful at the phrase level or motif level, which will enrich the music rep-resentation. These lower level and hierarchical structure details might be useful as the input toa music generation system. The similarity matrix algorithm has the time complexity of O(N2)where N is the number of 24th beats in the entire piece. When the MIDI music is very long,

41

such as more than 20 minutes, this algorithm could be very slow. Thus, an algorithm with bettertime complexity might be needed. The use of N-grams might be useful, and there is a substantialliterature in biological sequence matching that might apply to MIDI data.

42

Bibliography

[1] Joshua Bloch and Roger B Dannenberg. Real-time accompaniment of polyphonic keyboardperformance. In Proceedings of the 1985 International Computer Music Conference, pages279–290, 1985. 4.3.1

[2] Michael J Bruderer, Martin F McKinney, and Armin Kohlrausch. Structural boundaryperception in popular music. In ISMIR, pages 198–201, 2006. 2.3

[3] Gino Brunner, Yuyi Wang, Roger Wattenhofer, and Jonas Wiesendanger. Jambot: Musictheory aware chord based generation of polyphonic music with lstms. In 2017 IEEE 29th In-ternational Conference on Tools with Artificial Intelligence (ICTAI), pages 519–526. IEEE,2017. 6.1

[4] John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga. An expert ground truth set foraudio chord recognition and music analysis. In ISMIR, volume 11, pages 633–638, 2011.2.2

[5] Wei Chai and Barry Vercoe. Melody retrieval on the web. In Multimedia Computingand Networking 2002, volume 4673, pages 226–242. International Society for Optics andPhotonics, 2001. 2.1

[6] Roger B Dannenberg and Masataka Goto. Music structure analysis from acoustic signals.In Handbook of Signal Processing in Acoustics, pages 305–331. Springer, 2008. 2.3

[7] Masataka Goto. A chorus section detection method for musical audio signals and its ap-plication to a music listening station. IEEE Transactions on Audio, Speech, and LanguageProcessing, 14(5):1783–1794, 2006. 2.3

[8] Masataka Goto and Roger B Dannenberg. Music interfaces based on automatic music signalanalysis: New ways to create and listen to music. IEEE Signal Processing Magazine, 36(1):74–81, 2019. 2.3

[9] Cihan Isikhan and Giyasettin Ozcan. A survey of melody extraction techniques for musicinformation retrieval. In Proceedings of 4th Conference on Interdisciplinary Musicology(SIM08), Thessaloniki, Greece, 2008. 2.1, 4.1

[10] Anssi Klapuri, Jouni Paulus, and Meinard Muller. Audio-based music structure analysis.In ISMIR, in Proc. of the Int. Society for Music Information Retrieval Conference, 2010.2.3, 7.2.1

[11] Fred Lerdahl and Ray S Jackendoff. A generative theory of tonal music. MIT press, 1985.2.3

43

[12] Jiangtao Li, Xiaohong Yang, and Qingcai Chen. Midi melody extraction based on improvedneural network. In Machine Learning and Cybernetics, 2009 International Conference on,volume 2, pages 1133–1138. IEEE, 2009. 2.1

[13] Liu Li, Cai Junwei, Wang Lei, and Ma Yan. Melody extraction from polyphonic midifiles based on melody similarity. In Information Science and Engineering, 2008. ISISE’08.International Symposium on, volume 2, pages 232–235. IEEE, 2008. 2.1

[14] Hyungui Lim, Seungyeon Rhyu, and Kyogu Lee. Chord generation from symbolic melodyusing blstm networks. arXiv preprint arXiv:1712.01011, 2017. 2.2

[15] Bryan Pardo and William P Birmingham. Algorithms for chordal analysis. Computer MusicJournal, 26(2):27–49, 2002. 2.2

[16] Colin Raffel. Learning-based methods for comparing sequences, with applications toaudio-to-midi alignment and matching. Columbia University, 2016. (document), 4, 4.2

[17] Matti Ryynanen and Anssi Klapuri. Automatic bass line transcription from streaming poly-phonic audio. In 2007 IEEE International Conference on Acoustics, Speech and SignalProcessing-ICASSP’07, volume 4, pages IV–1437. IEEE, 2007. 2.1

[18] Man-Kwan Shan and Fang-Fei Kuo. Music style mining and classification by melody.IEICE TRANSACTIONS on Information and Systems, 86(3):655–659, 2003. 2.1

[19] Umut Simsekli. Automatic music genre classification using bass lines. In 2010 20th Inter-national Conference on Pattern Recognition, pages 4137–4140. IEEE, 2010. 2.1

[20] David Temperley. An algorithm for harmonic analysis. Music Perception: An Interdisci-plinary Journal, 15(1):31–68, 1997. (document), 2.2, 6.4

[21] Bertin-Mahieux Thierry, PW Ellis Daniel, Whitman Brian, and Paul Lamere. The millionsong dataset. In ISMIR 2011: Proc. the 12th InternationalSociety for Music InformationRetrieval Conference, October 24–28, 2011, Miami, Florida. University of Miami, pages591–596, 2011. (document), 4, 4.2

[22] Alexandra Uitdenbogerd and Justin Zobel. Melodic matching techniques for large musicdatabases. In Proceedings of the seventh ACM international conference on Multimedia(Part 1), pages 57–66. ACM, 1999. 2.1

[23] Sudha Velusamy, Balaji Thoshkahna, and KR Ramakrishnan. A novel melody line identi-fication algorithm for polyphonic midi music. In International Conference on MultimediaModeling, pages 248–257. Springer, 2007. 2.1

[24] Zhao Wei, Li Xiaoli, and Li Yang. Extraction and evaluation model for the basic charac-teristics of midi file music. In Control and Decision Conference (2014 CCDC), The 26thChinese, pages 2083–2087. IEEE, 2014. 4.1

[25] Hanqing Zhao and Zengchang Qin. Tunerank model for main melody extraction frommulti-part musical scores. In Intelligent Human-Machine Systems and Cybernetics(IHMSC), 2014 Sixth International Conference on, volume 2, pages 176–180. IEEE, 2014.2.1

44

Automatic Analysis of Music in Standard MIDI Files

Documents

Transcript of Automatic Analysis of Music in Standard MIDI Files