Jacek Wołkowicz - Dalhousie University

POLITECHNIKA WARSZAWSKA

WYDZIAŁ ELEKTRONIKI I TECHNIK INFORMACYJNYCH

INSTYTUT INFORMATYKI

Rok akademicki 2006/2007

PRACA DYPLOMOWA MAGISTERSKA

Jacek Wołkowicz

N-gram-based approach to automatic composer recognition

Ocena:.................................

............................................ Podpis przewodniczącego

Komisji Egzaminu Dyplomowego

Opiekun pracy: prof. nzw. dr hab. inż. Zbigniew Kulka

Specjalność: Inżynieria Systemów Informatycznych

Data urodzenia: 8 listopada 1983

Data rozpoczęcia studiów: 1 października 2002 r.

Życiorys

Nazywam się Jacek Wołkowicz. Urodziłem się 8 listopada 1983 r. w Warszawie. Edukację zacząłem w Szkole Podstawowej nr 106 im. Ryszarda Suskiego w Warszawie. Następnie zdałem do XIV LO im. Stanisława Staszica w Warszawie, gdzie przez cztery lata uczyłem się w klasie o profilu matematyczno-fizycznym z rozszerzonym programem nauczania informatyki. W czasach licealnych brałem udział w Międzynarodowych Turniejach Młodych Fizyków gdzie otrzymałem drugą nagrodę (Helsinki 2001) oraz pierwszą nagrodę (Odessa 2002). Po osiągnięciu bardzo dobrego wyniku na egzaminie wstępnym na Politechnikę Warszawską (PW) rozpocząłem studia informatyczne na Wydziale Elektroniki i Technik Informacyjnych. W roku akademickim 2005/2006 wziąłem udział wymianie międzynarodowej między PW a Dalhousie University w Kanadzie. Mojej edukacji zawsze towarzyszyła muzyka. Poza regularną edukacją, uczęszczałem do Szkoły Muzycznej I st. im. Witolda Lutosławskiego w Warszawie oraz Szkoły Muzycznej II st. im. Józefa Elsnera w Warszawie w klasie fortepianu. W ramach studiów brałem udział w ponadprogramowych zajęciach związanych z zagadnieniami Akustyki.

Egzamin dyplomowy

Złożył egzamin w dniu .................................................................................................... 2007 r.

Z wynikiem ................................................................................................................................

Ogólny wynik studiów ...............................................................................................................

Dodatkowe wnioski i uwagi komisji ..........................................................................................

.....................................................................................................................................................

.....................................................................................................................................................

Abstract

The methods of Natural Language Processing can be successfully applied to the musical symbolic contents since music can be treated not as an artificial language, but the natural one. Showing that some of the methods from natural language processing work on music leads to the point where we can apply well known methods, such as clustering, plagiarism detection or information retrieval to musical contents. A method of converting complex musical structure to the features corresponding with words for text was introduced. A mutual correspondence between both representations was shown. As far as composer recognition is concerned, keeping in mind that a successful authorship recognition task using n-grams’ statistical analysis was brought, one can assume, that this method will also work for composer attribution. The aim of the work is to create such a tool. The obtained effectiveness of the method is very high.

Key words: Statistical Computing; Content Analysis and Indexing – Linguistic processing; Sound and Music Computing - Methodologies and Techniques; Natural Language Processing

Streszczenie

Zastosowanie N-gramowej analizy informacji

do automatycznego rozpoznawania kompozytorów

Techniki Przetwarzania Języka Naturalnego mogą zostać skutecznie zastosowane do danych muzycznych jeśli tylko muzykę będzie się traktować jako język naturalny. Pokazanie, że możliwe jest zastosowanie tych technik w muzyce, prowadzi do rozwiązania problemów grupowania, wykrywania plagiatów czy wyszukiwania informacji dla danych muzycznych. Rozważając problem rozpoznawania kompozytorów, mając na względzie fakt, iż skuteczne rozwiązania do bliźniaczego problemu dla tekstu zostały już zaproponowane przez analizę statystyk n-gramów, można przypuszczać, że ta metoda zadziała również przy rozpoznawaniu kompozytorów. Zaproponowano metodę przejścia ze złożonej notacji muzycznej na cechy odpowiadające słowom tekstowym. Została pokazana wzajemna jednoznaczność obu form reprezentacji wiedzy. Uzyskano wysoką skuteczność zaproponowanej metody.

Słowa kluczowe: Przetwarzanie muzyki, przetwarzanie języka naturalnego, sztuczna inteligencja, wyszukiwanie informacji muzycznej.

ROZSZERZONE STRESZCZENIE:

Przetwarzanie muzyki w dzisiejszych czasach staje się istotną sprawą. Coraz

większe ilości informacji muzycznej są gromadzone zarówno w prywatnych zasobach

jak i w bibliotekach dostępnych przez sieć WWW dla każdego. Z tego powodu

narzędzia efektywnego przeszukiwania i automatycznej identyfikacji utworów

muzycznych są coraz bardziej pożądane. W powyższej pracy zauważono, że informacja

muzyczna wykazuje wiele cech podobnych do tekstu i wskazano możliwość aplikacji

znanych już metod przetwarzania języka naturalnego i wydobywania informacji w

analizie muzyki. Przybliżono również podstawowe informacje z zakresu

psychoakustyki oraz przedstawiono różne podejścia do przechowywania cyfrowej

informacji muzycznej: bezpośredniego zapisu cyfrowego sygnału akustycznego,

kodowania perceptualnego, protokołu MIDI oraz komputerowych metod zapisu notacji

muzycznej.

W drugim rozdziale przybliżono standard MIDI, gdyż to właśnie pliki MIDI

zostały użyte jako obiekt badań do szczegółowej implementacji oraz oceny

testowanych algorytmów i prawidłowości. Pliki MIDI są zapisem cyfrowym sesji

protokołu MIDI (zbioru wiadomości MIDI) i składają się z bloków zawierających

zdarzenia, czyli wiadomości MIDI z informacją czasową lub dodatkowych informacji

sterujących. Wskazano, iż biorąc pod uwagę wydobywanie wiadomości muzycznej,

interesujące są bloki MThd (reprezentujący plik) i MTrk (reprezentujący ścieżkę) oraz

zdarzenia: Note-On, Note-Off oraz Tempo. Na ich podstawie da się wyodrębnić całą

informację o czasie i wysokości odgrywanych dźwięków. Korzystając ze znajomości

budowy plików MIDI zaimplementowano parser wydobywający informację o

sekwencjach dźwięków dla całego utworu, które następnie brane są jako podstawa do

dalszego przetwarzania. Przedstawiono koncepcję uni-gramów jako najmniejszych

jednostek informacji muzycznej dla tej reprezentacji. Wprowadzono również pojęcie n-

gramów jako podstawowej cechy odpowiadającej słowom tekstowym oraz możliwość

skrótowego zapisu najprostszych ich form. Porównano ten pomysł z istniejącymi

propozycjami, głównie z dziedziny MIR (wyszukiwania informacji muzycznej).

Przedstawiono różne przykłady analizy korpusu muzycznego. Do badań użyto

utworów fortepianowych zebranych z wielu różnych publicznych stron WWW

autorstwa pięciu kompozytorów klasycznych. Pokazano, że muzyka przy przyjętej

reprezentacji spełnia prawo Zipf’a. Korzystając z pojęcia entropii informacji

zlokalizowano pozycję słów kluczowych w dokumentach muzycznych.

Przeprowadzono rozkład na wartości osobliwe macierzy zebranych dokumentów

jednak wyniki nie okazały się w tym przypadku pomocne dla problemu rozpoznawania.

Następnie przedstawiono koncepcję algorytmu rozpoznawania kompozytorów

muzycznych. Metoda ta oparta jest na budowie profili każdego kompozytora a

następnie porównywania tych profili do profilu klasyfikowanego dokumentu.

Ostateczny rezultat jest wynikiem całościowej oceny podobieństwa rytmu, melodii oraz

sposobu łączenia rytmu z melodią. Nie zakładano ścisłych wartości parametrów

systemu, lecz pozostawiono zmiennymi długość n-gramów, stopień starzenia przy

trenowaniu klasyfikatora, normalizację, ograniczania wielkości i metody porównania

profili.

W ramach pracy został stworzony system rozpoznawania kompozytorów. Został

on zaimplementowany w języku Perl z użyciem biblioteki graficznej Tk.

Zaproponowano format pliku umożliwiającego przechowywanie utworzonych profili

kompozytorów. W strukturze systemu wyraźnie wydzielono warstwę logiki aplikacji

(podsystem Engine), prezentacji (podsystem UI), obsługi zdarzeń (moduł UI::cmd.pm)

i biblioteki narzędzi dodatkowych niezależnych od głównego zadania aplikacji

(podsystem Utils).

System był testowany dla wielu możliwych konfiguracji parametrów. Dokładna

interpretacja uzyskanych wyników wskazuje na trafne i intuicyjnie poprawne rezultaty.

Nawet, gdy rozstrzygnięcie programu jest niepewne, po głębszej analizie można

znaleźć przyczynę takiego stanu rzeczy w analizie twórczości danego kompozytora.

Analiza szczegółowa algorytmu wskazuje, że prze przyjęciu odpowiednich

parametrów, sprawność systemu dla zebranych danych osiąga prawie 90%.

Wyznaczono optymalną długość n-gramów na poziomie 6, 7.

Acknowledgements

I would like to thank my supervisor prof. dr hab. inż. Zbigniew Kulka for his insight and helpful

comments on acoustic matters.

I would like to thank my Canadian advisor, Dr. Vlado Keselj for his help in creating the idea, how to take up the composer

recognition.

This work would not have this shape without the incalculable help of the best of friends: beloved sis Ola Kontkiewicz, and

best crony Kuba Gawryjołek.

Copyright

by

Jacek Wołkowicz

2007

- 8 -

CONTENTS:

1 INTRODUCTION ............................................................................................. 14

1.1 Aim of the work .................................................................................... 14

1.2 Music as a natural language – basic information about NLP................ 15

1.3 Psychoacoustic foundations .................................................................. 17

1.4 Music storage approaches ..................................................................... 17

1.4.1 Waveform...................................................................................... 18

1.4.2 Perceptual audio coding ................................................................ 20

1.4.3 MIDI.............................................................................................. 21

1.4.4 Symbolic notations ........................................................................ 23

1.5 Musical data vs. textual data. ................................................................ 23

2 MUSIC REPRESENTATION IN ALGORITHMS .................................................. 25

2.1 MIDI on computers ............................................................................... 25

2.2 MIDI parsing ......................................................................................... 26

2.2.1 File structure.................................................................................. 26

2.2.2 Events ............................................................................................ 28

2.2.3 Parser implementation................................................................... 29

2.3 N-grams extraction ................................................................................ 30

2.3.1 Uni-grams...................................................................................... 30

2.3.2 N-grams......................................................................................... 32

2.3.3 Compression of N-grams representation....................................... 34

2.4 Related work ......................................................................................... 35

2.4.1 Musical Information Retrieval ...................................................... 35

2.4.2 Existing approaches....................................................................... 36

3 CORPUS AND ITS FEATURES .......................................................................... 38

3.1 Building a musical data corpus ............................................................. 38

3.2 N-gram features..................................................................................... 39

3.3 Zipf’s law for music .............................................................................. 41

3.4 Entropy analysis .................................................................................... 42

3.4.1 Information entropy....................................................................... 42

3.4.2 Term ranking ................................................................................. 44

3.5 Singular value decomposition ............................................................... 46

4 THE ALGORITHM FOR COMPOSER ATTRIBUTION......................................... 50

4.1 Related work ......................................................................................... 50

4.2 Algorithm .............................................................................................. 51

- 9 -

4.2.1 Testing and training set ................................................................. 52

4.2.2 Building profiles............................................................................ 53

4.2.3 Building piece representation........................................................ 54

4.2.4 Profiles comparison....................................................................... 54

4.2.5 Final judgment............................................................................... 56

4.3 Algorithm details................................................................................... 56

4.3.1 N-gram length ............................................................................... 56

4.3.2 Aging factor................................................................................... 57

4.3.3 Normalization................................................................................ 57

4.3.4 Profiles size ................................................................................... 58

5 COMPOSER RECOGNITION SYSTEM .............................................................. 59

5.1 Functionality.......................................................................................... 59

5.2 Project.................................................................................................... 60

5.2.1 CDB file ........................................................................................ 61

5.2.2 Importing MIDI files ..................................................................... 61

5.3 Implementation...................................................................................... 62

5.3.1 Packages overview ........................................................................ 62

5.3.2 Engine subsystem.......................................................................... 63

5.3.3 Utils subsystem ............................................................................. 63

5.3.4 UI subsystem ................................................................................. 64

5.3.5 Running script – Run.pl ................................................................ 69

5.3.6 Testing plug-in .............................................................................. 69

6 ANALYSIS OF THE RESULTS .......................................................................... 70

6.1 Results interpretation............................................................................. 70

6.1.1 Proper judgment ............................................................................ 71

6.1.2 Wrong judgment............................................................................ 71

6.1.3 Unseen composers......................................................................... 72

6.2 Algorithm evaluation............................................................................. 73

6.2.1 Profiles comparison....................................................................... 73

6.2.2 Normalization................................................................................ 73

6.2.3 N-gram length ............................................................................... 75

6.2.4 Profile’s sizes ................................................................................ 75

6.2.5 Aging factor................................................................................... 75

6.2.6 Representative data ....................................................................... 75

6.2.7 Key-words-based classification..................................................... 76

7 CONCLUSIONS ............................................................................................... 78

A. APPENDIX – MUSIC NOTATION ...................................................................... 80

I. Western music system........................................................................... 80

- 10 -

II. Staff system ........................................................................................... 81

III. Temporal information ........................................................................... 82

BIBLIOGRAPHY ..................................................................................................... 84

- 11 -

LIST OF FIGURES:

Figure 1.1 Spectral Analysis of c-moll prelude BWV 846, J.S. Bach................. 20

Figure 1.2 Spectral Analysis of Etude c-moll, op. 25 no 12, F. Chopin.............. 20

Figure 1.3 NLP and Music processing domains.................................................. 24

Figure 2.1 Sample MThd chunk header .............................................................. 27

Figure 2.2 Sample MTrk chunk header ............................................................... 27

Figure 2.3 Sample Tempo event.......................................................................... 28

Figure 2.4 Sample Note-On – Note-Off sequence .............................................. 29

Figure 2.5 Solving the problem of parallelism.................................................... 30

Figure 2.6 Unigrams extraction ........................................................................... 32

Figure 2.7 Gliding window.................................................................................. 33

Figure 2.8 The sample of Thai document............................................................ 34

Figure 2.9 Two sample melodies......................................................................... 34

Figure 3.1 Composers Timeline .......................................................................... 39

Figure 3.2 Zipf’s law for text............................................................................... 41

Figure 3.3 Zipf’s law for music corpus ............................................................... 42

Figure 3.4 Entropy dist ........................................................................................ 45

Figure 3.5 Eigenvalues for the corpus ................................................................. 47

Figure 3.6 SVD dimensions: 1, 2, and 3.............................................................. 48



Figure 4.1 Building profiles ................................................................................ 53

Figure 4.2 Measure components for different n-grams values............................ 55

Figure 4.3 Aging example ................................................................................... 57

Figure 5.1 System scheme................................................................................... 60

Figure 5.2 System structure ................................................................................. 62

Figure 5.3 Adding to database window............................................................... 65

Figure 5.4 Adding composer window ................................................................. 65

Figure 5.5 Adding composer window ................................................................. 66

Figure 5.6 Application main window.................................................................. 67

Figure 5.7 Application: recognizing window...................................................... 68

Figure 6.1 Normalization’s influence to the results ............................................ 74

- 12 -

Figure 6.2 Accuracy of sieved and full profiles .................................................. 77

Figure A.1 Clefs ................................................................................................... 81

Figure A.2 Time related symbols ......................................................................... 82

Figure A.3 Staff layout ......................................................................................... 83

Figure A.4 Plethora of music notation potpourri.................................................. 83

- 13 -

LIST OF TABLES:

Table 1.1 Levels of NLP. Text vs. music........................................................... 15

Table 2.1 Variable Length Quantities ................................................................ 28

Table 2.2 Pitch and rhythm quantization ........................................................... 37

Table 3.1 Composer Corpus............................................................................... 38

Table 3.2 Sample document-term matrix........................................................... 43

Table 3.3 Class entropies calculations ............................................................... 43

Table 3.4 Sample term ranking .......................................................................... 44

Table 4.1 Training and testing set split .............................................................. 52

Table 6.1 Evaluation of the Frederic Chopin prelude Op. 28 No. 22 ................ 71

Table 6.2 Evaluation of the Ludwig van Beethoven Sonata Op. 49 No. 2 ........ 71

Table 6.3 Evaluation of the Franz Liszt Concert Etude No. 3 ‘Un sospiro’ ...... 72

Table 6.4 Unknown Composers assignments .................................................... 72

Table 6.5 Algorithm results of aging 0.96 ......................................................... 73

Table 6.6 Algorithm results of aging 0.96 with profiles normalization............. 74

Table 6.7 Maximal accuracies for different aging factors ................................. 75

Table 6.8 Results for representative data ........................................................... 76

Table A.1 Pitches ................................................................................................ 81

Table A.2 Notes and Rests .................................................................................. 82

- 14 -

1 Introduction

1.1 Aim of the work

People store large amounts of music on their computers nowadays. They listen

to it almost all the time in the background or sometimes they even treat computers as

mini sound studios that provide them with a great aural relax. These facts make the

problem of processing musical data more and more important. Musical data are still

treated as unstructured binary data left on the same shelf as images, movies, programs

opposite to textual data – easy to process, search, index, driven by a huge bunch of

available computer aided techniques provided by NLP (Natural Language Processing),

IR (Information Retrieval) or TDM (Text Data Mining) like classification, analysis,

generation, summarization, searching and much more.

The aim of the work is to prove that music can be treated as a natural language

and thus the automatic composer recognition system has been developed. The system

was based on the solution of the same problem concerning text provided by Keselj,

Peng, Cercone and Thomas [29]. The program was implemented in Perl, the language

designed for text processing. In order to show the accuracy of the algorithm a corpus of

MIDI files containing piano pieces of various classical composers has been built. Since

an NLP algorithm was to be applied, it has been shown how one can obtain equivalents

of characters and words for music [14] and apply the comparison algorithm as it is. The

system works and one could find that there are a lot to be done in this area, from sound

processing (in order to manage personal music libraries) through MIR (music IR) to

advanced music semantic analysis (musical NLP). Music recognition software tools can

- 15 -

be very important nowadays since there are a lot of web music repositories, and by

now, all of them had to be indexed manually. With automatic, content-based tools one

can build more sophisticated systems, for instance, the one similar to Google for text

[32].

One can think that music is in the same situation as other fine arts, like painting,

dance or sculpture so why we just not treat other arts like natural languages? In my

opinion, it is not possible. There is a big difference between music and writing

compared to other fine arts mentioned above. Music as well as writing uses a kind of

symbolic notation in order to easily exchange and preserve these artworks for next

generations, which none of the other fine arts do.

1.2 Music as a natural language – basic information about NLP

In order to treat music as a natural language, one has to show that music

processing works on the same classes of problems as NLP does. One distinguishes

certain levels of a text processing, listed in the Table 1.1. NLP, as well as music

processing, tries to convey through all levels, from recording (a voice, speech) to

understanding (the meaning of a discourse). Of course, there is no such tool that does

everything at a time, i.e., understands the meaning and gets knowledge from a raw

waveform. In fact, NLP tools concentrate on a certain level trying to move the problem

to the upper level.

Table 1.1 Levels of NLP. Text vs. music.

Text processing Music processing

phonetics Recorded voice Recording

phonology Phonemes of the language Separated notes

morphology Words structure Notes in the score

syntax Words order N-grams, notes order

semantics Words meaning, POS Harmonic functions

pragmatics The meaning of a sentence Phrase structure

discourse Context of a text Interpretation of a piece

Music, similarly to the natural language, can be recorded and presented

primarily as a waveform. On the ‘phonetics’ level one tries to investigate the structure

of a sound, separate and distinguish between notes and instruments. This task combined

with notes recognition is a well known problem to contemporary sound engineers even

- 16 -

if they do not know that they are involved in NLP tasks. This is the major task in NLP

and many different approaches to this task were proved to be successful. Nevertheless,

music is much more complex and sound recognition tasks regarding musical pieces are

still in music’s infancy. A simple explanation with an example of this fact will be

shown later in this chapter.

The second very important similarity results from the fact that music, as well as

text, has the symbolic representation. The first text script system called Cuneiform was

invented in the ancient world of Mesopotamia by the Sumerians about 3200 b.c. [51].

The origins of music scripting are dated back to the 8th century to Carolingian Empire

when the neumatic system, the first notation for music only, was invented, while the

first inscription that may be treated as a basic music notation is dated back to 2000 b.c.

[58]. It is true, that these two facts are disjoined in time, but both music and writing are

the only human activities that have a symbolic representation. This fact allows and

encourages thinking about the music content analysis as the next step of this so called

MLP (Music Language Processing) and similarly to text, can be analyzed on the

semantic and syntactic level. The music score also consists of characters which are

called notes. Similarly to NLP’s morphology and syntax – music has hidden,

grammar-like structure, hidden rules. In this case it is called the harmony. It determines

how to put words (notes) together, how to build well-formed phrases using them. It also

manages the musical meaning of a piece which is the order of chords. In the first case –

notes – we may talk about the syntax of the music while in the second case – chords,

harmonic functions – about the semantics of the certain chord or the pragmatics of a

phrase. This is very similar in its form to one of the main problems of NLP nowadays,

which is grammars analysis. The method of detecting probabilistic harmony was

introduced by Bod [6]. He makes his investigation based on Essen Folksongs

Collection (collection of folks themes, equivalent of the Penn Tree Bank).

The second main NLP’s course of action, which is statistical NLP, can be as

well applied to musical data. MIR (Music Information Retrieval), which is now a

highly exploited domain of research, is an example of this approach. The other problem

with music is that there are no word boundaries and phrasing is driven by harmony, so

one has to figure out the structure of a piece as well as its harmonic representation in

order to successfully retrieve the musical meaning. However, there are methods of

partitioning pieces into smaller themes [55], [57]. Similar problem can be found in

- 17 -

some natural languages that do not contain whitespaces; like the Thai language, a

language of almost 65 million people.

The highest levels of NLP (pragmatics and discourse) are also common for

music. Pieces can be positive (major) or negative (minor). They may represent human

ideas, desires or aspirations (romantic music) as well as depict real situation and actions

(program music). Paul Dukas’s The Sorcerer's Apprentice is the very good example of

program music and was filmed in 1940 in Walt Disney’s animated film Fantasia, in

which Mickey Mouse plays the role of the apprentice.

1.3 Psychoacoustic foundations

While talking about music as a natural language, especially at those low-level

analyses (phonetics, phonology), one has to point out some basic information from

psychoacoustics of hearing and human aural perception.

Sounds are disturbances of pressure that propagate from the source of a sound

through matter (air) as a longitudinal wave. They are perceived by the eardrum,

transmitted through the middle ear into cochlea where mechanical energy from sound is

converted to neural signals and then carried to the brain in order to create a sound scene

(sound picture).

The nature of the sound results from physical features of air and human aural

system. The shape of ear, its dimensions make us more sensitive for the frequencies

from 1 kHz to 4 kHz. The length of the cochlea in the internal aural system limits

human perception to the sounds form the range of 20 to 20 kHz. This information was

implicitly and unwittingly used by people in building and inventing musical

instruments as well as in developing contemporary musical systems. Nowadays they

are also used more consciously in the matter of sound processing and music storage. An

attention will be paid to some other facts form psychoacoustics in all the sections of

this thesis. Nevertheless, they all ensue from the foundations of human sound

perception.

1.4 Music storage approaches

Music can be represented digitally in various ways. However, there are mainly

two types of storage approaches:

- 18 -

1. Raw (Waveform) – the sound recorded by microphones representing nothing

but the motion of the speaker’s (or microphone’s) membrane. The data are kind of

snapshot of a real recording. It generally does not matter whether it is compressed

(mp3, ogg and more other well known formats) or stored explicité (pcm, wav or

aiff format).

2. Symbolic representation – score notations (mus, sib, abc, xml) and MIDI

protocol, which store information about musical events rather then about the actual

sound.

People got used to raw representations because they like to hear “real” artists’

music, not the symbolic version, which is played differently on every machine. The

other reason of this situation comes from the fact that not everyone understands music

in the way he reads. Musical education and studying scores are not such a common

entertainment compared with what used to be in the past. The ease of off-line listening

to the music comes from the rapid prevalence of vinyl Long-Play discs followed by

compact cassettes (audio tapes) and finally, in 1982 – audio compact discs (CD). These

technologies make music available for all, but also fewer people need to be involved in

active creating of music. The progress in compression methods and the rapid

development of personal computers and the Internet allows sharing music through the

web. People are surrounded by music, knowing nothing about it, about its content.

These formats will be briefly described in the following sections.

1.4.1 Waveform

Waveform is an audio format in which music is stored as a digital audio signal.

Analog sound signal is a variation of acoustic pressure usually represented by a

continuous-time voltage signal at the output of microphone, which is then low-pass

filtered, sampled, quantized and binary coded. As an output of this process, a digital

PCM (Pulse Code Modulation) signal is then stored in a file. There are plenty of

possible configurations of sampling rate and quantization depth, however, only one

became more popular then others. The human ear is sensible to the sounds up to 20

kHz. According to Shannon-Kotielnikow theorem, in order to encode a signal with a

maximum component frequency of 20 kHz one has to sample it with the frequency

greater than 40 kHz. Then the information will not be lost and the analog signal will be

able to be reconstructed. In order to leave a safety margin it was decided to sample

- 19 -

sound with a standard of 44.1 kHz. The second problem is the quantization level.

According to other research on human sound perception it was shown that dynamic

range of a human ear (i.e. the ratio of the loudest sound to the quietest one) is about 120

dB. Each bit more in the quantified sample gives 6 dB in the resulting dynamic range,

thus the most convenient quantization depth is 16 bits (96 dB) or 24 bits (144 dB).

The standardized parameters for audio (CD Audio) are 44.1 kHz sampling rate

and 16 bits quantization (96 dB dynamic range and more than 20 kHz maximal

frequency) and these are the most common settings for waveform files.

Two examples of piano music with additional graphical information for two

different excerpts are shown in Figure 1.1 and Figure 1.2. Sound, in basic approach,

can be represented as a digital function of voltage in time. This representation is called

a waveform and it is shown just above notes in both figures. There are some simple

approaches to waveform analysis such as various zero-crossing parameters or

envelope’s shape analysis [33] but this method may be used in few simple tasks, such

as speech/music distinction. The second possible approach is to calculate spectral

representation of the waveform. Spectral representation is the two dimensional function

of sound energy depending on time and frequency. It shows the distribution of sound

components of various frequencies. Both spectral images in Figure 1.1 and Figure 1.2

were calculated using Kaiser (180 dB) gliding windowing function with 8192 bands

(~2.5 Hz frequency resolution and ~ 50 ms time resolution).

Spectral analysis (i.e. the analysis of this kind of functions) is now the only

approach to notes distinction and recognition from a pure sound data. Sound

recognizing come down to catch all maxima from spectral view and classify them by

their shape and positions to different groups (notes recognition, instrument

identification). Various researches were done in this area ([3], [5], [18], [33], [35], [36]

and [59]). However, the results are still very poor. We are able to recognize notes from

the simple examples (Figure 1.1) but more complex and messy ones (e.g. Figure 1.2)

still can cause too many errors.

- 20 -

Figure 1.1 Spectral Analysis of c-moll prelude BWV 846, J.S. Bach

Figure 1.2 Spectral Analysis of Etude c-moll, op. 25 no 12, F. Chopin

1.4.2 Perceptual audio coding

A one very important use of the spectral approach to music identification is the

perceptual audio coding, where several psychoacoustics phenomena are incorporated in

- 21 -

the codecs design in order to remove the redundancy caused by the irrelevant

information. Many various lossy compression algorithms of storing audio data were

invented and many music file formats are in use. They leave only the information about

events in spectral view that is audible for the human ear (about tones and noise bands).

The psychoacoustic phenomena, such as the threshold in quiet derived from the equal-

loudness contours obtained for sine tones accompanied with temporal and frequency

masking, is used in this case. For instance, each frequency peak and noise band are able

to mask other sounds that occur in the close vicinity of the event thus, despite the fact

that they were registered by the microphone and probably will appear during playback,

will not be noticed by human ear. Masking phenomenon occurs in both, frequency and

time dimension, and may be both forward and backward. All additional information

that will not be perceived is not stored, therefore the compression is lossy. Analyzing

this kind of musical data may be simpler than analyzing pure recording, because all the

important features that represent maxima on spectral map are already extracted. The

use of perceptual audio coding in music was analyzed, for instance, in Fraunhofer IIS,

Germany where mp3 and aac formats were invented [19].

1.4.3 MIDI

MIDI files represent different approach to storing audio data. Apart from storing

exact information about a piece, i.e. recording, one can store information about the

piece itself. MIDI stores information about musical events, such as pressing or

releasing a key. Perceptually coded audio files contain the same type of information;

however, they are automatically obtained from the original recording (storing about

1/10 original data keeping whole audible information) without any human semantic

feedback while MIDI files are sequenced by people thus the content may be described

as semantic. The actual midi size is usually about 1/1000 those of original recording, so

one can see a difference.

MIDI stands for Musical Instrument Digital Interface and according to

Wikipedia, it is an industry-standard electronic communications protocol that enables

electronic musical instruments, computers and other equipment to communicate,

control and synchronize with each other in real time. MIDI was not designed to

transmit only the audio signal or media – it simply transmits digital data "event

messages" such as the pitch and intensity of musical notes to play, control signals for

- 22 -

parameters such as volume, vibrato and panning, cues and clock signals to set the

tempo [39]. For instance, RFC 4695 [31] describes a standard of network

communication using MIDI protocol and MIDI infrastructure. As an electronic

protocol, it is notable for its success, both in its widespread adoption throughout the

industry, and in remaining essentially unchanged in the face of technological

developments since its introduction in 1983.

MIDI files are binary files that consist of concurrent channels and tracks ([23],

[24] and [37]). Each channel is a container for events – MIDI messages. There are three

categories of these messages:

1. Channel (voice) messages – essential for midi usage, representing things that

may happen during music generation, such as: Note-On (pressing a key), Note-Off

(releasing a key), Key Pressure, etc.,

2. System Real-time messages – messages controlling real-time events that may

happen during music performance, regarding the time control and flow of other

sync messages such as: Clock, Tick, Start, Stop, Continue, Reset,

3. System Common messages – additional information about performance of the

piece such as: System Exclusive messages (various, usually textual data) and

playlist controllers (Song Select, Song Request).

Both channel and real-time messages depend on time. Time of events in a

Standard Midi File is counted in microseconds, so it is easy to determine the exact

moment of an event. While time information still remains real (as it is in waveform

data), pitch information became symbolical. Instead of frequency value, a key number

on a virtual instrument is used. There are 128 possible notes on a MIDI device,

numbered from 0 to 127. The middle C (261 Hz) is note number ‘60’, and, as it is in the

well-tempered scale system, the frequency of each note is 21/12 of the previous ones.

This solution however has a drawback, because it fixes MIDI utility to western music

that uses chromatic, twelve pitches scale. It is useless for other systems, like pentatonic,

Pythagorean or natural systems but it makes MIDI events clear for interpretation. It will

be explained in the following chapters how one can use this information.

MIDI format contains information about music structure only. It does not

preserve any information about the exact performance (except timing schedule), so

MIDI files are played using special programs, called synthesizers, where each note is

- 23 -

generated using certain algorithms; or samplers, more precisely, which use fixed sets of

previously recorded notes.

1.4.4 Symbolic notations

There is also another way of storing musical data that stores information about

musical score only. There are lots of formats that fulfill this paradigm but,

unfortunately, there in no such standard as for MIDI. They are usually designed for

different score editors (so called ‘Scorewriters’) and are not widely used by now. There

is a short list of some of these formats:

1. Finale file format [16] – binary file, commercial (MakeMusic, Inc.) but used

widely by composers,

2. Chris Walshaw’s ABCMusic notation [62] - designed to notate music, tunes

and lyrics, in ASCII format under various licenses, both commercial and open

source,

3. LilyPond notation [34] – text scripting notation for engraving sheet music.

Unlike some commercial proprietary programs such as Finale and Sibelius,

LilyPond does not contain its own graphical user interface for the creation of

scores. It is developed under GNU Public License,

4. MusicXML [40] – XML standard for storing musical data. It is very interesting

as far as music information retrieval or music as a natural language processing is

concerned because of the standardization of XML formats, but useless unless it is

widely used.

These formats could be very good for analyzing but they are not used over the

Internet as widely as MIDI is and there are not enough resources to build a reasonable

music corpus for researches.

1.5 Musical data vs. textual data.

Two main approaches to storing musical data, symbolic and actual one were

described in this chapter. One can show the same relationship for texts. Speech can also

be recorded and as it has been shown, there are already well known text processing

tools that may involve both text and speech processing. The resemblance between NLP

and music processing is not accidental and cannot be neglected. Representing speech as

text seems natural for us; however, on the other hand we can store human speech as

- 24 -

waveform data. We will have then the original author’s voice. But people are used to

represent text rather symbolically. This representation is easy to store, edit (using

computer keyboards), process in the way that it is flat – words occur one after another,

there is no concurrency in the text. Second problem is that almost everyone can read,

but rarely play the music and almost nobody understands music. The reason for this

situation lies in contemporary education model – music is simply not needed nowadays.

That is why people got used to mp3 files.

The similarities connected with the matter (types of data that are being

processed) of both domains (all the content of this diagram was described in previous

sections) have been summarized in Figure 1.3.

Music and natural languages were both used and created by people in order to

communicate, exchange thoughts, sensations. They have been used by people since the

very beginning of human existence. They both seem to have free structures. In fact,

they both are created using a complex set of rules called grammar (or harmony), and

were evaluating among centuries. Both, text processing and music processing, remain a

key problem, where its solution may help us understand human nature and the nature of

human thinking.

Figure 1.3 NLP and Music processing domains

- 25 -

2 Music representation in algorithms

2.1 MIDI on computers

MIDI files are binary records of MIDI messages. They are prepared for

playback on computers and other similar machines. There are two types of software

sequencers that perform MIDI files: synthesizers and samplers. They are both virtual

instruments that may be operated through MIDI protocol, while MIDI files are

practically records of real, usually faked performances. Synthesizers generate musical

sounds mathematically (algorithmically) so the sound is totally artificial. They are more

popular and they are usually distributed with operating systems or with the software

provided with sound cards. As they need much less resources (disc and processor), they

exist from the beginning of PCs development. Samplers contain a set of the original

sound samples, for many instruments, recorded with different pitches, sound pressures

and durations (with regard to different transients). Samplers require much more disc

space (several megabytes per instrument) and processor time (it should run smoothly

on a 1 GHz processor. One such system is a part of Finale Music, commercial

distribution of a score-scripting system [16]. There are also some hardware

implementations that use DSP. The main advantage of samplers is the real, not artificial

sound. It is also possible to record samples of a valuable, unique instrument and use

these samples to generate faked pieces. There are a lot of free and commercial sound

libraries available online (search for DLS files [20]).

As it was mentioned in the introduction, the MIDI protocol was not designed for

any special purpose. In order to standardize MIDI files, MIDI Manufacturer

- 26 -

Association (MMA [37]) provides the specification for synthesizers which imposes

several requirements beyond more abstract MIDI standard. While MIDI itself provides

the protocol which ensures that different instruments can interoperate at a fundamental

level (e.g. that pressing keys on a MIDI keyboard will cause an attached MIDI sound

module to play musical notes), General MIDI (or GM) goes further in two ways: it

requires that all GM-compatible instruments meet a certain minimal set of features,

such as being able to play at least 24 notes simultaneously (polyphony), and it attaches

certain interpretations to many parameters and control messages which were left

unspecified in MIDI, such as defining instrument sounds for each of 128 program

numbers [22]. Next, MIDI messages (along with timing information) can be collected

and stored as a file in a computer file system, in what is commonly called a MIDI file,

or more formally, a Standard MIDI File (SMF). The SMF specification was developed

and is maintained also by MMA [39].

2.2 MIDI parsing

2.2.1 File structure

Data parsing tool is an intrinsic task to be done if one is talking about

processing the data with complex structure and Standard MIDI File format has such a

complex structure. According to SMF specification, MIDI file can contain any number

of tracks and every track may contain up to 16 independent channels.

Data in a MIDI files are organized in chunks and there can be many chunks

inside a file. Each chunk can have a different size so the information about the size of

the data is always stored in the chunk header. Each chunk contains the chunk ID (four

bytes) that identifies the chunk type and 32-bit length of chunk data. The 4 bytes that

make up the length are stored in the (Motorola) "Big Endian" byte order, not in the

(Intel) "Little Endian" reverse byte order and this has to be taken into consideration,

especially on PCs.

MThd chunk defines primary MIDI features. MThd header contains 6 bytes –

16-bits format, 16-bits Number of tracks and 16-bits Division (tempo information). The

sample MThd header is shown in Figure 2.1:

- 27 -

Figure 2.1 Sample MThd chunk header

There are actually 3 different formats of MIDI files. The ‘0’ type means that the

file contains one single track containing midi data on possibly all 16 MIDI channels.

The ‘1’ type means that the file contains one or more simultaneous (i.e., all start from

the assumed time of 0) tracks, perhaps each on a single midi channel. This is the most

common type nowadays and the example of it is shown in Figure 2.1 (bytes 8 and 9).

The ‘2’ means that the file contains one or more sequentially independent single-track

pieces.

The second pair of bytes in MThd chunk is the number of tracks in the midi file.

It should have a value of 1 for format ‘0’. In the example it equals 5.

The third pair of bytes describes timing information. If it is positive it shows

Pulses per quarter note (PPQN) (in the example given above it is 192) or if it is

negative, the first byte defines a SMPTE (Society of Motion Picture and Television

Engineers) ‘frame rate’ of the piece (-24, -25, -29 or -30 fps) and the second byte – the

number of subframes per frame. So if the division is E7 28 (-25, 40) it gives

25*40=1000 ticks per second. Deep explanation of these tempo markings is available in

[25].

There are one or more MTrk (Track) chunks after the MThd chunk. Each MTrk

chunk contains the chunk ID (4 bytes ‘MTrk’) and the chunk data length (4 bytes). No

additional information is stored in MTrk chunk:

Figure 2.2 Sample MTrk chunk header

MTrk chunk is a container for all MIDI messages. Each message if preceded by

the time signature which has a type of ‘Variable Length Quantity’. If the value of a byte

is less than 127, the byte value is the final value. If the MSB is set, it means, that all

bits from the first byte, except MSB, are final value’s bits and the rest is in following

bytes. This situation happens until MSB of a byte is not set (i.e., its value is less than

128), then all bits from previous and current byte, except MSB’s, form the resulting

- 28 -

value. The examples of VLQ are shown in Table 2.1. Each VLQ value in SMF

describes a delta time (the number of ticks that elapsed from the previous event).

Table 2.1 Variable Length Quantities

Quantity VLQ representation

0x0 00

0x10 10

0x7F 7F

0x80 81 00

0x1000 A0 00

0x3FFF FF 7F

0x4000 81 80 00

0x100000 C0 80 00

0x1FFFFF FF FF 7F

0x200000 81 80 80 00

0x1000000 88 80 80 00

0xFFFFFFF FF FF FF 7F

2.2.2 Events

MTrk chunks are containers for MIDI events (MIDI messages plus delta-time

information). There are numerous event types in SMF specification, but in this case

only a few of them remain interesting:

1. Tempo,

2. Note-Off,

3. Note-On.

Tempo is a non-MIDI event. It has a structure ‘delta-time FF 51 03 xx xx xx’

where last three bytes are the new tempo. It expresses tempo as "the amount of time

(i.e., microseconds) per quarter note”. Default tempo is 500,000 (0x07A120) (120 BPM

– bits per minute). In the example in Figure 2.3 it is changed to ‘315,789’ (190 BPM)

and delta-time is 0 (it is here the very beginning of the piece). Tempo defines how fast

ticks are triggered and this can give actual time (in ms) of other events.

Figure 2.3 Sample Tempo event

Note-Off and Note-On both have the same structure ‘delta-time XY ww zz`

where X defines the event type, ‘8’ stands for Note-Off and ‘9’for Note-On, Y is the

channel number (can have a value of 0x0 to 0xF), ‘ww’ gives the key number (pitch)

- 29 -

and ‘zz’ gives velocity (volume level). The information about velocity will not be

needed except for the situation where it is a Note-On event with velocity – 0. In this

situation it is identical to the Note-Off event.

As before, an excerpt from MIDI file showing how MIDI Note-on and Note-off

works is presented in Figure 2.4:

Figure 2.4 Sample Note-On – Note-Off sequence

2.2.3 Parser implementation

Keeping all the information about MIDI files in mind, a MIDI parser was

implemented. It is based on MIDI Package which is a part of an Open Source project,

abcMIDI, distributed under GNU Public License. Original software resources are

available at [62]. It provides a framework for analyzing MIDI files. The program

collects Note-On, Note-Off events for each channel separately. Each event is assigned a

time value (in ms). In the next step all pairs of Note-On – Note-Off events are merged

into one note. Each note is characterized by onset time, pitch and duration. However,

that is not all. As it was mentioned above, music differs from text in the way that text is

flat but music is not. In this approach one has to find linear order of notes from this

concurrency. One can treat every channel separately if only every channel represents

each hand. Then on each channel the problem of parallelism is to be solved. According

to basic psychoacoustic knowledge, one can show that people concentrate on the

highest currently played note (Figure 2.5):

- 30 -

Figure 2.5 Solving the problem of parallelism

It was also shown by Uitdenbogerd and Zobel [61]. They tried to find the best

heuristic model that capture polyphonic music into monophonic representation. In their

investigations with human listeners, a heuristic with the highest currently played notes

in the chords or concurrencies has performed the best. One can find that this is what

people really perceive so there should be the key of understanding human music

perception.

In the next step, a set of notes in each channel is sorted in ascending order of

onset time and, in the second order, by pitch, descending. Then, a sequence of the

highest currently played notes is created for each channel. The output is the list of

channels with the channel number and number of notes in the channel and a list of

notes (pitch and duration information) in each channel:

channel 0 <number of notes>

<pitch duration>

<pitch duration>

...

<pitch duration>

channel 1 <number of notes>

<pitch duration>

...

<pitch duration>

...

channel <n> <number of notes>

...

2.3 N-grams extraction

2.3.1 Uni-grams

Unigrams are the elemental units that are taken into consideration during data

processing. Regarding NLP, we can talk about characters or words. It depends on the

task, whether taking one of those possibilities gives better of worse results. It should

always be checked in every task. With music, we do not have such a problem. Of

course, there are ‘phrases’ in music that may be assigned the ‘word’ meaning, but there

are no whitespaces or any other delimiters in music that may simply distinguish

- 31 -

between phrases. However, we may separate those using barlines as delimiters but it is

unlikely it will work well.

The simplest approach to the unigram extraction task is simply getting the

duration or pitch as the basic feature, but this cannot bring good results. The pieces can

be played in different speeds and can be transposed to any key. The features that one

needs to have are that they were to be key independent, so not the pitch itself is

important, but the relative pitch to other notes. It is important, because the key does not

tell us anything about the certain work, e.g., J. S. Bach wrote two sets of preludes and

fugues, each fugue in each existing key in well tempered scale, thus if one conducts the

pitch distribution analysis, we will obtain a flat, normalized one. The second significant

feature of musical n-grams is that they should be tempo-independent. The duration is

not given explicité in MIDI files – as quarters, eights, half-notes, but in the direct way,

that can be mapped to milliseconds. Every midi file representing the same piece, but

sequenced by different people (or programs) will look slightly different. That is why a

decision of applying the relative duration counting, not the direct one has been made.

Each difference is in the logarithmic scale and is rounded to cover some random tempo

fluctuations. Number ‘1’ means that the following note lasts twice as long as the

previous one, ‘2’ means 4 times more, ‘0’ – the same duration, ‘-1’ – twice faster. The

procedure of extracting N-grams is shown in Figure 2.6 and the proposed formula

applied to each pair of notes is given as follows:

( )

−= +

+ )(log,, 121

i

iiiIi

t

troundppTP (2.1)

Where Pi and Ti denote the resulting relative values, pi is the pitch of the i-ths note

represented by a MIDI value and ti is the duration (in seconds) of i-ths note. The

rounding precision of 0.2 has been chosen and after this smoothing it has been assessed

that the changes were imperceptible comparing to the original performance.

- 32 -

Figure 2.6 Unigrams extraction

In fact, all this preprocessing tasks described above, that lead to obtaining uni-

grams are implemented in the preprocessing (MIDI parser), so the sample (although a

very short one) output of the parser can be like that:

channel 0 2

1.0 -1

-1.6 3

One has to keep in mind that this example describes an excerpt containing three notes,

not two, because each uni-gram describes relative quantities thus represents a pair of

notes.

One thing that may struck us is that these files, that come out of the parser, are a

kind of textual data for music processing algorithms, similar to documents from text

corpuses. They contain ‘letters’ and may be easily analyzed. That is why they can be

given an ‘mtxt’ suffix (for musical txt) to emphasize their similarity to textual data.

2.3.2 N-grams

N-grams are simply n consecutive tokens [60]. In the case of text, one can

distinguish character n-grams and word n-grams. The task in this case is to retrieve n-

grams from the musical data. In this solution a sequence of tuples (i.e., relative pitch,

relative duration) was obtained from the preprocessor and three types of n-grams can be

extracted out of this. One can consider the rhythm only, the melody only and can also

take n-grams as a combination of both these features. N-grams are collected using a

gliding window which is shifted along each channel (Figure 2.7):

- 33 -

Figure 2.7 Gliding window

Almost every uni-gram is taken n-times during processing, so the resulting number of

n-grams doesn’t depend much on the ‘n’. The thing that is essential while choosing the

length of n-grams is that taking larger values of ‘n’ increases exponentially the size of

possible n-grams set, so the number of different n-grams really grows. The optimal n-

gram length will be chosen and evaluated in the further sections.

According to the fact, that it is not easy to separate words’ equivalents for

music, one can assign a ‘word’ meaning to n-grams. In NLP, character n-grams are

treated in the same way as single words and are used usually in the task, where word

separation is not evident, like during OCR tasks or if a language without evident word

borders is being processed. Thai is one of these languages. The sample of Thai shown

in Figure 2.8 was taken from the webpage of the Mahidol University, Bangkok [26]:

- 34 -

Figure 2.8 The sample of Thai document

This language has the same features as uni-gram music representation – it consists of

sequences of atoms (letters) that cannot be separated by additional symbols such as

whitespaces. N-gram analyses are one of the those techniques that work in this situation

[52].

2.3.3 Compression of N-grams representation

N-grams for larger values of n may need more space to be stored. However the

most frequent n-grams are those whose internal structure is very simple in this

representation. Let us assume the following situation. Two melodies, a simple, and a

complex one are given in Figure 2.9:

Figure 2.9 Two sample melodies

The logical representations as 7-grams of both items are as follows:

1. (4,3,5,-1,-2,-4,-3;-1,0,1.6,-1.6,1,0,1),

2. (0,0,0,0,0,0,0;0,0,0,0,0,0,0).

One can see, that the second example gives very simple n-gram representation and,

what needs to be pointed out is that this kind of n-grams are the most common ones.

That is why one may try to compress somehow of this representation in order to

- 35 -

simplify further processing. The following steps were proposed in order to compress

these strings:

1. Replacing delimiters by ‘#’:

1) (4,3,5,-1,-2,-4,-3;-1,0,1.6,-1.6,1,0,1) => 4#3#5#-1#-2#-4#-3#-1#0#1.6#-1.6#1#0#1,

2) (0,0,0,0,0,0,0;0,0,0,0,0,0,0) => 0#0#0#0#0#0#0#0#0#0#0#0#0#0.

2. Removing ‘zeros’:

1) 4#3#5#-1#-2#-4#-3#-1#0#1.6#-1.6#1#0#1 => 4#3#5#-1#-2#-4#-3#-1##1.6#-1.6#1##1,

2) 0#0#0#0#0#0#0#0#0#0#0#0#0#0 => #############.

3. Choosing some additional symbols and replacing successively each symbol. In

this situation it is: ‘##’=>’$’, ‘$$’=>’@’, ‘@@’=>’%’, ‘%%’=>’&’ so the

following sequences can be compressed as follows:

1) 4#3#5#-1#-2#-4#-3#-1##1.6#-1.6#1##1 => #3#5#-1#-2#-4#-3#-1$1.6#-1.6#1$1,

2) ############# => $$$$$$# => @@@# => %@#.

So the first, complex, sequence was not much compressed, but more frequent

pattern was compressed to the short ‘%@#’. Of course, this algorithm can also be

applied in one pass but presenting it in steps makes it easy to follow.

This algorithm applied backwards gives the primary n-gram representation.

2.4 Related work

2.4.1 Musical Information Retrieval

Since the number of music documents has rapidly been increasing with the

development of computers and networking opportunities, it became a serious problem

to handle these datasets. Music Information Retrieval (MIR) came out of Information

Retrieval (IR), the field that is concerned with the structure, analysis, organization,

storing, searching and retrieval of relevant information from the large textual databases.

At the beginnings of IR (1940s) the problem was to manage huge scientific literature

stored in textual documents. The duty of IR is to provide mechanisms to retrieve

documents or texts with information content that is relevant to the users needs [53].

However, along with the development of multimedia technology, the information

content that needs to be made available for searching changed its nature, from pure

textual data to multimedia content (text, images, videos and audios). MIR is nowadays

a growing international community drawing upon multidisciplinary expertise from

- 36 -

computer science, sound engineering, library science, information science, cognitive

science, and musicology and music theory [15]. MIR systems, that are operational or in

widespread use, have been developed using meta-data such as filenames, titles, textual

references and whole non-music information provided with a piece. Now, researches

and developers need to face creating content-based MIR systems. The most advanced

waveform-like content-based systems rely now upon musical fingerprint idea. It insists

on creating a small set of features that may be simply extracted from the piece and

trying to retrieve information basing on these features [41].

The most important research areas in this case are works done at the field of

symbolic music representation. With pitch and rhythm dimensions quite easily

obtainable from music data, one can obtain the textual string representation of the

music and then try to apply text based techniques to solve MIR tasks. The main

problem is to define the relation between pitch and rhythm information and musical

text representation.

2.4.2 Existing approaches

Various music representations have been already proposed. Buzzanca [8]

proposed using symbolic notes meanings, i.e. pitches like c’, d’, c” and durations like

‘quarter-note’, ‘half-note’ instead of using absolute values for pitch and duration.

However, the task, which was taken, was classification of highly prepared themes

representing the same type of music. Moreover, these features were given then as an

input to a neural network, so one does not know what was really taken into

consideration. This is the main drawback of neural networks, because we do not have

any feedback from the network whether our ideas and assumptions are good or bad.

Thom ([55], [56]) suggests splitting the piece on bars. She contends that using fixed

length, gliding window could make the problem sparse. It is true, however, the

researches conducted in this work show, that modern computers could handle with

good results and performance even such a sparse problem. The next example is the

Essen Folksong Collection. It provides a large sample of (mostly) European folksongs

that have been collected and encoded under the supervision of Helmut Schaffrath at the

University of Essen (see [47], [48], [50]). Each of the 6,251 folksongs in the Essen

Folksong Collection is annotated with the Essen Associative Code (ESAC) which

includes pitch and duration information ([6], [7]). In this approach the pitch is given

- 37 -

explicitly, while regarding time, we can say, that this information is more flexible

because it gives us the information about relative duration of the first (or shortest) note

in the passage. Another approach was presented in [17]. They use original MIDI pitch

representation and absolute time value with 20 ms resolution.

Against all the approaches presented above, MIR researches prefer an approach

similar to the one presented in this work. The first such approach was introduced by

Downie [14]. In this work, only a pitch was encoded as an interval between two

consecutive notes. It was then coded into letters as follows:

- ‘@’ stands for ‘no difference’ (perfect unison),

- small letters of the alphabet stands for lower notes. ‘a’ is minor second, ‘b’ is

major second … ‘g’ is perfect fifth, ‘l’ is perfect octave,

- capital letters of the alphabet stands for higher notes. ‘A’ is minor second, ‘B’ is

major second … ‘G’ is perfect fifth, ‘L’ is perfect octave.

In this approach no information about time (duration) is stored. However,

Downie claims, that this is sufficient to treat such sequences as text and to do

successful n-gram based retrieval. A more precise approach was presented by

Doraisamy [13]. She encoded both pitch (as an interval to the previous note) and

duration ratio (as a ratio of durations of 2 consecutive notes). However, she did not

logarithm it. In the work regarding theme classification provided by Pollastri and

Simoncelli [43] an approach to take relative pitch and relative duration was also used.

However, they quantified both dimensions as shown in Table 2.2:

Table 2.2 Pitch and rhythm quantization

Pitch Rhythm

‘much higher’ Interval ∈ [3,∞]

‘higher’ Interval ∈ [1,2] ‘faster’ Ratio<1

‘same’ Interval = 0 ‘same’ Ratio=1

‘lower’ Interval ∈ [-1,-2] ‘slower’ Ratio>1

‘much lower’ Interval ∈ [-3,-∞]

- 38 -

3 Corpus and its features

3.1 Building a musical data corpus

A set of MIDI files of different composers was collected. For better

compatibility, only piano works were included in the corpus. Moreover, each piece of a

set needs to be well-sequenced, i.e., each channel has to represent only one staff or

hand. The aspiration of this work is to classify musical pieces in terms of authorship,

thus one has to keep in mind that while composing a piece, one is thinking about one

hand or voice at a time. Choosing the pieces that satisfy this criterion was the only

human preprocessing task carried out on this data. The reason for this requirement is

that it is very easy to produce the MIDI sequence that sounds good, but has a mess

inside. So if one searches for the score-like music, it has to be checked whether the

midi file is well-sequenced in that way.

Table 3.1 Composer Corpus

Composer Number of pieces total size

Johann Sebastian Bach 109 963kB

Ludwig van Beethoven 44 1399kB

Frederic Chopin 58 1052kB

Wolfgang Amadeus Mozart 17 448kB

Franz Schubert 23 1116kB

A corpus of the following classical piano composers was set up: Johann

Sebastian Bach, Ludwig van Beethoven, Frederic Chopin, Wolfgang Amadeus Mozart

and Franz Shubert. Numbers of pieces and sizes are given in Table 3.1 above.

- 39 -

While considering music files one has to point out that there are big

disproportions between pieces. Some miniatures are quite tiny, but there are also very

large forms, like concertos. Thus, it is better to describe the volume of a corpora in

bytes rather than in number of pieces. The second important thing is to know that the

differences between composers vary depending on the composers’ background and

their lifetimes, e.g., greater difference is between Schubert and Bach than between

Schubert and Chopin because they both lived in 19th century. Lifetimes of the

composers are shown in Figure 3.1 for better understanding of the relations between

given composers:

Figure 3.1 Composers Timeline

3.2 N-gram features

As it has been shown in the previous sections, one can retrieve musical words

from MIDI corpus and then built document-term matrix, well-known from IR. It is

large, it is sparse and it contains all the information about documents that are in the

corpus. The only problem is to retrieve the knowledge. Document-term matrix is a table

where columns represent documents, i.e. each column is one piece and rows represent

terms, i.e. each row contains information about one n-gram in all documents. The value

in a cell describes relation between a term and a document. There are plenty of

available relations that are in use. In the binary matrix the value in the cell is positive if

the corresponding document contains a corresponding term. In the majority of

application, this value tells how many n-gram occurrences are in the appropriate

document. There are some more sophisticated approaches, for example tf.idf measure,

but it is now beyond my interests albeit the use of more refined measures may give

better results.

Apart from n-gram to piece affinity, one binds two types of values for n-grams.

The first, quite straight forward one is the probability of the n-gram itself in all

documents, so this is simply the frequency of each n-gram in the corpus:

- 40 -

)...()...( 11 knkknk wwFreqwwP +−+− = (3.1)

where P(wk-n+1...wk) is the probability of an n-gram, Freq(wk-n+1...wk) is the frequency

of given n-grams in the corpus and wi is i-th’s (pitch, duration) pair.

The given approach can be applied in recognition tasks. One can obtain

statistical distribution of n-grams in the corpus and provide some reasoning out of this

information. The second approach can be calculating the conditional probability of the

last uni-gram given the probability of the tail (previous n-1 uni-grams). This is also

known as a Markov assumption:

)|()|( 1...1..0 −+−≅ knkkNk wwPwwP (3.2)

It tells, that the probability of the uni-gram at the k-ths position does not depend

on whole uni-grams’ distribution in the document but from the n-1 preceding tokens

only.

According to Bayes Rule, conditional probability for each n-gram equals as

follows:

n

n

knk

knk

knk

knkknkk

Count

Count

wFreq

wFreq

wP

wPwwP 1

1...1

...1

1...1

...11...1

)(

)(

)(

)()|( −

−+−

+−

−+−

+−−+− ⋅== (3.3)

where P(wk-n+1...wk) denotes the probability of an n-gram, P(wk-n+1...wk-1) denotes the

probability of (n-1)-gram, Countn-1 is the total number of (n-1)-grams and Countn is the

total number of n-grams. If we notice, that the ratio Countn-1/ Countn remains the same

for all n-grams, one can simplify the relation to:

)(

)()|(

1...1

...11...1

−+−

+−−+− ≈

knk

knkknkk

wFreq

wFreqwwP (3.4)

This formula can be applied in a music generation task and it was shown by

Ponsford, Wiggins and Mellish [44] as well as Rivasseau [46]. They built a system that

statistically learns harmonic movements from the given corpus and then generates

sample pieces that satisfy this statistical harmony, equivalent PCFG (Probabilistic

Context Free Grammar) task known from NLP.

The main problem in both these approaches is that if one takes too small value

for ‘n’, the method cannot tell anything about the problem but if we take ‘n’ too big, we

- 41 -

have very large problem on the sparse data, that requires lots of computing, storage and

memory.

3.3 Zipf’s law for music

The statistical researches done on the musical data, represented as n-grams,

show great convergence with fundamental NLP and IR theories. The n-grams obtained

out of the corpus satisfy Zipf’s Law, which constitutes the foundation of Information

Retrieval. According to this law, the frequency of any term is roughly inversely

proportional to its rank. If dimensions are in the logarithmic scale, the relation should

be linear. A presentation how this law is satisfied for Polish poem “Pan Tadeusz” is

shown in Figure 3.2:

Figure 3.2 Zipf’s law for text

According to the investigations that were conducted in this work, Zipf’s law is

also satisfied for the n-gram music representation which is presented in Figure 3.3. The

investigations was conducted independently for three types of n-grams, defined in

chapter 2:

- 42 -

Figure 3.3 Zipf’s law for music corpus

In the case of music, rhythmic and melody curves behave slightly different, i.e.,

rhythm has higher slope, however, the regularity is preserved.

3.4 Entropy analysis

3.4.1 Information entropy

The second investigation carried out of this data was to identify the position of

the most important ‘words’ of musical content. It turned out that it is similar situation

to the one regarding textual data. The most frequent n-grams behave like stop-words –

they occur in every piece with almost the same probability. The least frequent n-grams,

that occur a few times, build the majority of the lexicon and do not have any positive

effect on the tasks, such as retrieval or classification. The most important ones lay in

the middle of the rank axis.

A tool that will sieve the profiles extracting only the n-grams that give us more

information about certain class (classes) than others do can play an important role in

showing text to music correspondence. These ‘key-words’ might be simply taken out

from both, ‘noise’ and ‘stop’ words using data-mining tools, same as those used in text

processing, such as information gain or similar methods. In order to find this data, the

following experiment has been conducted, but first of all the term ‘Entropy’ should be

explained.

Information entropy is a measure of the uncertainty associated with a random

variable. It can be interpreted as the shortest average message length, in bits, that can

- 43 -

be sent to communicate the true value of the random variable to a recipient, so Entropy

can be also interpreted as an amount of information. The information entropy of a set X

containing n events, that occurs {x1...xn} times in the probe is defined to be:

( ) ( )( )

( ) ( )( )∑∑==

−=

=

n

i

ii

n

i i

i xpxpxp

xpXH1

2

1

2 log1

log (3.5)

where the probability of each event is given by:

( )∑

=

=n

k

k

ii

x

xxp

1

(3.6)

The value of the entropy is higher if the distribution of values in the set is

flatter. The value is undefined if p(xi)=0, but in this case, it was assumed, that the

element of the sum for element i is 0. The entropy for an empty set (i.e., containing no

event) was also assigns ‘0’ value.

Entropy analysis will be conducted on the data arranged in document-term

matrix described above. Table 3.2 contains examples of various types of terms to make

the reasoning in the next paragraph easy. It is also a sample document-term matrix

containing n documents that belong to N classes which all contain k terms:

Table 3.2 Sample document-term matrix

class 1 class 2 class 3...class N-1 class N d1 d2 d3 d4 d5 d6 d7…dn-3 dn-2 dn-1 dn

term1 0 0 1 5 4 5 ...0... 2 0 0

term2 2 3 1 4 3 3 ..2..5..4.. 2 3 2

term3 0 0 1 1 1 0 ...0... 0 0 1

... ... ... ... ... ... ... ... ... ... ...

termk 1 0 0 0 2 0 ...0... 0 0 0

Table 3.3 Class entropies calculations

class 1 class 2 class 3…class N-1 class N Entropy (N=5)

term1 0 1.58 …0… 0 0

term2 1.46 1.57 ..1.26..1.12..1... 1.56 2.20

term3 0 1 …0… 0 0

... … … … … …

termk 0 0 …0… 0 0

- 44 -

Table 3.3 contains precalculated values of entropies in the following classes

with additional information (last column) about the entropy of the entropy values of a

given row (assuming the number of classes equals 5).

3.4.2 Term ranking

The feature that is a good indicator between classes needs to occur quite

frequently in all the documents belonging to the certain class (i.e. the entropy of this

term in a certain class should be high), but has to be quite rare in the documents that do

not belong to the class (i.e., the entropy of entropies calculated in each class should be

as small as possible). Thus, (whereas the maximum entropy on N elements is equal

log2N) the rank of each term denoted as:

( )( ))(log),(max)( 2..1

iHNkiHiRkNk

−=∈

, (3.7)

where H(i,k) is the entropy inside k-ths class for i-ths terms and H(i) is the Entropy of

all entropies for i-ths term (last column in Table 3.3, should be large if the term is a

good discriminator between certain classes, and low, if it does not discriminate well.

The limiting value for being a key word is log2N where N is the number of classes (for

N=5, log25=2.32). In this case there are only two occurrences of the term and they

dropped luckily into the same class. The probability of this event is 1/N.

Table 3.4 Sample term ranking

… … term1 3.67

… … } Key-words

term3 2.32 Noise-words … …

term2 0.19 … …

} Stop-words

termk 0 … … } Noise-words

Listing all the terms sorted by Rank in descending order (Table 3.4) reveals the

following groups having:

1) R(i)>log2N (‘key words’),

2) R(i)= log2N (‘random pairs’),

3) R(i)< log2N (‘stop words’),

- 45 -

4) R(i)=0 (‘noise words’).

The first group contains words that bring us most information about its classes.

The second one is the random pairs group described above. This group limits our

research. The terms from the third group bring us less information than random words,

these are noise words, that occur equally frequently in every group. They more mess up

the classification than really help. The fourth group represents the words that occur at

most one time in every group. These words are void but it is the most frequent group in

the corpus, so leaving these terms saves computational time and storage requirements.

The distribution of these groups is given in Figure 3.4 (on the vertical axis there is the

proportion of each group, on the horizontal – the log rank of each term assigned during

calculating Zipf’s law). One can see that, as expected, key words occur in the middle of

the scale (each group is given a number corresponding to the position in the list show

above):

Figure 3.4 Entropy dist

This method can be used in sieving out the noise in classification tasks. The

experiments that were conducted on the text classification show that the error rate

decreased 3 – 4 times using this method of sieving training data [21]. The method may

also work for music classification [63].

- 46 -

3.5 Singular value decomposition

SVD (Singular Value Decomposition) analysis may also be done in order to

find most frequent patterns in piece classes (composer groups) and sift out the noise.

This is an unsupervised method of analyzing the data, so no additional information

about the membership to a particular composer’s group is needed.

One can see that more than half of the information contains first 20 dimensions,

so using them, we might be able to do successful classification, but, still, half of the

information is lost and actually we do not know anything about what the dimensions

really indicate. In order to find the information about each composer’s contribution, I

visualized the pieces on the 3D space and tried to find the dimensions responsible for

composer attribution. My investigations show that there is certain regularity, but it is

not significant, so I concluded that an unsupervised approach to the composer

recognition task is not the thing that we are looking for.

Singular values decomposition, also known as PCA (Principal Component

Analysis) in computer science, consists in converting a matrix to the new base so that

the values of successive dimensions were the lowest possible. In general, having almost

200 000 features (different n-grams) and 250 pieces, 250 dimensions is enough to

precisely distinguish between those pieces (the simplest solution is that each dimension

is identical to the appropriate document vector) but using SVD we may simplify this

situation. In other words, the aim of this method is to lead to such situation where each

consecutive dimension gives less and less information, so that one can chop off some

dimensions with loosing some insignificant information. What then remains is a few

dimensions that can be visualized and analyzed. This is an unsupervised method, so no

information about piece’s composer affinity needs to be given. It has many advantages

because if it succeeds, we may assume that after adding more pieces and more

composers the result will be still satisfactory i.e. every new composer should be

detected and recognized. The main drawback of this method is the fact that one does

not provide any information to the program so we have to rely upon its insight (the

same problem as the one with neural network, described in the introduction).

A program that calculates SVD for the document-term matrix that comes out the

corpus was implemented. The program was based on the algorithm presented in [45].

The visualization was written using OpenGL.

- 47 -

The results that came out of the program were not satisfying. Figure 3.5 shows

eigenvalues of the resulting dimensions:

Figure 3.5 Eigenvalues for the corpus

The first k values that bring half of the useful information are taken into

consideration. One can see that more than half of the information contains first 20

dimensions. Therefore by using them, we might be able to do successful classification,

but still half of the information is lost and actually we do not know anything about what

the dimensions really indicate. In order to find the information about each composer’s

contribution, a visualization of the pieces on the 3D space was conducted and one tried

to find the dimensions responsible for composer attribution. Investigations show, that

there is certain regularity, but it is not significant, so one might conclude that an

unsupervised approach to the composer recognition task is not the thing that one is

looking for. Moreover, twenty dimensions are still too much for 3D visualization which

requires 3 elements’ vector. However, one can visualize different triples of dimensions.

The following colors are used in the following figures: Bach – red, Beethoven – green,

Chopin – blue, Mozart – yellow, Schubert – cyan. In the first screen (Figure 3.6)

dimensions 1, 2, 3 were shown:

- 48 -

Figure 3.6 SVD dimensions: 1, 2, and 3

Except that one can see the grouping of red dots on the right (Bach) – no special

information about composership is stored in the most ‘informational’ dimensions.


In Figure 3.7 (dimensions 1, 4, and 6) one can observe distinction between Bach

(red dots) and Beethoven (green dots)

- 49 -


In Figure 3.8 (dimensions 4. 7, 8), Bach is visibly grouped in the center of the

screen.

In turned out, that composership is not a primary feature of the piece because

the very first dimensions do not distinguish between composer groups. However, it

may be obvious on the second thought. The most important features of the piece is

mode, length, tempo, form and these features may come out the analysis. Some groups

reveal in deeper analysis, but they are observed after, not before assigning colors to the

pieces, which is not the point regarding unsupervised method.

- 50 -

4 The algorithm for composer attribution

4.1 Related work

A system about authorship attribution has been done on texts by Keselj, Peng,

Cercone and Thomas [29]. They reported that a successful authorship attribution

method can be applied to text using n-gram based statistical approach from natural

language processing with the accuracy that reaches 100%. The method introduced is

very simple in its concepts and might be successfully applied in other fields like music.

Pollastri and Simoncelli [43] have done the system of theme recognition using

Hidden Markov Model and report 42% accuracy among 5 composers. This is not a

satisfactory amount. However, they claimed, according to other psychological research,

that human ability of recognizing themes for professionals is about 40 %. They have

also used n-grams, as it was described in the previous section and they have done their

research just on monophonic themes.

Ponsford et al. [44] and Rivasseau [46] conducted a research to apply

unsupervised learning of the harmony in order to produce the set of grammar rules and

generate random harmonic structures. Their reasoning was clear but similar results may

be obtained through applying a probabilistic context-free harmonic grammar rules, but

of course – it will not be an unsupervised approach. However, their results shows that

music can be investigated using statistical analysis.

Lots of work has been done to recognize some aspects of waveform data using

different methods ([3], [18], [35]), but this field is so far not investigated enough and

the results are quite poor. The main problem in this field is that we still cannot interpret

- 51 -

the waveform data well and without this insight our work is still just a rambling in the

darkness

Kranenburg and Backer [30] conducted an interesting research on the method

on classifying the composer by the fixed set of features and using machine learning

methods to predict results, but they did not show clearly the accuracy of their system.

The main drawback of this system is that it uses the direct expert knowledge to run the

system. Nowadays, main systems try to teach themselves because the amount of data is

too large to be run over by people, but fortunately big enough that computers can take

out some useful knowledge and patterns from it.

Lots of work has been done in the field of psychology and psychoacoustics

([11], [64]). It shows that people are not so perfect in recognizing music in general, but

they work very fast in recognizing well known pieces.

A successful style recognition system has been done by Buzzanca [8]. He used

neuronets and reports 97% accuracy, but “highly prepared data” were used in this

solution. By “highly prepared data”, one means selecting themes from pieces, not

giving whole pieces to be classified. Having that in mind, this solution is not fully

automated, because it involves long-lasting users, experts preprocessing, which is not

the case in this thesis. Second thing is that the use of neuronets cannot give an

explanation of such behavior and results. It does not give the insight into the features

that can distinguish between different composers. The system may work, but it will not

increase human knowledge in this area. In n-gram based approach one assumes, that the

order of notes plays role and after that one can take the profiles out and check what the

features (the sequences of notes) that specify composer’s contribution are.

4.2 Algorithm

According to Wikipedia, document classification or categorization is a problem

in information science. The task is to assign an electronic document to one or more

categories, based on its contents. Document classification tasks can be divided into two

sorts: supervised document classification where some external mechanism (such as

human feedback) provides information on the correct classification for documents, and

unsupervised document classification, where the classification must be done entirely

without reference to external information [12]. The problem of assigning composership

- 52 -

to the musical pieces is such a task. Musical pieces are electronic documents, that can

be processed and useful information can be obtained out of them. In the previous

section it was described how this can be done. It was also show that a few regularities,

which occur in texts, are satisfied for the music as well. Therefore NLP techniques may

be applied to music content as well.

In the previous section it was shown that an unsupervised approach may not

work in the case of composer recognition. Unsupervised methods are usually less

effective then supervised ones that is why a decision of trying the following method

was made.

In order to classify musical documents the following steps can be done:

1. splitting corpus into two sets, training and testing,

2. building profiles for each composer using information from training corpus,

3. building representation of each piece from the testing corpus,

4. comparing each of the testing documents to each profile,

5. judgment for the resulting assignments.

In the following sections each of these steps will be briefly described.

4.2.1 Testing and training set

The corpus was split into two sets. If the composer’s subset was big enough, 10

items were drawn out of it. For smaller subcorpora such number of elements was

chosen, so that the remaining – training part – was still big enough to train the

classifier. The result of the split is presented in Table 4.1 :

Table 4.1 Training and testing set split

Composer Training Set Testing Set

J. S. Bach 99 items, 890kB 10 items, 73kB

L. van Beethoven 34 items, 1029kB 10 items, 370kB

F. Chopin 48 items, 870kB 10 items, 182kB

W. A. Mozart 15 items, 357kB 2 items, 91kB

F. Schubert 18 items, 863kB 5 items, 253kB

The resulting number of testing items is not too big, for very precise algorithm

evaluation; however, it is still enough to show whether the method works. Building

corpora is a tough problem, because, as it was mentioned in the previous chapter, one

- 53 -

has to choose only those MIDI files that were well-sequenced, while minority of MIDI

files available on the web satisfies this requirement.

4.2.2 Building profiles

In the next step, each item from the training corpus is used to build composers’

profiles. Each document is being preprocessed and three sets of n-grams are extracted:

melodic n-grams, rhythmic n-grams and combined n-grams. Then, n-grams in each

group are counted and information about them combined with the number of

occurrences is kept. The process of building such profiles is illustrated in Figure 4.1:

Figure 4.1 Building profiles

Trigrams are used in the example. In this case each n-gram occurs once,

however, as far as whole pieces are concerned, some n-grams are more frequent than

others. Each profile is a hash table where n-grams are keys and numbers of occurrences

are values. As a result, one obtains three independent profiles and they are analyzed in

next steps separately.

Then, three profiles are created for each composer. Each profile is a vector of

features (n-grams) which is a join of all profiles of the same type from all pieces of a

composer. The process of joining profiles will be described in algorithm details section.

As it was already mentioned, some n-grams are quite frequent in all document

regardless of composership, some of them occur only few times. There are also

- 54 -

n-grams, that are true composer indicators and they made the algorithm work as

planned.

Building composers’ profiles is the last training step. At this point the entire

knowledge is stored in profiles and one can sell the system, if it is a commercial project

or share if it is a free distribution.

4.2.3 Building piece representation

The next step of the algorithm is logically the first step in end-user part. This

part works similarly to the previous one – profiles part. Here, the piece that is being

recognized is converted to the same form as original profiles – each piece is also

represented as a vector (or as a hash table) of n-grams occurrences and these

representations are then being compared.

4.2.4 Profiles comparison

One may notice from previous chapters, that both, composer profiles and the

profiles for each analyzed piece have the same structure – multidimensional vector.

There are many methods of comparing such representation [9]. They are being widely

developed in NLP and IR since documents are presented in the same way – as vectors.

The most popular method used in IR is cosine similarity measure. It is simply

the cosine of the angle between vectors representing document and composer’s profile.

Following cosine definition, it returns ‘1’ if vectors are parallel (i.e. if the vectors are

equal regardless of the scale), ‘0’ if the vectors are orthogonal (the documents contain

disjunctive sets of words) and value in between according to the following formula in

other case:

( )

( ) ( )∑∑∑

⋅

⋅=

22),(

ii

ii

yx

yxyxCosSimrr

(4.1)

where xr

and yr

are documents’ vectors and xi and yi represent i-ths value. Of course,

both vectors are to have the same cardinality, so if a feature in a profile does not exists,

i.e. there were no such term in the document, ‘0’ is assumed.

Cosine similarity has many advantages. It is sensitive to the vectors that contain

values with the same proportions, not only the same values. It is a frequent situation

while comparing single documents against profiles of whole subcorpora. However, it

- 55 -

behaves unjust if some dimensions are incomparably larger (i.e. contain larger values)

than others. In this situation cosine similarity measure biases its verdict by the most

frequent terms. It is an undesired situation because the most frequent terms, known as

stop-words, does not contain useful knowledge and mislead the assessment.

Another method for comparing profiles is here proposed. It is a modified

method described by Keselj, Peng, Cercone and Thomas [29] that was used for

comparing the profiles of texts authors:

( ) ( )∑

+

−⋅−=

2

24,

ii

ii

yx

yxyxSimrr

(4.2)

This similarity measure consist in counting the relative difference of valued

( ii yx − ) to the mean ( ) 2ii yx + for each of the dimensions separately and then

summing them up.

This method does not lead to the situation where not incomplete profiles bias

the verdict in favor of their side. Every n-gram may increase final verdict by the small

value between ‘0’, if one feature is not in a vector and ‘4’ if the values are equal. The

graphs showing possible component values depending on feature’s values are shown in

Figure 4.2, where maximum ‘fours’ are marked by dashed line. However, it is

susceptible to the situation where the vectors are the same, but scaled. It happens if a

piece contains much less n-grams than a profile. Still, while comparing the same piece

against different profiles, results are bigger or larger according to the piece size but

remain comparable among all profiles (if the profiles are quite balanced).

Figure 4.2 Measure components for different n-grams values

- 56 -

4.2.5 Final judgment

One obtains 3n values, where n stands for the number of analyzed composers,

for a piece. There are many possible judgment algorithms that can be applied in order

to find the most appropriate choice. Thus it is not connected with composer

classification but classification itself, therefore a decision not to pay an attention on this

parameter was made. The following steps were applied:

1. Sum up all the similarities for profiles of each composer,

2. Sort all sums descending,

3. Take topmost composers as a result.

Sample calculations based on real example are shown in Table 6.1 (on page 71).

The example presented there also shows the other side of this algorithm. It evaluates in

its foundations the rhythmic and melodic adherence of each piece to the appropriate

composership element. The analyses of this algorithm behavior are shown in the

evaluation chapter.

4.3 Algorithm details

Before algorithm application one has to specify many other parameters, not

essential for the algorithm, like comparison measure, but inherent in algorithm’s

‘runtime mode’ (opposite to its logical, paper specification).

4.3.1 N-gram length

As it was described it in n-grams features section, the use of n-grams results

from Markov inference assumption. N influents on, so called, though horizon of the

algorithm. If it is too small (if n is small) n-grams contain less information about the

context of a note. If it is large one has to feed the algorithm using larger and larger

training corpus otherwise the profiles will suffer from the lack of data, most n-grams

will occur only a few times and the recognition system will be unlikely to work.

N-grams size was not limited in this work. However, it turned out during

investigations that n-gram length should not exceed several.

- 57 -

4.3.2 Aging factor

Aging factor consist in lowering profiles values during learning. This operation

should be repeated before each training document. Profiles aging may lose some

information that comes out the very first pieces, but it enables to generalize information

stored in training pieces. This is useful mechanism if some corrupted data came out of

some pieces that may damage accuracy of the profiles. In order to understand it better,

the following example was prepared: Let us assume one has two types of incoming n-

grams – good, but rare (it occurs in every piece twice) and misleading – that were

occurring five times less frequent, but were 5 times stronger. The training process on

these data is shown in Figure 4.3:

Figure 4.3 Aging example

It turned out that larger, but less frequent event was remembered almost 4 times

less. Without aging both symptoms will end up with the value of 20.

4.3.3 Normalization

Normalization is the process of equalizing values in profiles to ensure, that none

of the composers obtains a handicap resulting from the fact, that his profiles were more

complete than others. Testing whether to normalize was also the point of my research.

One can also think whether to normalize pieces profiles that are compared

against composers’ profiles, however a decision not to pay attention to this point was

made. That results in the fact, that smaller pieces have smaller cumulative similarities,

but as it was already mentioned, the ratios still remain the same for a piece.

- 58 -

4.3.4 Profiles size

The size of the profiles represents nonfunctional limitations of the system. The

number of different n-grams for n=6 reaches 200,000 using this not so large corpus and

it turns out that many of them occur only few times. That’s why a decision not have to

bother these kinds of problems was made. Current machines should easily handle such

data. An observation of some problems and regularities that may occur while applying

such system on much larger datasets was also a reason for applying this parameter. The

second purpose was to observe how the size of the profiles affects other parameters

such as n-gram length, normalization or aging factor.

The final project was not limited in terms of the size of the profiles. Looking

ahead the final release of the system uses less than 100 MB of RAM and processor

delays were acceptable. However, limiting some information using more sophisticated

method, like entropy sieving may drop down resource requirements for the system

without significant fall of its performance factors.

- 59 -

5 Composer recognition system

5.1 Functionality

The implementation schema is given in Figure 5.1. Squares stand for the units

that analyze the data. Preprocessor was implemented in C++ and converts input MIDI

files into the unigram representation, denoted as the Musical TXT files (MTXT)

because of its linear representation, which makes it corresponding to the written text.

The trainer creates the profiles from the training data and then the profiles are stored in

external files. Classifier classifies every testing file in three ways – using three types of

n-grams, rhythmic, melodic and combined independently and then merges the results to

find the final verdict as it was explained in previous section.

System should also be able to store profiles in external files, allow saving and

loading profiles in order to available using them in different locations and in different

time.

It should also provide the ability to create composers’ profiles for various

system parameters, which were described in previous chapter.

- 60 -

Figure 5.1 System scheme

5.2 Project

Some additional system features should be analyzed before the implementation.

In the following sections a brief description of the solution of storing composer’s

profiles and MIDI corpus will be given. Other features such as system structure will be

described in the implementation section.

- 61 -

5.2.1 CDB file

In order to store information about the single set-up a file type of composer

database (CDB) should be defined. Storing all the desired information in a zip archive

(renamed to .cdb) involves the following files:

1. settings.dat – file containing information about database settings, n-gram

length, aging factor and profiles’ size,

2. composers.dat – file with a list of composers containing composers names and

their IDs,

3. the following files are stored for each $composer from composers.dat:

3.1. $composer.piecelist – list of trained pieces with various additional

meta-information,

3.2. $composer/* – MIDI files associated with certain composer. No

additional information about MTXT files is stored since it is quick and easy

to obtain n-gram’s representation directly from the MIDI file,

3.3. $composer.$type.profile – profile hashes for each $type from the

following list: rhythm, melody and both, containing information about each

profile type for each composer.

5.2.2 Importing MIDI files

One way of managing external files is not to bother them, and make the user

store the files in his own filesystem. A user should browse for them each time they

want to add them to a database, classify or move out of database. This is a simple

approach but might not be convenient to maintain this kind of data resources.

The other approach is to store MIDI files inside program structures for easy user

management. A decision was made to force user to import MIDI files to the program

before they are used for training or testing and then MIDI files are copied to the

system’s structures at the time they are imported. It requires from the system storing

and maintaining the whole set of MIDI files. However, MIDI files are small, so the

system will not suffer from large storage (disc) requirements. Moreover, the user will

be free from problems of moving pieces in and out the profiles and moving them across

different databases since the files will be available in the program all the time.

- 62 -

5.3 Implementation

Composer recognition system is implemented in Perl scripting language with

TK GUI distributed with Perl package. I used Perl version 5.8.8.820 distributed by

ActiveState [1] under Artistic License [2]. Additionally, MIDI parser was implemented

in ANSI C thus it is compilable using many popular compilers. I have compiled it with

compiler provided with Visual Studio 2005 on Win32 and gcc on UNIX (for tests). I

have used zipping package provided by Info-ZIP [28], distributed freely for any

purposes.

5.3.1 Packages overview

System is divided into three subsystems. The structure of these systems is

shown in Figure 5.2:

Figure 5.2 System structure

1. Engine – subsystem that provides fundamental system functionality such as

creating profiles, managing databases, passing judgments etc…,

2. UI – subsystem responsible for graphical user interaction, containing business

procedures to all functionalities provided by Engine API,

3. Utils – subsystem containing all functionalities which are not inherent to the

system, but vital for the system itself such providing OS API, packager API,

preprocessor API,

4. Run.pl – runtime script.

All subsystems are described in following sections.

- 63 -

5.3.2 Engine subsystem

Engine subsystem contains the following modules (Perl packages):

5.3.2.1 DataConnectionMgr.pm

Package that provides API for loading, saving and creating new CDB

(composers’ database) files. While loading and creating a database, it informs the user

about the progress and handles refreshing information in other packages.

5.3.2.2 ListMgr.pm

This Package handles external MIDI files. These files can be then added to

profiles or classified against existing profiles. An assumption was made that external

MIDI files should be imported to the system even if they are not included in profiles. It

enables easy transfer of files among different profiles without accessing local

filesystem. ListMgr package is responsible for adding, removing and loading these files

and provide functions for accessing these files.

5.3.2.3 ProfileMgr.pm

This package handles information about current configuration such as

composer’s information, created profiles and system parameters. It loads current

configuration from cdb files and saves them, allows adding, removing composers,

adding and removing files from profiles. ProfilesMgr package contains functions that

enable the comparison of profiles using comparison algorithm presented in the previous

chapter.

5.3.2.4 MidiProcessor.pm

It contains a set of functions, which retrieve n-grams from MIDI files for other

packages. It also implements n-gram packing algorithm.

5.3.3 Utils subsystem

Utils subsystem contains the following modules (Perl Packages and programs):

5.3.3.1 color.pm

This package implements color operations, generates colors of desired

parameters for all UI packages.

5.3.3.2 debug.pm

This package handles information about the environment of the application as

well as provides some useful information about system performance and status.

- 64 -

5.3.3.3 hashop.pm

It provides basic operations on hashes that represent document vectors and

composer’s profiles. Various functionalities include adding, subtracting of two vectors

(for adding and removing files form profiles), multiplying vector by scalar (for aging)

and various vectors’ limiting functions (for profile’s sizes limitation).

5.3.3.4 refresh.pm

It provides a functionality, which ensures that all application windows are

refreshed.

5.3.3.5 settingsMgr.pm

It stores information about opened database between system’s launches. It also

provides a function that creates a unique Composer ID among the system.

5.3.3.6 system.pm

This package contains functions that access operation system. The actions that it

takes depend on the operating system that the program is running on. It provides

various files listing functions, storing and loading document vectors on disc, archiving

and restoring data from zip files, copying files and managing folders.

5.3.3.7 zip.exe and unzip.exe

Tools that enable packing and unpacking data for storage purposes if the

program is ran on Windows OS.

5.3.3.8 preprocessor.cpp and preprocessor.exe

Preprocessing programs for retrieving unigrams described in chapter 2.

5.3.4 UI subsystem

UI subsystem contains these modules, usually containing widget creation

methods:

5.3.4.1 AddToDatabaseWindow.pm (Figure 5.3)

This package creates dialog window that performs adding external midi files to

the profiles specified in Composer field. Option Save source indicates whether to

include original MIDI files to the CDB file or not. Choosing this option results in larger

files size but allows retrieving MIDI file later or performing some conversion

operations on composers’ profiles. Pieces are added in the same order as it is on the list.

- 65 -

If anyone wants to change the order, there are buttons on the left to do it. It is important

if aging is applied.

Figure 5.3 Adding to database window

5.3.4.2 ComposerWindow.pm (Figure 5.4)

This is a tiny modal window that allows adding new composers to the database.

After typing composer’s name, one can press Enter to accept new composer or Tab key

if one wants to add this one and then add another one – an empty Add new composer

window will then pop up.

Figure 5.4 Adding composer window

5.3.4.3 DBSettingsWindow.pm (Figure 5.5)

This window pops up when New database action is fired. It allows choosing

different system parameters such as n-gram’s length, profiles’ sizes or aging factor.

- 66 -

Figure 5.5 Adding composer window

5.3.4.4 MainWindow.pm (Figure 5.6)

This package builds main window of the application. It is split into two main

parts. The left part contains current database view with composers and already trained

pieces (with information about age of each piece). Different colors indicate piece status,

red is used when there is no available MIDI source for piece, green if MIDI piece is

stored with CDB file. On the right side MIDI resources are available. These are the

files imported to the system. Information about channels containing notes with

information how many n-grams are in each channel may be provided for each piece.

Each list is followed by a set of function buttons.

- 67 -

Figure 5.6 Application main window

5.3.4.5 Message.pm

This package is responsible for popping up messages to the user.

5.3.4.6 RecognizeWindow.pm (Figure 5.7)

This package creates recognizing window. It performs recognizing tasks on

selected pieces and presents results to the user. Each piece is assigned values of

similarities to composers followed by a final verdict. User is informed about the

progress using various progress bars at the bottom of the window.

- 68 -

Figure 5.7 Application: recognizing window

5.3.4.7 DBTree.pm

It holds database list (left pane on main window). It creates the list widget and

manages list operations inside the list such as refreshing, adding and removing

composers and files, retrieving selected items.

5.3.4.8 FileTree.pm

The purpose of this package is the same as DBTree.pm but it is used for

managing external files list, on the right side of main window.

5.3.4.9 Menu.pm

This package creates a menu and binds actions to menu items.

5.3.4.10 progress.pm

This package is responsible for informing the user about the progress of actions

connected with CDB operations such as loading, saving databases. It creates progress

bar at the bottom of main window.

- 69 -

5.3.4.11 WindowList.pm

It manages the proper refreshing of windows. It allows Utils::refresh package to

properly refresh all currently opened windows while various time-consuming

operations.

5.3.4.12 graphicd.pm and icons folder

It is responsible for loading all required icons and provides easy access to them.

5.3.4.13 cmd.pm

This package contains all business procedures for the applications. All the

actions in menu, application main window or other windows point to some functions in

this package. cmd.pm package contains business scenarios for following actions:

1. creating and removing composers,

2. creating, opening, saving and ‘saving as’ databases,

3. adding file, adding folder, removing items from external MIDI files list,

4. moving files in and out the database from MIDI files list,

5. recognizing action,

6. closing application.

5.3.5 Running script – Run.pl

It runs the application. If run on Win32 – it releases the console and lowers

program priority class (for better overall system performance during complex and time

consuming operations). It sets application environmental settings, creates main

window of the application and load recently used database.

5.3.6 Testing plug-in

System was also provided with a script for algorithm testing. It was designed to

enable applying different parameters to the system for the same data, automatic testing

and outputting results on external file, since deep analyses of the system are

time-consuming.

- 70 -

6 Analysis of the results

A set of tests of the system was conducted using the corpus described in the

previous chapter for various parameters. The testing results space includes the

following parameters:

1. normalization – with or without,

2. similarity measures: cosine similarity, proposed similarity,

2. various n-grams length: 2, 3, 4, 5, 6, 7, 9, 12,

3. various profiles sizes: 100, 250, 500, 1000, 2500, 5000, 10000,

4. various aging factors: 0.7, 0.8, 0.85, 0.9, 0.96, 0.99 and without aging.

The accuracy, i.e. the ratio of the number of correctly assigned pieces to the

total number of tested files was used. The small number of available testing files does

not allow applying more sophisticated methods such as precision or recall, because they

could be not reliable enough. Nevertheless, the accuracy is good enough to show

whether the algorithm works.

Testing consists in repeating all the steps (training, classifying) using various

parameters. Despite the fact that tests were conducted on 4-processors, Sparc Solaris

machine, since I decided to test exhaustively the result space, it lasts sometimes many

days to conduct these tests.

6.1 Results interpretation

Running experiments using the program that output all intermediate results

allows looking inside the algorithm’s performance. The judgment is clear because

- 71 -

assigning pieces to certain groups is well defined, i.e. one knows for sure who the

author of the piece is. There are many problems that do not have this certainty, for

instance, text classification by topic. Of course, while talking about a certain composer

we usually have his style in mind. Bach wrote his compositions in Bach’s style, F.

Chopin had Chopin’s style, but one talks also about influences that have appeared

between composers. These impacts may be also observed by looking into some

examples:

6.1.1 Proper judgment

Example of the successful judgment is shown in Table 6.1:

Table 6.1 Evaluation of the Frederic Chopin prelude Op. 28 No. 22

Profiles melodic rhythmic combined Total Verdict

Beethoven 43.2 17.2 11.0 71 3

Mozart 49.2 11.4 6.4 67 4

Bach 62.4 8.2 6.4 77 2

Schubert 19.3 13.2 5.9 38 5

Co

mp

ose

rs

Chopin 86.8 25.1 10.9 122 1

According to this example, the prelude shows high Chopin affinity albeit it

manifests Bach-based melody carrying on with typical romantic rhythmic structure

(high marks for Beethoven and Shubert). It is well-known fact that Chopin was

fascinated by Bach’s compositions. Before playing in concert he shut himself up and

played, not Chopin but Bach, always Bach [27]. These facts came out of the composer

recognition system.

6.1.2 Wrong judgment

In Table 6.2 one can show the results for Beethoven’s training sonata:

Table 6.2 Evaluation of the Ludwig van Beethoven Sonata Op. 49 No. 2


Beethoven 303.7 208.8 109.7 622 4

Mozart 319.2 201.7 124.4 645 2

Bach 366.1 263.0 83.7 712 1

Schubert 315.6 201.8 119.1 636 3

Co

mp

ose

rs

Chopin 296.5 127.3 79.0 502 5

- 72 -

This composition is not a typical Beethoven work. However, it was drawn as a

member of testing corpus. According to the fact, that it is simple piece in the style of

his predecessors, Bach and Mozart, they were classified at the top. On can see also how

low Chopin was classified in terms of rhythmic structure. It was so because this piece

has a very simple rhythmic structure which is unlikely for Chopin.

6.1.3 Unseen composers

The algorithm behaves surprisingly well even for unseen composers. A result

for Liszt’s pieces is shown in the Table 6.3. F. Liszt was contemporary for F. Chopin

(almost the same birth date). The system did not know Liszt’s compositions; however,

it sorted existing composers in terms of relative lifetimes to F. Liszt (sic!).

Table 6.3 Evaluation of the Franz Liszt Concert Etude No. 3 ‘Un sospiro’


Beethoven 203.8 115.9 76.6 396 2

Mozart 198.8 60.7 45.5 305 4

Bach 178.8 42.8 36.0 257 5

Schubert 192.0 93.6 82.6 368 3

Co

mp

ose

rs

Chopin 309.0 109.0 88.5 506 1

A small set of other composers’ pieces was collected aside. Assignments done

by the system opinioned if the choice was correct (√√√√) or not (××××) are presented in Table

6.4. Only professionals are allowed to pass sentences but these results look good.

Table 6.4 Unknown Composers assignments

Piece Assignmen

L. Boccherini Badinerie Chopin ×××× A. Borodin Prince Igor Schubert √√√√ C. Debussy Golliwogs Cakewalk Chopin √√√√ C. Debussy Petit Negre Beethoven ×××× L. Delibes Lakme Bach ××××

E. Grieg Wedding March Schubert √√√√ J. Haydn Arietta Bach √√√√ J. Haydn Capriccio Schubert ×××× S. Joplin Entertainer Bach ×××× F. Liszt Un Sospiro Chopin √√√√

F. Mendelssohn Characteristic Piece Chopin √√√√ F. Mendelssohn Christmas Piece Chopin √√√√

J. Pachelbel Canon Mozart √√√√

- 73 -

6.2 Algorithm evaluation

The best results for regular pieces (from training set) are given in Table 6.5. The

best results were obtained without normalization for the featured measure. Using large

profiles and n-gram length about 6 allows obtaining 85% accuracy of the system.

Columns represent the size of profiles, rows indicate the n-gram length and the results

are shown for the aging factor 0.96:

Table 6.5 Algorithm results of aging 0.96

100 250 500 1k 2.5k 5k 10k

2 41 38 38 35 32 43 43

3 46 54 59 62 59 51 43

4 62 70 65 73 73 78 86

5 54 62 70 78 78 81 81

6 54 59 68 68 84 78 84

7 46 49 68 68 68 70 84

9 46 57 49 51 57 68 76

12 41 46 41 41 41 46 49

The important thing to be pointed out is that the random classifier will have the

accuracy of 20%, so the result over 80% is a very good result and show that the system

really works. The second thing is the fact, that some pieces were written by a composer

in a different style and it is really hard for people who do not know the certain piece to

classify a piece to the proper class.

6.2.1 Profiles comparison

I have tested the algorithm in terms of using cosine similarity measure or the

one featured in this thesis. Cosine similarity behaves quite well. However, for the best

parameters the accuracy reaches 65% which is still less then using proposed method. It

probably results from the fact that the words that are the most frequent drive most the

final verdict while it was shown that the most important words lie in the middle of the

frequency ranking scale.

6.2.2 Normalization

One can think that the answer for the question whether to normalize profiles is

obvious: yes. The investigations show that it is not true. It means, it is true, but not in

the area of our interests. It turned out that using small profiles and small n-gram lengths

- 74 -

leads to the necessity of normalization, but using larger parameters makes the result

slightly worse. The kind of normalization of results is maintained by the aging

mechanism, which drops down some incidental growths. The results for the

normalization profiles using other parameters, the same as for Table 6.5, are shown in

Table 6.6, while the differences between results are presented in Figure 6.1. ‘∆’ value

describes the accuracy advantage of normalized profiles to regular profiles:

Table 6.6 Algorithm results of aging 0.96 with profiles normalization

100 250 500 1k 2.5k 5k 10k

2 62 68 73 57 62 62 54

3 59 68 68 62 76 62 54

4 41 65 65 70 73 76 76

5 43 62 70 76 70 78 78

6 46 59 59 68 76 78 81

7 32 57 68 68 65 78 84

9 43 43 54 54 43 57 73

12 19 32 46 41 32 46 41

Figure 6.1 Normalization’s influence to the results

- 75 -

6.2.3 N-gram length

It comes from the experiments, that the best results were obtained for n-grams

lengths about 6, 7. One can infer that it means that average phrases contain 7, 8 notes

(unigram describes two notes, not one). It is the value that was expected. Similar values

were obtained by Downie [14]. It can be the next voice in the discussion about average

musical phrase length.

6.2.4 Profile’s sizes

It turned out that profiles size might be limited for better system performance.

The difference between accuracies for unlimited profiles and large profiles are

imperceptible especially while lacking a large testing and training corpus. One has to

notice, that as the size of profiles increases, the best n-gram length grows.

6.2.5 Aging factor

We can see that the best results can be obtained in the area with aging factor

0.96 for big profiles and with n-gram size about 6, 7 (7, 8 notes). The problem with

highest aging factor is that some pieces have misleading contents and without aging –

these misleading features are preserved for final classification. The aging factor

represents the generalization ability of the classifier. The maximal accuracies obtained

for different aging factors are shown in Table 6.7. According to these results one can

come to a conclusion, that aging factor should be as high as possible, but not higher.

Table 6.7 Maximal accuracies for different aging factors

Aging factor 0.7 0.8 0.85 0.9 0.96 0.99 1.00

Best accuracy 69% 75% 75% 84% 86% 86% 81%

6.2.6 Representative data

The algorithm has been also tested on the testing data chosen from the dataset in

the way that the pieces that are in the testing set are representative for each composer,

not chosen randomly. Very good coefficients were received in this case (Table 6.8),

because the tests of the algorithm were conducted without misleading contents

described above.

- 76 -

With these data the accuracy of 100% can be reached, and over 90% was easy to

obtain. It shows, that the problem of why the results for random test set is not near

100% but 80% is because of the fact that each composer wrote some pieces not in their

own style.

Table 6.8 Results for representative data

100 250 500 1k 2.5k 5k 10k

2 38 31 25 38 38 38 38

3 50 69 69 62 75 44 38

4 69 81 81 88 81 94 100

5 62 81 81 81 88 100 94

7 56 75 81 75 81 81 88

9 44 75 75 69 69 88 88

12 44 62 69 69 56 69 56

This conclusion leads to the next one – the classifier recognizes the style, as

well as the “hand” of a composer. This is good and bad on the other side, because it

shows that “a hand” of a composer is not the only indicator, but possibly, using this

method one can recognize the style and genre of music better than the authorship.

The classifier also behaves very well tested on the training data – in this case

aging factor does not matter and the results are the best for the aging factor equal ‘1.0’.

However, the classifier does not recognize a style then, but tries to remember the piece.

6.2.7 Key-words-based classification

One can also investigate whether the algorithm works better for the n-grams that

are key-words only. In this case only the small part of the whole profiles is taken into

consideration as long as entropy sieving is applied to the profiles (page 44) and only

key-words are left. It is usually from 5 to 15 percent of the content of the full profile. It

also depends on the initial size of the profile. The experiment that measures accuracy of

the system with or without sieving shows that the results are quite poor comparing to

ones obtained for text classification based on subjects [21] (Figure 6.2). The reason of

that fact is probably that there is lots of useful information that lies between keywords

(or key n-grams) which is important for more complex classifications like composer

classification then for the plainer ones, like subject categorization. Nevertheless,

according to the fact that new profiles are about 10 times smaller than the original ones,

the classification process requires about 10 times less memory and processor time.

- 77 -

Figure 6.2 Accuracy of sieved and full profiles

- 78 -

7 Conclusions

The analysis shows that music can be treated as a natural language and thus can

be sensitive to NLP and IR tools. Converting music to the flat, textual representation

enables application this methods directly, which was used in this work. The full process

of obtaining n-gram representation of music from MIDI files has been shown; from

brief introduction to MIDI files specification, through MIDI parser description to the

method of retrieving n-grams.

N-gram representation of musical data demonstrates great convergence to text

since a number of classic NLP and IR theories such as Zipf's law or keywords

distribution are fulfilled for musical n-grams as well. A small corpus that was used to

carry out these investigations was collected, however it should be pointed out that it

was a toy database. In order to conduct detailed analyses of algorithms on the field of

music processing a far-flung community project needs to be set up in order to create a

large, freely distributed, well-tagged corpus of musical data. One can then talk about

comparison and performance of various algorithms.

The algorithm for composer recognition was proposed and a system based on

this approach has been developed. System accuracy of over 80% among the corpus of

five classical composers was reported. However, the method can still be improved,

while various features and details of the algorithm may be changed and replaced by

some more sophisticated solutions. On the other hand, the steps of the algorithm

presented in this work may be applied to many different tasks. The most important

common value of these methods is that doing researches on music data is also doing

- 79 -

researches on achievements of human thought which may lead to finding the answer to

the question of what differs mankind from apes.

- 80 -

A. Appendix – music notation

I. Western music system

Western music system is based on the standard western chromatic scale [49].

Each doubling of the frequency of a sound is called an octave. Assuming that humans

perceive sound from the range of 20 Hz – 20 kHz, there are about 10 (since 210 ≈ 1000

times) audible octaves. In the western chromatic scale, each octave consists of twelve

semitones which are the smallest integral units. The distribution of tones in the octave

is uniform in the logarithmic scale, i.e., the ratio of frequencies of every pair of

adjacent notes remains the same. This is also described as the equal temperament.

Having twelve notes in the octave it turns that each note pitch depends on the previous

according to the following rule:

1

1212 −⋅= ii pp (A.1)

Modern musical system refers to the note called A4 which has exactly 440 Hz.

The number that follows the note letter indicates the octave number. Following the rule

presented above, the middle 'c' note (C4) has 261.6 Hz, while 880 Hz is called A5.

The MIDI system assumes that the middle 'c' (C4) has the value of '60' where the

values represent semitone order. Since referencing A4 has an order of '69', the

following mapping may be applied to all MIDI pitches:

⋅+=

440log1269 2

fp (A.2)

where p is the midi pitch of a note and f is the frequency.

- 81 -

II. Staff system

Staff is the fundamental latticework of a musical score upon which all musical

symbols, such as notes, are placed. Each of the five staff lines and intervening spaces

correspond to the seven repeating pitches of the diatonic scale [38]. Diatonic scale

contains seven notes chosen out of the twelve from chromatic scale – they are

presented directly on the staff, while remaining 5 notes need to be altered using flats,

which lower the pitch by a semitone, and sharps, which raise the pitch by a semitone

(Figure A.4, 1a-1d, page 83). Using multiple alteration signs allows changing the pitch

more. Each transition between the position on line and the intervening space describes

one diatonic step.

Each staff is characterized by a clef. It shows the reference to the diatonic scale.

There are 3 types of clefs (Figure A.1):

Figure A.1 Clefs

1. Treble clef (G clef) – its reference pitch (G4) is pointed by the center of the

spiral,

2. Alto clef (C clef) – the middle point of the clef defines the C4 pitch line,

3. Bass clef (F clef) – its reference point (the place between the dots) defines F3.

Knowing staff system and clefs allows looking at the examples of different

notes with their detailed description shown in Table A.1:

Table A.1 Pitches

Note

Name C0 C1 C2 F3 C4 G4 A4 C5 C6 C7

Freq. (Hz) 33 66 131 175 262 392 440 523 1046 2093

MIDI pitch 24 36 48 53 60 67 69 72 84 96

- 82 -

III. Temporal information

Notes are assigned with duration information. Each note starts at the point,

where the previous note stopped so the flow of notes in contiguous. One may use rests

in order to generate periods of silence. Note and rest values are not absolutely defined,

but are proportional in duration to all other note and rest values. This proportion is

defined by the shape of the rest, shape of the note stems or note filling. In Table A.2 the

semibreve (a 'whole') was used as a reference value.

Table A.2 Notes and Rests

Note

Rest

Breve Minim Quaver Demisemiquaver Name

Semibreve Crotchet Semiquaver Hemidemisemiquaver

double whole half eight thirty-second a.k.a.

whole quarter sixteenth sixty-fourth

duration 2 1 1/2 1/4 1/8 1/16 1/32 1/64

Notes with flags may be merged in groups using beams instead of flags (Figure

A.2a). Notes may be also assigned additional symbols such as dots (Figure A.2b),

fermatas (Figure A.2c) in order to change their length.

Figure A.2 Time related symbols

Notes with pitch and duration information are organized in bars (Figure A.3a),

separated by various barlines (Figure A.3b). Each bar contains a certain number of

given time units (Figure A.3c). For instance, means four quarters per bar, means

twelve eights per bar. Special symbol is defined as and as . The exact duration

of the reference note (i.e., the tempo) should be defined by the performer supported by

the verbal information about the speed of the piece (Figure A.3d) and tempo marking

(Figure A.3d), placed at the beginning of a piece.

- 83 -

Figure A.3 Staff layout

There are many various auxiliary musical symbols that enable precise

description of a performance. They may be involved in the events that are important

regarding content-based music processing and retrieval such as notes alterations (note’s

pitches modification) (Figure A.4 group 1) but may also affect volatile elements of the

performance that are important for the player but may be negligible for music analysts

such as notes relationship marking, dynamics marking, articulation marking or

ornamentation. The examples of all this groups are assembled in Figure A.4:

Figure A.4 Plethora of music notation potpourri

- 84 -

Bibliography

[1] ActivePerl 5.8.8.820 Perl distribution for Windows. ActiveState. Online resource. http://www.activestate.com/products/activeperl/. retrieved 1.05.2007.

[2] ActivePerl License Agreement. ActiveState. Online resource. http://www.activestate.com/Products/ActivePerl/license_agreement.plex. Retrieved 1.05.2007.

[3] Allamanche, E., Herre, J., Hellmuth, O., Fröba, B., Kastner, T., Cremer, M. (2001). Content-based Identification of Audio Material Using MPEG-7 Low Level Description. In Proceedings of the International Symposium of Music Information Retrieval. http://ismir2001.ismir.net/pdf/allamanche.pdf

[4] Baumann, S. (1995). A Simplified Attributed Graph Grammar for High-Level Music Recognition. In Proceedings of the Third International Conference on Document Analysis and Recognition. http://ieeexplore.ieee.org/iel3/4755/13256/00602096.pdf

[5] Berenzwieg, A., Logan, B., Ellis, D.P.W., Whitman, B. (2004). A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures. Computer Music Journal, 28:2, pp. 63–76. http://www.ee.columbia.edu/~dpwe/pubs/ismir03-sim-draft.pdf

[6] Bod, R. (2001). Probabilistic Grammars for Music. In proceedings of the Belgian-Dutch Conference on Artificial Intelligence. http://staff.science.uva.nl/~rens/bnaic01.pdf

[7] Bod, R. (2002). A unified Model of Structural Organization in Language and Music. Journal of Artificial Intelligence Research 17 (2002), 289-308. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/jair/OldFiles/OldFiles/pub/volume17/bod02a.pdf

[8] Buzzanca, G. (1997). “A Supervised Learning Approach to Musical Style Recognition”. In Proceedings of International Computer Music Conference. http://www.conservatoriopiccinni.it/~g.buzzanca/A_Supervised_Learning_Approach.PDF

[9] Chapman, S. String Similarity Metrics for Information Integration. Online resource. http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Retrieved 1.05.2007.

[10] Conklin, D. (2003). Music Generation form Statistical Models. In Proceedings of the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and Sciences, Aberystwyth, Wales, p. 30–35. http://www.soi.city.ac.uk/~conklin/papers/AISB/paper.pdf

[11] Damiani, A., Olivetti Belardinelli, M. (2003). Recognition of Composer’s Style from Musical Fragments. In Proceedings of the 5th Triennial ESCOM Conference. p. 254-256. http://www.epos.uos.de/music/books/k/klww003/pdfs/228_Damiani_Proc.pdf

- 85 -

[12] Document classification. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/Document_classification. Retrieved 1.05.2007

[13] Doraisamy, S. (2004). Polyphonic Music Retrieval: The N-gram Approach. Ph.D. thesis. University od London.

[14] Downie, S. (1999). Evaluating a simple approach to music information retrieval: Conceiving melodic n-grams as text. Ph.D. Thesis, University of Western Ontario.

[15] Downie, S. (2003). Music Information Retrieval. Annual Review of Information Science and Technology 37, 295-340. http://www.music-ir.org/downie_mir_arist37.pdf

[16] Finale, composing and score-writing tool. MakeMusic, Inc. Online resource. http://www.finalemusic.com/finale/. Retrieved 1.05.2007.

[17] Francu, C., Nevill-Manning, C. G. (2000). Distance Metrics and Indexing Strategies for a Digital Library of Popular Music. IEEE International Conference on Multimedia and Expo (II). http://cristian.francu.com/Papers/icme00.pdf

[18] Franklin, D. R, Chicharo, J. F. (1999). Paganini – A Music Analysis and Recognition Program. Fifth International Symposium on Signal Processing and its Applications, Brisbane. p. 107-110 vol. 1. http://ieeexplore.ieee.org/iel5/6605/17735/00818124.pdf

[19] Fraunhofer Institut Integrierte Schaltungen. Online resource. http://www.iis.fraunhofer.de/. Retrieved 22.05.2007.

[20] Frojdh, P., Lindgren, U., Westerlund, M. (2006). Media Type Registration for Downloadable Sounds for Musical Instriment Digital Interface. RFC 4613. IETF. Online resource. http://www.apps.ietf.org/rfc/rfc4613.txt. Retrieved 1.05.2007

[21] Gawryjołek, J. (2007). Analiza I wizualizacja wpisów w serwisach typu wiki. B.Sc. Thesis. Warsaw University of Technology.

[22] General MIDI. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/General_MIDI. Retrieved 1.05.2007

[23] Glatt, J. (2004a). MIDI Specification. Online resource. http://www.borg.com/~jglatt/tech/midispec.htm. Retrieved 1.05.2007

[24] Glatt, J. (2004b). MIDI File Format. Online resource. http://www.borg.com/~jglatt/tech/midifile.htm. Retrieved 1.05.2007

[25] Glatt, J. (2004c). MIDI File Format: Tempo and Timebase. Online resource. http://www.borg.com/~jglatt/tech/midifile/ppqn.htm. Retrieved 1.05.2007

[26] History of Mahidol University (in Thai). Online Resource. http://www.mahidol.ac.th/muthai/history/history.htm. Retrieved 1.05.2007.

[27] Huneker, J. (1900). Chopin: The Man and His music. New York. ISBN 1-603-03588-5

[28] Info-Zip 2.32. GNU Project. Online resource. http://www.info-zip.org/. Retrieved 1.05.2007

- 86 -

[29] Keselj, V., Peng, F., Cercone, N., Thomas, C. (2003). N-gram-based Author Profiles for Authorship Attribution”. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING’03,pages 255–264, Dalhousie University, Halifax, Nova Scotia, Canada,. http://users.cs.dal.ca/~vlado/papers/pacling03.pdf

[30] Kranenburg, P. van, Backer, E. (2004). Musical style recognition – a quantitative approach. In Proceedings of the Conference on Interdisciplinary Musicology. http://gewi.uni-graz.at/~cim04/CIM04_paper_pdf/Kranenburg_Backer_CIM04_proceedings.pdf

[31] Lazzaro, J., Wawrzynek, J. (2006). RTP Payload Format for MIDI. RFC 4695. IETF. Online resource. http://www.apps.ietf.org/rfc/rfc4695.txt. Retrieved 1.05.2007

[32] Lemstrom, K. (2000). String matching Techniques for Music Retrieval. Ph.D. Thesis, University of Helsinki. http://www.cs.helsinki.fi/~klemstroz/THESIS/thesis-gzipped.pdf/lemstr00string.pdf

[33] Li, T., Ogihara, M. Li, T. (2003). A Comparative Study on Content-Based Music Genre Classification. Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval. http://portal.acm.org/citation.cfm?id=860435.860487.

[34] LilyPond, automated engraving system. GNU Project. Online resource. http://lilypond.org/. Retrieved 1.05.2007.

[35] Martin. K. D. (1999). Ph.D. Thesis. Sound-Source Recognition: A Theory and Computational Model. Massachusetts Institute of Technology. http://xenia.media.mit.edu/~kdm/research/papers/kdm-phdthesis.pdf

[36] Martin, K. D., Kim, Y. E. (1998). 2pMU9. Musical instrument identification: A pattern-recognition approach. In Proceedings of the 136th meeting of the Acoustical Society of America. http://sound.media.mit.edu/Papers/kdm-asa98.pdf

[37] MIDI specification. MIDI Manufacturer Association. Online resource. http://www.midi.org/. Retrieved 1.05.2007.

[38] Modern Musical Symbols. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/Modern_musical_symbols. Retrieved 1.05.2007

[39] Musical Instrument Digital Interface. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/MIDI. Retrieved 1.05.2007

[40] MusicXML Definition. Recordare LLC. Online resource. http://www.musicxml.org/xml.html. Retrieved 1.05.2007.

[41] Pardo, B. (2006). Finding Structure in Audio for Music Information Retrieval. IEEE Signal Processing Magazine. Vol. 23 Issue 4, 126-132. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=1628889&isnumber=34166

- 87 -

[42] Perez-Sancho, C., Inesta, J.M., Calera-Rubio, J. (2004). Style Recognition through Statistical Event Models. In Proceedings of the Sound and Music Computing Conference, SMC ’04. http://smc04.ircam.fr/scm04actes/P23.pdf

[43] Pollastri, E., Simoncelli, G. (2001). Classification of Melodies by Composer with Hidden Markov Models. In Proceedings of the First International Conference on Web Delivering of Music, p. 88-95. http://ieeexplore.ieee.org/iel5/7752/21299/00990162.pdf

[44] Ponsford, D., Wiggins, G., Mellish, C. (1999). Statistical learning of harmonic movement. Journal of New Music Research, 28(2). .http://www.doc.gold.ac.uk/~mas02gw/papers/JNMR97.pdf

[45] Press, W.H. (1992). Numerical recipes in C: the art of scientific computing. Cambridge University Press. ISBN 0-521-43720-2.

[46] Rivasseau, J.-N. (2004). Learning harmonic changes for musical style modeling. Project report, University of British Columbia. http://www.elvanor.net/files/learning_harmonic_changes.pdf

[47] Schaffrath, H. (1993). Repräsentation einstimmiger Melodien: computerunterstützte Analyse und Musikdatenbanken. In B. Enders and S. Hanheide (eds.) Neue Musiktechnologie, 277-300, Mainz, B. Schott’s Söhne.

[48] Schaffrath, H., Huron, D (ed). (1995). The Essen Folksong Collection in the Humdrum Kern Format. Menlo Park, CA. CCARH.

[49] Scientific pitch notation. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/Scientific_pitch_notation. Retrieved 1.05.2007

[50] Selfridge-Field, E. (1995). The Essen Musical Data Package. Menlo Park, California. CCARH.

[51] Senner, W.M. (1991). The Origins of Writing. University of Nebraska Press. ISBN 0-80-329167-1

[52] Sornlertlamvanich, V., Potipiti, T., Charoenporn, T. (2000). Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. Proceedings of the 18th conference on Computational Linguistics - Volume 2, 802-807. http://acl.ldc.upenn.edu/C/C00/C00-2116.pdf

[53] Spärck Jones, K., Willett, P. (1997). Reading in Information Retrieval. Overall Introduction, 1-7. Morgan Kaufmann.

[54] Spevak, C., Thom, B., Hothker, K.. (2002). Evaluating Melodic Segmentation. In Proceedings of the 2nd Conference on Music and Artificial Intelligence, ICMAI’02, Edinburgh, Scotland. http://www.cs.cmu.edu/~bthom/PAPERS/icmai02.pdf

[55] Thom, B. (2000a). Unsupervised Learning and Interactive Jazz/Blues Improvisation. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, p. 652–657. http://www.cs.cmu.edu/~bthom/PAPERS/aaai2k.pdf

[56] Thom, B. (2000b). BoB: an Interactive Improvisational Music Companion. In Proceedings of the Fourth International Conference on Autonomous Agents

- 88 -

(Agents-2000), Barcelona, Spain. http://www.cs.cmu.edu/~bthom/PAPERS/agents2k.pdf

[57] Thom, B. (2001). Machine Learning Techniques for Real-time Improvisation Solo Trading. In Proceedings of the 2001 International Computer Music Conference, Havana, Cuba. http://www.cs.cmu.edu/~bthom/PAPERS/icmc01.pdf

[58] Treitler, L. (2000). With Voice and Pen. Oxford University Press. ISBN 0-19-816644-3

[59] Truong, B. (2002). Trancedence: An artificial life approach to the synthesis of music. http://www.informatics.susx.ac.uk/easy/Publications/Online/MSc2002/bt20.pdf

[60] Tseng, Y. (1999). Content-based retrieval for music collections. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 176-182. http://blue.lins.fju.edu.tw/~tseng/papers/p176-tseng.pdf

[61] Uitdenbogerd, A., Zobel, J. (1999). Melodic matching techniques for large database. In Proceedings of the seventh ACM international conference on Multimedia, pp 57-66. http://pi3.informatik.uni-mannheim.de/~helmer/m1.pdf.

[62] Walshaw Chris. “Abc Music Notation Software Package”. Web resource. http://staffweb.cms.gre.ac.uk/~c.walshaw/abc/ http://abc.sourceforge.net/abcMIDI/original/

[63] Wołkowicz, J. (2006). Analysis of piano pieces of various composers stored in MIDI files, (in Polish). Project Report. Warsaw university of Technology. http://torch.cs.dal.ca/~jacek/papers/projects/classification_enhancement.pdf

[64] Zdrahal-Urbanek, J. ,Vitouch, O. (2003). Recognize the tune? A Study on Rapid Recognition of Classical Music. In Proceedings of the 5th Triennial ESCOM Conference, p. 257-260. http://www.musicpsychology.net/vitouch/vitouch_2003b.pdf

Jacek Wołkowicz - Dalhousie University

Documents

Transcript of Jacek Wołkowicz - Dalhousie University