Unsupervised Incremental Online Learning and Prediction of ...

1

Unsupervised Incremental Online Learning andPrediction of Musical Audio Signals

Ricard Marxer, Hendrik Purwins

Abstract—Guided by the idea that musical human-computerinteraction may become more effective, intuitive, and creativewhen basing its computer part on cognitively more plausiblelearning principles, we employ unsupervised incremental onlinelearning (i.e. clustering) to build a system that predicts thenext event in a musical sequence, given as audio input. Theflow of the system is as follows: 1) segmentation by onsetdetection, 2) timbre representation of each segment by Melfrequency cepstrum coefficients, 3) discretization by incrementalclustering, yielding a tree of different sound classes (e.g. timbrecategories/instruments) that can grow or shrink on the fly drivenby the instantaneous sound events, resulting in a discrete symbolsequence, 4) extraction of statistical regularities of the symbolsequence, using hierarchical N-grams and the newly introducedconceptual Boltzmann machine that adapt to the dynamicallychanging clustering tree in 3), and 5) prediction of the nextsound event in the sequence, given the last n previous events. Thesystem’s robustness is assessed with respect to complexity andnoisiness of the signal. Clustering in isolation yields an adjustedRand index (ARI) of 82.7% / 85.7% for data sets of singingvoice and drums. Onset detection jointly with clustering achievean ARI of 81.3% / 76.3% and the prediction of the entire systemyields an ARI of 27.2% / 39.2%.

Index Terms—Music information retrieval, unsupervised learn-ing, adaptive algorithms, prediction algorithms

I. INTRODUCTION

Most traditional computational music analysis systems arenot based on the cognitively plausible principles of unsu-pervised, online, and incremental learning [1]. Clustering asan example of unsupervised learning (opposed to supervisedlearning with a ’teacher’) consists of grouping instances ac-cording to their proximity to each other, e.g. like infants learnto form phoneme classes from articulation instances. Whereasin batch learning, an entire database remains accessible tosearch in or to run through, once or several times, in orderto train a learning algorithm, in online learning only one itemis accessible at a time and can be used to train the learningalgorithm. Human online learning occurs when listening toone phoneme triggers a instantaneous, subtle update of the

R. Marxer is with Speech and Hearing Group, Department of Computer Sci-ence, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield,S1 4DP, UK, [email protected].

H.Purwins is with Audio Analysis Lab and Sound and Music Comput-ing Group, Aalborg Universitet København, A.C. Meyers Vænge 15, 2450Copenhagen SV, [email protected].

R. Marxer and H. Purwins were with Music Technology Group, UniversitatPompeu Fabra, Roc Boronat, 138, 08018 Barcelona, Spain.

H. Purwins was also with Neurotechnology Group, Berlin Institute ofTechnology, Sekr. MAR 4-3, Marchstr. 23, 10587 Berlin, Germany.

This work was partially funded by the EmCAP (European CommissionFP6-IST, contract 013123) and INSPIRE-ITN (European Commission FP7-PEOPLE-2011-290000) projects. Thanks to Daniel Bartz from Berlin Instituteof Technology for advice in statistics and for proof reading.

neural connections in the brain. In incremental learning, thecurrent input may also modify the representation. E.g. ininfant language acquisition, protocategories of allophones joinlater to form a phoneme category ([2] cited in [3]). If therepresentation used by the computer would imitate theselearning principles used by humans, such a representationmay be suitable to determine musical expectation, similar-ity, communication, emotion, interpretation, and visualizationin music analysis, music information retrieval, and musicalhuman-computer interaction.

For music transcription [4], prediction [5, 6], representa-tion [7, 8], automatic accompaniment, or human-machine-improvisation [9, 10], a traditional system usually is basedeither on symbolic data instead of audio input, or on classifiersthat are pre-trained on a labelled data base [4]. If a system,based on pre-trained classifiers needs to cope with new musicalconcepts (instruments, harmonies, pitches, motifs) it has notbeen designed for, it may cease to work reasonably. Such asystem would have to be retrained with labeled data, everytime a new instrument (pitch, harmony etc.) appears. Thispresents a severe lack of flexibility of such a system, incontrast to human cognition processing new instruments andharmonies with ease, even if one has not heard them before. Ahuman mind can grasp a novel motif, when listening to a pieceor an improvisation. Unsupervised learning (clustering) insteadof supervised classification is one paradigm how an algorithmcan model the cognition of novel concepts [11, 12, 13]. Basedon a discrete representation of the input derived by clustering,an n-gram, i.e. a suffix tree, can be used as a statisticalrepresentation of the structure of the input sequence [5, 6].In this paper, we extend such a system by equipping it withthe capability to deal with a varying number of clusters. Thenumber of clusters can increase if e.g. a timbre cluster forhi-hat splits into subclusters of open and closed hi-hats. Thecluster number may decrease if two instruments become tosound very similar. We implement these features by usingunsupervised online learning. This requires that the n-gram(suffix tree) must be coupled with the clustering in order to beable to merge or split the symbol counts when cluster numberschange. We introduce a system prototype that learns in anunsupervised, adaptive manner and that generates predictionsfrom audio sequences. From the first note it will begin togenerate reasonable predictions without using previous knowl-edge.

Many previous approaches to predicting musical sequencesare based on symbolic representation [7, 5, 8, 9, 10, 6].Paiement et al. [14] present a model that is capable ofpredicting and generating melodies using a combination ofBayesian networks, clustering, rhythmic self-similarity and

2

a special representation of melody. The distances betweenrhythmical patterns are clustered and the continuation of amelody is predicted conditioned on the chord root, chord type,and Narmour group of recent melodic notes. Hazan et al.[11] build a system for generation of musical expectationthat operates on music in audio data format. The auditoryfront-end segments the musical stream and extracts bothtimbre and timing description. In an initial bootstrap phase, anunsupervised clustering process builds up and maintains a setof different sound classes. The resulting sequence of symbolsis then processed by a multi-scale technique based on n-grams.Model selection is performed during a bootstrap phase viathe Akaike information criterion. Marchini and Purwins [13]present a non-adaptive system that learns rhythmic patternsfrom drum audio recordings and synthesizes music variationsfrom the learned sequence. The procedure uses a fuzzy multi-level representation. Moreover, a tempo estimation procedureis used to guarantee that the metrical structure is preserved inthe generated sequence. Online clustering has been proposedby Zhang et al. [15] for document clustering. Bertin-Mahieuxet al. [16] have used online k-means to cluster beat-chromapatterns. The Hierarchical Dirichlet Process Hidden MarkovModel (HDP-HMM) [17] has been used for segmentation inconjunction with clustering. Fox et al. [18] and Ren et al. [19]have proposed ’sticky’ versions of the HDP-HMM that in-troduce explicit modelling of state occupancy duration. Thesemodels are applied to segmentation of a Beethoven sonata intomusical sections [19] and to speaker diarization [18]. Stepletonet al. [20] used the block diagonal infinite hidden Markovmodel for musical theme labelling. However, these methods donot perform incremental online learning, whereas we proposean online incremental clustering method that uses a separatesegmentation method (onset detection) and switches relativelyrapidly between states. Bargi et al. [21] have adapted HDP-HMM to an online setting employing an initial supervisedlearning phase (bootstrap) whereas our approach is entirelyunsupervised.

A part of the work covered in this paper, the applicationof the hierarchical n-grams on the Voice data, has beenpresented previously [12]. Here we compare that methodwith the conceptual Boltzmann machine and with HDP-HMMon an extended data set using a more advanced evaluationmeasure (the adjusted Rand index) and providing more ex-amples of adaptive clustering. We will give an overview ofthe system, introduce its components, namely segmentation,timbre representation, clustering, and prediction. Then we willintroduce the adjusted Rand index, test the performance ofthe sequence analysis algorithms under noisy conditions, ofeach system module separately, and in conjunction. Finally,we will give some demonstration examples. Audio-visual dataand examples are available on the supporting website [22].

II. SYSTEM OVERVIEW

The system that we present in this paper (cf. Fig. 1) consistsof four main stages: segmentation by onset detection, fea-ture extraction resulting in timbre representation, incrementalclustering giving a symbol sequence, and sequence analysis

yielding a prediction of the next symbol. In particular, theclustering tree generated by incremental clustering grows andshrinks online, driven by the most recent sounds. In turn, thesequence model adapts to the changing numbers of symbols.Segmentation and representation can be interpreted as a modelof perception, whereas discretization and prediction can beconsidered to be a cognitive model.

SegmentationAudio FeatureExtraction

IncrementalClustering

SequenceAnalysis

NextSymbol& IOI

Segments TimbreRepresentation

SymbolSequence

Growth/Shrinkingof Clustering Tree

StructuralAdaptation

Fig. 1. System architecture: An audio sound file is segmented, using onsetdetection. Each segment is then represented as a high-dimensional timbrefeature vector which is clustered into symbols. Symbols are added or removedto the clustering tree on the fly. The symbol sequence is then statisticallyanalysed, adapting to the varying number of symbols, allowing for predictionof the next symbol and the next inter-onset interval (IOI).

A. Segmentation by Onset Detection

A piece of music can be segmented on several hierarchicaltime scales such as notes, motifs, phrases, themes, or parts. Asin this work we focus on the prediction of the next event/notebased on the series of previous events/notes, onset detectionappears to be a promising method to segment an audio streaminto events. In order to be more generally applicable, wehave employed the complex domain based onset detector[23], since it subsumes onset detection algorithms based onenergy, spectral difference, or phase as special cases. Thisonset detection function captures onsets due to abrupt energychanges as well as soft onsets induced by pitch changes, withlittle energy variations. For each frame l, the short-term Fouriertransform yields a complex spectrum Xk(l) = rk(l)eiφk(l),with magnitude rk and phase φk for the k-th bin with framelength K (0 ≤ k ≤ K − 1). We build the onset detectionfunction as the Euclidean distance between the actual complexspectrum Xk(l) at bin k and the estimated complex spectrum[23]:

Xk(l) = rk(l)eiφk(l), (1)

where the estimated amplitude rk(l) is set equal to the mag-nitude of the previous frame ‖Xk(l − 1)‖, and the estimatedphase φk(l) is calculated as the linear extrapolation from theunwrapped phases of the two preceding frames:

φk(l) = princarg [ϕk(l − 1) + (ϕk(l − 1)− ϕk(l − 2))] ,

where the ϕ denotes the unwrapped phase and the princargoperator maps the unwrapped value back to the (−π, π] range.We calculate the bin-wise Euclidean distance between theactual and the estimated complex spectrum, quantifying thestationarity for the k-th bin as: ∆k(l) = ‖Xk(l)− Xk(l)‖. Bysumming across all K bins and across M + 1 consecutive

3

frames centered around frame l (smoothing), we yield theonset detection function:

η(l) =1

M

bM2 c∑j=d−M

2 e

K−1∑k=0

∆k(l + j). (2)

Similarly to previous approaches [24], an adaptive thresholdθ(l) is used. This threshold is calculated as the scaled medianacross a look-ahead window of length P + 1

θ(p) = C ·mediann∈(p,p+1,...,l+P )(η(n)), (3)

with 0 ≤ C ≤ 1 being a predefined parameter controlling thesensitivity of the onset detector. In order to eliminate multipleoccurrences of onsets shortly one after another, smoothing isapplied via another window of length W+1 centered at samplel:

µ(l) =

dW2 e∑m=d−W

2 e

max(η(l +m)− θ(l +m), 0) (4)

A silence threshold θs is applied:

µs(l) = max(µ(l)− θs, 0). (5)

Finally, the local maxima of µs(l) define the predicted onsettimes.

B. Feature Extraction for Timbre Representation

For each onset, a short window of length L subsequent to theonset time is analyzed. For each frame within this window, thefirst 13 Mel-Frequency Cepstrum Coefficients (MFCC) [25]are calculated. To model the coefficient’s temporal behaviourright after the onset, for each coefficient another DiscreteCosine Transform (DCT) is calculated on the sequence ofcoefficients across the frames. Taking the first 4 DCT co-efficients for each MFCC yields a 52-dimensional vector,representing timbral features both of the sound event’s spectralcharacteristics and their initial temporal development.

C. Incremental Clustering for Symbol Sequence Generation

The clustering stage receives multivariate feature vectorsfrom the preprocessing stage and converts them into symbols.It is important to state that in our system the events areclustered in an online manner and in order of arrival, sincethis symbolic representation is used immediately to createpredictions of future events. As a reference and benchmark,we compare online clustering by Cobweb with a state-of-the-art batch clustering method exploiting sequential information,the HDP-HMM.

1) Cobweb: For this purpose, Marxer et al. [26] used theCobweb [27]. Cobweb is an incremental clustering modelwhich continuously builds a knowledge tree (hierarchicalpartitioning of the object space) and assigns to each instancea partition created at each level until the object reaches theleaves of the tree. Each node of the tree represents a concept. Aconcept is modelled by a univariate Gaussian for each featuredimension. The edges of the structure represent taxonomicrelations. Further works [28, 29] have proposed techniques

to create, in an unsupervised manner, the concept tree basedon the sequence of data presented, by the use of a heuristicfunction to be maximized. The heuristic function used inthis paper is the numerical version of the standard categoryutility function used by Fisher and introduced by Gluck andCorter [30]. The version of Cobweb that we will use waspresented as Cobweb/3 [28] and later extended as Cobweb/95[29]. This algorithm clusters D-dimensional feature vectorsx = (x1, . . . , xD) extracted in the previous section. Considera particular cluster containing I feature vectors. Let σd be thestandard deviation in component d of the input feature vectorsassigned to that cluster. Then

∑Dd=1

1σd

is the specificity ofthat cluster across all feature dimensions. We consider theutility U to quantify the gain in specificity by splitting thiscluster into K child clusters. For a potential child cluster1 ≤ k ≤ K with Ik instances and each input feature dimensiond we define σdk to be the inner cluster standard deviation inthat dimension. Then

∑Dd=1

1σdk

is the specificity of cluster k,and

∑Kk=1

IkI

∑Dd=1

1σdk

is the specificity of the child clustersaltogether. For the cluster utility holds

U ∝ 1

K

(K∑k=1

IkI

D∑d=1

1

max(σdk, a)−

D∑d=1

1

σd

), (6)

The acuity parameter a is an upper limit of maximalspecificity (minimal standard deviation) of the clusters, therebycontrolling the maximal resolution of the clustering discrimi-nation.

The incorporation of an object is a process of clusteringthe object by descending the tree along an appropriate path,updating counts along the way, and possibly performing oneof several operations at each level. These operators are:

• creating a new node,• removing all children from a node (pruning),• combining two clusters into a single node, and• splitting a node into several nodes.

While these operations are applied to a single object setpartition (i.e., set of siblings in the tree), compositions ofthese primitive operations transform a single clustering tree.As a search strategy we use hill-climbing through a space ofclustering trees.

Thereby, the input is converted into a sequence not only ofsymbols, but also of meta symbols (partitions) according totheir parent nodes and grandparent nodes in the cobweb tree.The symbols and meta symbols provide the alphabet on whichexpectations will be generated by the hierarchical N-gram.

We modify the set of possible Cobweb operations (seeabove) in order to achieve persistent partitioning. This reducedset of operations can perform any of Cobweb’s originaloperations. We reformulate the second Cobweb operation(see above) in order to control the clustering only by newincoming events. Other partitions and past events should notbe considered. This reduces the operations to:

• creating a new partition inside a container partition,• removing a partition, reparenting it’s children if it has

any.

4

2) Hierarchical Dirichlet Process Hidden Markov Model(HDP-HMM): The feature vector sequence may also be mod-elled as the emission of a HDP-HMM, a Bayesian nonpara-metric model in which the hidden states can be consideredas clusters. Given the observed feature vector sequence, themost likely hidden state sequence can be interpreted as asequence of symbols. In the HDP-HMM, the hidden states areassumed to be drawn from a countably infinite state space.The HDP-HMM is used to jointly estimate the number ofclusters, the cluster assignment of the feature vectors, and thetransition probabilities between clusters. Inference in the HDP-HMM is performed using the weak limit approximation [18]implemented in pyhsmm [31].1 However the inference doesnot work in an online manner, it requires the entire featurevector sequence as input. This method is offline (batch mode)and is only used as a reference and benchmark, since it doesnot fulfil the constraint of clustering the feature vectors asthey arrive to perform immediate prediction from the verybeginning.

D. Sequence Analysis for Next Symbol/Onset Prediction

We choose two methods (hierarchical N-grams and con-ceptual Boltzmann machine) [32] that require relatively littlestorage by deducing frequency counts for longer sequencesfrom frequency counts of their shorter subsequences. Thesealgorithms iteratively predict the next symbol (or the inter-onset interval= IOI respectively) ct+1 based on previoussymbols (IOIs) ct−n+1, ct−n+2, . . . , ct, previously generatedby incremental clustering. Thereby we derive which soundto expect when. The prediction of symbols and of IOIsis performed independently. By predicting the IOI, we candetermine the onset time of symbol ct+1.

1) Hierarchical N-Grams (HN): N -grams have been usedin the analysis of genome sequences and in language modeling[33]. Exhaustive N -grams count the instances of all possiblesymbol (IOI) sequences of length N. Their memory require-ment is exponential in the sequence length N and the problemarises how to account for patterns that have not occurred before(zero frequency problem). We use N -grams as an estimate forthe forward conditional distribution for online prediction ofthe next symbol (IOI).

Hierarchical N -grams (HN) [32] need less memory thanexhaustive N -grams. HN are a combination of sparse N -grammodels in a hierarchical structure that allows compositionallearning. Compositional learning consists in learning longpatterns from already learned sub-patterns. In sparse N -gramscounts of the most frequent patterns and a separate total countfor the non-frequent patterns are kept. This technique separatesthe estimates of patterns whose statistics are reliable from theestimates of infrequent patterns whose statistics are biased. Onthe other hand, the multi-width exhaustive approach consists inkeeping the count of all possible patterns of at most length N .These models are able to represent any distribution of patternsup to width N .

Let C1 = {c1, . . . , c|C1|} be the set of cluster indices,renumbered so that they reflect the order of their first appear-

1http://github.com/mattjj/pyhsmm

ance in the symbol sequence c = (c1, . . . , ct), achieved fromthe clustering process in the previous section. C1 forms thealphabet of the n-gram. Then, Cn is the set of all possible n-grams of length n composed from alphabet C1. To exploitsparsity, we only consider the patterns that have actuallyoccurred as a subsequence of c so far until time t. The setof patterns of length n having occurred so far will be denotedby Cn = {c1, . . . c|Cn|}, in which again the subpatterns areordered according to their first appearance. o(c) is definedas the position of c in C|c|. We consider HNs of maximallength N . Let Cn,i(n ≤ N) be the frequency count of thei-th pattern of length n and let Tn,i be the total count ofpatterns of length n since pattern i occurred for the first time.In Algorithm 1, we use the counts Tn,i and Cn,i to iterativelyestimate the joint probabilities Pn,i for all patterns seen so far.We define Tn,0 := T1,1 for 1 ≤ n ≤ N . In simple N -grams,the empirical frequency Cn,i

Tn,1could be used as an estimate for

the probability of a pattern of length n. In the HN method(Eq. 12), the probability Pn−1n,i for the i-th pattern of lengthn under the joint distribution of width n − 1 is estimated.For pattern ci = (ci1, . . . , c

in), statistical estimates (Eq. 8.2 in

Pfleger)

Pn−1n,i =P (ci1, . . . , c

in−1) · P (ci2, . . . , c

in)∑

y∈C1 P (ci2, . . . , cin−1, y)

(7)

are calculated, using the sub-patterns of the ith pattern oflengths n. We estimate the probability Qn−1n,i (Eq. 11) of sub-patterns of length n − 1 of the ith pattern of length n ofnot being a subpattern of the first i patterns of length n. InEq. 12, they are weighted by their confidence. The confidencevalues depend on the number of occurrences of the patterns.Therefore, when a pattern of length n has appeared rarelyin the data stream, its probability of occurrence is estimatedfrom a small number of counts and is not reliable. In thiscase the probability of appearance is better estimated fromthe n − 1 length sub-patterns through Pn−1n,i . In other words,the information of patterns of large lengths is integrated withthe information of models of small lengths. Pfleger shows thatthe probability of a given pattern can be calculated in a linearsweep by updating all the probabilities in order of the pattern’sfirst occurrence and length.

In order to adapt Pfleger’s HN [32] to our architecture, wehave to link the operations of the clustering model to theoperations on the n-gram (Fig. 2). When two or more clustersare merged in the clustering model, we have to remove thesuperfluous clusters from the set of cluster indices (Eq. 8)and to sum up the counts for the merged clusters (Eq. 9).For example, if the n-gram tracks patterns bbc and bbd andsuddenly the clustering model merges symbols c and d intoa new symbol e, the n-gram must sum up the counts of bbcand bbd and substitute them with the count of bbe. Mergingof clusters (symbols) requires a merge of the counts in theHN, but the statistics gathered in the HN do not affect theclustering process. In contrast, in the HDP-HMM, the clusternumber, the assignment of feature vectors to clusters, and thetransition probabilities between clusters are estimated jointly.The estimate of the number of clusters and the estimation

5

Algorithm 1 The Hierarchical N-Gram for Merged ClustersInitialization Cn = {} for 1 ≤ n ≤ Nfor incoming event ct do

for 1 ≤ n ≤ N doif (ct−n+1, . . . , ct) /∈ Cn then

Add new pattern: Cn = Cn ∪ (ct−n+1, . . . , ct), Tn,|Cn| = 0, Cn,|Cn| = 0end ifif c1, . . . , ck ∈ Cn are merged by Cobweb then

o′ = min(o(c1), . . . , o(ck)) (8)

Cn,o′ =

k∑i=1

Cn,o(ck) (9)

Cn = Cn\{c1, . . . , co′−1, co

′+1, . . . , ck} (10)

Update indicesend ifUpdate counts: Cn,o(ct−n+1,...,ct) = Cn,o(ct−n+1,...,ct) + 1Update total counts: Tn,i = Tn,i + 1 for 1 ≤ i ≤ |Cn|

end forCalculate joint probabilities:P 01,i = 1

|C1| (1 ≤ i ≤ |C1|)for 1 ≤ n ≤ N do

for 1 ≤ i ≤ |Cn| do

Qmn,i = (1−i∑

k=1

Pmn,k) (m = n, n− 1) (11)

Calculate Pn−1n,i according to Eq.7

Pnn,i =1

T1,1

Cn,i +

i−1∑j=0

(Tn,j − Tn,j+1) ·Qnn,j ·Pn−1n,i

Qn−1n,j

(12)

end forend for

end for

of the transition probabilities between clusters influence eachother both ways.

2) Conceptual Boltzmann Machine (CB): The Boltzmannmachine [34] is a stochastic, symmetric-recurrent neural net-work that can be used to represent a joint distribution ofrandom variables, to complete patterns, and in particular (asin our case) to predict the continuation of a time series.

Formally, a Boltzmann machine consists of a vector ofbinary units (s1, . . . , sI) ∈ {0, 1}I , and symmetric weightswij ∈ R between pairs of units (si, sj), an update rule for theunits and a learning rule for the weights.

Applied to categorical data [35], a V -valued symbol cu isencoded as binary units su1 , . . . , suV

with suv = 1 if andonly if cu = v. To connect two V -valued variables cu andci, V 2 weights wij ,uv

are needed to connect the binary unitsrepresenting the two variables. Initially, the architecture of ourparticular Boltzmann machine implementation consists of setsof binary variables for consecutive symbols, where the binarynodes of each variable are initially only connected to thebinary nodes of the previous and the next symbol. Dependingon the other units and weights, the stochastic softmax update

rule for the symbol is:

P (ci = j) =1

1 + e−∑

u 6=i∑V

v=1 suvwij,uv

T

, (13)

with temperature T decreasing from T = 50 to T = 0.005 in100 steps. As an example of Gibbs sampling, this update ruleis applied iteratively. In general, through simulated annealingof the temperature T , the states converge to a particular statevector [34].

For training the Boltzmann machine, the weights wij have tobe learned. As in the case of the restricted Boltzmann machine[36], in our case, not all pairs (si, sj) are connected by non-zero weights wij . Units representing the same symbol are notconnected among each other. For each binary previous symbolsequence, the update rule (13) is iteratively applied until thefinal states are reached (denoted by s+i ). In addition, the updaterule is applied with no units fixed until another vector offinal states (s−i ) is reached. Then a stochastic gradient-basedlearning step for the weights can be performed with learning

6

Cobweb

H N-Gram

a

b c d

--------- 1-Gram ----------b: C1,1=8, T1,1=12d: C1,4=2, T1,4=10c: C1,3=2, T1,3=7

--------- 2-Gram ----------b b: C2,2=4, T2,2=11b d: C2,4=2, T2,4=10d b: C2,6=2, T2,6=9b c: C2,3=2, T2,3=7c b: C2,5=1, T2,5=6

--------- 3-Gram ----------b b d: C3,2=2, T3,2=10b d b: C3,6=2, T3,6=9d b b: C3,4=2, T3,4=8b b c: C3,1=2, T3,1=7b c b: C3,3=1, T3,3=6c b b: C3,5=1, T3,5=5

--------- 1-Gram ----------b: C1,1=9, T1,1=13e: C1,4=4, T1,4=11

--------- 2-Gram ----------b b: C2,2=4, T2,2=12b e: C2,4=4, T2,4=11e b: C2,6=4, T2,6=10

--------- 3-Gram ----------b b e: C3,2=4, T3,2=11b e b: C3,6=4, T3,6=10e b b: C3,4=3, T3,4=9

Sequence b b d b b c b b d b b c b b e b b e b b e

MERGE c d TO e

a

c d

b e

a

b e

Fig. 2. The Effect of a concept merge in the hierarchical n-gram. Nodes cand d are merged into the new symbol e. The n-gram inherits the counts forpatterns including c and d to patterns including e.

a

b

c

a

b

c

a

b

c

ba

ac

ba

ac

bac

a

b

c

ba

ac

bac

Fig. 3. Illustration of the continuous composition of symbols (atoms) in theBoltzmann Machine into longer patterns (chunks).

rate µ for a single training instance yielding s+i s+j :

∆wij = µ(s+i s+j − s

−i s−j ). (14)

The learning step aims at minimizing the difference betweens+i s

+j and s−i s

−j . µ = 0.1 is used.

When weight wij rises above a threshold θw, a new hiddenunit is created, representing the concatenation of symbolsconnected by strong weights (cf. Fig. 3) [35]. In addition,weight wij is removed. Iteratively, hidden units for patternsof length n + 1 are created from nodes representing patternsof length n and a new set of binary nodes representing patternsof length n is appended. We set θw = 0.2, 0.15, 0.1, 0.05respectively, depending on the length of the pattern the unitrepresents (length 1,2,3,4). This variant of the Boltzmannmachine is called the compositionally-constructive categoricalBoltzmann machine [35]. For predicting the next symbol ct+1,in a trained Boltzmann machine, the respective units are fixedto the previous symbol sequence ct−n+1, . . . , ct. After runningthe unit update rule (13) until convergence, the predicted

next symbol ct+1 in the sequence can be retrieved from thecorresponding binary units of the Boltzmann machine.

In our system we have implemented a new method calledconceptual Boltzmann machine (CB). In Pfleger, the Boltz-mann machine acts on a static set of categories. We have ex-tended this to an architecture which operates on a dynamicallychanging taxonomy of categories. Therefore, the model adjuststo the tree structure generated by the Cobweb. This means theBoltzmann machine changes the architecture on the fly guidedby the creation, removal, splitting, and merging operationssuggested by the Cobweb. Accordingly, in the Boltzmannmachine, the units and the update rule must be adjusted tothe new structure.

During the run, sequences of atoms cause the creation ofhigher-level chunks that represent patterns. The newly createdchunks that represent patterns are then further chunked intonodes that represent patterns of longer length. The longestpattern represented by a node is fixed to a value of N .

III. PERFORMANCE ANALYSIS OF THE SYSTEM

A. Measures for Clustering Evaluation

Unlike in supervised learning, where accuracy can be mea-sured between the annotated labels and the labels predicted bya classifier, the number of clusters predicted by the analysiscan be different from the number of annotated label categories.In addition, the mapping between annotated and predicted la-bels is unclear. This creates the need for a particular clusteringevaluation measure. The following measures for evaluatingthe agreement between annotations and predicted labels havebeen suggested: purity [37], F-measure [38], and Pearson’s chisquared coefficient [39], and Rand index [39]. We choose thelatter measure for evaluation, since it is a natural extension ofclassifying elements to pairs of elements.

A partition (clustering) C of a set X is defined as a setC = {C1, . . . , CJ} of subsets Cj ⊂ X , so that ∪jCj = X andCj , Cj′ disjoint for j 6= j′. Let P be the set of all partitionsof X and let |C| be the number of elements in a partitionC ∈ P. Let A ∈ P be a partition generated by annotation andlet C ∈ P be a predicted partition derived from an algorithm.Let P = {(x, x′)|x, x′ ∈ X , x 6= x′} be the set of pairs ofdistinct events. Let L ⊂ P be the set of events pairs whereboth x and x′ share the same labels/annotation provided by Aand let K ⊂ P be the set of event pairs where both x and x′

lie in the same cluster provided by C . Then |K ∩ L| are thenumber of point pairs that lie in the same cluster and - at thesame time - share the same annotated labels. For |C| > 1, theRand index [39] is defined as:

R(A, C) =2(|L ∩ K|+ |P\L ∩ P\K|)

|C|(|C| − 1). (15)

Since R depends on the number of clusters |C|, we adjustthe Rand index, comparing it with the expected value of R(baseline of a random clustering) ER. The expected value ofR over all partition combinations P ×P is calculated as [40]:

ER =1

|C|2∑A,C∈P

R(A, C). (16)

7

ER gets maximal for A = C:

Rmax =1

|C|2∑C∈P

R(C, C). (17)

Then the adjusted Rand index (ARI) holds:

ARI(A, C) =R(A, C)− ERRmax − ER

(18)

The ARI has values between 0 (random partitioning) and 1(A = C).

The ARI assumes that annotations and clusterings are drawnrandomly with a fixed number of clusters and a fixed numberof elements per cluster [39]. Although this assumption will notalways be true in our evaluation, we will use ARI, since it isa more established measure than alternative ones, such as theFowlkes-Mallows index, the Mirkin metric, the Jaccard index[41], or entropy-based measures [37, 39], e.g. normalizedmutual information and variation of information.

In evaluating our system, we use the ARI in two ways: inthe evaluation of 1) the clustering of the feature vectors of eachevent (Tables IV and V in the supplementary material[22]) andof 2) the prediction of the entire symbol sequence, as explainedin the sequel. According to Fig. 1, by segmentation, featureextraction, and clustering, the input sound wave is transformedinto a sequence c = (c1, c2, . . . , cT ) of T events, each onerepresented as one of J symbols. All occurrences of symbolj can be included in a cluster Cj that contains all the indicest where event ct equals symbol j. Then C = (C1, C2, . . . , CJ)is a partition of X = (1, 2, . . . , T ). To evaluate the predictionc, we annotate one of the ground truth labels (1, 2, . . . , I) toeach segment of the input, yielding an annotated sequence a =(a1, a2, . . . , aT ). From this, a partition A = (A1, . . . ,AI) canbe generated in the same way as the partition C for c. Thenumber I of the annotated labels is not necessarily the sameas the number of symbols J determined by the clustering stageof our system. Then the ARI can be used to compare C andA, as done in Tables I-II and Fig. 4-6.

B. Data Sets

Two sets of test data are employed:• Repetitive symbol sequences are used to assess the learn-

ing rate and the robustness of the sequence learningtechniques with respect to errors in onset detection (skip-ping noise) or clustering (switching noise): We generatesequences that consist of patterns of length nl = 2, . . . , 5made up of I distinct symbols. These patterns are re-peated 20 times. For each pattern length nl and eachpartition A = {A1,A2, . . . ,AI} of (1, 2, . . . , nl), onesequence is generated in a way so that elements ofeach partition subset Ai are symbol i’s positions in thesequence. E.g. for nl = 5 and partition A = {A1,A2} ={{1, 3, 5}, {2, 4}}, symbol ’1’ occurs at positions A1 ={1, 3, 5} and symbol ’2’ occurs at positions A2 = {2, 4},yielding the symbol sequence (’1’,’2’,’1’,’2’,’1’). Sinceprediction of the next symbol is the same for structurallyequivalent sequences, we only need one sequence torepresent sequences where symbols are permuted, suchas in (’2’,’1’,’2’,’1’,’2’).

• Audio recordings are used to evaluate the system in partsand as a whole:Voice: Informal low quality and short voice recordings

of very simplified beat boxing, each consisting of 2-3 different sound categories with different degrees oftonality with a simple changing rhythm, a sequenceof a repetitive three-sound pattern, and a ritardando,altogether 5 recordings each of 10-13 s duration. Inorder to demonstrate the unsupervised character of oursystem we choose sounds that do not belong to apredefined category (e.g. an acoustical instrument). ForVoice, onsets and segment labels (timbre classes) havebeen annotated manually.

ENST Drums: Formal high quality recorded drum se-quences. 5 segments described in terms of style,complexity and tempo as disco (simple slow, com-plex medium), rock (simple fast), country (sim-ple slow, complex medium).[42] We use Gillet andRichard [42]’s annotations from multi-channel record-ings which had been obtained applying automatic onsetdetection with subsequent manual correction of onsetsand labels, based on audio and video inspection.

Audio data is available on the website [22].

C. Results

The system architecture consists of the processing chain:1) onset detection and feature extraction 2) clustering, 3)expectation. We will evaluate stages 1), 2), 3) in isolation, 1) +2) together (referred to as transcription) and the entire chain 1)+ 2) + 3) together (referred to as prediction). For expectation,the performance of the sequence analysis is evaluated ona symbol sequence. For prediction, the sequence analysisdiffers from a purely symbolic analysis: the symbols assignedto the events may change in retrospective, e.g. when twoclusters are merged online and only transitions to or from thenewly merged cluster are counted. In addition, the sequenceanalysis needs to cope with two kinds of noise, due to wrongsegmentation and/or due to assigning the wrong symbol to thesegment. We use the repetitive symbol sequences, in order toassess expectation, i.e. learning rate and noise robustness ofthe sequence analysis. The audio recordings are used to testthe processing stages of the entire system separately.

1) Learning Rate and Noise Robustness with RepetitiveSymbol Sequences: We assess the learning rate and noiserobustness of the sequence analysis stage (CB and HN). Thesequence learning algorithm receives an initial chunk of arepetitive symbol sequence (Section III-B) as input. From this,the algorithm determines the most probable (expected) next4nl symbols. The algorithm outputs the expected next symbolsct+1, . . . , cnl·4, given the annotated symbols a1, a2, . . . , at.For each t, from the predictions ct+1, ct+2, . . . , cnl·4 a par-tition C is generated and compared with the partition of thecorresponding A based on annotations, as explained in Sec-tion III-A. For t ≤ nl ·5 all annotations so far are used for pre-diction, then only the last 12 annotations at−11, at−10, . . . atare used for prediction. For the stochastic BM, ARI is averagedover 100 runs of all partitions of a given length nl. The trivial

8

0 5 10 15 20 25 30 35 40Event

0.0

0.2

0.4

0.6

0.8

1.0

Adju

sted R

and Index

CB, nl =2

CB, nl =3

CB, nl =4

CB, nl =5

HN, nl =2

HN, nl =3

HN, nl =4

HN, nl =5

Fig. 4. Learning rate of the two sequence learning algorithms (CB and HN),depending on the number of pattern repetitions. The ARI (Eq. 18) is givenfor an increasing number of repetitions of a pattern with various lengths nl =2, 3, 4, 5, 6. HN reaches a perfect ARI quickly, in contrast to CB.

0.0 0.2 0.4 0.6 0.8 1.0Skip Probability

0.0

0.2

0.4

0.6

0.8

1.0

Adju

sted R

and Index

CB, nl =2

CB, nl =3

CB, nl =4

CB, nl =5

HN, nl =2

HN, nl =3

HN, nl =4

HN, nl =5

Fig. 5. Robustness of the two sequence learning algorithms (CB, HN, cf.Fig. 4) with respect to skipping noise. The ARI is given for a sequence of 20repetitions of patterns of different lengths nl and increasing probability pskof randomly skipping an event. For psk < 0.4, HN performs better than CB.

sequence that consists of a constant repetition of the samesymbol is not considered. First we assess how the learning ratescales with pattern length and number of pattern repetitions.Fig. 4 shows the averaged ARI across of all partitions oflengths nl = 2, . . . 5. For this test, the HN is set to a maximumN -gram length of N = 5. The HN reaches perfect prediction(ARI=1) after 4nl events (2 pattern repetitions). CB seems toconverge much more slowly than HN, for nl = 2 reaching anARI of higher than 0.8 after 8 events, then increasing muchmore slowly. For higher nl, CB seems to converge towardsperfect prediction even more slowly.

Different types of noise are used to transform the sequencein order to assess the robustness of the sequence learningtechniques:

Skipping noise: In the original sequence, a symbol is skippedwith a given probability 0 ≤ psk ≤ 0.95.

0.0 0.2 0.4 0.6 0.8 1.0Switch Probability

0.0

0.2

0.4

0.6

0.8

1.0

Adju

sted R

and Index

CB, nl =2

CB, nl =3

CB, nl =4

CB, nl =5

HN, nl =2

HN, nl =3

HN, nl =4

HN, nl =5

Fig. 6. Robustness of CB and HN (cf. Fig. 4) with respect to switching noise.The ARI is given for an increasing probability psw of randomly switching asymbol. HN performs better than CB for psw < 0, reaching random guesslevel (ARI=0) for psw = 0.5.

Switching noise: In the original sequence, with a given prob-ability of 0 ≤ psw ≤ 0.95, a symbol is selected randomlywith uniform distribution across the nl alternative sym-bols.

The average ARI is calculated over 100 runs for nl = 2, 3,over 50 runs for nl = 4 and 20 runs for nl = 5 for bothCB and HN. Fig. 5 shows how the prediction performance(ARI) is affected by skipping symbols with a defined pskin the repetitive symbol sequences. This simulates e.g. thefailure of the onset extraction algorithm to detect an event. Theprediction is performed, given a sequence of 20 repetitions ofthe basic pattern. For HN and CB, the performance degradesuntil psk = 0.5, where random guess level is reached (ARI=0).Until psk = 0.4, HN appears to be more robust towardsskipping noise than CB, with CB having a worse ARI forhigher nl.

In Fig. 6, the effect of clustering errors on the sequencelearning process is simulated. With increasing switching prob-ability psw, a symbol is replaced by any of the nl symbolsunder uniform distribution. The graph shows the predictionperformance using the ARI for CB and HN for differentpattern lengths. The results are similar as for skipping noise(Fig. 5): HN is more robust wrt noise than BM, reachingrandom guess level (ARI=0) for psw = 0.5. It can besummarized that for relatively small noise the HN appearsto be more robust to skipping and switching noise, especiallyfor longer pattern lengths.

2) Testing of Processing Stages with Audio Recordings:The tests with the Voice recordings (Section III-B) serve asa proof of concept of clustering with dynamically varyingnumbers of clusters. The ENST recordings are used for a morecomprehensive quantitative evaluation of the system. We testeach process stage separately. For the audio, the sample rate isfs = 44100 Hz. For segmentation and feature extraction, thehop size is 128 samples, and the window size is 1024 samples.

a) Onset detection: For the evaluation of the onset de-tection (Section II-A) we employ a widely used procedure

9

[4, 43]. Onset times manually annotated by subjects serveas references. The onsets estimated by the onset detectionalgorithm are then compared to the manually annotated onsets.Annotated and estimated onsets are considered a match whentheir difference in time is smaller than a given threshold. Inour evaluation, we use an onset match threshold of ∼ 50 ms.Since the data is assumed to be monophonic, the evaluationonly permits a one-to-one mapping between estimated andannotated onsets.

Using the following onset detection parameters: smoothinglength M = 33 in Eq. 2, sensitivity C = 0.9 and look-ahead window length P = 10 in Eq. 3, threshold windowlength W = 11 in Eq. 4, and silence threshold θs = 0.002 inEq. 5, onset detection yields an F-measure of ∼ 99% for theVoice data set. Therefore, we focus on the clustering and theprediction stage. We also notice that for smoothing lengthsM > 33 the system does not improve significantly. Largesmoothing lengths reduce the temporal precision of onsets,which is important for good feature extraction, since most ofthe information about an event is located in the attack.

b) Clustering: We now compare the performance ofour incremental online Cobweb clustering and benchmarkoffline (batch) HDP-HMM clustering with a constant yetinferred cluster number as a benchmark. In order to assessthe clustering process in isolation, we assume error-free onsetdetection on the previous stage. In order to achieve this,we use the annotated onsets as input. In order to assessthe stability of the system, we tested it performing a gridsearch on the two most sensible parameters involved in thetask and the algorithm. For Cobweb, we explore the analysiswindow length L (Section II-B) and the acuity a (Eq. 6).On the parameter grid, the window length/acuty pair withmaximal ARI is determined, extending the parameter grid ifthe maximum lies on the grid border, with empirically setconstant grid step sizes.

For Voice, Cobweb performance peaks at ARI=82.7% forL = 150ms, a = 18.5 on a parameter grid over L =50, 75, . . . , 175; a = 15, 15.5, . . . , 19. For ENST, Cobweb per-forms best at 85.7% for L = 50ms, a = 13.5 on a parametergrid over L = 25, 50, . . . , 100; a = 13, 13.5, . . . , 15. (cf. Ta-bles IV and V in the supplementary material[22]) This meansthat the timbre model and clustering process can successfullyclassify the audio events. We also notice that Voice needsa much longer analysis window than ENST. This test, asexplained above, was performed using the annotated onsets.The results could change when the onsets are estimated. Thiseffect is evaluated in the transcription test (Section III-C2c).

For the HDP-HMM, we first reduce the feature vectors ofthe input to D dimensions by means of a PCA on the fullsequence. The observation distributions used are Gaussian withparameters sampled i.i.d. from a normal inverse Wishart prior[31] with parameters µ0 = 0, κ0 = 0.4,Λ0 = 0.001, ν0 =D + 2.2 The maximum number of states of the weak limitapproximation inference is set to 10 and the number of Gibbssampling iterations to 100. For Voice, HDP-HMM perfor-mance peaks at ARI=99.1% for γ = 8.0, α = 7.0, D = 2 on

2Cf. Murphy [44], Section 9.2., p. 20 for the meaning of the parameters.

TABLE IEXPECTATION OF Voice (LEFT) AND ENST (RIGHT): ARI (IN %) FOR

DIFFERENT MAXIMUM LENGTHS N OF CB/HN (ROWS).

N CB HN2 7.4 22.43 6.8 27.34 7.3 41.15 5.1 50.96 4.4 50.97 5.1 50.9

N CB HN2 6.0 18.93 9.1 28.94 7.8 43.25 6.3 42.76 6.6 42.77 7.7 42.6

a parameter grid over γ, α = 4.0, 5.0, . . . , 11.0 and D = 2, 3.For ENST, HDP-HMM performs best at ARI=84.0% forHDP concentration parameters γ = 6.0, α = 12.0 [17] andD = 2 on a parameter grid over γ, α = 4.0, 5.0, . . . , 13.0 andD = 2, 3. Benchmark HDP-HMM performs better for Voicethan Cobweb, whereas for ENST, Cobweb performs 1.7%better than HDP-HMM. When comparing these results one hasto keep in mind that HDP-HMM clustering has learned offlinejointly a stable cluster number and the transition probabilities,exploiting sequential information whereas Cobweb has beentrained online with an adaptive cluster number.

c) Transcription: The transcription test evaluates thesubsystem composed of onset detection, feature extraction, andclustering. In contrast to the expectation test, the entire symbol(inter-onset interval) sequence c1, c2, . . . , ct extracted from theclustering stage is always used from the beginning to predictthe next symbol (inter-onset interval) ct+1. The annotations aare not used for prediction, only for evaluation. The partitionsgenerated from the detected events were compared with thepartitions generated from the annotated labels using ARI.Online learning Cobweb with dynamically changing clusteringnumbers and offline learning HDP-HMM with a constantcluster number are compared. For Cobweb, Voice performswith ARI = 81.3% for L = 150, a = 17. On the ENST dataset Cobweb yields ARI = 76.3% for L = 50, a = 13.5, usingthe same parameter grids as for the clustering (p. III-C2b). Incomparison to the results for clustering, the ARI degradesa bit in particular for ENST due to wrongly estimated on-sets. (cf. Tables VI and VII in the supplementary material[22]) HDP-HMM transcription performance for Voice peaks atARI=98.8% for γ = 8.0, α = 12.0 on a parameter grid overγ, α = 4.0, 5.0, . . . , 13.0. For ENST, HDP-HMM transcriptionperformance peaks at ARI=76.2% for γ = 5.0, α = 8.0 onthe same parameter grid as for Voice. For ENST, HDP-HMMand Coweb are almost equal. Although for Voice, the ARI ismuch higher for the HDP-HMM benchmark than for Cobweb,we have to keep in mind that Cobweb learns online withchanging cluster numbers over time whereas HDP-HMM istrained offline with a constant number of clusters.

d) Expectation: The expectation test evaluates the per-formance of the sequence learning module on the data sets.We predict the cluster label ct+1 of event t+ 1 based on theannotations from the start: a1, a2, . . . , at. Results in Table Ishow that for the prediction of the sequences of the Voice andthe ENST data set, HN (ARI = 43.2% for N = 4) worksa lot better than CB, which yields an ARI = 7.8%, justslightly better than random (0%). CB’s low performance can

10

TABLE IIFULL PREDICTION FOR THE Voice DATA SET USING HN: ARI (IN %) FORDIFFERENT TEMPORAL ACUITIES at FROM EQ. 6 (ROWS) AND TIMBRAL

ACUITIES a FROM EQ. 6 (COLUMNS).

at\a 17 17.5 18 18.5 19 19.5 200.05 16.7 15.5 16.3 16.0 25.2 23.8 23.8

0.0625 15.9 15.3 16.2 16.0 25.2 24.0 24.00.075 14.4 14.3 15.2 16.7 27.2 25.8 25.80.0875 15.7 16.4 17.2 17.3 25.8 24.5 24.5

TABLE IIIFULL PREDICTION FOR THE ENST DATA SET USING HN: ARI (IN %) FORDIFFERENT TEMPORAL ACUITIES at FROM EQ. 6 (ROWS) AND TIMBRAL

ACUITIES a FROM EQ. 6 (COLUMNS).

at\a 18 18.5 19 19.5 20 20.5 210.075 33.1 33.8 33.0 32.6 36.2 34.7 33.30.0875 35.1 36.2 35.3 34.2 37.8 36.3 34.9

0.1 36.3 37.6 36.6 35.4 39.2 37.6 36.20.1125 35.9 37.2 36.2 35.0 38.8 37.2 35.80.125 34.6 35.9 34.9 33.7 37.5 35.9 34.5

be attributed to various factors: In general, many traditionalrecurrent networks are known to have a slow learning rate.[45]In particular, we have observed slow learning rate (Fig. 4)and low noise robustness (Fig. 5 & Fig. 6). Whereas for HN,the updates in frequency counts are getting smaller relativeto the count so far (from 1 to 2 is a higher step relative to1 than from 100 to 101 relative to 100), in CB the weightsare updated with a constant learning rate µ (Eq. 14). Weightupdates are performed by stochastic gradient descent whereeach instance is only used once when it has just occurred.Although this is cognitively plausible if we assume that onlya limited number of instances can be stored by the cognitivesystem, it comes with the price of diminished learning speed,compared to a system where the update is performed usinga batch of instances. In addition, the architecture of the CBmay be suboptimal w.r.t. the hidden nodes. Also, in thenetwork, new hidden nodes are only generated one at a time,further limiting learning speed. Furthermore, the parametersθk for creating new hidden nodes are chosen heuristicallyand may be suboptimal for short sequences like the onepresented. We can also see that for n-gram maximum lengthsn > 5 for Voice (n > 3 for ENST) the result does notimprove. For linguistic data, slower convergence and worseperformance of CB relative to HN is also observed in Pfleger[32], pp. 80&133. In the sequel, we will only use HN.

e) Prediction: The prediction task consists in running thefull system including HN as the sequence analyzer (Tables IIand III). After the transcription of the events c1, . . . , ct, thesystem predicts the next symbol and the timing of it (the nextIOI) ct+1. For evaluating the match between predicted andannotated onsets, we set the tolerance threshold to 150 ms.For the best configuration, the full prediction yields an ARIof 27.2% for Voice and an ARI of 39.2% for ENST. Theperformance is limited by the weakest performance of itscomponents, in this case the sequence analysis.

D. Examples

In this section, we present a few examples (audio on thewebsite [22]) of transcription and prediction using HN in orderto demonstrate the performance, evolution and shortcomingsof the system. From Hazan et al. [11], we adopt the procedureto optimally map the annotated symbols to the clusters foundby the clustering algorithm. We calculate the matching matrixbetween the annotations (’score’) and clusters of each event.In this matching matrix, we then iteratively yield the maximalentry, thereby establishing a connection between a row (an-notations) and a column (clusters). After eliminating the rowand column of the maximal entry, we determine the maximalentry again until the matrix vanishes.

0 2 4 6 8 10 12Time (s)

ta

tschi

bum

Annota

tion

Full Prediction of bumtatschi

Fig. 7. The system (with HN) quickly captures a simple ta-tschi-bum pattern.Time (horizontal axis) is mapped versus event labels (one line each for ’ta’,’tschi’, and ’bum’). Annotated labels are indicated in black below the lines.Above the horizontal lines we find events that are correctly estimated (’•’),matched to the wrong cluster (’�’), and unmatched (’N’) due to a wronglyestimated onset.

In Fig. 7 and 8, we display sequences of annotations andclusters on the same line if they are linked through thismapping. In Fig. 7, a simple ta-tschi-bum pattern is quicklycaptured. We can see how the first three events annotatedas ta, tschi, and bum are matched with the wrong clustersbum (initial blue triangle above top line), ta, and tschi (redsquares). The first three cluster mismatches are expected, sincethe system has no previous knowledge of the symbol spacenor of the sequence and therefore cannot predict symbols norpatterns that have not yet occurred. At around 5.5s, an eventannotated as tschi is matched with the ta cluster. The timing ofthe last bum is misestimated and for the last tschi timing andcluster matching are wrong. The time deviation errors are dueto the fact that the recorded voice does not follow a temporallyregular pattern.

0 5 10 15Time (s)

ta

bong

Annota

tion

Full Prediction of bongtabongtata

Fig. 8. The system (with HN) adapts to a pattern change from ta-bong tota-ta-bong (cf. Fig. 7).

11

1st PCA component

2nd P

CA

com

ponent

10 Events

2

1

20 Events

2

1

38 Events

Root

1 2

10 Events

1 2

20 Events

Root

38 Events

Fig. 9. Cluster merging: After 38 events, two clusters (’•’, ’�’) merge into onecluster (’�’). The projection of the MFCC vectors (timbre representation) ontotheir first two principal components (above) and the incremental clustering tree(below) are shown.

In Fig. 8, we observe how the system adapts to patternchanges within the sequence. For the first two events, the clus-ter matching is wrong. Then, after having processed enoughsounds, the system performs correct predictions. In the middle,around 7.5s, when the repetition pattern of ta is introduced,for three ta events the onset is wrongly estimated, two of theseevents as well being mismatched with the wrong cluster, andone additional event being only mismatched with the wrongcluster. The errors in the middle of the sequence are due tothe pattern change. The N -gram is able to update the statisticsand perform correct predictions after three occurrences of thenew pattern.

In Fig. 9 (sound and video on the website [22]), a sequenceof alternating bass drum and hi-hat samples is played. Duringthe sequence, the hi-hat is gradually mixed in a linear fashionwith an increasing amount of bass drum and vice versa sothat in the end both hi-hat and bass drum are mixed togetherin a balanced way, yielding a repetitive sequence of similarsounds. The system recognizes the two sound clusters in thebeginning, and finally merges the two clusters into one singlecluster.

In Fig. 10 (sound and video on the website [22]), a sequenceof sound events is analyzed that starts with one sound, laterjoined by a second and third sound. The system is able to splitthe initial cluster gradually into 2 and 3 clusters.

IV. CONCLUSION AND PERSPECTIVES

We have presented a full system that predicts the nextsound event from the previous events, operating on audiodata. Taking into account no previous knowledge, neitheron the used sounds or instruments nor on the timing andrhythmical structure of the audio segment, the system startsfrom tabula rasa, performing predictions from the very firstsound event. The system adapts to pattern changes in thesequence as well as the appearance of new sounds or in-struments at any time. Currently the system is limited bythe lack of metrical analysis, making it especially sensitive

1st PCA component

2nd P

CA

com

ponent

8 Events

Root

20 Events

1

2

80 Events

2.2

2.1

1

Root

8 Events

1 2

20 Events

1

2.1 2.2

80 Events

Fig. 10. Creation of new clusters: After 20 and 80 sound events, new clusters(’•’, ’N’) emerge on the fly. Cf. Fig. 9.

to missed onsets. Considering the metrical context couldsignificantly improve the quality of predictions. For this goal, ametrical alignment procedure[13, 46] could be combined withincremental learning. As alternatives to CB and HN, variablelength Markov models [47, 13, 46] or other deep learningarchitectures can be used, thereby overcoming the contextlength limitation of HN and CB and the slow learning of CB.The long short-term memory (LSTM) [45] is a recurrent neuralnetwork that had been developed to capture dependenciesbetween disconnected distant chunks within the same timeseries. Crucial to this and for speeding up learning, in LSTM,special memory cells are used. The access to the latter canbe opened and closed by special gating units. Successfullyapplied to protein homology detection, automatic composition[48], handwriting and spoken language recognition, LSTMcould be used to replace CB or HN and improve learningspeed in our application. HDP-HMM [17] could be adaptedto online learning [21] with incremental addition/removal ofclusters comprising also segmentation, thereby replacing onsetdetection. The presented system can also be modified to learnmelodies and chord progressions. For learning melodies, inthe feature extraction stage (Section II-B), MFCCs need tobe replaced by a pitch detection method as used in Marxeret al. [26] for learning songs by the Mbenzele pygmiesor as in Cherla et al. [49] for learning guitar riffs. Whenanalysing (piano) chord progressions, MFCCs can be replacedby constant Q profiles [50, 51]. Future work includes thedevelopment of a better representation of pitch and harmony,using a larger training set when processing more complexmusic.

Inspired by these ideas, we imagine a musical improvisa-tional dialogue between a human and a machine in which thehuman may spontaneously articulate novel ideas such as newsounds, motifs, rhythms, or harmonies. A dumb and ignorantmachine would dampen and finally stop the musical flow. Butif the machine could take up the novel idea, reply to it, varyingthe suggestions of its human partner, they could develop anenhanced musical conversation.

12

REFERENCES

[1] C. Parisien, A. Fazly, and S. Stevenson, “An incrementalbayesian model for learning syntactic categories,” inProceedings of the Twelfth Conference on ComputationalNatural Language Learning, Stroudsburg, PA, USA,2008, pp. 89–96.

[2] P. K. Kuhl, “Mechanisms of developmental change inspeech and language,” in Proceedings of the XIIIth in-ternational congress of phonetic sciences, vol. 2, 1995,pp. 132–139.

[3] B. Hayes, “Phonological acquisition in optimality theory:The early stages,” Constraints in phonological acquisi-tion, pp. 158–203, 2004.

[4] J. S. Downie, K. West, A. F. Ehmann, and E. Vincent,“The 2005 music information retrieval evaluation ex-change (MIREX 2005): Preliminary overview,” in Proc.Int. Society Music Information Retrieval, 2005, pp. 320–323.

[5] D. Conklin and I. H. Witten, “Multiple viewpoint systemsfor music prediction,” Journal of New Music Research,vol. 24, no. 1, pp. 51–73, 1995.

[6] M. T. Pearce and G. A. Wiggins, “Improved methods forstatistical modelling of monophonic music,” Journal ofNew Music Research, vol. 33, no. 4, pp. 367–385, 2004.

[7] M. Mozer, “Neural network music composition by pre-diction: Exploring the benefits of psychophysical con-straints and multiscale processing,” Connection Science,vol. 6, pp. 247–280, 1994.

[8] O. Lartillot, S. Dubnov, G. Assayag, and G. Bejerano,“Automatic modeling of musical style,” in Proc. Int.Computer Music Conference, 2001.

[9] F. Pachet, “The continuator: Musical interaction withstyle,” Journal of New Music Research, vol. 32, no. 3,pp. 333–341, 2003.

[10] G. Assayag and S. Dubnov, “Using factor oracles formachine improvisation,” Soft Computing, vol. 8, no. 9,pp. 604–610, 2004.

[11] A. Hazan, R. Marxer, P. Brossier, H. Purwins, P. Herrera,and X. Serra, “What/when causal expectation modellingapplied to audio signals,” Connection Science, vol. 21,pp. 119 – 143, 2009.

[12] R. Marxer and H. Purwins, “Unsupervised incrementallearning and prediction of audio signals,” in Proc. Int.Symposium on Music Acoustics, 2010.

[13] M. Marchini and H. Purwins, “Unsupervised generationof percussion sound sequences from a sound example,”in Sound and Music Computing Conference, 2010.

[14] J. Paiement, Y. Grandvalet, and S. Bengio, “Predictivemodels for music,” Connection Science, vol. 21, no. 2,pp. 253–272, 2009.

[15] J. Zhang, Z. Ghahramani, and Y. Yang, “A probabilisticmodel for online document clustering with applicationto novelty detection,” in Advances in Neural InformationProcessing Systems 17. Cambridge, MA: MIT Press,2005, pp. 1617–1624.

[16] T. Bertin-Mahieux, R. J. Weiss, and D. P. Ellis, “Clus-tering beat-chroma patterns in a large music database,”

in Proc. Int. Society Music Information Retrieval, 2010,pp. 111–116.

[17] Y. W. Teh and M. I. Jordan, “Hierarchical Bayesiannonparametric models with applications,” Journal of theAmerican Statistical Association, vol. 1, 2010.

[18] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Will-sky, “A sticky HDP-HMM with application to speakerdiarization,” The Annals of Applied Statistics, pp. 1020–1056, 2011.

[19] L. Ren, D. B. Dunson, and L. Carin, “The dynamichierarchical dirichlet process,” in Proc. Int. Conf. onMachine learning. ACM, 2008, pp. 824–831.

[20] T. S. Stepleton, Z. Ghahramani, G. J. Gordon, and T. S.Lee, “The block diagonal infinite hidden Markov model,”in Int. Conf. on Artificial Intelligence and Statistics,2009, pp. 552–559.

[21] A. Bargi, R. Xu, and M. Piccardi, “An online HDP-HMM for joint action segmentation and classification inmotion capture data,” in Computer Vision and PatternRecognition Workshops, IEEE Computer Society Conf.on, June 2012, pp. 1–7.

[22] R. Marxer and H. Purwins. (2014, Dec.) Unsupervisedincremental learning and prediction of audio signals:Supplementary material. [Online]. Available: http://www.ricardmarxer.com/research/unsupervised2014

[23] C. Duxbury, J. Bello, M. Davies, and M. Sandler, “Com-plex domain onset detection for musical signals,” in Proc.Digital Audio Effects Workshop, London, UK, 2003.

[24] J. P. Bello and M. B. Sandler, “Phase-based note on-set detection for music signals,” Proc. IEEE Int. Conf.Acoustics, Speech and Signal Processing, vol. 5, no. 2,pp. 441–444, 2003.

[25] P. Mermelstein, “Distance measures for speech recogni-tion, psychological and instrumental,” in Pattern Recog-nition and Artificial Intelligence. New York: Academic,1976, pp. 374–388.

[26] R. Marxer, P. Holonowicz, H. Purwins, and A. Hazan,“Dynamical hierarchical self-organization of harmonic,motivic, and pitch categories,” in Music, Brain and Cog-nition Workshop, held at Neural Information ProcessingConference, 2007.

[27] D. H. Fisher, “Knowledge acquisition via incrementalconceptual clustering,” Machine Learning, vol. 2, no. 2,pp. 139–172, 1987.

[28] K. McKusick and K. Thompson, “Cobweb 3: A portableimplementation,” Technical Report No. FIA-90-6-18-2,1990.

[29] J. Yoo and S. Yoo, “Concept formation in numericdomains,” in Proc. ACM Annual Conf. on ComputerScience, 1995, pp. 36–41.

[30] M. Gluck and J. Corter, “Information, uncertainty, andthe utility of categories,” Proc. Annual Conf. of theCognitive Science Society, pp. 283–287, 1985.

[31] M. J. Johnson and A. S. Willsky, “Bayesian nonpara-metric hidden semi-Markov models,” Journal of MachineLearning Research, vol. 14, pp. 673–701, February 2013.

[32] K. Pfleger, “On-line learning of predictive composi-tional hierarchies,” Ph.D. dissertation, Stanford Univer-

http://www.ricardmarxer.com/research/unsupervised2014

http://www.ricardmarxer.com/research/unsupervised2014

13

sity, 2002.[33] I. Zitouni, O. Siohan, H.-K. J. Kuo, and C.-H. Lee,

“Backoff hierarchical class n-gram language modellingfor automatic speech recognition systems.” in Proc. In-terspeech, 2002.

[34] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “Alearning algorithm for Boltzmann machines,” Cognitivescience, vol. 9, no. 1, pp. 147–169, 1985.

[35] K. Pfleger, “On-line learning of predictive compositionalhierarchies by Hebbian chunking,” In Proceedings of theAAAI2000 workshop on New Research Problems forMachine Learning, Tech. Rep., 2002.

[36] P. Smolensky, “Information processing in dynamicalsystems: Foundations of harmony theory,” in ParallelDistributed Processing, J. M. D. Rumelhart, Ed. Cam-bridge, MA: MIT Press, 1986, vol. 1, pp. 194–281.

[37] Y. Zhao and G. Karypis, “Hierarchical clustering algo-rithms for document datasets,” Data Mining and Knowl-edge Discovery, vol. 10, no. 2, pp. 141–168, 2005.

[38] B. Larsen and C. Aone, “Fast and effective text miningusing linear-time document clustering,” in Proc. ACMSIGKDD Int. Conf. on knowledge discovery and datamining, 1999, pp. 16–22.

[39] S. Wagner and D. Wagner, Comparing clusterings: anoverview. Universitat Karlsruhe, Fakultat fur InformatikKarlsruhe, 2007.

[40] A. L. Fred and A. K. Jain, “Combining multiple clus-terings using evidence accumulation,” Pattern Analysisand Machine Intelligence, IEEE Transactions on, vol. 27,no. 6, pp. 835–850, 2005.

[41] M. Meila, “Comparing clusterings an information baseddistance,” Journal of Multivariate Analysis, vol. 98, no. 5,pp. 873 – 895, 2007.

[42] O. Gillet and G. Richard, “Enst-drums: an extensiveaudio-visual database for drum signals processing.” inISMIR, 2006, pp. 156–159.

[43] P. Leveau and L. Daudet, “Methodology and tools forthe evaluation of automatic onset detection algorithms inmusic,” in Proc. Int. Society Music Information Retrieval,2004, pp. 72–75.

[44] K. P. Murphy, “Conjugate bayesian analysis of the gaus-sian distribution,” University of British Columbia, Tech.Rep., 2007.

[45] S. Hochreiter and J. Schmidhuber, “Long Short-TermMemory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[46] M. Marchini and H. Purwins, “Unsupervised analysis andgeneration of audio percussion sequences,” in ExploringMusic Contents. Springer, 2011, pp. 205–218.

[47] P. Buhlmann and A. J. Wyner, “Variable length Markovchains,” Annals of Statistics, vol. 27, pp. 480–513, 1999.

[48] D. Eck and J. Schmidhuber, “Finding temporal structurein music: Blues improvisation with LSTM recurrentnetworks,” in IEEE Workshop on Neural Networks forSignal Processing. IEEE, 2002, pp. 747–756.

[49] S. Cherla, H. Purwins, and M. Marchini, “Auto-matic phrase continuation from guitar and bass guitarmelodies,” Computer Music Journal, vol. 37, no. 3, pp.

68–81, 2013.[50] H. Purwins, B. Blankertz, and K. Obermayer, “A new

method for tracking modulations in tonal music in audiodata format,” in Int. Joint Conf. on Neural Network(IJCNN’00), vol. 6. NI-BIT, 2000, pp. 270–275.

[51] K. Kosta, M. Marchini, and H. Purwins, “Unsupervisedchord-sequence generation from an audio example.” inISMIR, 2012, pp. 481–486.

Ricard Marxer is currently a Research Associate atthe Speech and Hearing Research Group, Universityof Sheffield, United Kingdom, where he previouslyparticipated as a Marie Curie Experienced researcherin the INSPIRE-ITN European project. He obtainedhis Ph.D. in source separation of music audio signalsfrom Universitat Pompeu Fabra, Barcelona, Spain.He studied Telecommunications Engineering at theInstitute National des Sciences Appliquees, Lyon,France. He has taught various Masters and Bach-elors level courses at Universitat Pompeu Fabra and

Elisava School of Design. He has written more than 15 scientific papers andis an inventor of several patents. He has participated in several internationalresearch projects, an industrial research project in collaboration with YamahaCorporation and has been a visiting researcher at Universiteit van Amsterdamand KU Leuven. His research interests include source separation, machinelistening, unsupervised and active learning, speech recognition and musicanalysis.

Hendrik Purwins is Assistant Professor at AudioAnalysis Lab and Sound and Music ComputingGroup, Aalborg Universitet København, Denmark.He obtained his Ph.D. from Berlin Institute ofTechnology (EE/CS), with a scholarship from the”Studienstiftung des deutschen Volkes”. He stud-ied mathematics at Bonn and Munster University(diploma in pure mathematics). He has been lecturerat Universitat Pompeu Fabra, Barcelona, researcherat Berlin Brain Computer Interface, visiting scholarat IRCAM, Paris, CCRMA, Stanford, and McGill

University, Montreal. He has written more than 70 scientific papers and haswon 13 research grants/prizes. His interests comprise statistical, unsupervisedmodels for machine listening, music generation, sound resynthesis, and failureprediction in semi-conductor manufacturing. Starting playing the violin at ageof 7, he also studied music at Cologne Music Academy, Munster Universityand Berlin University of the Arts.

Unsupervised Incremental Online Learning and Prediction of ...

Documents

Transcript of Unsupervised Incremental Online Learning and Prediction of ...