Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7....

16
1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data Harvey D. Thornburg, Student Member, IEEE, Randal J. Leistikow, Student Member, IEEE, and Jonathan Berger Abstract—We propose a probabilistic method for the joint segmentation and melody extraction for musical audio signals which arise from a monophonic score. The method operates on framewise short-time Fourier transform (STFT) peaks, enabling a computationally efficient inference of note onset, offset, and pitch attributes while retaining sufficient information for pitch determination and spectral change detection. The system explic- itly models note events in terms of transient and steady-state regions as well as possible gaps between note events. In this way, the system readily distinguishes abrupt spectral changes associated with musical onsets from other abrupt change events. Additionally, the method may incorporate melodic context by modeling note-to-note dependences. The method is successfully applied to a variety of piano and violin recordings containing reverberation, effective polyphony due to legato playing style, expressive pitch variations, and background voices. While the method does not provide a sample-accurate segmentation, it facilitates the latter in subsequent processing by isolating musical onsets to frame neighborhoods and identifying possible pitch content before and after the true onset sample location. I. I NTRODUCTION M USICAL signals are typically organized as sequences of note events. These events often arise from a per- former’s action, and the pitch content is expected to be similar for the duration of each event. The beginning of each note event is termed a note onset, and the end a note endpoint. For the melody extraction task, we aim to estimate the onset time, endpoint time, and pitch of each event. This extraction is so far restricted to nominally monophonic signals, which are signals that may arise from a monophonic score. Of course, the actual recording may contain significant overlaps in the pitch content over time, especially at note transition points, where reverberation or performance techniques sustain notes until after subsequent ones have begun. Due to computational considerations, the extraction oper- ates on framewise sequences of short-time Fourier transform (STFT) peaks. In most cases these peaks suffice for pitch determination [10], [27], [30]. When tracking pitches over time, however, problems arise where the frame boundaries do not exactly coincide with the pitched portions of note events. Frames may, for instance, contain samples from the transient portion of a note event in which pitch information is not yet salient, cross over the boundary between two notes, contain only background noise, or include information due to other interfering sound sources. Fortunately, musical signals are highly structured, both at the signal level, in terms of the expected timbral evolution of each note event, and at higher levels, in terms of melodic and rhythmic tendencies. This structure creates context which can be used to predict pitch values and note event bound- aries. For instance, prior knowledge of musical key generates expectations favoring certain pitch sequences. Although the key itself is seldom known a priori, we can still benefit from the knowledge that the key is likely to be consistent for long sequences of frames, while being robust to situations in which the key happens to fluctuate more rapidly than expected. To best exploit contextual knowledge in ways that are robust to inherent uncertainties in this knowledge, we propose a probabilistic model which satisfies the following objectives: It segments the signal into discrete note events, possibly punctuated by regions of silence or spurious phenom- ena. Each event is further segmented into transient and steady-state regions. Transient regions account as well for corrupted peak information caused by frame boundary misalignment at the true onset location. It distinguishes abrupt spectral changes associated with musical onsets from other, spurious abrupt-change events, such as the performer knocking the microphone, clicks and pops, etc. It improves pitch detection across all frames containing significant pitch content in a note event, because the information in all frames is used to infer the content in a given frame. The desire to fuse information from raw signal observations with prior and contextual knowledge leads naturally to a Bayesian approach, especially when the goal is to minimize the probability of making incorrect decisions about a particular attribute (e.g., the pitch of a note event) 1 . The Bayesian temporal integration and segmentation of spectral features is presently an emerging field to which many researchers have contributed; e.g.: Kashino et al. [13], Raphael [22], Walmsley et al. [31], Sheh and Ellis [24], Hainsworth [11], Kashino and Godsill [14], and Cemgil et al. [4], [5], among others. Perhaps the closest comparable work (in terms of modeling aspects) is that by Cemgil et al., which proposes a generative probabilistic model for note identification in both monophonic and polyphonic cases. Cemgil’s model explicitly encodes note segmentation via a binary (note-off/note-on) indicator variable, the latter influencing the evolution of signal characteristics (frequencies, decay rates, and amplitudes of sinusoidal com- ponents). By contrast, the proposed method yields not only a segmentation into individual note events, but also a sub- segmentation of each event into transient and steady-state regions. The sub-segmentation proves useful in applications 1 The decision-theoretic optimality of Bayesian methods is briefly reviewed in Section II-B.

Transcript of Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7....

Page 1: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

1

Melody Extraction and Musical Onset Detectionfrom Framewise STFT Peak Data

Harvey D. Thornburg, Student Member, IEEE, Randal J. Leistikow, Student Member, IEEE, and Jonathan Berger

Abstract— We propose a probabilistic method for the jointsegmentation and melody extraction for musical audio signalswhich arise from a monophonic score. The method operates onframewise short-time Fourier transform (STFT) peaks, enablinga computationally efficient inference of note onset, offset, andpitch attributes while retaining sufficient information for pitchdetermination and spectral change detection. The system explic-itly models note events in terms of transient and steady-stateregions as well as possible gaps between note events. In thisway, the system readily distinguishes abrupt spectral changesassociated with musical onsets from other abrupt change events.Additionally, the method may incorporate melodic context bymodeling note-to-note dependences. The method is successfullyapplied to a variety of piano and violin recordings containingreverberation, effective polyphony due to legato playing style,expressive pitch variations, and background voices. While themethod does not provide a sample-accurate segmentation, itfacilitates the latter in subsequent processing by isolating musicalonsets to frame neighborhoods and identifying possible pitchcontent before and after the true onset sample location.

I. INTRODUCTION

MUSICAL signals are typically organized as sequencesof note events. These events often arise from a per-

former’s action, and the pitch content is expected to be similarfor the duration of each event. The beginning of each noteevent is termed a note onset, and the end a note endpoint.For the melody extraction task, we aim to estimate the onsettime, endpoint time, and pitch of each event. This extraction isso far restricted to nominally monophonic signals, which aresignals that may arise from a monophonic score. Of course,the actual recording may contain significant overlaps in thepitch content over time, especially at note transition points,where reverberation or performance techniques sustain notesuntil after subsequent ones have begun.

Due to computational considerations, the extraction oper-ates on framewise sequences of short-time Fourier transform(STFT) peaks. In most cases these peaks suffice for pitchdetermination [10], [27], [30]. When tracking pitches overtime, however, problems arise where the frame boundaries donot exactly coincide with the pitched portions of note events.Frames may, for instance, contain samples from the transientportion of a note event in which pitch information is not yetsalient, cross over the boundary between two notes, containonly background noise, or include information due to otherinterfering sound sources.

Fortunately, musical signals are highly structured, both atthe signal level, in terms of the expected timbral evolutionof each note event, and at higher levels, in terms of melodicand rhythmic tendencies. This structure creates context which

can be used to predict pitch values and note event bound-aries. For instance, prior knowledge of musical key generatesexpectations favoring certain pitch sequences. Although thekey itself is seldom known a priori, we can still benefit fromthe knowledge that the key is likely to be consistent for longsequences of frames, while being robust to situations in whichthe key happens to fluctuate more rapidly than expected. Tobest exploit contextual knowledge in ways that are robustto inherent uncertainties in this knowledge, we propose aprobabilistic model which satisfies the following objectives:

• It segments the signal into discrete note events, possiblypunctuated by regions of silence or spurious phenom-ena. Each event is further segmented into transient andsteady-state regions. Transient regions account as well forcorrupted peak information caused by frame boundarymisalignment at the true onset location.

• It distinguishes abrupt spectral changes associated withmusical onsets from other, spurious abrupt-change events,such as the performer knocking the microphone, clicksand pops, etc.

• It improves pitch detection across all frames containingsignificant pitch content in a note event, because theinformation in all frames is used to infer the content ina given frame.

The desire to fuse information from raw signal observationswith prior and contextual knowledge leads naturally to aBayesian approach, especially when the goal is to minimizethe probability of making incorrect decisions about a particularattribute (e.g., the pitch of a note event)1. The Bayesiantemporal integration and segmentation of spectral features ispresently an emerging field to which many researchers havecontributed; e.g.: Kashino et al. [13], Raphael [22], Walmsleyet al. [31], Sheh and Ellis [24], Hainsworth [11], Kashino andGodsill [14], and Cemgil et al. [4], [5], among others.

Perhaps the closest comparable work (in terms of modelingaspects) is that by Cemgil et al., which proposes a generativeprobabilistic model for note identification in both monophonicand polyphonic cases. Cemgil’s model explicitly encodes notesegmentation via a binary (note-off/note-on) indicator variable,the latter influencing the evolution of signal characteristics(frequencies, decay rates, and amplitudes of sinusoidal com-ponents). By contrast, the proposed method yields not onlya segmentation into individual note events, but also a sub-segmentation of each event into transient and steady-stateregions. The sub-segmentation proves useful in applications

1The decision-theoretic optimality of Bayesian methods is briefly reviewedin Section II-B.

Page 2: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

2

ranging from analysis-synthesis transformations to windowswitching for audio data compression [18], [7], [29], [8]. Intime-scaling, for instance, it is often desirable to dilate/contractthe steady-state portions of notes without altering the transientportions [7], [18].

As in the proposed model, the approaches of Cemgil [4], [5]and Hainsworth [11], among others, allow the segmentationto influence the signal characteristics’ evolution, the lattervarying smoothly during note events and abruptly across eventboundaries. However, there exists an important distinctionbetween these methods and the present approach in termsof the encoding of these characteristics which has profoundimplications on the computability of exact Bayesian inference.The proposed model, by discretizing relevant signal charac-teristics (i.e., note pitch as measured by integer value andfractional tuning endpoint, as well as amplitudes of pitchedand transient information), obtains an exact inference solutionwhich is computable in linear time, while the exact solutionfor the models of [4], [5] are at best quadratic time. Althoughthe authors [11], [5] develop linear-time approximate inferencestrategies which seem to work well in practice, it may notbe possible to check the validity of these approximationsin cases where the cost of computing exact inference isprohibitive. The computational cost of the proposed methodis analyzed in Section VI. The loss of resolution imposedby the discretization of signal characteristics in the proposedapproach fails to be problematic in this sense: first, the primaryquantities of interest for segmentation and melody extraction(note values, onset incidences, region labels) are themselvesdiscrete; second, the effects of the discretization may bechecked by increasing the grid resolutions and examiningwhether there is any appreciable change in the result.

The proposed method obtains additional computational sav-ings by operating on a highly reduced, yet quite general featureset, that of framewise STFT peaks. While this approach doesnot provide a sample-accurate segmentation like some existingmethods, [4], [5], [11], sample accuracy may be recoveredin a subsequent pass which obtains the precise onset samplelocations by preselecting just those frames in which onsets aredetected (plus surrounding frames) and performing sample-accurate processing on just those frames. Since onset frameneighborhoods are of limited duration, it may be possible toapply even quadratic-time algorithms to perform the sample-accurate processing. The latter may also be facilitated bypartial knowledge, obtained from the framewise pitch contentanalysis, of signal models before and after change.

Another key innovation of the proposed method is theuse of two distinct criteria to perform transcription: globalerror rate across all frames is minimized for the segmentationvariable, then, given that optimal sequence, symbol error ratesare minimized in each frame for the signal characteristicvariables. Applying a global maximum a posteriori criterionto the segmentation variable yields segmentations in whichsequential integrity is preserved, avoiding sequences whichare undesirable in transcription, such as those with onsets insuccessive frames. Section II-B provides a detailed discussionmotivating these criteria as the correct ones for the melodyextraction and onset detection task.

II. PROBABILISTIC MODEL AND EXTRACTION GOALS

A. Probabilistic Model

The proposed model accounts for the evolution of mono-phonic signal characteristics governed by a succession ofnote events punctuated by regions of silence or spuriousphenomena. Each note event is further divided into twoprincipal regions, a transient region in which pitch contentis ambiguous, followed by a steady-state region in whichpitch content is salient. The result is a cyclic succession ofregimes (transient→pitch→null), as displayed in Figure 1. The

transient

null

transient

null

note event note event

pitched pitched

{{

Fig. 1. Cyclic succession of note event regimes in a monophonic passage

duration of transient or null regimes may be zero or moreframes; however, a pitched regime is expected to occupy atleast one frame.

We define a musical onset as a function of two successiveframes indicating that the transient boundary has been crossed.To represent the cyclic succession of regimes, as well as theonset incidence, we define the following modes which serveto label each frame:

• ‘OT’ – the beginning frame of a transient region, of whichthere can be at most one per note event.

• ‘OP’ – the beginning frame of a note event in which thefirst frame already contains salient pitch content, of whichthere can be at most one per note event.

• ‘CT’ – the continuation of a transient region in the eventthe region occupies more than one frame; must follow aprevious ‘CT’ or ‘OT’.

• ‘CP’ – the continuation of a pitched region; must follow‘OP’, ‘CP’, ‘OT’, or ‘CT’.

• ‘N’ – a silent or spurious frame which occurs anytimeafter the last frame of a note event. A ‘N’ is followed byeither another ‘N’ or an onset (‘OT’ or ‘OP’).

The onset incidence is represented by the event that eitherMt = ‘OT’ or Mt = ‘OP’. Such an explicit representationfacilitates the modeling of abrupt-change behaviors whichcharacterize the beginnings of note events. For instance,the amplitude of percussive sounds generally undergoes anabrupt increase at onset locations; this behavior may notoccur at other regime transitions, such as the transient/pitchedboundary. In general, grouping modes into sets with commonproperties proves useful in modeling the evolution of signalcharacteristics; Table I summarizes these sets.Additionally, we define M as the set of all modes,

M∆= P ∪Q

= O ∪ C ∪ {‘N’} (1)

Page 3: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

3

Symbol Definition DescriptionP {‘OP’, ‘CP’} pitched modesQ {‘OT’, ‘CT’, ‘N’} non-pitched modesT {‘OT’, ‘CT’} transient modesO {‘OT’, ‘OP’} onset modesC {‘CT’, ‘CP’} continuation modes

TABLE IDefinitions of mode groupings

We associate with the frame at time t a variable Mt ∈M, which describes the mode associated with that frame.Therefore, we can represent an onset by Mt ∈ O, and bythe constraints of the mode succession, segment the passageinto contiguous note event regions.

Recall that in the melody extraction task, we wish atminimum to estimate the pitch of each note region, as well asthat region’s onset time and duration. Additionally, we can alsoextract information potentially useful for a more refined musi-cal transcription by tracking amplitudes and tuning endpoints.Even though note pitches are of primary interest, tunings andamplitudes are likely to exhibit structure across frames in away that helps both pitch determination and onset detection,hence it becomes necessary to model them regardless of thespecific transcription objectives. Under normal circumstances,tuning is likely to be constant or exhibit only local variationsdue to drifts associated with the onset transients of percussivesounds or expressive modulations such as vibrato. Of course,we must also be robust to the situation where tuning expe-riences a global shift, for instance, when the playback of arecording from analog tape undergoes a sudden speed change.

We represent the state quantities of note value, tuning, andamplitude as follows:

• Nt ∈ SN = {Nmin, Nmin +1, . . . , Nmax}, where eachelement of SN is an integer representing the MIDI notevalue (e.g., the note C4, middle C, corresponds to Nt =60).

• Tt ∈ ST , where ST is a uniformly spaced set of tuningvalues in [−0.5, 0.5), with the minimum value equal to−0.5.

• At ∈ SA, where SA is an exponentially spaced set ofreference amplitude values active when Mt ∈ P .

• AQt ∈ SAQ , where SAQ is an exponentially spaced set of

reference amplitudes active when Mt ∈ Q.We define St, the state at time t, to be the collection of

valid possibilities for all state quantities:

St ∈ S = SN ⊗ ST ⊗ (SA ∪ SAQ) . (2)

which is to say, either St = {Nt, Tt, At} if Mt ∈ P or St ={Nt, Tt, A

Qt }, if Mt ∈ Q. State information is expected to be

continuous during the steady-state region of each note event,but may vary abruptly across the incidence of an onset. In thisway, we may integrate information across sequences of frameswhich associate with the same pitch event, while respondingas quickly as possible to changes in this information.

Finally, we represent the observations Yt as the set of allSTFT peak frequencies and amplitudes in frame t. Peaks arechosen from overlapping, Hamming-windowed, zero-padded

frames following the quadratic interpolation method describedin [25].

The joint distribution of all variables of interest,P (M0:K , S0:K , Y1:K), where K is the number of frames,factors over the directed acyclic graph as shown in Figure2, i.e.:

P (M0:K , S0:K , Y1:K) =

P (M0, S0)K�

t=1

P (Mt|Mt−1)P (St|Mt−1, Mt, St−1)P (Yt|St) (3)

Mt-1 Mt

StSt-1

Yt-1 Yt

Mt+1

St+1

Yt+1

Fig. 2. Directed acyclic graph for probabilistic model

B. Inference and Estimation Goals

The primary goals are to make optimal decisions concerningsuccessions of onsets, endpoints, note pitches, tunings, andamplitudes over frames. However, the exact optimality criteriadiffer according to each attribute’s interpretation, as well as itsrole in the melody extraction task.

Two common criteria are as follows. First, we simplyminimize the probability of any error in the attribute sequence.That is, letting Z1:K denote the true attribute sequence, andZ1:K (Y1:K) any estimate of the true sequence based on theobservations Y1:K , the optimal estimate, Z∗

1:K , satisfies thefollowing.

Z∗1:K = argmin

Z1:K(Y1:K )

P � Z1:K (Y1:K) 6= Z1:K � (4)

It is easily shown [3] that the Z∗1:K satisfying (4) maximizes

the posterior P (Z1:K |Y1:K):

Z∗1:K = argmax

Z1:K

P (Z1:K |Y1:K) (5)

This choice is commonly called the MAP (maximum a pos-teriori) estimate.

For long enough sequences and under sufficiently noisy con-ditions, the probability of Z1:K (Y1:K) 6= Z1:K may approachunity. That is, it may be impossible to decode the attributesequence in an errorless fashion. Now, since an estimateZ1:K (Y1:K) having nothing to do with the recording beinganalyzed is valued the same as an estimate that differs onlyin one symbol from the true sequence Z1:K , there seems littleincentive for the MAP scheme to favor sequences somehow

Page 4: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

4

“close to” the true sequence. To remedy this, we proposeto minimize the symbol error rate, which is the same asminimizing the expected number of symbol errors. This secondcriterion is given as follows.

Z∗1:K = argmin

Z1:K

K�t=1

P � Zt 6= Zt � (6)

One may show that Z∗1:K in (6) is equivalently obtained

by maximizing the framewise posterior, P (Zt|Y1:K), for eacht ∈ 1 : K. This posterior is traditionally called the smoothedposterior.

Z∗1:K = {Z∗

t }Kt=1

Z∗t = argmax

Zt

P (Zt|Y1:K) (7)

A key disadvantage of the maximized smoothed posterioris that it may fail to preserve the integrity of the entireattribute sequence. This becomes most relevant when consid-ering segmentation (onset/endpoint detection; characterizationof contiguous note regions), which is tantamount to estimatingthe mode sequence M1:K . For instance, suppose the truesample location of an onset lies infinitesimally close to aframe boundary, though just preceding it; here, considerableambiguity exists concerning whether to assign the onset tothe frame in which it actually occurred, or to the subsequentframe, for which the change in the spectral content regardingthe previous frame is most salient. Now, it is clearly less costlyfor the estimator M∗

1:K to hedge in favor of declaring onsetsin both frames, which incurs at most one symbol error, asopposed to declaring a single onset in the wrong frame, whichincurs at least two errors. However, the introduction of anextraneous onset is disastrous for the melody extraction task,whereas the shift of the estimated onset location by one frameusually has a negligible effect. In this case the MAP criterion(4) is preferable, since by construction, it is unlikely thatthe joint posterior P (M1:K |Y1:K) concentrates in sequencescontaining onsets in adjacent frames. In fact, these sequencesare guaranteed to have zero probability under our choice oftransition prior P (Mt+1|Mt).

As a result, the optimal mode sequence is estimated via

M∗1:K = argmax

M1:K

P (M1:K |Y1:K) (8)

although as Section IV discusses, adherence to (8) is onlyapproximate due to computational cost considerations.

As for the remaining state sequences (i.e., N1:K , T1:K ,A1:K ∪ AQ

1:K ) there seem to be no special reasons to applythe MAP criterion as opposed to estimating the sequenceswhich minimize symbol error rates, as long as these estimatessynchronize in some fashion with M∗

1:K . A natural objectiveis to minimize the expected number of symbol errors givenM∗

1:K . The latter minimization is achieved by maximizingP (Zt|M∗

1:K , Y1:K) where Zt is one of Nt, Tt, or At∪AQt . The

overall melody extraction and segmentation process (includingthe postprocessing steps which are necessary to convert theidentified mode and state sequences to MIDI data) are sum-marized in the block diagram of Figure 3. Preprocessing is dis-

PREPROCESSING(STFT, peak picking)

ESTIMATION OFDISTRIBUTIONAL(SIGNAL MODEL) PARAMETERS

PRIMARYINFERENCE

P(St+1,Mt+1|St,Mt) M*1:K N*1:K T*1:K, A*1:K,A(*Q)1:K

Y1:K (framewise STFT peaks)

POSTPROCESSING

Input signal

Onsets, durations, note values, transient regions

(peaks) (notes) (tunings, amps,nullamps)

Fig. 3. Block diagram of melody extraction and segmentation method

cussed in Section II-A, estimation of distributional parametersin Sections III and IV-B, primary inference in Section IV-A,and postprocessing in Section V.

III. DISTRIBUTIONAL SPECIFICATIONS

For the model in Figure 2, it remains to specifythe prior P (S0, M0), the transition distribution acrossframes: P (St+1, Mt+1|St, Mt), and the observation likeli-hood P (Yt|St).

A. Prior

The prior encodes information about the frame immediatelypreceding the recording. When this information is absent, wedesire a solution which applies in the widest variety of musicalcontexts. The prior admits the factorization:

P (S0, M0) = P (M0) P (S0|M0) (9)

In the most general case we specify P (M0) as uniform, andP (S0|M0) as factorizing independently among the compo-nents of S0:

P (S0|M0 ∈ P) = P (T0) P (N0) P (A0)

P (S0|M0 ∈ Q) = P (T0) P (N0) P (AQ0 ) (10)

where P (T0), P (N0), P (A0), and P (AQ0 ) are uniform.2

However, there do exist situations in which specific knowl-edge about the preceding frame is available. It is common,

2In many recordings of Western music, we expect the tuning to be roughlyA440, meaning that the note A4 ≈ 440Hz; however, it is dangerous toassume this for all applications, because recordings are commonly transposedin web-based music applications, for example, to maintain consistent tempowithin a collection of pieces.

Page 5: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

5

for instance, that the beginning of the recording coincideswith the beginning of the first note event. In this case, theframe immediately preceding the recording should not containinformation associated with a note event; i.e., P (M0 = ‘N’)=1. All other conditional distributions remain unchanged.

B. Transition distribution

The transition distribution factors accordingly:

P (St+1, Mt+1|St, Mt) =

P (Mt+1|Mt) P (St+1|Mt,Mt+1, St) (11)

The mode transition dependence, P (Mt+1|Mt), developsaccording to the transition diagram displayed in Figure 4. In

OT CT

CPOP

N

Fig. 4. Mode transition diagram for P (Mt+1|Mt) .

the figure, solid lines indicate transitions with positive proba-bility under the standard note evolution hypothesis modelingthe cyclic succession in Figure 1, while dotted lines indicateadditional transitions due to other, spurious incidents, forinstance an attack transient followed immediately by silence3.

The rationale behind the standard note evolution hypoth-esis is as follows. A primary governing principle is thatonsets, as they indicate the beginnings of note events, maynot occur in adjacent frames; i.e., an onset mode mustbe followed immediately by a continuation or null mode:P (Mt+1 ∈ C ∪N|Mt ∈ O) = 1. The latter ensures that thesegmentation is well defined, especially when the attacktransient occupies more than one frame. Additionally, eachnote event must have at least one frame containing pitchcontent. The transition behavior adheres otherwise to the cyclicsuccession (transient → pitched → null) discussed in SectionII-A.

Nonzero elements of P (Mt+1|Mt) corresponding to the thestandard hypothesis are estimated via the EM algorithm [6].The latter encompasses iterations which, if properly initialized,converge to the (constrained) maximum-likelihood estimateP (Mt+1|Mt). A favorable initialization ensures not only a

3Such incidents may arise from the sudden splicing of recordings inthe middle of note events. We do not attempt to rigorously model suchphenomena. To be robust, however, we assign these transitions a small,nonzero probability.

correct convergence, but also that the convergence withinspecified limits takes as few iterations as possible. Furtherdetails appear in Section IV-B.

Our initialization of P (Mt+1|Mt) proceeds via a generative,heterogeneous Poisson process model, as shown in Figure 1.This model represents the cyclic succession of a transientregion of expected length τT , followed by a pitched regionof expected length τP , followed by a null region of expectedlength τN . Individual lengths are modeled as independent,exponentially distributed random variables.

Table II gives the Poisson model in algebraic form. Eachterm p

(m)j,k denotes the probability that the beginning of the next

frame lies in a region of type k of the mth subsequent cyclegiven that the beginning of the current frame lies in a regionof type j, where j, k ∈ {T, P, N}, and where T correspondsto a transient, P corresponds to a pitched, and N correspondsto a null region. For example, if the current frame correspondsto a pitched region, the probability that no transition occurs inthe next frame is p

(0)P,P . The probability that the boundary of

the next frame lies within the pitched region of the subsequentnote is p

(1)P,P . Finally, ps represents the probability of spurious

transition corresponding to any of the dotted lines in Figure 1;we set ps to some small, fixed nonzero value; i.e., ps = .001is used to generate the results in Section VII.

Mt+1 = ‘OT’ Mt+1 = ‘OP’ Mt+1 = ‘CT’Mt = ‘OT’ 0 0 (1−ps)p

(0)T,T

Mt = ‘OP’ 0 0 0Mt = ‘CT’ ps ps (1−3ps)p

(0)T,T

Mt = ‘CP’ p(0)P,T

p(1)P,P

0

Mt = ‘N’ p(0)N,T

p(1)N,P

0Mt+1 = ‘CP’ Mt+1 = ‘N’

Mt = ‘OT’ (1−ps)(1 − p(1)T,T

) ps

Mt = ‘OP’ p(0)P,P

1 − p(0)P,P

Mt = ‘CT’ (1−3ps)p(1)T,T

ps

Mt = ‘CP’ p(0)P,P

p(0)P,N

Mt = ‘N’ 0 p(0)N,N

TABLE IIGenerative Poisson model for the initialization of P (Mt+1|Mt) .

C. State transition behavior

The state transition behavior, P (St+1|Mt, Mt+1, St) de-scribes the prior uncertainty in the next state, St+1, giventhe current state, St, over different values of Mt and Mt+1.We note that this distribution depends on Mt at least throughMt ∈ P or Mt ∈ Q, as the relation between two temporallyadjacent pitched states is fundamentally different than therelations between a pitched state following a non-pitched state.However, we assume no additional dependence on Mt.

For fixed Mt, the variation of P (St+1|Mt, Mt+1, St) withrespect to Mt+1 yields the primary consideration for the de-tection of note region boundaries. For instance, given Mt ∈ P ,Mt+1 = ‘CP’ indicates that frames t and t+1 belong to thesame note event; hence Nt+1, Tt+1, and At+1 are expectedto be close to Nt, Tt, and At, respectively. On the otherhand, Mt+1 = ‘OP’ signifies that frame t+1 contains the

Page 6: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

6

onset of a new note event. Here, At+1 is independent ofAt, and Nt+1 depends only on Nt through the probabilisticrelation between the values of adjacent notes. The lattermodels specific tendencies in melodic transitions. It may bedifficult to say much about these tendencies, however, unlessmusical key is also introduced as a hidden state variable.

For fixed Mt and Mt+1, the transition behavior factorsindependently over the components of St:

P (St+1|St,Mt+1 ∈ P,Mt ∈ P) =

P (Tt+1|Tt,Mt+1 ∈ P,Mt ∈ P)

× P (Nt+1|Nt,Mt+1 ∈ P,Mt ∈ P)

× P (At+1|At, Mt+1 ∈ P, Mt ∈ P) (12)

and similarly for Mt ∈ Q or Mt+1 ∈ Q, although in thesecases At+1 is replaced by AQ

t+1, or At is replaced by AQt .

The factorization (12) assumes no interdependence betweenstate components when Mt and Mt+1 are in evidence. Thoughsituations may arise which do indicate interdependence (e.g.,{Tt = 0.49, Nt = 60} and {T ′

t = −0.5, Nt = 61} refer tothe same pitch hypothesis), such occurrences are infrequentenough that in most practical situations, it is unnecessary tomodel this interdependence.

We discuss now the individual distributions on the r.h.s. of(12), considering note, tuning, and amplitude, in that order. Tobegin, if Mt+1 = ‘CP’, Mt ∈ P with probability one; hence,frames t and t+1 belong to the same note event, and Nt+1≈Nt.In these cases, we choose the conditional distribution of Nt+1

given Nt to concentrate about Nt in such a way that largechanges in note value are less likely than small changes or nochange. To express this concentration, we define the double-sided exponential distribution:

E2 (X1|X0, α+, α−) = �� � c, X1 =X0

cακ(X1)−κ(X0)+ , X1 >X0

cακ(X0)−κ(X1)− , X0 >X1

(13)

where c is chosen such that the distribution sums to unity, andκ(X) = k means that X is the kth smallest element in thefinite set of values for X . For Nt+1 given Nt, the dependenceis symmetric:

P (Nt+1|Nt, Mt+1 = ‘CP’, Mt ∈ P) =

E2 (Nt+1|Nt, αN , αN ) (14)

Ideally αN = 0, but we allow some small deviation forrobustness to the case where the tuning endpoint approaches±0.5, as here some ambiguity may result as to the note value.On the other hand, allowing for more more extreme tuningvariations (e.g., vocal vibrato greater than ±0.5) requires thatα should remain very close to zero. Now, if Mt+1 ∈ Q, noinformation about the note is reflected in the observations.Here we adopt the convention that Nt+1 refers to the value ofthe most recent note event; upon transition to a new event, wewill have memorized the value of the previous event and canthus apply knowledge concerning note-to-note dependences.Thus:

P (Nt+1|Nt, Mt+1 ∈ Q, Mt ∈ M) =

E2 (Nt+1|Nt, αN , αN ) (15)

Let Pnote trans(N1|N0) be the dependence where N0 andN1 are the values of adjacent note events; if such informationis absent, the dependence is simply uniform over N1, indepen-dent of N0. If Mt+1 = ‘OP’, the frame t+1 is the first framewhere the value of the new note event can be observed. SinceNt memorizes the value of the previous note, the conditionaldistribution of Nt+1 must follow Pnote trans(N1|N0):

P (Nt+1|Nt, Mt+1 = ‘OP’, Mt ∈ M) =

Pnote trans (N1|N0) (16)

The remaining cases involve certain (Mt, Mt+1) combina-tions which occur with zero probability due to the modetransition mechanism of Table II. These distributions can beset to anything without affecting the inference outcome, sowe specify them as to minimize computational effort (seeTable III).

The conditional distribution of the tuning, Tt+1, reflects theassumption that tuning is expected to be constant, or vary onlyslightly throughout the recording, independently of onsets,endpoints and note events. Of course, this is not entirely true;as some instruments exhibit a dynamic pitch envelope, weshould allow for some degree of variation in order to be robust.Hence:

P (Tt+1|Tt, Mt+1 ∈ M, Mt ∈ M) =

E2 (Tt+1|Tt, αT+, αT−) (17)

where αT+ = αT−∆= αT indicates symmetry of the expected

tuning variation.Finally, we consider the conditional distribution of both

pitched and null amplitudes. The case (Mt+1 = ‘CP’, Mt ∈ P)implies that At and At+1 belong to the same note event, At+1

concentrating about At as follows.

P (At+1|At, Mt+1 = ‘CP’, Mt ∈ P) =

E2 (At+1|At, αA+, αA−) (18)

where αA+ ≤αA−. Setting αA+ <αA− indicates a decayingamplitude evolution throughout the note duration, best adaptedto percussive tones like marimba and piano. On the other hand,setting αA+ =αA− may be more appropriate for violin, voice,and other sustained tones, or tones with lengthy attack regions.In all other cases, At+1 is independent of At (or AQ

t ). WhereMt+1 = ‘OP’, At+1 corresponds to the pitch amplitude of theonset of a note event. In these cases, At+1 resamples from adistribution favoring higher amplitudes:

P (At+1|At, Mt+1 = ‘OP’, Mt ∈ P) =

E1 � At+1, βA,‘OP’ � (19)

P � At+1|AQt , Mt+1 = ‘OP’, Mt ∈ Q � =

E1 � At+1, βA,‘OP’ � (20)

Page 7: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

7

where, using the notation of (13),

E1(X, β) = cβκ(X) (21)

The constant c is chosen such that for fixed β, E1(X, β)sums to unity over values of X . Setting β

A,‘OP’ > 1 meansthat the pitched onset amplitude favors higher amplitudes. Caremust be taken that β

A,‘OP’ is not too large, however, to allowfor quiet notes. Where Mt+1 ∈ ‘OT’ or Mt+1 ∈ ‘CT’ (i.e.,Mt+1 ∈ T ), the distribution is similar, but concerning AQ

t+1 inplace of At+1:

P � AQt+1|At, Mt+1 ∈ T , Mt ∈ P � = E1 � AQ

t+1, βA,T �P � AQ

t+1|AQt , Mt+1 ∈ T , Mt ∈ Q � = E1 � AQ

t+1, βA,T � (22)

where βA,T >1.Where Mt+1 = ‘N’, the distribution of AQ

t+1 follows eitherline of (22), depending on Mt ∈ P or Q, but with βA,′N ′ <1in place of βA,T , since the null mode favors lower amplitudes.Table III summarizes the aforementioned state transition be-havior, filling in ”don’t-care” possibilities.

Mt+1 Mt ∈ QP (Nt+1|Nt) = E2 (Nt+1|Nt, αN , αN )

‘OT’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (AQt+1|A

Qt ) = E1(AQ

t+1, βA,T )

P (Nt+1|Nt) = Pnote trans (Nt+1|Nt)‘OP’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (At+1|AQt ) = E1 � At+1, β

A,‘OP’ �P (Nt+1|Nt) = E2 (Nt+1|Nt, αN , αN )

‘CT’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (AQt+1|A

Qt ) = E1(AQ

t+1, βA,T )

P (Nt+1|Nt) = Pnote trans (Nt+1|Nt)‘CP’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (At+1|AQt ) = E1(At+1, β

A,‘OP’)P (Nt+1|Nt) = E2 (Nt+1|Nt, αN , αN )

‘N’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (AQt+1|A

Qt ) = E1(AQ

t+1, βA,‘N’)Mt+1 Mt ∈ P

P (Nt+1|Nt) = E2 (Nt+1|Nt, αN , αN )‘OT’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (AQt+1|At) = E1(AQ

t+1, βA,T )

P (Nt+1|Nt) = Pnote trans (Nt+1|Nt)‘OP’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (At+1|At) = E1 � At+1, βA,‘OP’ �

P (Nt+1|Nt) = E2 (Nt+1|Nt, αN , αN )‘CT’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (AQt+1|At) = E1(AQ

t+1, βA,T )

P (Nt+1|Nt) = E2 (Nt+1|Nt, αN , αN )‘CP’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (At+1|At) = E2 (At+1|At, αA+, αA−)P (Nt+1|Nt) = E2 (Nt+1|Nt, αN , αN )

‘N’ P (Tt+1|Tt) = E2 (Tt+1|Tt, αT , αT )

P (AQt+1|At) = E1(AQ

t+1, βA,‘N’)TABLE III

State transition table for component distributions of

P (St+1|St,Mt+1, Mt)

D. Frame likelihood

The likelihood for frames with pitch content (where theamplitude is represented by At) is computed by a modification

of the method of [30], to which we refer as the canonicalevaluation. The latter evaluates Pcan(Yt|f0,t, A0,t) where f0,t

is the radian fundamental frequency, and A0,t is an unknownreference amplitude. In [30], [28] this method is shown tobe robust to real-world phenomena such as inharmonicity,undetected peaks, and spurious peaks due to noise and otherinterference phenomena (see also [16] for polyphonic exten-sions).

Care must be taken in the association of the hypotheses(f0,t, A0,t) with those of the state (Nt, Tt and At). While f0,t

is uniquely determined by Nt and Tt, the relation between thereference amplitude, A0,t, and At becomes more involved.In [30], the reference amplitude is estimated as the maximumamplitude over all peaks in the frame, denoted as Amax,t. Thelatter seems to yield favorable psychoacoustic properties (asborne out by informal listening tests) in the context of manyreal-world signals which are assumed to be monophonic, butare actually polyphonic. For instance, consider a recordingof the introductory motive of Bach’s Invention 2 in C-minor(BWV 773) by Glenn Gould. Here the pianist hums twooctaves below the piano melody. The humming can barely beheard in most frames; nevertheless, the likelihood evaluationsometimes favors the voice’s fundamental rather than that ofthe piano, especially when these fundamentals are in an exactharmonic relationship. While this result may be technicallycorrect in the absence of explicit timbral models, it fails torepresent what is heard as salient. Now, one may argue thatthe perceived salience of the piano melody arises from theconsistency of pitch and amplitude information across longsegments of frames, as the voice tends to fade in and outover these regions. We find, nevertheless, that the perceivedsalience of the piano tone persists even in the absence ofcontextual cues; for instance, when a single frame is extractedand repeated for any given duration. A relevant explanation isthat in the absence of other contextual cues, we focus on theloudest of multiple pitch components: A0,t = Amax,t.

Unfortunately the latter choice ignores the state variable At,allowing little possibility for the conditional distribution ofAt to be influenced by the signal, except indirectly via themode layer. This choice disables the capacity of jumps inthe signal’s amplitude envelope to inform the segmentation,which can be a critical issue when detecting onsets of repeatednotes. Our solution is to take A0,t = At while introducingAmax,t as an independent noisy observation4 of At, as shownin Figure 5. By so doing, we blend the strategy which derivesA0,t from the state (A0,t = At) with the strategy incorporatingpsychoacoustic salience (A0,t = Amax,t). The conditionaldistribution for the observation layer becomes as follows:

P (Yt, Amax,t|Nt, Tt, At) =

P (Yt|Nt, Tt, At)P (Amax,t|At) (23)

4It may seem counterintuitive to model Amax,t and Yt as conditionallyindependent of At since, unconditionally speaking, Amax,t is a deterministicfunction of Yt. However, we wish not to introduce bias by assuming specificdependences between the noise on Amax,t and the amplitude/frequencynoises on other peaks of Yt

Page 8: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

8

Nt, Tt, At

Yt Amax,t

Fig. 5. Observation layer dependence with Amax,t

Here the dependence of Yt is modeled via Pcan(Yt|f0,t, A0,t)using A0,t = At, and Amax,t is modeled as At plus Gaussiannoise:

P (Yt|Nt, Tt, At) = Pcan (Yt|f0(Nt, Tt), A0,t = At)

P (Amax,t|At) = N At, σ2A (24)

We may interpolate between the rigid cases (A0,t =Amax,t vs.A0,t = At) simply by varying σ2

A between zero and infinity.A detailed discussion of this amplitude observation weightingappears in Appendix I.

Another consideration is the computational complexity ofthe frame likelihood computation. As shown in in Section IV,the inference requires that Pcan (Yt|f0(Nt, Tt), At) be eval-uated over each note, tuning, and amplitude value for eachframe. The resultant computational cost may be excessive,especially for real-time operation. Hence, we introduce afaster, if possibly less accurate, likelihood approximation thanthe MCMC method proposed in [30]. This approximation,detailed in Appendix II, works best when the pitch hypothesisis most salient and the STFT peaks are well separated infrequency.

As a final note, some frames may lack pitch contentaltogether; these correspond to purely transient effects, back-ground noise, or silence. In these cases the relevant amplitudecomponent of St is AQ

t . Since we still wish to model ageneral amplitude characteristic associated with these frames,in order to distinguish transients from silence, for instance,we model the frame via Pcan(Yt|f0(Nt, Tt), A

Qt ) under the

restriction that all peaks are spurious. Future work in modelingtransient characteristics may involve additional spectral enve-lope features, for instance mel-frequency cepstral coefficients(MFCC’s). The latter have been demonstrated as vital fortimbre discrimination in the absence of pitch information; see[26] for a recent study and associated bibliography.

IV. INFERENCE METHODOLOGY

A. Primary inference

The primary inference goals, summarized from the discus-sion in Section II-B, are the maximum a posteriori (MAP)determination of M1:K , denoted as M∗

1:K , and the computa-tion of the conditional smoothed state posterior given M∗

1:K ,namely P (St|M∗

1:K , Y1:K). Via (5), the MAP determinationsatisfies

M∗1:K = argmax

M1:K

P (M1:K |Y1:K) (25)

If, for any t ∈ 1 :K−1, Yt+1 is conditionally independentof Y1:t and M1:t given Mt+1, the Viterbi algorithm [21]may be used to identify M∗

1:K . Unfortunately, the implicitmarginalization of S1:K in (25) makes this conditional inde-pendence impossible to satisfy. It is possible, nevertheless, toapproximate the Viterbi inference in a manner which seems tohold quite well in practice, following one of the approachespresented in [20]. This approximation rests on the followingassumption:

P (Yt+1|M1:t+1, Yt) ≈ P (Yt+1|M∗1:t−1(Mt), Mt, Mt+1, Y1:t)

(26)

where

M∗1:t−1(Mt) ≈ argmax

M1:t−1

P (M1:t−1|Mt, Y1:t) (27)

We refer to M∗1:t−1(Mt) as the (approximate) Mt-optimal

mode sequence, define M∗a:b(Mt) as the restriction of this

sequence to frames a through b, and adopt the shorthandM∗

a

∆= M∗

a:a. As a side effect, the approximate inferencecomputes the smoothed posteriors P (St|M∗

1:K , Y1:K) jointlywith the MAP estimate M∗

1:K . Complexity is linear in thenumber of frames and quadratic in the number of modes,exactly as in the classic Viterbi algorithm.

The approximate Viterbi algorithm consists of two passes,a forward pass followed by a backward pass. Table IVsummarizes the quantities propagated in these passes. The(≈) symbol indicates that the corresponding quantity is ap-proximate. By computing the quantities M ∗

1:K and σ∗(St),displayed in Table IV, for t∈1 :K, we complete the primaryinference objectives by satisfying (approximately) the MAPdetermination given in (25) and obtaining the conditionalsmoothed state posteriors, P (St|M∗

1:K , Y1:K) for t∈1:K.The forward pass is initialized as follows.

J(M1) = P (M1|Y1)

µ∗(M1, S1) = P (S1|M1, Y1) (28)

These quantities are easily computed from P (M1, S1|Y1),which is obtained from the given prior and conditional dis-tributions:

P (M1, S1|Y1) ∝ P (Y1|S1)�S0

[P (S1|S0, M1)

�M0

P (M0, S0)P (M1|M0)] (29)

The remainder of the forward pass consists of the followingrecursions, for t∈1:K−1:

τ (Mt, St, Mt+1) =�St

µ∗ (Mt, St) P (St+1|Mt, Mt+1, St)

µ0 (Mt, St, Mt+1) = P (Yt+1|St+1) τ (Mt, St, Mt+1)

Σ0 (Mt, Mt+1) =�

St+1

µ0 (Mt, St, Mt+1)

J0 (Mt, Mt+1) = J (Mt) P (Mt+1|Mt) Σ0 (Mt, Mt+1)

Page 9: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

9

Symbol Quantity Descriptionτ∗(Mt, St) P (St|M∗

1:t−1(Mt), Mt, Y1:t−1) Predicted posterior givenMt-optimal mode sequence

µ∗(Mt, St) P (St|M∗1:t−1(Mt), Mt, Y1:t) Smoothed posterior given

Mt-optimal mode sequenceJ(Mt) maxM1:t−1

P (M1:t|Y1:t) (≈) Objective at time t

M∗t−1(Mt) argmaxMt−1

maxM1:t−2P (M1:t|Y1:t) (≈) Backpointer

M∗t argmaxMt

maxM1:t−1,Mt+1:KP (M1:K |Y1:K) (≈) MAP mode at time t

σ∗t (St) P (St|M∗

1:K , Y1:K) Smoothed posteriorµ0(Mt, St, Mt+1) P (St+1, Yt+1|M∗

1:t−1(Mt), Mt,Mt+1, Y1:t) Intermediateτ(Mt, St,Mt+1) P (St+1|M∗

1:t−1(Mt), Mt, Mt+1, Y1:t) Intermediateµ(Mt, St,Mt+1) P (St+1|M∗

1:t−1(Mt), Mt, Mt+1, Y1:t+1) IntermediateΣ0(Mt, Mt+1) P (Yt+1|M∗

1:t−1(Mt), Mt+1, Y1:t+1) IntermediateJ0(Mt,Mt+1) maxM1:t−1

P (M1:t+1|Y1:t+1) (≈) IntermediateTABLE IV

Inference inputs and propagated quantities

µ (Mt, St, Mt+1) =µ0 (Mt, St, Mt+1)

Σ0 (Mt, Mt+1)

M∗t (Mt+1) = argmax

Mt

J0 (Mt, Mt+1)

J (Mt+1) =J0 (Mt, Mt+1)

P (Yt+1|Yt)

µ∗ (Mt, St) = µ (M∗t (Mt+1) , St+1, Mt+1)

τ∗ (Mt, St) = τ (M∗t (Mt+1) , St+1, Mt+1) (30)

To prepare for the backward pass, quantities µ∗(Mt, St)and M∗

t (Mt+1) are stored for t ≥ 1 as well as τ∗(Mt, St) fort ≥ 2. The final-stage objective, J(MK), is also stored. Thebackward pass is initialized as follows.

M∗K = argmax

MK

J (MK)

σ∗(SK) = µ∗(M∗K , SK) (31)

The remainder of the backward pass consists of the followingrecursions, as t decreases from K−1 down to 1:

M∗t = M∗

t (M∗t+1)

σ∗(St) = µ∗(M∗t , St)

�St+1

σ∗(St+1)P (St+1|St, M∗t , M∗

t+1)

τ∗(M∗t+1, St+1)

(32)

A complete derivation of the approximate Viterbi recursions,(28 – 32) is similar to that given in Appendix A of [28].

B. Estimation of P (Mt+1|Mt)

The distribution P (Mt+1|Mt) depends on the collection offree parameters in Table II. These parameters are identifiedvia the EM algorithm. To begin, define

pk|j∆= P (Mt+1 = k|Mt = j) ∀j, k ∈ M (33)

Let for each j ∈ M, the set Sj ∈ M denote the set ofpossibilities for k for which pk|j is unknown. That is, wecharacterize the collection of free parameters, θ, as follows.

θ = �j∈M

�k∈Sj

{pk|j} (34)

The EM algorithm begins with an initial guess for θ, i.e.,θ(0), and proceeds over iterations i, updating θ = θ(i). Each it-eration comprises two steps: the first, or expectation step, com-putes the expected log likelihood log P (M0,K , S0:K , Y1:K |θ)where M0:K and S0:K are generated according to the distri-bution P (M0:K , S0:K |Y1:K). That is, we form

Q(θ|θ(i)) =

EP(M0:K ,S0:K |Y1:K ,θ(i))[ log P (M0:K , S0:K , Y1:K |θ) ]

(35)The subsequent maximization step chooses θ(i+1) as a valueof θ maximizing Q(θ|θ(i)).

The EM steps are as follows ([28], Appendix B):• Expectation step: Compute

σ(2)(Mt, Mt+1) = P (Mt, Mt+1|Y1:K , θ(i)M ) (36)

for all t ∈ 1 : K − 1 and Mt, Mt+1 ∈ M.• Maximization step: Update for each j ∈ M, k ∈ Sj

p(i+1)k|j = � K−1

t=1 σ(2)(Mt =j,Mt+1 =k)� k∈M � K−1t=1 σ(2)(Mt =j, Mt+1 =k)

(37)

Computation of the posterior quantities σ(2)(Mt, Mt+1) fort ∈ 1 : K−1 is the result of a standard Bayesian smoothinginference. The latter consists of a forward, filtering pass,followed by a backward, smoothing pass analogous to theRauch-Tung-Striebel method for Kalman smoothing [19], [12].The associated recursions are given in Appendix III.

V. POSTPROCESSING

The goal of postprocessing is to take the maximum aposteriori mode sequence, M ∗

1:K , and the smoothed noteposterior P (Nt|M∗

1:K , Y1:K), and produce a string of distinctnote events, which can be stored in a standard MIDI file5.The MIDI file output contains a sequence of note events,each consisting of an onset time, note value, and duration.Additionally, we provide a sub-segmentation into transient andpitched regions by specifying the duration of each note event’stransient region.

5Note that P (Nt|M∗1:K , Y1:K) may b e derived from the primary inference

output, P (St|M∗1:K , Y1:K), by marginalization.

Page 10: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

10

Table V specifies the quantities propagated in the postpro-cessing recursions for the mth note event.

Symbol Descriptiono(m) Onset frame for note eventp(m) First pitched frame in note eventd(m) Note event duratione(m)+ One frame beyond end of note event

N∗(m) MIDI note valueTABLE V

Transcription output quantities

Now let S be a collection of distinct integers, and let min(S)be the minimum integer in the collection if S is nonempty.Define:

+

min(S)∆= ∞, S = ∅

min(S) otherwise(38)

The postprocessing algorithm iterates over note events k,stopping only when either the onset frame, pitch boundary, orthe advanced end point (o(m), p(m), or e

(m)+ ), are infinite. This

stopping condition indicates that there is not enough signal todetermine information about the current or subsequent noteevents.

The onset frame for the first event is initialized as follows.

o(1) =+

min{t ≥ 1 : M∗t ∈ O} (39)

This search for an explicit onset automatically discards tailportions of note events which are truncated by the beginningof the signal.

In general, the recursions used to extract note events infor-mation are as follows.

o(m) =+

min{t ≥ e(m−1)+ : M∗

t ∈ O}

p(m) =+

min{t ≥ o(m) : M∗t ∈ P}

e(m)+ =

+

min{t ≥ p(m) : M∗t /∈ C} (40)

If k = 1, the initialization (39) is used in place of (40) incase of o(m). As indicated, e

(m)+ lies one frame beyond the

last frame of the note event. The duration of note k is justthe simple difference between e

(m)+ and o(m), unless e

(m)+ has

been truncated by the end of the signal. In the latter case, theduration is N − o(m) + 1.

d(m) = min(e(m)+ , N + 1) − o(m) (41)

To obtain the MIDI note value, we extract

N∗(m) = argmaxn

P (Np(1)+c = n|Y1:K) (42)

where

c = min(c0, e(m)+ − p(1) − 1) (43)

Here c is a margin variable ensuring that the maximum aposteriori pitch value assigned to the entire event is sampled

from a frame which is some distance away from the endof the transient region. The canonical value of c, ignoringtruncation effects, is c0; c0 = 3 suffices in the extraction ofMIDI information from the examples of Section VII.

VI. COMPUTATIONAL COST

As discussed in Section II-B, the overall melody extrac-tion and segmentation algorithm operates by estimating thetransition matrix P (Mt+1|Mt) by the EM approach of Sec-tion IV-B, followed by the primary inference discussed inSection IV-A, followed by postprocessing to convert the MAPmode sequence M∗

1:N and posteriors P (S1:N |Y1:N , M∗1:N) into

MIDI data as discussed in the previous section. We assesscomputational cost with respect to the following variables:K, the number of signal frames, NM , the number of distinctmodes6, NT , the number of tunings, NA, the number ofpitched amplitudes, and NAQ , the number of null amplitudes.

The cost of estimating P (Mt+1|Mt) is the combined costof the expectation and maximization steps (36, 37) timesthe number of EM iterations (typically fixed at two for thePoisson initialization; see Figure 8 for typical results). Theexpectation step consists of the standard Bayesian smoothinginference detailed in Appendix III. From (66,67) the inferenceappears linear in K and quadratic in NM . Additionally, thesummations�

St

[µ (Mt, St|Y1:t) P (St+1|St, Mt+1, Mt)

�St+1

φ (Mt+1, St+1)P (St+1|St, Mt+1, Mt) ] (44)

appear quadratic in NS , the number of distinct state possi-bilities, which equals NNNT (NA + NAQ). However, a morecareful analysis of the factorization of P (St+1|St, Mt+1, Mt),as presented in Section III enables the interchange of ordersof summation among the components of St or St+1 to reducethe overall cost. Let us consider the first summation and thecase Mt ∈ P , Mt+1 ∈ P ; all other summations and cases aresimilar. From (12), it follows:�

St

µ (Mt ∈ P, St|Y1:t) P (St+1|St, Mt+1 ∈ P, Mt ∈ P) =�Tt

[P (Tt+1|Tt, Mt+1 ∈ P,Mt ∈ P)

�Nt

[P (Nt+1|Nt, Mt+1 ∈ P,Mt ∈ P)

�At

P (At+1|At, Mt+1 ∈ P, Mt ∈ P)µ (Mt, St|Y1:t) ]](45)

The cost of (45) becomes O(NNNT (NA +NAQ)(NN +NT +NA). The cost for all required summations of this form hencebecomes O(NNNT (NA + NAQ)(NN + NT + NA + NAQ)),meaning that the overall cost for estimation of P (Mt+1|Mt)is O(KN2

MNNNT (NA + NAQ)(NN + NT + NA + NAQ)).

6While NM = 5 for the present algorithm, one may consider addingmodes which model more sophisticated envelope behavior (attack-decay-sustain-release) or which encode certain types of pitch fluctuations such asvibrato. Hence, it is useful to consider NM as a variable for the cost analysis.

Page 11: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

11

From (30, 32), we see that terms of the same type as in(45) dominate the cost of the primary inference; hence thecomputational cost of the latter is also O(KN 2

MNNNT (NA +NAQ)(NN + NT + NA + NAQ)). Finally, the postprocess-ing steps require the marginalization of P (Nt|M∗

1:K , Y1:K)from P (Nt|M∗

1:K , Y1:K), a process which requires O(NS) =O(NNNT (NA + NAQ)) computations per frames, plus theiterations (40) which are at most linear in the number offrames and constant with respect to other quantities. The post-processing computations thus appear negligible with respectto the other stages. As such, the overall computational costof the proposed method remains as O(KN 2

MNNNT (NA +NAQ)(NN + NT + NA + NAQ)).

VII. RESULTS

A. Primary inference

Our joint melody extraction and onset detection algorithmhas been applied to a small collection of violin and pianorecordings. In each case, the score is monophonic; however,the presence of reverberation, note overlaps due to legatoplaying style, and even the performer’s tendency to singaccompanying phrases are evidenced. A representative pianorecording is the introductory motive of Bach’s Invention 2in C Minor (BWV 773), performed by Glenn Gould. Therecording is fairly clean, with minimal reverberation; however,legato playing and the performer’s tendency to vocalize ap-proximately two octaves below the melody line, pose particulardifficulties for the pitch estimation. Onset locations, however,are clearly retrievable by eye or by a heuristic following ofamplitude jumps as in Schloss’s “surfboard” algorithm [23].

A representative violin recording, several notes from thethird movement of Bach’s solo violin Sonata No. 1 in Gminor (BWV 1001), performed by Nathan Milstein, is awashin reverberation, making the note transitions extremely diffi-cult to locate by time-domain methods which monitor onlyamplitude. The reverberation also induces effective polyphonyduring significant portions of the note duration. Moreover, thepresence of portamento renders ambiguous the exact momentsof several note transitions. Nevertheless, the melody is clearlyaudible.

Primary inference results for the piano example are shownin Figure 6. Directly beneath the time-domain waveform areshown the onset indication and the MAP mode determination,M∗

1:K . The MAP mode determination indicates the segmenta-tion according to onset locations and note durations, as well asthe sub-segmentation into transient and pitched regions withineach note event (see Section V for details about how thisinformation is extracted via postprocessing). Lower sectionsof the figure display posteriors for note, tuning, amplitude,and null amplitude values for t ∈ 1 : K given M∗

1:K andSTFT peaks Y1:K .

We observe that the concentration of the note posterior onthe correct value continues after the endpoint of the note eventand into the initial transient portion of the next event. Thiscontinuation is by design: to model note-to-note dependencesvia Pnote trans(Nt+1|Nt), we memorize the note value untilthe beginning of the first pitched region of the next event

(see Section III). For melody extraction, we select the valuemaximizing the note posterior for the initial frame withinthe first pitched region. This value is assigned to the entirenote event. In the example, there appears virtually no pitchambiguity, despite the occasional presence of the performer’svoice. Amplitude posteriors trace out appropriately decayingenvelopes, and tuning remains relatively constant throughout.

Results for the violin example are shown in Figure 7. In thisexample, we validate the segmentation by extracting portionscorresponding to identified note events and repeatedly listeningto them, since it is extremely difficult to detect the majority ofonset locations by eye. The reverberation and expressive pitchvariations in this example tend toward the worst case amongthe range of similar recordings tested; however, such effectsare common enough that it becomes necessary to assess theperformance under such conditions. We note that the initialgrace note is incorrectly detected an octave below the truepitch; however, the surrounding notes are correctly segmented.The tuning posterior is less consistent than in the pianoexample, thanks to the range of pitch variations (portamento,vibrato, etc.) available to the performer. A clear illustrationof tuning variability appears in the form of a portamentotransition between the fourth and fifth notes of the violinexample (Figure 7). As the performer slides smoothly downone semitone, the tuning posterior tracks the expressive gestureby first concentrating on pitches corresponding to the lowertuning values relative to the fourth note, then concentratingon higher-pitched tunings at the beginning of the fifth note7.Indeed, despite the highly reverberant context and expressivepitch variations, we observe that onset locations are identifiedin the places that make the best sense for the determinationof note boundaries.

B. Estimation of P (Mt+1|Mt)

The convergence of the constrained EM approach for theestimation of P (Mt+1|Mt) is shown in Figure 8 for the uni-form and Poisson initializations. While the uniform case nearsconvergence after five iterations, the Poisson converges muchfaster, after only two iterations. We note empirically that themelody extraction and segmentation results remain relativelyinsensitive to deviations in the specification of P (Mt+1|Mt);however, the exact distribution may itself be of interest incharacterizing recording qualities and performance tendencies.

VIII. CONCLUSIONS AND FUTURE RESEARCH

As the included piano and violin examples demonstrate, ourmethod provides a robust segmentation and melody extractionin the presence of significant reverberation, note overlapsacross onset boundaries, expressive pitch variations, and back-ground voices. The violin example demonstrates the capacityof the algorithm to correctly segment audio even when thelocations of many onsets in the time-domain waveform areambiguous.

7In cases where pitch fluctuation within one note may be greater than a half-step (e.g., vocal vibrato), the space of tunings may be extended to encompassa wider range of values.

Page 12: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

12

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150Frames

0.0501

0.0006NullAmps

55.2931

0.6911Amps

0.41

−0.50

Tunings

80

44

Notes

N CP CT OP OT

Modes

onOnset−1.0

0.0

1.0

Signal

Fig. 6. Piano example: Introductory motive of Bach’s Invention 2 in C-minor (BWV 773), performed by Glenn Gould. Rectangle sizes varylogarithmically according to probability, with the smallest visible rectangle corresponding to a probability of 0.03, and the largest, 1.0.

10 20 30 40 50 60 70 80 90 100 110Frames

0.0501

0.0006NullAmps

55.2931

0.6911Amps

0.41

−0.50

Tunings

80

44

Notes

N CP CT OP OT

Modes

onOnset−1.0

0.0

1.0

Signal

Fig. 7. Violin example: several notes from the third movement of Bach’s solo violin Sonata No. 1 in G minor (BWV 1001), performed byNathan Milstein. Rectangle sizes vary logarithmically according to probability, with the smallest visible rectangle corresponding to a probabilityof 0.03, and the largest, 1.0.

Page 13: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

13

OT OP CT CP N

N

CP

CT

OP

OT

(a) Initial uniform

OT OP CT CP N

N

CP

CT

OP

OT

(b) Uniform Iter. 1

OT OP CT CP N

N

CP

CT

OP

OT

(c) Uniform Iter. 2

OT OP CT CP N

N

CP

CT

OP

OT

(d) Uniform Iter. 3

OT OP CT CP N

N

CP

CT

OP

OT

(e) Uniform Iter. 4

OT OP CT CP N

N

CP

CT

OP

OT

(f) Uniform Iter. 5

OT OP CT CP N

N

CP

CT

OP

OT

(g) Initial Poisson

OT OP CT CP N

N

CP

CT

OP

OT

(h) Poisson Iter. 1

OT OP CT CP N

N

CP

CT

OP

OT

(i) Poisson Iter. 2

OT OP CT CP N

N

CP

CT

OP

OT

(j) Poisson Iter. 3

OT OP CT CP N

N

CP

CT

OP

OT

(k) Poisson Iter. 4

OT OP CT CP N

N

CP

CT

OP

OT

(l) Poisson Iter. 5

Fig. 8. Comparison of EM convergence with uniform vs. Poisson initialization. States along the horizontal axis of each figure correspond to Mt+1, andvertical axis Mt. The rectangle size varies logarithmically according to the probability, with the smallest visible square corresponding to a probability of0.03, and the largest, 1.0.

Forthcoming extensions include the integration of addi-tional layers of musical context and the simultaneous sample-accurate segmentation and demixing of individual note eventsoverlapping at onset boundaries. To incorporate more generalmodels of melodic expectation, we augment the state St toencode higher-level contextual attributes such as key and ad-ditional pitch history via N-grams. Such encoding allows us tocapture melodic tendencies such as inertia, gravity, magnetism,and others addressed in functional tonal analysis [15], [17]. Toperform sample-accurate segmentation, we recognize that theapproximate location of the musical onset is determined by theframe in which it is detected; this region may be expanded toinclude a surrounding neighborhood of frames. Additionally,information about the pitch content before and after onsetlocations, as provided by our method, may greatly aid theoptimal time-domain methods to be used for sample-accuratesegmentation. Such investigations, as well as development ofpolyphonic extensions using the multipitch likelihood evalua-tion presented in [16] are currently in progress.

APPENDIX IINDEPENDENT AMPLITUDE OBSERVATION WEIGHTING

Recall from Section III-D that we may interpolate betweenthe rigid cases (A0,t =Amax,t vs. A0,t =At) simply by varyingσ2

A between zero and infinity.Assuming At ∈ � + , as σ2

A → 0, the pitch infer-ence P (Nt, Tt|Yt, Amax,t), becomes identical to the infer-ence P ′ (Nt, Tt|Yt) where P ′ (Yt|Nt, Tt, At) is defined asthe canonical evaluation with A0,t replaced by Amax,t.On the other hand, as σ2

A → ∞, the pitch inferenceP (Nt, Tt|Yt, Amax,t) converges to P (Nt, Tt|Yt), which is thecanonical evaluation using A0,t = At, Amax,t being ignored.

To show, we first consider the case σ2A → ∞. Here the

dependence on At vanishes:

P (Amax,t|At) → P (Amax,t) (46)

As a result, Amax,t and the collection {Yt, Nt, Tt, At}become mutually independent. Then P (Nt, Tt|Yt, Amax,t)converges to P (Nt, Tt|Yt), as was to be shown.

Now we consider the case σ2A → 0, To begin, we note that

as σ2A → 0, P (Amax,t|At) becomes impulsively concentrated

about At; i.e.:

P (Amax,t|At) ∼ δ (Amax,t, At) (47)

It suffices to show that, given (47), P (Nt, Tt|Yt, Amax,t)becomes identical to the inference P ′(Nt, Tt|Yt) where

P ′(Yt|Nt,Tt,At) = Pcan (Yt|f0(Nt,Tt), A0,t =Amax,t) (48)

Expanding P (Nt, Tt|Yt, Amax,t) according to Bayes’ ruleyields the following:

P (Nt, Tt|Yt, Amax,t)× � �ν,τ � At

ρ (At, ν, τ, Yt, Amax,t) dAt �= � At

ρ (At, Nt, Tt, Yt, Amax,t) dAt ∀ Nt, Tt (49)

where

ρ (At,Nt,Tt,Yt,Amax,t)∆= P (At,Nt,Tt,Yt,Amax,t) (50)

and

P (At, Nt, Tt, Yt, Amax,t) = P (At) P (Nt, Tt|At)

× Pcan (Yt|f0(Nt, Tt), A0,t = At) δ (Amax,t, At) (51)

Substituting (51) into (49) obtains integral expressions withimpulsive terms. These expressions, and hence (49), simplifyas follows.

� At

P (At,Nt,Tt,Yt,Amax,t) dAt = P (Nt,Tt|At =Amax,t)

× Pcan (Yt|f0(Nt, Tt), A0,t = Amax,t) (52)

Page 14: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

14

Now, since At and {Nt, Tt} are a priori independent, (52)simplifies further:

� At

P (At, Nt, Tt, Yt, Amax,t) dAt = P (Nt, Tt)

× Pcan (Yt|f0(Nt, Tt), A0,t = Amax,t) (53)

It thereby follows that the substitution of (53) into (49)obtains the same relation as the expansion of (48) via Bayes’rule (in parallel fashion to (49)). As such:

P (Nt, Tt|Yt, Amax,t) = P ′ (Nt, Tt|Yt, Amax,t) (54)

as was to be shown.In the preceding, the space of At was assumed to be � +

which is an uncountably infinite space. In actuality the domainof At is limited and the space discretized to a finite set ofpossibilities. Nevertheless, provided the domain’s extent issufficiently large, and σ2

A considerably exceeds the square ofthe largest spacing between At values, the results realized “inpractice” become virtually identical to the analyzed situationwhere At ∈ � + .

APPENDIX IIDETERMINISTIC LIKELIHOOD APPROXIMATION

Let Y denote the set of STFT peak observations for asingle frame. Y consists of a vector of peak frequencies F ,and a parallel vector of peak amplitudes A. We denote theradian frequency of the jth output peak by F (j), and thecorresponding amplitude by A(j). The task is to evaluate thelikelihood of these observations given hypotheses f0 for thefundamental frequency, and A0 for the reference amplitude.

In [30], it is shown that this likelihood may be evaluatedwith respect to a template of fictitious input peaks whichrepresent actual sinusoidal components generated by the pitchhypothesis (f0,A0). Each input peak governs the frequencyand amplitude distribution of a peak potentially observed inthe STFT output. Of course, not all input peaks are observed;moreover, spurious peaks due to interference and sidelobedetection may also appear in the output. In [30], the unknownmapping between input and output peaks is marginalized:

P (F, A|f0, A0) =�L∈L

P (L|f0, A0) P (F, A|L, f0, A0) (55)

where L, a linkmap in the set of valid maps L, describes whichoutput peaks correspond to which input peaks, which outputpeaks are spurious, and (by implication) which input peaks aremissing from the output8. L can be represented as a mappingL : Jo → Ji where Jo

∆= 1 : No and Ji

∆= 1 : Ni ∪ {‘S’},

where No is the number of output peaks, Ni is the number of

8The problem of unknown linkage of STFT peak data with possiblesinusoidal components generated by the pitch hypothesis is conceptuallysimilar to that encountered in radar applications where various measurements,some of which may be spurious, must be associated with specific hypothesizedtargets. Each input peak in the pitch hypothesis evaluation correspondsto a target in the radar problem. It is hence unsurprising that the directmarginalization of the unknown linkmap in (55) parallels earlier approachesfrom the radar literature, particularly the joint probabilistic data associationfilter (JPDAF) method of Bar-Shalom et al. [1], [2].

input peaks, and ‘S’ denotes the spurious possibility. That is,if j is the index of an output peak; L(j) returns the index ofthe corresponding input peak, except when L(j) = ‘S’, whichmeans that output peak is spurious. The indices of output peaksare sorted in terms of increasing frequency; the indices ofinput peaks are sorted in terms of increasing mean frequencyparameter. An example linkmap is shown in Figure 9.

.05 .10 .15 .20 .25

.07 .10 .13 .28

INPUT PEAKS

OUTPUT PEAKS

Fig. 9. Example linkmap, where numerical values represent the radianfrequency associated with each peak. Symbolically, L(1) = ‘S’, L(2) = 2,L(3) = ‘S’, and L(4) = 5.

From [30], we obtain that the set of valid linkmaps are thosefor which either of the following statements is true for any twooutput indices j(0) and j(1) in Jo:

• if both output peaks are linked, the linkages do not cross,i.e., if L(j(0)) ∈ 1 :Ni and L(j(1)) ∈ 1 :Ni, then j(1) >j(0) → L(j(1)) > L(j(0))

• one or both of the output peaks is spurious, i.e.,L(j(0)) = ‘S’ or L(j(1)) = ‘S’

In practice, we may approximate the summation in (55) withP (L|f0, A0) as a uniform prior. That is:

P (F, A|f0, A0) ≈∑

L∈L

1

#LP (F, A|L, f0, A0) (56)

where #L is the cardinality of L. Even though the specifi-cation of P (L|f0, A0) in [30] varies significantly with L, weexpect less variation for those L for which P (F, A|L, f0, A0)is large, provided (f0, A0) is reasonably close to the true pitchhypothesis. This approximation has been tested empirically fora limited number of acoustic examples consisting of clean tonoisy and reverberant recordings of piano and violin, includingthose used to generate the results in Section VII.

As specified in [30], P (F, A|L, f0, A0) factors as a productdistribution over individual output peaks:

P (F, A|L, f0, A0) =

No�j=1

P (F (j), A(j)|L(j), f0, A0) (57)

which means

P (F, A|f0, A0) ≈1

#L

�L∈L

No�j=1

P (F (j), A(j)|L(j), f0, A0)

(58)

The form of the individual P (F (j), A(j)|L(j), f0, A0) distri-butions, as given in [30], takes into account amplitude andfrequency deviations in STFT peaks resulting from signalscontaining sinusoids embedded in additive white Gaussiannoise [30].

Now, suppose we consider extending the summation in (58)over the set of all possible maps Jo → Ji, which we denote

Page 15: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

15

as L∗. Then L∗ becomes a Cartesian product space, enablingthe interchange of sums and products in (58). Making thissubstitution yields:

P (F, A|f0, A0) ≈1

#L

�L∈L∗

No�j=1

P (F (j), A(j)|L(j), f0, A0)

(59)

where L∗ = l∗1 ⊗ l∗2 ⊗ . . . ⊗ l∗No, where l∗j denotes the set

of possible maps from the index j to Ji, each of whichcorresponds to a possibility for L(j). Thus, the summation(59) may be written:

P (F, A|f0, A0) ≈

1

#L

�L(1)∈Ji

�L(2)∈Ji

. . .�

L(N0)∈Ji

No�j=1

P (F (j), A(j)|L(j), f0, A0) (60)

Interchanging sums and products yields as follows:

P (F, A|f0, A0) ≈1

#L

No�j=1

�L(j)∈Ji

P (F (j), A(j)|L(j), f0, A0)

(61)The approximation and product-sum interchange allows the

likelihood evaluation to take place using far fewer computa-tions than the exact evaluation of (55). In (61), the numberof link evaluations of the form P (F (j), A(j)|L(j), f0, A0) isexactly No(Ni +1), which for No = Ni = N , is O(N2). Bycontrast, the number of terms enumerated in (55) is9:

#L =

min(No,Ni)�j=0 � No

j � � Ni

j � (62)

For No = Ni = N , the number of terms (62) isO

(

4N/√

N)

[30]. Hence the deterministic approximationyields a polynomial-time solution to an inherently exponential-time problem.

In practice, approximating the sum (58) by (61) seems toobtain virtually identical results for the acoustic examplesconsidered. The approximation seems to hold best where theoutput peaks are well separated in frequency and where thepitch content is salient, meaning that the standard deviation ofF (j) given any L(j) ∈ 1 : Ki is small. We expect output peaksto be well separated in the context of nonreverberant harmonicsounds which lack complex character (beating, chorusing, etc.)or when sufficiently long analysis windows are used. To showthis, define L = L∗\L; the approximation error comprising thedifference between the r.h.s. of (59) and (58), η(F, A|f0, A0),may be expressed:

η(F, A|f0, A0) =1

#L

�L∈L

No�j=1

P (F (j), A(j)|L(j), f0, A0)

(63)

9Though we may precompute No(Ni +1) individual link evaluations forthe product term in (58), there is still the task of forming the specific productcombination for each L ∈ L.

By definition of L, we note that each L ∈ L has theproperty that there exists j(0) ∈ 1 : Ni, j(1) ∈ 1 :Ni, for which j(1) > j(0), but L(j(1)) ≤ L(j(0)). IfP (F (j(0)), A(j(0))|L(j(0)), f0, A0) is negligibly small, thisterm tends to annihilate the product corresponding to L in(63). On the other hand, if P (F (j(0)), A(j(0))|L(j(0)), f0, A0)is non-negligible, by the pitch salience hypothesis this mustmean F (j(0)) is close to the mean frequency of the inputpeak at index L(j(0)). Now by assumption j(1) > j(0),implying that F (j(1)) > F (j(0)). Furthermore, if we assumethat the output peaks are well separated in frequency, the dif-ference F (j(1))−F (j(0)) is non-negligible. However, becauseL(j(1)) ≤ L(j(0)), by the frequency sorting construction thefrequency of the input peak with index L(j(1)) must be lessthan or equal to the frequency of the input peak with indexL(j(0)). Hence F (j(1)) exceeds the frequency of the inputpeak associated with L(j(1)) by a non-negligible amount. Asa result, P (F (j(1)), A(j(1))|L(j(1)), f0, A0) becomes negligi-ble, annihilating the product corresponding to L in (63). Thus,for all L ∈ L, each product becomes negligible implying thatη(F, A|f0, A0) is also negligible. Hence�

L∈L

No�j=1

P (F (j), A(j)|L(j), f0, A0) ≈�L∈L∗

No�j=1

P (F (j), A(j)|L(j), f0, A0) (64)

as was to be shown.

APPENDIX IIIBAYESIAN SMOOTHING INFERENCE RECURSIONS

Recall from Section IV-B, each iteration of the EM dependson the computation of the pairwise mode posterior (36):

σ(2)(Mt, Mt+1) = P (Mt, Mt+1|Y1:K , θ(i)M ) (65)

Quantities propagated in filtered and smoothed passes aswell as the necessary inputs are summarized in Table VI.

Symbol Quantity Descriptionπ (M0, S0) P (M0, S0) Prior

P (Mt+1|Mt) Mode trans. prob.P (St+1|St,Mt+1,Mt) State trans. prob.

τ (Mt, St) P (Mt, St|Y1:t−1) Predicted post.µ (Mt, St) P (Mt, St|Y1:t) Filtered post.σ (Mt, St) P (Mt, St|Y1:K) Smoothed post.σ(2) (Mt,Mt+1) P (Mt,Mt+1|Y1:K) Pairwise mode post.

Ψ (Mt, St+1,Mt+1)P(Mt,St+1,Mt+1)

P(Mt+1|Mt)Intermediate

φ (Mt, St)P (Mt,St|Y1:K )

P(Mt,St|Y1:t−1)Intermediate

TABLE VIInference inputs and propagated quantities

To initialize the filtering pass, we set µ(M0, S0) =π (M0, S0). The remainder of the filtering pass consists ofthe following recursions for t∈1:K−1.

Page 16: Melody Extraction and Musical Onset Detection from Framewise …harv23/Onset.pdf · 2005. 7. 17. · 1 Melody Extraction and Musical Onset Detection from Framewise STFT Peak Data

16

Ψ (Mt, St+1, Mt+1) =�St

[µ (Mt, St|Y1:t)

× P (St+1|St, Mt+1, Mt) ]τ (Mt+1, St+1) =

�Mt

[P (Mt+1|Mt)

× Ψ (Mt, St+1, Mt+1) ]µ (Mt+1, St+1) =

τ (Mt+1, St+1) P (Yt+1|Mt+1, St+1)� Mt+1,St+1τ (Mt+1, St+1) P (Yt+1|Mt+1, St+1)

(66)

The smoothing pass begins with the initializationσ(SK , MK) = µ(SK , MK). The remainder of the smoothingpass consists of the following recursions, as t decreases fromK−1 down to 1:

φ (Mt+1, St+1) =P (Mt+1, St+1|Y1:K)

P (Mt+1, St+1|Y1:t)

σ (Mt, St) = µ (Mt, St)�

Mt+1

[P (Mt+1|Mt)

�

St+1

φ (Mt+1, St+1)P (St+1|St, Mt+1, Mt) ]

σ(2) (Mt,Mt+1) = P (Mt+1|Mt)�St+1

[φ (Mt+1,St+1)

× Ψ(Mt, St+1, Mt+1) ] (67)

ACKNOWLEDGMENT

The authors would like to thank Brent Townshend forsuggestions motivating the deterministic approximation forthe single-frame likelihood evaluation, as well as severalanonymous reviewers for helpful suggestions and corrections.

REFERENCES

[1] Y. Bar-Shalom, T. E. Fortmann, and M. Scheffe “Joint probabilistic dataassociation for multiple targets in clutter”, Proceedings of the Conferenceon Information Sciences and Systems, Princeton, NJ, 1980.

[2] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association,Academic Press, Orlando, FL, 1988.

[3] C. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: OxfordUniversity Press, 1995.

[4] A. Cemgil, B. Kappen, and D. Barber, “Generative model based poly-phonic music transcription.” Proc. 2003 IEEE WASPAA, New Paltz, NY,2003.

[5] A. Cemgil, Bayesian Music Transcription. Ph.D. thesis, Radboud Univer-sity Nijmegen, the Netherlands, 2004.

[6] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm.” J. Royal Statistical Society,Series B, 39:1-38, 1977.

[7] C. Duxbury, M. Davies, and M. Sandler, “Improved time-scaling ofmusical audio using phase locking at transients.” Proc. 112th AESConvention, Munich, 2002.

[8] B. Edler, “Codierung von audiosignalen mit uberlappender transformationund adaptiven fensterfunktionen.” Frequenz, 43(9), 252–256, 1989.

[9] P. Fearnhead, “Exact and efficient bayesian inference for multiple change-point problems.” Technical report, Dept. of Mathematics and Statistics,Lancaster University, 2003.

[10] J. Goldstein, “An optimum processor theory for the central formation ofthe pitch of complex tones.” J. Acoustical Society of America, 54:1496-1516, 1973.

[11] S. W. Hainsworth, Techniques for the Automated Analysis of MusicalAudio. Ph. D. thesis, University of Cambridge, 2003.

[12] T. Kailath, A. Sayed, and B. Hassibi, Linear Estimation, Prentice Hall,Englewood Cliffs, NJ, 2000.

[13] K. Kashino, K. Nakadai, T. Kinoshita, and H. Tanaka, “Application ofBayesian probability network to music scene analysis.” Working Notesof IJCAI Workshop of Computational Auditory Scene Analysis (IJCAI-CASA), Montreal, 1995.

[14] K. Kashino and S. J. Godsill, “Bayesian estimation of simultaneousmusical notes based on frequency domain modeling.” Proc. 2004 IEEEIntl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’04),Montreal, 2004.

[15] S. Larson and S. McAdams, “Musical forces and melodic expectations:comparing computer models and experimental results.” Music Perception,21:4, 457-498, 2004.

[16] R. J. Leistikow, H. Thornburg, J. O. Smith, and J. Berger, “Bayesianidentification of closely-spaced chords from single-frame STFT peaks.”Proc. 7th Intl. Conf. on Digital Audio Effects (DAFx-04), Naples, Italy,2004.

[17] R. J. Leistikow, Bayesian Modeling of Musical Expectations via Max-imum Entropy Stochastic Grammars. Ph.D. thesis, Stanford University,2005.

[18] S. N. Levine and J. O. Smith, “A sines+transients+noise audio repre-sentation for data compression and time/pitch-scale modifications.” Proc.105th AES Convention, San Francisco, CA, 1998.

[19] K. Murphy, “Filtering, smoothing, and the junction tree algo-rithm.” http://www.cs.ubc.ca/˜murphyk/Papers/smooth.ps.gz, 1998.

[20] V. Pavlovic, J. M. Rehg, and T. Cham, “A dynamic bayesian networkapproach to tracking using learned switching dynamic models.” Proc. Intl.Workshop on Hybrid Systems, Pittsburgh, PA, 2000.

[21] L. R. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition.” Proc. IEEE 77:2, 257-286, 1989.

[22] C. Raphael. “Automatic segmentation of acoustic musical signals usinghidden Markov models.” IEEE Trans. Pattern Analysis and MachineIntelligence, 21:360-370, 1999.

[23] W. A. Schloss. On the Automatic Transcription of Percussive Music- From Acoustic Signal to High-Level Analysis. Ph.D. thesis, StanfordUniversity, 1995.

[24] A. Sheh and D. P. W. Ellis, “Chord segmentation and recognition usingEM-trained hidden Markov models.” Proc. 4th Intl. Symposium on MusicInformation Retrieval (ISMIR-03), Baltimore, MD, 2003.

[25] J. O. Smith and X. Serra, “PARSHL: an analysis/synthesis program fornonharmonic sounds based on a sinusoidal representation.” Proc. 1987Intl. Computer Music Conf. (ICMC-87), San Francisco, CA, 1987.

[26] H. Terasawa, M. Slaney, and J. Berger, “Perceptual distance in timbrespace”, Proc. of ICAD 05 – Eleventh Meeting of the InternationalConference on Auditory Display, Limerick, Ireland 2005.

[27] E. Terhardt, “Pitch, consonance and harmony.” J. Acoustical Society ofAmerica, 55:1061-1069, 1974.

[28] H. Thornburg, Detection and Modeling of Transient Audio Signals withPrior Information, Ph.D. thesis, Stanford University, 2005.

[29] H. Thornburg and R. J. Leistikow, “Analysis and resynthesis of quasi-harmonic sounds: an iterative filterbank approach.” Proc. 6th Intl. Conf.on Digital Audio Effects (DAFx-03), London, 2003.

[30] H. Thornburg and R. J. Leistikow, “A new probabilistic spectral pitchestimator: exact and MCMC-approximate strategies”. Proc. ComputerMusic Modeling and Retrieval (CMMR-2004), Esbjerg, Denmark, 2004.

[31] P. J. Walmsley, S. J. Godsill, and P. J. W. Rayner, “Polyphonic pitchtracking using joint Bayesian estimation of multiple frame parameters.”Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audioand Acoustics, New Paltz, NY, 1999.