Extracting Information from Music Audiodpwe/talks/musicie-2006-02+sndex.pdfMusic Information...

32
Music Information Extraction - Ellis 2006-02-13 p. /32 1 1. Motivation: Learning Music 2. Notes Extraction 3. Drum Pattern Modeling 4. Music Similarity Extracting Information from Music Audio Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu /

Transcript of Extracting Information from Music Audiodpwe/talks/musicie-2006-02+sndex.pdfMusic Information...

  • Music Information Extraction - Ellis 2006-02-13 p. /321

    1. Motivation: Learning Music2. Notes Extraction3. Drum Pattern Modeling4. Music Similarity

    Extracting Information from Music Audio

    Dan EllisLaboratory for Recognition and Organization of Speech and Audio

    Dept. Electrical Engineering, Columbia University, NY USA

    http://labrosa.ee.columbia.edu/

  • Music Information Extraction - Ellis 2006-02-13 p. /322

    LabROSA Overview

    InformationExtraction

    MachineLearning

    SignalProcessing

    Speech

    Music EnvironmentRecognition

    Retrieval

    Separation

  • Music Information Extraction - Ellis 2006-02-13 p. /323

    1. Learning from Music

    • A lot of music data availablee.g. 60G of MP3 ≈ 1000 hr of audio, 15k tracks

    • What can we do with it?implicit definition of ‘music’

    • Quality vs. quantitySpeech recognition lesson:10x data, 1/10th annotation, twice as useful

    • Motivating Applicationsmusic similarity (recommendation, playlists)computer (assisted) music generationinsight into music

  • Music Information Extraction - Ellis 2006-02-13 p. /324

    Ground Truth Data

    • A lot of unlabeled music data availablemanual annotation is expensive and rare

    • Unsupervised structure discovery possible.. but labels help to indicate what you want

    • Weak annotation sourcesartist-level descriptionssymbol sequences without timing (MIDI)errorful transcripts

    • Evaluation requires ground truthlimiting factor in Music IR evaluations?

    File: /Users/dpwe/projects/aclass/aimee.wav

    f 9 Printed: Tue Mar 11 13:04:28

    5001000150020002500300035004000450050005500600065007000

    Hz

    0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28t

    mus mus musvoxvox

  • Music Information Extraction - Ellis 2006-02-13 p. /325

    Talk Roadmap

    Musicaudio

    Drumsextraction

    Eigen- rhythms

    Eventextraction

    Melodyextraction

    Fragmentclustering

    Anchormodels

    Similarity/recommend'n

    Synthesis/generation

    ?

    Semanticbases

    1 2

    3

    4

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    2. Notes Extraction

    • Audio → Score very desirablefor data compression, searching, learning

    • Full solution is elusivesignal separation of overlapping voicesmusic constructed to frustrate!

    • Maybe simplify problem:“Dominant Melody” at each time frame

    6

    with Graham Poliner

    Time

    Frequency

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

    1000

    2000

    3000

    4000

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Conventional Transcription

    • Pitched notes have harmonic spectra→ transcribe by searching for harmonicse.g. sinusoid modeling + grouping

    • Explicit expert-derived knowledge7

    E6820 SAPR - Dan Ellis L10 - Music Analysis 2005-04-06 - 5

    Spectrogram Modeling

    • Sinusoid model

    - as with synthesis, but signal

    is more complex

    • Break tracks

    - need to detect new ‘onset’

    at single frequencies

    • Group by onset &

    common harmonicity

    - find sets of tracks that start

    around the same time

    - + stable harmonic pattern

    • Pass on to constraint-

    based filtering...

    time / s

    fre

    q /

    Hz

    0 1 2 3 40

    500

    1000

    1500

    2000

    2500

    3000

    0 0.5 1 1.5 time / s0

    0.020.040.06

    E6820 SAPR - Dan Ellis L10 - Music Analysis 2005-04-06 - 5

    Spectrogram Modeling

    • Sinusoid model

    - as with synthesis, but signal

    is more complex

    • Break tracks

    - need to detect new ‘onset’

    at single frequencies

    • Group by onset &

    common harmonicity

    - find sets of tracks that start

    around the same time

    - + stable harmonic pattern

    • Pass on to constraint-

    based filtering...

    time / s

    freq / H

    z

    0 1 2 3 40

    500

    1000

    1500

    2000

    2500

    3000

    0 0.5 1 1.5 time / s0

    0.020.040.06

  • Music Information Extraction - Ellis 2006-02-13 p. /328

    Transcription as Classification

    • Signal models typically used for transcriptionharmonic spectrum, superposition

    • But ... trade domain knowledge for datatranscription as pure classification problem:

    single N-way discrimination for “melody”per-note classifiers for polyphonic transcription

    Trainedclassifier

    Audiop("C0"|Audio)p("C#0"|Audio)p("D0"|Audio)p("D#0"|Audio)p("E0"|Audio)p("F0"|Audio)

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Melody Transcription Features

    • Short-time Fourier Transform Magnitude (Spectrogram)

    • Standardize over 50 pt frequency window9

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Training Data

    • Need {data, label} pairs for classifier training• Sources:

    pre-mixing multitrack recordings + hand-labeling?synthetic music (MIDI) + forced-alignment?

    10

    fre

    q / k

    Hz

    0

    0.5

    1

    1.5

    2

    0

    0.5

    1

    1.5

    2

    time / sec0 0.5 1 1.5 2 2.5 3 3.5

    0

    10

    20

    30

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Table 1: Results of the formal MIREX 2005 Audio Melody Extraction evaluation from http://www.music-ir.

    org/evaluation/mirex-results/audio-melody/. Results marked with * are not directly comparable to the

    others because those systems did not perform voiced/unvoiced detection. Results marked † are artificially low due to anunresolved algorithmic issue.

    Rank Participant Overall Accuracy Voicing d′ Raw Pitch Raw Chroma Runtime / s1 Dressler 71.4% 1.85 68.1% 71.4% 32

    2 Ryynänen 64.3% 1.56 68.6% 74.1% 10970

    3 Poliner 61.1% 1.56 67.3% 73.4% 5471

    3 Paiva 2 61.1% 1.22 58.5% 62.0% 45618

    5 Marolt 59.5% 1.06 60.1% 67.1% 12461

    6 Paiva 1 57.8% 0.83 62.7% 66.7% 44312

    7 Goto 49.9%* 0.59* 65.8% 71.8% 211

    8 Vincent 1 47.9%* 0.23* 59.8% 67.6% ?

    9 Vincent 2 46.4%* 0.86* 59.6% 71.1% 251

    10 Brossier 3.2%* † 0.14 * † 3.9% † 8.1% † 41

    STFT frame in the analysis of the synthesized audio.

    1.4 Segmentation

    Voiced/Unvoiced melody classification is performed by

    simple energy thresholding. The sum of the magnitude

    squared energy over the frequency range 200 < f <1800 Hz is calculated for each 10 ms frame. Each frameis normalized by the median energy value for the given

    song, and segments are classified as voiced or unvoiced

    with respect to a global threshold.

    2 Results

    The results of the formal MIREX 2005 Audio Melody

    Extraction evaluation are show in table 1. While “Raw

    Pitch” and “Raw Chroma” measure the accuracy of the

    dominant melody pitch extraction (measured only over

    the frames that were tagged as containing melody in the

    ground truth, and where the latter ignores octave errors),

    the “Overall Accuracy” combines pitch accuracy with cor-

    rect detection of unvoiced frames; the “Voicing d′” figureindicates the accuracy of the detection of frames that do or

    do not contain melody (d′ is the separation between twounit-variance Gaussians that would give the observed false

    alarm and false reject rates for some choice of threshold).

    Calculating statistical significance for these results is

    tricky because the classification of individual 10 ms win-

    dows is highly non-independent – in most cases, two

    temporally-adjacent frames will correspond to virtually

    identical classification problems. Each individual melody

    note comes much closer to an independent trial: we esti-

    mate that there are about 2000 such trials in the test set,

    which consisted of 25 musical excerpts from a range of

    styles of between 10 s and 40 s in length. Given this many

    trials, and assuming the error rates remain the same at the

    note level, a one-tailed binomial significance test requires

    a difference in error rates of about 2.4% for significance

    at the 5% level for results in this range. Thus, roughly,

    for overall accuracy the performance differences between

    the rank 1 (Dressler) and 2 (Ryynänen) systems are sig-

    nificant, but the next three (including ours at rank 4) are

    not significantly different. Raw pitch and chroma, how-

    ever, give another picture: For pitch, our system is in a

    three-way tie for top performance with the top two ranked

    systems, and when octave errors are ignored we are in-

    significantly worse than the best system (Ryynänen in this

    case), and almost significantly better than the top-ranked

    system of Dressler.

    The fact that Dressler’s system performed best overall

    even though it did not have the highest raw pitch accuracy

    is because it combined high pitch accuracy with the best

    voicing detection scheme, achieving the highest d′. Ourvoicing detection scheme, which consisted of a simple

    adaptive energy threshold, came in a joint second on this

    measure. Because voicing errors lead to false negatives

    (deletion of pitched frames) and false positives (insertion

    of pitch values during non-melody times), this aspect of

    the algorithm had a significant impact on overall perfor-

    mance. Naturally, the systems that did not include a mech-

    anism to distinguish between melody and accompaniment

    (Goto, Vincent, and Brossier) scored much lower on over-

    all accuracy despite, in some cases, raw pitch and chroma

    performance very similar to the higher-ranked systems.

    We note with some regret that our system failed to

    score better overall than Paiva’s 2nd submission despite

    exceeding it by a healthy margin on the other measures.

    This paradoxical result is explained in part by the fact

    that the voicing d′ is calculated from all frames pooledtogether, whereas the other measures are averaged at the

    level of the individual excerpts, giving greater weight to

    the shorter excerpts. Paiva 2 did better than our system on

    voicing detection in the shorter excerpts (which tended to

    be the non-pop-music examples), thus compensating for

    the worse performance on raw pitch. Also, although not

    represented in the statistics of table 1, the voicing detec-

    tion of Paiva 2 had an overall higher threshold (more false

    negatives and fewer false positives), which turned out to

    be a better strategy.

    The final column in table 1 shows the execution time

    in seconds for each algorithm. We see an enormous varia-

    tion of more than 1000:1 between fastest and slowest sys-

    tems – with the top-ranked system of Dressler also the

    fastest! Our system is expensive, at almost 200 times

    slower, but not as expensive as several of the others. The

    evaluation, of course, did not place any emphasis on exe-

    11

    Melody Transcription Results• Trained on 17 examples

    .. plus transpositions out to +/- 6 semitonesAll-pairs SVMs (Weka)

    • Tested on ISMIR MIREX 2005 setincludes foreground/background detection

    Example...

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Polyphonic Transcription

    • Train SVM detectors for every piano notesame features & classifier but different labels88 separate detectors, independent smoothing

    • Use MIDI syntheses, player piano recordings

    about 30 min training data

    12

    time / seclevel / dB

    freq

    / pitc

    h

    0 1 2 3 4 5 6 7 8 9

    A1

    A2

    A3

    A4

    A5

    A6

    -20

    -10

    0

    10

    20

    Bach 847 Disklavier

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Piano Transcription Results

    • Significant improvement from classifier:frame-level accuracy results:

    Breakdownby frametype:

    http://labrosa.ee.columbia.edu/projects/melody/13

    Table 1: Frame level transcription results.

    Algorithm Errs False Pos False Neg d′

    SVM 43.3% 27.9% 15.4% 3.44

    Klapuri&Ryynänen 66.6% 28.1% 38.5% 2.71

    Marolt 84.6% 36.5% 48.1% 2.35

    • Overall Accuracy Acc: Overall accuracy is a frame-level version of the metricproposed by Dixon in [Dixon, 2000] defined as:

    Acc =N

    (FP + FN + N)(3)

    where N is the number of correctly transcribed frames, FP is the number ofunvoiced frames UV transcribed as voiced V , and FN is the number of voicedframes transcribed as unvoiced.

    • Error Rate Err: The unbounded error rate is defined as:

    Err =FP + FN

    V(4)

    Additionally, we define the false positive rate FPR and false negative rate FNRas FP/V and FN/V respectively.

    • Discriminability d′: The discriminability is a measure of the sensitivity of adetector that attempts to factor out the overall bias toward labeling any frame

    as voiced (which can move both hit rate and false alarm rate up and down in

    tandem). It converts the hit rate and false alarm into standard deviations away

    from the mean of an equivalent Gaussian distribution, and reports the difference

    between them. A larger value indicates a detection scheme with better discrimi-

    nation between the two classes [Duda et al., 2001]

    d′ = |Qinv(N/V )−Qinv(FP/UV )|. (5)

    As displayed in Table 1, the discriminative model provides a significant perfor-

    mance advantage on the test set with respect to frame-level transcription accuracy.

    This result highlights the merit of a discriminative model for candidate note identi-

    fication. Since the transcription problem becomes more complex with the number of

    simultaneous notes, we have also plotted the frame-level classification accuracy versus

    the number of notes present for each of the algorithms in the left panel of Figure 4, and

    the classification error rate composition with the number of simultaneously occurring

    notes for the proposed algorithm is displayed in right panel. As expected, there is an

    inverse relationship between the number of notes present and the proportional contri-

    bution of insertion errors to the total error rate. However, the performance degredation

    of the proposed is not as significant as the harmonic-based models.

    8

    ! " # $ % & ' ()

    ")

    $)

    &)

    ()

    !))

    !")

    *+,-./0+12/0/,.

    3450067685.6-,+/22-2+9

    :540/+;/-06.6=/0

  • Music Information Extraction - Ellis 2006-02-13 p. /3214

    3. Eigenrhythms: Drum Pattern Space

    • Pop songs built on repeating “drum loop”variations on a few bass, snare, hi-hat patterns

    • Eigen-analysis (or ...) to capture variations?by analyzing lots of (MIDI) data, or from audio

    • Applicationsmusic categorization“beat box” synthesisinsight

    with John Arroyo

  • Music Information Extraction - Ellis 2006-02-13 p. /3215

    Aligning the Data• Need to align patterns prior to modeling...

    tempo (stretch): by inferring BPM &

    normalizing

    downbeat (shift): correlate against ‘mean’ template

  • Music Information Extraction - Ellis 2006-02-13 p. /3216

    Eigenrhythms (PCA)

    • Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims)

    • Eigenrhythms both add and subtract

  • Music Information Extraction - Ellis 2006-02-13 p. /3217

    Posirhythms (NMF)

    • Nonnegative: only adds beat-weight• Capturing some structure

    Posirhythm 1

    BDSNHH

    BDSNHH

    BDSNHH

    BDSNHH

    BDSNHH

    BDSNHH

    Posirhythm 2

    Posirhythm 3 Posirhythm 4

    Posirhythm 5

    500 0 samples (@ 200 Hz)beats (@ 120 BPM)

    100 150 200 250 300 350 4001 2 3 4 1 2 3 4

    Posirhythm 6

    50 100 150 200 250 300 350

    -0.1

    0

    0.1

  • Music Information Extraction - Ellis 2006-02-13 p. /3218

    Eigenrhythms for Classification• Projections in Eigenspace / LDA space

    • 10-way Genre classification (nearest nbr):PCA3: 20% correctLDA4: 36% correct

    -20 -10 0 10-10

    -5

    0

    5

    10PCA(1,2) projection (16% corr)

    -8 -6 -4 -2 0 2-4

    -2

    0

    2

    4

    6LDA(1,2) projection (33% corr)

    bluescountrydiscohiphophousenewwaverockpoppunkrnb

  • Music Information Extraction - Ellis 2006-02-13 p. /3219

    Eigenrhythm BeatBox

    • Resynthesize rhythms from eigen-space

  • Music Information Extraction - Ellis 2006-02-13 p. /3220

    4. Music Similarity

    • Can we predict which songs “sound alike” to a listener?.. based on the audio waveforms?many aspects to subjective similarity

    • Applicationsquery-by-exampleautomatic playlist generationdiscovering new music

    • Problemsthe right representationmodeling individual similarity

    with Mike Mandeland Adam Berenzweig

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Music Similarity Features

    • Need “timbral” features:Mel-Frequency Cepstral Coeffs (MFCCs)auditory-like frequency warping

    log-domain

    discrete cosine transform orthogonalization

    21

    !"ec%r'(r)m

    +e,-.requency!"ec%r'(r)m

    +e,-Frequency4e"5%r),64'e..icien%5

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Timbral Music Similarity• Measure similarity of feature distribution

    i.e. collapse across time to get density p(xi)compare by e.g. KL divergence

    • e.g. Artist Identificationlearn artist model p(xi | artist X) (e.g. as GMM)classify unknown song to closest model

    22

    KL

    KL

    MFCCs

    Art

    ist

    1A

    rtis

    t 2

    Test Song

    Min Artist

    Training

    GMMs

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    “Anchor Space”• Acoustic features describe each song

    .. but from a signal, not a perceptual, perspective

    .. and not the differences between songs

    • Use genre classifiers to define new spaceprototype genres are “anchors”

    23

    n-dimensionalvector in "Anchor

    Space"Anchor

    Anchor

    AnchorAudioInput

    (Class j)

    p(a1|x)

    p(a2|x)

    p(an|x)

    GMMModeling

    Conversion to Anchorspace

    n-dimensionalvector in "Anchor

    Space"Anchor

    Anchor

    AnchorAudioInput

    (Class i)

    p(a1|x)

    p(a2|x)

    p(an|x)

    GMMModeling

    Conversion to Anchorspace

    SimilarityComputation

    KL-d, EMD, etc.

  • Music Information Extraction - Ellis 2006-02-13 p. /3224

    Anchor Space

    • Frame-by-frame high-level categorizationscompare toraw features?

    properties in distributions? dynamics?third cepstral coef

    fifth

    cep

    stra

    l coe

    f

    madonnabowie

    Cepstral Features

    1 0.5 0 0.5

    0.8 0.6 0.4 0.2

    00.20.40.6

    Country

    Elec

    troni

    ca

    madonnabowie

    10

    5

    Anchor Space Features

    15 10 5

    15

    0

  • Music Information Extraction - Ellis 2006-02-13 p. /3225

    ‘Playola’ Similarity Browser

  • Music Information Extraction - Ellis 2006-02-13 p. /3226

    Ground-truth data

    • Hard to evaluate Playola’s ‘accuracy’user tests...ground truth?

    • “Musicseer” online survey:ran for 9 months in 2002> 1,000 users, > 20k judgmentshttp://labrosa.ee.columbia.edu/projects/musicsim/

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    si =N

    ∑r=1

    αrrαkrc αr =

    (12

    )13

    αc = α2r

    Top rank agreement

    0

    10

    20

    30

    40

    50

    60

    70

    80

    cei cmb erd e3d opn kn2 rnd ANK

    %

    SrvKnw 4789x3.58

    SrvAll 6178x8.93

    GamKnw 7410x3.96

    GamAll 7421x8.92

    27

    Evaluation• Compare Classifier measures against

    Musicseer subjective results“triplet” agreement percentageTop-N ranking agreement score:

    First-place agreement percentage- simple significance test

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Using SVMs for Artist ID

    • Support Vector Machines (SVMs) find hyperplanes in a high-dimensional spacerelies only on matrix of distances between pointsmuch ‘smarter’ than nearest-neighbor/overlapwant diversity of reference vectors...

    28

    (w x) + b = –1(w x) + b = + 1

    x 1

    y

    y i = +1

    w

    (w x) + b = 0

    x 2

    i = – 1

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Song-Level SVM Artist ID

    • Instead of one model per artist/genre, use every training song as an ‘anchor’then SVM finds best support for each artist

    29

    D

    D

    D

    D

    D

    D

    MFCCs

    Art

    ist

    1A

    rtis

    t 2

    Song Features

    DAG SVM

    Test Song

    Artist

    Training

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Artist ID Results

    • ISMIR/MIREX 2005 also evaluated Artist ID• 148 artists, 1800 files (split train/test)

    from ‘uspop2002’• Song-level SVM clearly dominates

    using only MFCCs!

    30

    Table 4: Results of the formalMIREX 2005 Audio Artist ID evaluation (USPOP2002) from http://www.music-ir.

    org/evaluation/mirex-results/audio-artist/.

    Rank Participant Raw Accuracy Normalized Runtime / s

    1 Mandel 68.3% 68.0% 10240

    2 Bergstra 59.9% 60.9% 86400

    3 Pampalk 56.2% 56.0% 4321

    4 West 41.0% 41.0% 26871

    5 Tzanetakis 28.6% 28.5% 2443

    6 Logan 14.8% 14.8% ?

    7 Lidy Did not complete

    References

    Jean-Julien Aucouturier and Francois Pachet. Improving

    timbre similarity : How high’s the sky? Journal of

    Negative Results in Speech and Audio Sciences, 1(1),

    2004.

    Adam Berenzweig, Beth Logan, Dan Ellis, and Brian

    Whitman. A large-scale evaluation of acoustic and

    subjective music similarity measures. In International

    Symposium on Music Information Retrieval, October

    2003.

    Dan Ellis, Adam Berenzweig, and Brian Whitman.

    The “uspop2002” pop music data set, 2005.

    http://labrosa.ee.columbia.edu/projects/musicsim/

    uspop2002.html.

    Jonathan T. Foote. Content-based retrieval of music and

    audio. In C.-C. J. Kuo, Shih-Fu Chang, and Venkat N.

    Gudivada, editors, Proc. SPIE Vol. 3229, p. 138-147,

    Multimedia Storage and Archiving Systems II, pages

    138–147, October 1997.

    Alex Ihler. Kernel density estimation toolbox for matlab,

    2005. http://ssg.mit.edu/ ihler/code/.

    Beth Logan. Mel frequency cepstral coefficients for mu-

    sic modelling. In International Symposium on Music

    Information Retrieval, 2000.

    Beth Logan and Ariel Salomon. A music similarity func-

    tion based on signal analysis. In ICME 2001, Tokyo,

    Japan, 2001.

    Michael I. Mandel, Graham E. Poliner, and Daniel P. W.

    Ellis. Support vector machine active learning for mu-

    sic retrieval. ACM Multimedia Systems Journal, 2005.

    Submitted for review.

    Pedro J. Moreno, Purdy P. Ho, and Nuno Vasconcelos.

    A kullback-leibler divergence based kernel for SVM

    classification in multimedia applications. In Sebastian

    Thrun, Lawrence Saul, and Bernhard Schölkopf, edi-

    tors, Advances in Neural Information Processing Sys-

    tems 16. MIT Press, Cambridge, MA, 2004.

    Alan V. Oppenheim. A speech analysis-synthesis system

    based on homomorphic filtering. Journal of the Acosti-

    cal Society of America, 45:458–465, February 1969.

    William D. Penny. Kullback-liebler divergences of nor-

    mal, gamma, dirichlet and wishart densities. Technical

    report, Wellcome Department of Cognitive Neurology,

    2001.

    John C. Platt, Nello Cristianini, and John Shawe-Taylor.

    Large margin dags for multiclass classification. In S.A.

    Solla, T.K. Leen, and K.-R. Mueller, editors, Advances

    in Neural Information Processing Systems 12, pages

    547–553, 2000.

    George Tzanetakis and Perry Cook. Musical genre classi-

    fication of audio signals. IEEE Transactions on Speech

    and Audio Processing, 10(5):293–302, July 2002.

    Kristopher West and Stephen Cox. Features and classi-

    fiers for the automatic classification of musical audio

    signals. In International Symposium on Music Infor-

    mation Retrieval, 2004.

    Brian Whitman, Gary Flake, and Steve Lawrence. Artist

    detection in music with minnowmatch. In IEEE Work-

    shop on Neural Networks for Signal Processing, pages

    559–568, Falmouth, Massachusetts, September 10–12

    2001.

    Changsheng Xu, Namunu C Maddage, Xi Shao, Fang

    Cao, and Qi Tian. Musical genre classification using

    support vector machines. In International Conference

    on Acoustics, Speech, and Signal Processing. IEEE,

    2003.

    MIREX 05 Audio Artist (USPOP2002)

  • Music Information Extraction - Ellis 2006-02-13 p. /32

    Playlist Generation

    • SVMs are well suited to “active learning”solicit labels on items closest to current boundary

    • Automatic player with “skip”= Ground truth data collectionactive-SVM automatic playlist generation

    31

  • Music Information Extraction - Ellis 2006-02-13 p. /3232

    Conclusions

    • Lots of data + noisy transcription + weak clustering⇒ musical insights?

    Musicaudio

    Drumsextraction

    Eigen- rhythms

    Eventextraction

    Melodyextraction

    Fragmentclustering

    Anchormodels

    Similarity/recommend'n

    Synthesis/generation

    ?

    Semanticbases