TRIM Indexation Audio - Telecom Paris
Transcript of TRIM Indexation Audio - Telecom Paris
![Page 1: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/1.jpg)
Institut Mines-Télécom
Master MVA Analyse des signaux Audiofréquences Audio Signal Analysis, Indexing and Transformation
Lecture on Audio indexing or Machine Listening
Gaël RICHARD
TELECOM ParisTech
Image, Data, Signal department
January 2018
« Licence de droits d'usage" http://formation.enst.fr/licences/pedago_sans.html
![Page 2: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/2.jpg)
Institut Mines-Télécom
Content
Introduction
• Interest and some applications
• A few dimensions of musical signals
• Some basics in signal processing
Analysing the music signal
• Pitch and Harmony,…
• Tempo and rhythm,…
• Timbre and musical instruments,..
• Polyphony,…
Some other machine listening applications
• Audio fingerprint
• Audio scene recognition
• Audio-based video search for music videos
![Page 3: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/3.jpg)
Institut Mines-Télécom
Foreword….
Lecture largely based on :
• M. Mueller, D. Ellis, A. Klapuri, G. Richard « Signal Processing for
Music Analysis, IEEE Trans. on Selected topics of Signal Processing,
Oct. 2011
With the help for some slides from :
• O. Gillet,
• A. Klapuri
• M. Mueller
• S. Fenet
• V. Bisot
![Page 4: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/4.jpg)
Institut Mines-Télécom
Audio indexing : interests
The enormous amount of unstructured multimedia data available nowadays
The continuously growing amount of this digital multimedia information increases the difficulty of its access and management, thus hampering its practical usefulness.
New challenges for the Information society: • Make the digital information more readily available to the user is
becoming ever more critical.
• Need for content-based parsing, indexing, processing and retrieval techniques
![Page 5: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/5.jpg)
Institut Mines-Télécom
Search by content…..
![Page 6: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/6.jpg)
Institut Mines-Télécom
Why analysing the music signal ?
Search by content • From a music piece …
• From a hummed query…
• New music that I will like/love ….
• A cover version of my favorite title
• A video that matches a music piece..
• …
New applications • Semantic playlist (play music pieces
that are gradually faster …)
• « Smart » Karaoké (the music follows
the singer…)
• Predict the potential success of a
single
• Automatic mixing, Djing,
• Active listening,..
Musical Jogging
Synchronous modifications Playlist, « musical space »
Search by voice
Automatic music score
![Page 7: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/7.jpg)
Institut Mines-Télécom
Acoustic scene and sound event recognition
Acoustic scene recognition:
• « associating a semantic label to an audio stream that
identifies the environment in which it has been produced »
• Related to CASA (Computational Auditory Scene
Recognition) and SoundScape cognition (psychoacoustics)
7
D. Barchiesi, D. Giannoulis, D. Stowell and M. Plumbley, « Acoustic Scene Classification », IEEE Signal Processing
Magazine [16], May 2015
Acoustic Scene
Recognition System
Subway?
Restaurant ?
![Page 8: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/8.jpg)
Institut Mines-Télécom
Acoustic scene and sound event recognition
Sound event recognition
• “aims at transcribing an audio signal into a symbolic
description of the corresponding sound events present in an
auditory scene”.
8
Sound event
Recognition System
Bird
Car horn
Coughing
Symbolic description
![Page 9: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/9.jpg)
Institut Mines-Télécom
Applications of scene and events recognition
Smart hearing aids (Context recognition for adaptive
hearing-aids, Robot audition,..)
Security
indexing,
sound retrieval,
predictive maintenance,
bioacoustics,
environment robust speech recognition,
ederly assistance
…..
9
![Page 10: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/10.jpg)
Institut Mines-Télécom
Classification systems
Several problems, a similar approach
• Speaker identification/recognition
• Automatic musical genre recognition
• Automatic music instruments recognition.
• Acoustic scene recognition
• Sound samples classification.
• Sound track labeling (speech, music, special effects etc…).
• Automatically generated Play list
• Hit predictor...
![Page 11: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/11.jpg)
Institut Mines-Télécom
Traditional Classification system
From G. Richard, S. Sundaram, S. Narayanan, “Perceptually-motivated audio indexing and
classification”, Proc. of the IEEE, 2013
![Page 12: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/12.jpg)
Institut Mines-Télécom
A little bit of signal processing
![Page 13: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/13.jpg)
Institut Mines-Télécom
……A little bit of signal processing
Let x(t) be a continuous signal (e.g. captured by a
microphone):
Let x(nT) be the discrete signal sampled at time t=nT
Page 13
x(t)
t
x(n)=x(nT)
t
T
![Page 14: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/14.jpg)
Institut Mines-Télécom
Time-Frequency representation
Fourier Transform
xn |Xk|
![Page 15: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/15.jpg)
Institut Mines-Télécom
Spectral analysis of an audio signal (1) (drawing from J. Laroche)
Fre
qu
en
cy
Time
![Page 16: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/16.jpg)
Institut Mines-Télécom
Spectral analysis of an audio signal (2)
xn |Xk|
Spectrogram
![Page 17: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/17.jpg)
Institut Mines-Télécom
Audio signal representations
Example on a music signal: note C (262 Hz) produced by a
piano and a violin.
Temporal Signal
Spectrogram
From M. Mueller & al. « Signal Processing for Music Analysis, IEEE Trans. On Selected topics of Signal
Processing, oct. 2011
![Page 18: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/18.jpg)
Institut Mines-Télécom
Z transform/ Discrete Fourier Trnasform
Z-transform of a signal x(n) is given by:
with
Links Z-transform /DFT
• This corresponds to a sampling of the Z-transform with N points regularly spaced on the unit circle.
Re(z)
Im(z)
N/2
![Page 19: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/19.jpg)
Institut Mines-Télécom Gaël RICHARD – Master of Science - Filtering
19
Digital filtering
Linear shift invariant system
R[] x(nT) y(nT)
Input sequence =Excitation output sequence
Filter characterised by its impulse response, or transfer function
Y(nT) = R[x(nT)] where T is the sampling period.
By choosing T=1, we have: Y(n) = R[x(n)]
![Page 20: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/20.jpg)
Institut Mines-Télécom Gaël RICHARD – Master of Science - Filtering
20
Digital filtering
Linear constant-coefficient Difference Equations (a sub
class of shift invariant systems)
Causal recursive filters
Causal non-recursive filters
![Page 21: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/21.jpg)
Institut Mines-Télécom
Digital filtering: convolution
Convolution allows to represent the intput-output
transformation realised by a linear shift-invariant filter
Gaël RICHARD – Master of Science - Filtering
21
The impulse response is also the response to the
unit sample at n=k:
![Page 22: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/22.jpg)
Institut Mines-Télécom
A widely used model: the source filter model
Resonator
(Vocal tract)
Source signal
(Vocal folds)
Filter
Speech
X(f) H(f) Y(f)
![Page 23: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/23.jpg)
Institut Mines-Télécom
Some dimensions of the musical signal …
Pitch, Harmony.. Tempo, rhythme,…
Timbre, instruments,… Polyphony, melody, ….
![Page 24: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/24.jpg)
Institut Mines-Télécom
Some dimensions of the musical signal …
Pitch, Harmony.. Tempo, rhythme,…
Timbre, instruments,… Polyphony, melody, ….
![Page 25: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/25.jpg)
Institut Mines-Télécom
A quasi-periodic sound
T0
F0=1/T0
How can we estimate the height
(pitch) of a note
or
How to estimate the fundamental
periode (T0)
or frequency (F0) ?
A piano sound (C3)
Spectrum of a piano sound
![Page 26: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/26.jpg)
Institut Mines-Télécom
Signal Model
•
• normalised fundamental frequency
• H is the number of harmonics
• Amplitudes {Ak} are real numbers > 0
• Phases {k} are independant r.v. uniform on [0, 2 [
• w is a centered white noise of variance 2, independent of phases {k}
• x(n) is a centered second order process with autocovariance
![Page 27: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/27.jpg)
Institut Mines-Télécom
Time domain methods
Autocovariance estimation (biased)
![Page 28: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/28.jpg)
Institut Mines-Télécom
Time domain methods
Autocorrelation
![Page 29: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/29.jpg)
Institut Mines-Télécom
Maximum likelihood approach
• Signal model: ─ a is a deterministic signal of period T0 ─ w is white Gaussian noise of variance 2
• Observation likelihood
• Log-likelihood
• Method: maximise successively L with respect to a, then 2
and then T0.
![Page 30: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/30.jpg)
Institut Mines-Télécom
Maximum likelihood approach
• It can be shown that maximisation of L with respect to is
is equivalent to maximise the spectral sum
![Page 31: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/31.jpg)
Institut Mines-Télécom
Spectral product
• By analogy to spectral sum (often more robust)
![Page 32: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/32.jpg)
Institut Mines-Télécom
Pitch Features
![Page 33: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/33.jpg)
Institut Mines-Télécom
Pitch Features
Model assumption: Equal-tempered scale
MIDI pitches:
Piano notes:
Concert pitch:
Center frequency:
![Page 34: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/34.jpg)
Institut Mines-Télécom
Pitch Features
Logarithmic frequency distribution
Octave: doubling of frequency
A2
110 Hz
A3
220 Hz
A4
440 Hz
![Page 35: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/35.jpg)
Institut Mines-Télécom
Towards a more specific representation
Idea: Binning of Fourier coefficients
• Divide up the frequency axis into logarithmically spaced
“pitch regions”
• …and combine spectral coefficients (e.g. ) of each
region to form a single pitch coefficient.
![Page 36: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/36.jpg)
Institut Mines-Télécom
Towards a more specific representation
Towards a Constant-Q time-frequency transform:
Windowing in the time domain
Windowing
in the
frequency
domain
![Page 37: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/37.jpg)
Institut Mines-Télécom
Towards a more specific representation
From M. Mueller & al. « Signal Processing for Music Analysis, IEEE Trans. On Selected topics of Signal
Processing, oct. 2011
![Page 38: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/38.jpg)
Institut Mines-Télécom
Towards a more specific representation
In practice:
• Solution is only partially satisfying
More appropriate solution: Use temporal windows of
different size for each frequency bin k’
Bin kN’
Bin k2’
Bin k1’
J. Brown and M. Puckette, An efficient algorithm for the calculation of a constant Q transform, JASA, 92(5):2698–2701, 1992.
J. Prado, Une inversion simple de la transformée à Q constant, technical report, 2011, (in French)
http://www.tsi.telecom-paristech.fr/aao/en/2011/06/06/inversible-cqt/
![Page 39: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/39.jpg)
Institut Mines-Télécom
Towards a more specific representation
Example: Chromatic scale (Credit M. Mueller)
Time (seconds)
Fre
quency (
Hz)
Inte
nsity (
dB
)
Spectrogram
![Page 40: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/40.jpg)
Institut Mines-Télécom
Towards a more specific representation
Example: Chromatic scale
MID
I pitch
Inte
nsity (
dB
)
Log-frequency spectrogram
Time (seconds)
![Page 41: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/41.jpg)
Institut Mines-Télécom
Some dimensions of the musical signal …
Pitch, Harmony.. Tempo, rhythme,…
Timbre, instruments,… Polyphony, melody, ….
![Page 42: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/42.jpg)
Institut Mines-Télécom
Harmony: the chroma features
Pitches are perceived as related (or harmonically similar) if
they differ by an octave (the notes have the same name)
idea: build parameters which gather this „similar“
information
We consider the 12 traditionnal notes of the tempered scale
Chromas are obtained, for a given note, by adding up
contributions of all his octaves
Obtention of a vector of dimension 12 (the „chromas“)
![Page 43: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/43.jpg)
Institut Mines-Télécom
Chroma Features
![Page 44: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/44.jpg)
Institut Mines-Télécom
Chroma Features
C2 C3 C4
Chroma C
![Page 45: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/45.jpg)
Institut Mines-Télécom
Chroma Features
C#2 C#3 C#4
Chroma C#
![Page 46: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/46.jpg)
Institut Mines-Télécom
Chroma Features
D2 D3 D4
Chroma D
![Page 47: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/47.jpg)
Institut Mines-Télécom
Chroma Features
Shepard‘s helix of pitch perception Chromatic circle
http://en.wikipedia.org/wiki/Pitch_class_space
[Bartsch/Wakefield, IEEE-TMM 2005] [Gómez, PhD 2006]
![Page 48: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/48.jpg)
Institut Mines-Télécom
Chroma Features
Example: Chromatic scale
MID
I pitch
Inte
nsity (
dB
)
Log-frequency spectrogram
Time (seconds)
![Page 49: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/49.jpg)
Institut Mines-Télécom
Chroma Features
Example: Chromatic scale
Chro
ma
Inte
nsity (
dB
)
Chroma representation
Time (seconds)
![Page 50: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/50.jpg)
Institut Mines-Télécom
Chroma Features
Example: Chromatic scale
Inte
nsity (
norm
aliz
ed)
Chroma representation (normalized, Euclidean)
Time (seconds)
Chro
ma
![Page 51: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/51.jpg)
Institut Mines-Télécom
Chroma Features M
IDI
pitch
Inte
nsity (
dB
)
Time (seconds)
Log-frequency spectrogram
Example: Friedrich Burgmüller, Op. 100, No. 2
![Page 52: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/52.jpg)
Institut Mines-Télécom
Chroma Features
Chroma representation
Inte
nsity (
dB
)
Chro
ma
Time (seconds)
Example: Friedrich Burgmüller, Op. 100, No. 2
![Page 53: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/53.jpg)
Institut Mines-Télécom
Application to Chord recognition …
Using theoretical chroma templates
• Examples of 2 chromas templates with or without integrating
higher harmonics
C Major (1 harmonic) C Major (6 harmonics)
![Page 54: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/54.jpg)
Institut Mines-Télécom
Application to Chord recognition …
Chords or/and tonality recognition ,…
• Other applications: ─ Audio/Audio or Audio/Score alignment
─ Audiofingerprint, ….
From L.Oudre, PhD. Telecom ParisTech 2010
![Page 55: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/55.jpg)
Institut Mines-Télécom
Some dimensions of the musical signal …
Pitch, Harmony.. Tempo, rhythme,…
Timbre, instruments,… Polyphony, melody, ….
![Page 56: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/56.jpg)
Institut Mines-Télécom
Interest of rhythmic information
Rhythm: is an essential component of the musical signal
Numerous applications:
• Automatic mixing, DJing : synchronisation of tempo, rhythm,..
• Smart Karaoké
• Automatic playlists (podcast,…)…
• Genre reconnaissance
• Music/video synchronisation
• Smart jogging shoes ? »
• ..
![Page 57: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/57.jpg)
Institut Mines-Télécom
Extraction du rythme ou du Tempo
Le rythme: concept musical intuitivement simple à comprendre mais difficile à définir !!
Handel (1989): « The experience of rhythm involves movement regularity, grouping and yet accentuation and differentiation »
le rythme d’un signal écouté n’a pas nécessairement une interprétation unique !!
On définit fréquemment la pulsation (beat en anglais)
![Page 58: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/58.jpg)
Institut Mines-Télécom
Rhythm or “Tempo” Extraction
Principle
Rythmic
Description
Musical
events
detection
Periodicity
estimation
Periodicity
tracking
Metrical
level
selection
Filterbanks
Scheirer98, Alonso07
Low level features
Sethares04, Gouyon05
Temporal methods
Seppanen01, Foote01
Frequency methods
Gouyon05, Peeters05
Network of Oscillators
Scheirer98, Klapuri04
Probabilistic methods
Laroche01, Sethares05
Probabilistic
Hainsworth03, Sethares05
Deterministic
Laroche03, Collins05, Alonso07
Agents/Histograms
Dixon01, Eck05
![Page 59: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/59.jpg)
Institut Mines-Télécom
Discovering the rhythmic information…
Use of filterbanks (e.g. separating the frequency information…)
•
Bands 8-16 (3500 – 8000 Hz)
Band 4 (1500 – 2000 Hz)
Band 7 (3000 – 3500 Hz)
Band 6 (2500 – 3000 Hz)
Band 5 (2000 – 2500 Hz)
Band 3 (1000-1500 Hz)
Band 2 (500 – 1000 Hz)
Band 1 (0 – 500 hZ)
Musical signal in different bands (Fs=16kHz)
![Page 60: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/60.jpg)
Institut Mines-Télécom
Rhythm or “Tempo” Extraction
Autocorrelation
Signal + Onsets
« Detection function »
Periodicity tracking (« tempogramme»)
Metrical level selectionTempo
Musical
events
detection
Periodicity
estimation
Periodicity
tracking
Metrical
level
selection
![Page 61: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/61.jpg)
Institut Mines-Télécom
Tempo and beat extraction
A filterbank approach (Scheirer, 1998)
Page 63
![Page 62: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/62.jpg)
Institut Mines-Télécom
Discovering the rhythmic information…
Harmonic + noise decomposition
• Original = Sïnusoidal component + Noise
Examples given by R. Badeau
![Page 63: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/63.jpg)
Institut Mines-Télécom
Improved tempo extraction (From Alonso et al.)
Possible improvement :
By exploiting the fact that
rhythm is mainly carried out
by onsets
By using a Harmonic + noise
decomposition
Demo tempo tracking
video_tracking.mp4
![Page 64: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/64.jpg)
Institut Mines-Télécom
Rhythm and tempo estimation : a feature a great
interest
Rhythm: an interesting information for style/genre classification or swing estimation.
Histogram of onset positions on a techno style music signal (Laroche2001)
Page 66
![Page 65: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/65.jpg)
Institut Mines-Télécom
Rhythm and tempo estimation : a feature a great interest
Audio-based video retrieval
Exploit semantic correlations sémantiques between audio and vidéo
Application: search for audio that « fits » the video stream
O. Gillet, S. Essid and G. Richard, On the Correlation of Audio and Visual Segmentations of Music Videos.
IEEE Transactions on Circuits and Systems for Video Technology, 17 (2), March 2007, pp 347-355.
![Page 66: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/66.jpg)
Institut Mines-Télécom
Current trends …
Estimate rhtyms (tatums,tempo) but also
downbeat (but higher level semantic)
To exploit machine learning (and deep learning in
particular)
Use and combine multiple representations
• Rhythm is intrinsically multi-dimensionnal
• Downbeat depdns on melody, chords, bass, etc …
![Page 67: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/67.jpg)
Institut Mines-Télécom
Downbeat estimation (Durand & al. 2017)
![Page 68: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/68.jpg)
Institut Mines-Télécom
Downbeat estimation (Durand & al. 2017)
S Durand & al., "Robust Downbeat Tracking Using an Ensemble of Convolutional Networks", IEEE/ACM
Transactions on Audio, Speech, and Language Processing, Vol 25, N°1, 2017
![Page 69: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/69.jpg)
Institut Mines-Télécom
Downbeat estimation: démo
Examples at the output of each network
• https://simondurand.github.io/dnn_audio.html
Video example
• directory: Démos
Other audio example
JBB (Downbeat)
JBB (Tatum)
Exemple (Downbeat)
Exemple (Tatum)
![Page 70: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/70.jpg)
Institut Mines-Télécom
Some dimensions of the musical signal …
Pitch, Harmony.. Tempo, rhythme,…
Timbre, instruments,… Polyphony, melody, ….
![Page 71: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/71.jpg)
Institut Mines-Télécom
Traditional Classification system
From G. Richard, S. Sundaram, S. Narayanan, “Perceptually-motivated audio indexing and
classification”, Proc. of the IEEE, 2013
![Page 72: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/72.jpg)
Institut Mines-Télécom
A possible definition: « The attribute of auditory perception that
allows to differentiate 2 sounds of equal pitch and equal
intensity.»
Closely related to sound source identification and auditory
organization
Examples of sounds with the same pitch and root-mean-square
(RMS) levels, but different timbre:
Recent PhD theses addressing musical instrument recognition:
[Essid06], [Kitahara-07], [Eronen-09]
Timbre: What is this ?
![Page 73: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/73.jpg)
Institut Mines-Télécom
Refers to the overall timbral mixture, the “global sound”,
of a piece of music [Alluri-10]
Mainly affected by instrumentation
Example
“Bohemian rhapsody” by Queen
“Bohemian rhapsody” by London Symphony Orchestra
Closely related to genre classification [Scaringella-06] or
music tagging [Turnbull-08]
„Polyphonic“ timbre
![Page 74: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/74.jpg)
Institut Mines-Télécom
Timbre is a multidimensional concept
Several parameters of both spectral and temporal kind
Schouten’s [1968] list of five major parameters of
timbre:
1. range between tonal and noise-like character
2. spectral envelope
3. time envelope
4. changes of spectral envelope and pitch
5. onset differing notably from the sustained part
Facets of timbre
![Page 75: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/75.jpg)
Institut Mines-Télécom
Dimensions perceptuelles du timbre
Quelques paramètres acoustiques importants (Krumhansl-89,
McAdams-95, Peeters2004)
• Le Centre de Gravité Spectral (CGS)
• CGS élevé: son brillant
• CGS faible: son chaud, rond
• Le flux spectral (« variation temporelle du contenu spectral »)
![Page 76: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/76.jpg)
Institut Mines-Télécom
• Quelques paramètres acoustiques importants (suite..)
Temps d’attaque: “Log attack time”: log(tmax–tthresh)
L’irrégularité spectrale (différence moyenne entre les amplitudes
de partiels adjacents)
A propos de la perception du timbre polyphonique
Peu d’études [Cogan-84, Kendall-91, Alluri-10]
Reliée aux modulations spectro-temporelles [Alluri-10]
Dimensions perceptuelles du timbre
![Page 77: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/77.jpg)
Institut Mines-Télécom
Timbre of musical instruments
Timbre of musical instruments
Timbre « space »
![Page 78: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/78.jpg)
Institut Mines-Télécom
Timbre of musical instruments
Timbre of musical instruments
3D space (Krumhansl, 1989, McAdams, 1992 )
BSN = bassoon
CNT = clarinet
GTR = guitar
HRN = French horn
HRP = harpe
TPT = trumpet
PNO = piano
VBS = vibraphone
![Page 79: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/79.jpg)
Institut Mines-Télécom
Paramétrisation
Intérêts d’une analyse par un banc de filtres
• Permet de séparer les informations localisées en fréquence
• Permet une réduction de complexité (sous-échantillonnage
dans chaque bande)
• Cas particulier: FFT
• Possibilité d’utiliser des échelles de fréquences
« perceptives »
─ Echelle Mel: Correspond à une approximation de la sensation
psychologique de hauteur d’un son (Tonie)
![Page 80: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/80.jpg)
Institut Mines-Télécom
As a first approximation, let us assume
“timbre levels at critical bands as a function of time”
(this stems from ASR, not entirely satisfactory in music)
Flute (left) and violin (right) spectrograms
Time-varying spectral envelope
Frequency (CB)
(CB = critical band)
Ma
gn
itu
de
(dB
)
Time (s)
(credit A. Klapuri) (credit A. Klapuri)
![Page 81: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/81.jpg)
Institut Mines-Télécom
Filter banks distributed on a Mel Scale
Mel scale filtering (from Rabiner93)
Energy in each band Sj SN
S1
![Page 82: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/82.jpg)
Institut Mines-Télécom
Cepstral représentation
Interest
• Source/filter model of speech production
Source-filter model in the cepstral domain
Cepstre (real): a sum of two almost non-overlapping terms
![Page 83: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/83.jpg)
Institut Mines-Télécom
Cepstral Representation (from Furui2001)
Examples:
• of Spectrum (left)
• of Cepstrum c() (right)
is homogeneous with a time
and is called quefrency
![Page 84: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/84.jpg)
Institut Mines-Télécom
Cepstral Representation
Separation of the vocal tract contribution and of the source
contribution by liftering
![Page 85: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/85.jpg)
Institut Mines-Télécom
MFCC « Mel-Frequency Cepstral Coefficients »
The most common features (from Furui, 2001)
![Page 86: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/86.jpg)
Institut Mines-Télécom
Cepstral smoothing
Envelope estimation by cepstrum:
• Compute real cesptrum Cn, , then low quefrency liftering
• (log) Spectral envelope reconstruction E =FFT(Cn)
Gaël RICHARD – SI350 – Juin 2007 91
![Page 87: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/87.jpg)
Institut Mines-Télécom
Some other parameters….
Warped Linear prediction Cepstral coefficients
Onset Spectral « Asynchrony »
Wavelet coefficient
Harmonic / noise separation
Entropy,
Entropy variation,
….
No real consensus on the most
appropriate feature set even for a specific
audio transcription task.
![Page 88: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/88.jpg)
Institut Mines-Télécom
Classification With the example of “automatic musical instrument recognition”
Aim of classification:
• Find the class (i.e the instrument) from the features computed on
the music signal
![Page 89: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/89.jpg)
Institut Mines-Télécom
Some of the most common classifications
schemes used in audio classifications
K-nearest neighbors (for simple problems)
Gaussian Mixture Models (GMM)
Support Vector machines
Linear Regression
Decision tree, Random forest
…
And more recently Deep neural networks
• Recurrent Neural networks (RNN) , Gated Recurrent Units (GRU)
• Convolutional Neural Networks (CNN applied on spectrograms)
• Long-Short Term Memory (LSTM)
• Generative Adversarial Networks (GANs)
![Page 90: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/90.jpg)
Institut Mines-Télécom
A typical recent example in Audio scene and
event recognition
Acoustic scene recognition vs Acoustic event recognition
![Page 91: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/91.jpg)
Institut Mines-Télécom
Recent approaches for Audio scene and event
recognition
![Page 92: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/92.jpg)
Institut Mines-Télécom
A recent framework for Audio scene and event
recognition (Bisot & al. 2017)
![Page 93: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/93.jpg)
Institut Mines-Télécom
Use of non-supervised decomposition methods (for example Non-
Negative Factorization methods or NMF)
Principle of NMF :
Why NMF ?
Image from R. Hennequin
![Page 94: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/94.jpg)
Institut Mines-Télécom
Example for scene classification
![Page 95: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/95.jpg)
Institut Mines-Télécom
Unsupervised NMF for acoustic scene
recognition
![Page 96: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/96.jpg)
Institut Mines-Télécom
Unsupervised NMF for acoustic scene
recognition
![Page 97: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/97.jpg)
Institut Mines-Télécom
Example with DNN: acoustic scene recognition
V. Bisot & al., "Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification", IEEE/ACM
Transactions on Audio, Speech, and Language Processing, (2017),
V. Bisot & al., Leveraging deep neural networks with nonnegative representations for improved environmental sound
classification IEEE International Workshop on Machine Learning for Signal Processing MLSP, Sep 2017, Tokyo,
![Page 98: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/98.jpg)
Institut Mines-Télécom
Typical performances of Acoustic scene
recognition (challenge DCASE 2016)
A Mesaros & al. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 challenge IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (2), 379-393
![Page 99: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/99.jpg)
Institut Mines-Télécom
Some dimensions of the musical signal …
Pitch, Harmony.. Tempo, rhythme,…
Timbre, instruments,… Polyphony, melody, ….
![Page 100: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/100.jpg)
Institut Mines-Télécom
How to analyse polyphonic signals ?
Process the signal globally
• Recognize the polyphonic timbre “violin+ cello”
…or exploit more or less sophisticated source
separation principles
• E.g., filterbank used in tempo estimation…
![Page 101: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/101.jpg)
Institut Mines-Télécom
Process the signal globally
An example with chord recognition
─ Use of « global templates »
C Major (1 harmonique) C Major (6 harmoniques)
![Page 102: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/102.jpg)
Institut Mines-Télécom
Process the signal globally
Process globally
• Use mixtures of classes (for instance class « violin+cello »,…)
• For more instruments, possibility to use hierarchical approaches…..
• Or even to automatically learn intermediate classes …
Preprocessing
An example with music instrument recognition in polyphonic
music
![Page 103: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/103.jpg)
Institut Mines-Télécom
Process the signal globally
Examples of classes for jazz quartets • Bs= Bass ; Dr = Drums ; Pn = Piano ; Eg= Electric Guitar; Tr =
Troumpet; V= singing voice ; ….
S. Essid, G. Richard, B. David. Instrument recognition in polyphonic music based on
automatic taxonomies. IEEE Trans. on Audio, Speech, and Language Proc. 14 (2006), no. 1
![Page 104: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/104.jpg)
Institut Mines-Télécom
Exploit source separation principles
An example with polyphonic transcription…
• First, detect the most prominent note …
• Subtract this note from the polyphony
• Then, detect the next most prominent note
• Soustract this note from the polyphony
• Etc… until all notes are found
• Approaches followed for example in : ─ Anssi P. Klapuri, Multiple Fundamental Frequency Estimation Based on
Harmonicity and Spectral Smoothness, IEEE Trans. On Speech and Sig. Proc., 11(6), 2003
─ Anssi P. Klapuri “Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model”, IEEE Trans. On ASLP, Feb. 2008
![Page 105: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/105.jpg)
Institut Mines-Télécom
Iterative multipitch estimation
Detect the most prominent note (in red)
Chord of two synthetic notes C – F#
Subtract the detected note
Detect the next most prominent note
There is no more notes….chord C – F# is recognized
![Page 106: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/106.jpg)
Institut Mines-Télécom
Exploit source separation principles…
Using parcimonious decomposition methods (or source
separation)
• For example: Atomique decomposition of polyphonic signals
• The signal is represented as a linear combination of atoms
chosen in a fixed dictionary.
![Page 107: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/107.jpg)
Institut Mines-Télécom
Exploit source separation principles…
The atomic decomposition is for
example obtained by “matching
pursuit” :
• The most prominent atom (i.e.
the most correlated with the
signal) is extracted and
subtracted from the original
signal.
• Iterate the procedure until a
predefined number of atoms
have been selected (or until a
predefined SNR has been
reached)
Figure from L. Daudet: Audio Sparse Decompositions in
Parallel, IEEE Signal Processing Magazine, 2010
![Page 108: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/108.jpg)
Institut Mines-Télécom
Découvrir l’information tonale / timbrale
• Utiliser des atomes informés: e.g. caractéristiques d’une hauteur
tonale et d’un instrument
• Introduction de contraintes de continuité pour construire des
“molécules” (note jouée par un instrument )
Demo from P. Leveau
Testflcl.mov
P. Leveau, E. Vincent, G. Richard and L. Daudet, « Instrument-Specific Harmonic Atoms for Mid-Level Musical
Audio Representation » IEEE Trans. on ASLP, Volume 16, N°1 Jan. 2008 Page(s):116 - 128
![Page 109: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/109.jpg)
Institut Mines-Télécom
Use of non-supervised decomposition methods (for example Non-
Negative Factorization methods or NMF)
Principle of NMF :
Exploit source separation principles
Image from R. Hennequin
![Page 110: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/110.jpg)
Institut Mines-Télécom
An example of musical transcription (From V. Emiya, PhD thesis, Telecom ParisTech, 2008)
Démo
Segmentation
(Onset detection)
Segment Analysis - Candidates Notes: multiF0 estimation
-Tracking of note combinations (HMM)
Detection of
repeating notes
![Page 111: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/111.jpg)
Institut Mines-Télécom
A few examples of music track extraction….
Bass line extraction …..
M. Ryynanen and A. Klapuri, “Automatic bass line transcription from streaming polyphonic audio,” in IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Hawaii, USA, 2007.
![Page 112: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/112.jpg)
Institut Mines-Télécom
A few examples of music track extraction….
Main melody extraction (trumpet)
original trompette Accomp.
J-L Durrieu, G. Richard, B. David, C. Févotte, Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic
Audio Signals, IEEE Transactions on ASLP, March 2010.
J-L Durrieu, B. David, G. Richard, A musically motivated mid-level representation for pitch estimation and musical audio source
separation, IEEE Journal on Selected Topics in Signal Processing, October 2011.
![Page 113: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/113.jpg)
Institut Mines-Télécom
A few examples of music track extraction….
From Leglaive, 2017
![Page 114: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/114.jpg)
Institut Mines-Télécom
A few examples of music track extraction….
Drum track extraction …..
O. Gillet, G. Richard. Transcription and separation of drum signals from polyphonic music. accepted in IEEE Trans. on Audio, Speech and Language Proc. , Mars 2008
![Page 115: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/115.jpg)
Institut Mines-Télécom
A few references…
Audio classififcation / Music signal procesing
• M. Mueller, D. Ellis, A. Klapuri, G. Richard, Signal Processing for Music Analysis", IEEE Journal on Selected Topics in Signal Processing, October 2011.
• G. Richard, S. Sundaram, S. Narayanan "An overview on Perceptually Motivated Audio Indexing and Classification", Proceedings of the IEEE, 2013.
• M. Mueller, Fundamentals of Music Processing, “Audio, Analysis, Algorithms, Applications, Springer, 2015
• A. Klapuri A. M. Davy, Methods for Music Transcription M. Springer New York 2006
• G. Peeters, “Automatic classification of large musical instrument databases usign hierarchical classifiers with inertia ratio maximization, in 115th AES convention, New York, USA, Oct. 2003.
• G. Peeters. A large set of audio features for sound description (similarity and classification) in the cuidado project. Technical report, IRCAM (2004)
Rhythm/tempo estimation
• M. Alonso, G. Richard, B. David, “Accurate tempo estimation based on harmonic+noise decomposition”, EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 82795, 14 pages, 2007.
• Scheirer E., 1998, "Tempo and Beat Analysis of Acoustic Musical Signals", Journal of the Acoustical Society of America (1998), Vol. 103, No. 1, pp. 588-601. 50
• Laroche, 2001] J. Laroche. Estimating Tempo, Swing, and Beat Locations in Audio Recordings. Dans Proc. of WASPAA'01, New York, NY, USA, octobre 2001
• S Durand, J. Bello, S. Leglaive, B. David, G. Richard, "Robust Downbeat Tracking Using an Ensemble of Convolutional Networks", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol 25, N°1, 2017
Music instrument recognition
• S. Essid, G. Richard, B. David. Instrument recognition in polyphonic music based on automatic taxonomies. IEEE Trans. on Audio, Speech, and Language Proc. 14 (2006), no. 1
• Eronen, « comparison of features for musical instrument recognition », Proc of IEEE-WASPAA’2001.
• Eronen-09]A. Eronen, “Signal processing method for audio classification and music content analysis,” Ph.D. dissertation, Tampere University of Technology, Finland, June 2009.
• S. Essid, G. Richard, B. David. Musical Instrument recognition by pairwise classification strategies. IEEE Trans. on Audio, Speech and Language Proc. 14 (2006), no. 4
• [Barbedo-11] J. Barbedo and G. Tzanetakis, "Musical instrument classification using individual partials," IEEE Trans. Audio, Speech and language Processing, 19(1), 2011.
• [Leveau-08]: P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech and Language Processing, vol. 16, no. 1, pp. 116–128, 2008.
• [Kitahara-07] T. Kitahara, “Computational musical instrument recognition and its application to content-based music information retrieval,” Ph.D. dissertation,
![Page 116: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/116.jpg)
Institut Mines-Télécom
Quelques References Chord Estimation,
• L. Oudre. Template-based chord recognition from audio signals. PhD thesis, TELECOM ParisTech, 2010.
Multipitch estimation
• A. Klapuri, “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 11, no. 6, pp. 804–816, 2003.
• V. Emiya, PhD thesis. Telecom ParisTech.
Perception
• [Alluri-10] V. Alluri and P. Toiviainen, “Exploring perceptual and acoustical correlates of polyphonic timbre,” Music Perception, vol. 27, no. 3, pp. 223–241, 2010.
• [Kendall-91] R. A. Kendall and E. C. Carterette, “Perceptual scaling of simultaneous wind instrument timbres,” Music Perception, vol.
8, no. 4, pp. 369–404, 1991.
• [McAdams-95] McAdams, S., Winsberg, S., Donnadieu, S., DeSoete, G., and Krimphoff, J. “Perceptual Scaling of synthesized
musical timbres: Common dimensions, specificities and latent subject classes,” Psychological Research, 1995.
• Schouten’s [1968] J. F. Schouten, “The perception of timbre,” in 6th International Congress on Acoustics, Tokyo, Japan, 1968,
Source separation
• O. Gillet, G. Richard. Transcription and separation of drum signals from polyphonic music. IEEE Trans. on Audio, Speech and Language Proc. (2008)
• M. Ryyn¨anen and A. Klapuri, “Automatic bass line transcription from streaming polyphonic audio,” in IEEE International
• Conference on Acoustics, Speech and Signal Processing (ICASSP), Hawaii, USA, 2007.
• S. Leglaive, R. Badeau, G. Richard, "Multichannel Audio Source Separation with Probabilistic Reverberation Priors", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, no. 12, December 2016
• J-L Durrieu, B. David, G. Richard, A musically motivated mid-level representation for pitch estimation and musical audio source separation, IEEE Journal on Selected Topics in Signal Processing, October 2011.
Acoustic Scene and event recognition
• V. Bisot & al., "Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification", IEEE/ACM Transactions on
Audio, Speech, and Language Processing, (2017),
• V. Bisot & al., Leveraging deep neural networks with nonnegative representations for improved environmental sound classification
IEEE International Workshop on Machine Learning for Signal Processing MLSP, Sep 2017, Tokyo,
• A Mesaros & al. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 challenge IEEE/ACM
Transactions on Audio, Speech, and Language Processing 26 (2), 379-393
![Page 117: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/117.jpg)
Institut Mines-Télécom
Une autre application: l’audiofingerprint
![Page 118: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/118.jpg)
Institut Mines-Télécom
Audio Identification ou AudioID
Audio ID = retrouver des métadonnées haut niveau à
partir d’un son/morceau
Challenges:
• Efficacité en conditions adverses (distorsion, bruits,..)
• Passage à l’échelle (bases > 100.000 titres)
• Rapidité / Temps réel
Example de produit: Shazam
Audio
identification
Information sur
l’extrait (e.g. Pour la
musique: titre, artiste,
…)
3
![Page 119: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/119.jpg)
Institut Mines-Télécom
Audio fingerprinting
Audio Fingerprinting: une approche pour l’Audio-ID
Le principe :
• Pour chaque référence, une empreinte audio unique.
• Identification d’un son: calculer son empreinte et comparaison avec une base d’empreintes de références.
Identify
Fingerprint Processing Excerpt ID
result
Information about the
excerpt (e.g. for a
music: title, album,
artist, …)
Database Creation
Fingerprint
Data Base Fingerprints of the
references
DB query
Reference
audio
tracks
DB answer
Schéma d’après Sébastien Fenêt
![Page 120: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/120.jpg)
Institut Mines-Télécom
Modèle de signal utilisé
‘Binarisation’ du spectrogramme (2D-peak-picking):
2D
peak
picking
![Page 121: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/121.jpg)
Institut Mines-Télécom
Stratégie de recherche efficace
Extrait inconnu à identifier dans une
base de 100.000
Stratégies possibles
• Comparaison directe avec chaque référence de la
base (avec tous les décalages temporels possibles)
• Utiliser la localisation des points blancs comme index
• Utiliser les paires de points comme index
![Page 122: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/122.jpg)
Institut Mines-Télécom
Trouver la meilleure référence
Pour chaque paire, une requête à la base: “quelle référence possède cette paire, et à quel instant cette paire apparaît”
Si la paire apparaît à T1 dans l’extrait inconnu et à T2 dans la référence, on définit le décalage temporel : ΔT(pair)=T2-T1
Algorithme pour trouver la meilleure référence:
For each pair:
Get the references having the pair;
For each reference found:
Store the time-shift;
Look for the reference with the most frequent time-
shift;
![Page 123: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/123.jpg)
Institut Mines-Télécom
Rejet d’un extrait hors-base:
Fusion de décisions locales
L’extrait inconnu est divisé en sous-segments
Pour chaque segment, l’algorithme retourne un meilleur candidat
Si une référence apparaît de manière prépondérante (ou un nombre de fois supérieure à un seuil), l’extrait est identifié
Sinon, la requête est jugée hors-base
Taux de bonne détection proche de 90% (pour base de 7500 references)
UNKNOWN EXCERPT
Best
match #1
Best
match #2
Best
match #3
Best
match #4 Best
match #5
Best
match #6
![Page 124: TRIM Indexation Audio - Telecom Paris](https://reader031.fdocuments.net/reader031/viewer/2022022804/621b10028893c500851d31d7/html5/thumbnails/124.jpg)
Institut Mines-Télécom
Evaluation (détection d’évènements
récurrents ) - Quaero 2012
page 129