Nonnegative Tensor Factorization for Source Separation of ... · [Rafii, Liutkus, & Pardo 2014]...
Transcript of Nonnegative Tensor Factorization for Source Separation of ... · [Rafii, Liutkus, & Pardo 2014]...
Nonnegative Tensor Factorization for Source
Separation of Loops in Audio
Jordan B. L. Smith National Institute of Advanced Industrial Science and Technology (AIST), Japan
Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST), Japan
Laboratoire de signaux et systèmes (L2S) & IRCAM, Paris
Introduction
• In some musical styles, songs are built from loops. E.g.:Extracting loops from music
2. Loops arranged to make a song
0:00 0:30 1:00A A A A A A A
B B B B B BC C C C
DD D
DrumMelody
BassFX
1. Collection of loops
A B
C D
3. Song mixed down to audio
→ composition process →
Audio examples (and test data) all borrowed from [López-Serrano et al. 2016]
• In some musical styles, songs are built from loops. E.g.:Extracting loops from music
2. Loops arranged to make a song
0:00 0:30 1:00A A A A A A A
B B B B B BC C C C
DD D
DrumMelody
BassFX
• Goal: decompose the audio signal to recover:• the layout of the song• the source-separated loops
1. Collection of loops
A B
C D
3. Song mixed down to audio
← decomposition procedure ←
• Two previous approaches that inspired us:• Fingerprint-based loop detection [López-Serrano et al. 2016]
Extracting loops from music
Inputs:
A B
C D+ →
A A A A A A AB B B B B B
C C C CDD D
Output:
Original loops Mixed audio Map of loop activations
Inputs:
+ →
Output:
Assumption that loops are introduced
additively A:B:C:D:
Mixed audio Separated tracks, one per loop
• Iterative NMF [Seetharaman & Pardo 2016]
• Our proposed system:Extracting loops from music
Input:
→A A A A A A A
B B B B B BC C C C
DD D
Outputs:
+A:B:C:D:
Mixed audio Map of loop activations Separated tracks, one per loop
• We attempt to solve both problems in one step, without assumption of additive layout
• We do so by extending nonnegative matrix factorization (NMF) to handle periodicity
Source separation using NMF*
• Steady-state notes
• Note sequences repeated in time
• Transposed notes
• Periodicity (especially at downbeats)
• NMF with harmonic templates
• NMFD with time-evolving templates[Smaragdis 2004]
• NMF2D with transposed harmonic templates[e.g., FitzGerald, Cranitch & Coyle 2008]
• ...no nonnegative approach!NB: REPET, a median-filtering approach[Rafii, Liutkus, & Pardo 2014]
NMF can handle many
types of repetition:
Method
Nonnegative tensor factorization• Step 1: estimate downbeats [madmom, Böck et al. 2016]
Nonnegative tensor factorization• Step 1: estimate downbeats [madmom, Böck et al. 2016]
Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a
“spectral cube”)
Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a
“spectral cube”)
Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a
“spectral cube”)
Detour: understanding the spectral cube
Frequency
Bar number (time in piece)
Time in bar
Detour: understanding the spectral cube
Frequency
Bar number (time in piece)
Time in bar
Detour: understanding the spectral cube
Frequency
Bar number (time in piece)
Time in bar
Detour: understanding the spectral cube
Frequency
Bar number (time in piece)
Time in bar
Bottom to topBack to front Left to right
Visualizing a 3D volume: CT scan
Low frequency to highBeginning to end of piece Beginning to end of a bar
Visualizing a 3D volume: CT scan
Frequency
Bar number (time in piece)
Time in bar
Nonnegative tensor factorization• Step 1: estimate downbeats• Step 2: stack the 2D spectrograms into a 3D volume (a
“spectral cube”)• Step 3: use nonnegative tensor factorization (NTF) to
model the spectral cube
Nonnegative matrix factorization• NMF: X ≈ W ◦ H• W = note templates• H = activation functions
X ≈ M
N
M × rW
r × NH
• Needs post-processing to separate sources:• which templates in W belong to the same source?• different sources could use the same harmonic
components!
Nonnegative tensor factorization• Tucker Decomposition: X ≈ C ◦ (W ◦ H ◦ D)• W = note templates• H = activation functions (time-in-bar)• D = loop activation functions (time-in-piece)• C = core tensor = recipe for each loop type
≈ M
PQ
=
Tucker decomposition
𝓧
Interpreting theNTF model
• W, H, and D all musically intuitive:
A A A A A A AB B B B B B
C C C CDD DLoop template
activations directly estimate layout of song
Interpreting theNTF model
• Core tensor C = recipe for each loop typeLoop recipes
(C)
• Pixel C(i, j, k) tells us to play note wi with activation function hj whenever loop dk appears.
(w4, h7)+
(w11, h10)+
(w24, h16)
Interpreting theNTF model
• Core tensor C = recipe for each loop typeLoop recipes
(C)
• To recover entire spectrogram: C ◦ (W ◦ H ◦ D) • To recover individual loop source: C[:,:,k] ◦ (W ◦ H ◦ D[k,:])
Evaluation
Evaluation• We used synthetic data [López-Serrano et al. 2016]
• 7 sets of loops x 3 different layouts (arrangements)• Algorithm output 1: separated signals
• Evaluate quality with SDR, SIR, SAR
A A A A A A AB B B B B B
C C C CDD D
estimated map ground truth map
estimated source tracks stem tracks
• Algorithm output 2: loop layout• Evaluate accuracy with correlation
Good separation example
• When it works, it works
Collection of loops for genre: “Acid”
Drum Melody
Bass FX
Extracted loops
1 2
3 4
Flawed separation example
Original tracks for genre “Brezo”
Source separated tracks
A A A A A A A
B B B B B B
C C C C
DD D
A A A
B B B
C C C C
DD D
Flawed separation example
Original tracks for genre “Brezo”
Source separated tracks
A A A A A A A
B B B B B B
C C C C
DD D
A A A
B B B
C C C C
DD D
A A A A A A AB B B B B BC C C C
DD D
swap rows
substitute C=CA
A A AB B B B B BC C C C
DD D
substitute C=CA
A A AB B BC C C C
DD D
A A A A A A AB B B B B BC C C CDD D
A A AB B BC C C CDD D
swap rows
(proposed)
(performance ceiling)
[Seetharaman & Pardo 2016]
10
5
0
20151050
1050
–5
SAR
SDR
SIR
Our reconstruction quality is average.:-|
We have more noisy artifacts. :-(
We have less crosstalk than others! :-D
1.00.80.60.40.20.0Co
rrela
tion
We get very clean layouts! :-D
Conclusion
Conclusion• Proposed method of decomposing audio into loops that:
• Models periodicity using the spectral cube• Models source signals and song composition jointly• Tucker decomposition is musically intuitive
• Weaknesses include:• Very conservative reconstructions don’t model the
whole signal• Like NMFD, we cannot distinguish between
algebraically equivalent decompositions• Future work: searching for repetitions at multiple
hierarchical time scales
Future work: hierarchical analysis
• Different loops in the song have different lengths and periods
• Spectral cubes with different periods highlight different consistent repetitions
1 downbeat 4 downbeatsPERIOD: 2 beats
Future work: hierarchical analysis
• Different loops in the song have different lengths and periods
• Spectral cubes with different periods highlight different consistent repetitions
1 downbeat 2 downbeats 4 downbeatsPERIOD: 2 beats
Thank you!
PS. Jordan is now at:
+