R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
1
Lisbon, Portugal June 5-8, 2017
SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations
Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017
spars2017.lx.it.pt
Rémi Gribonval Inria Rennes - Bretagne Atlantique
Random Moments for Compressive Learning
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Main Contributors & Collaborators
3
Anthony Bourrier Nicolas Keriven Yann Traonmilin
Nicolas Tremblay
Gilles Puy
Mike Davies Patrick PerezGilles Blanchard
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Agenda
From Compressive Sensing to Compressive Learning ? The Sketch Trick Compressive K-means Compressive GMM Conclusion
4
From Compressive Sensing to Compressive Learning
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Machine Learning
Available data training collection of feature vectors = point cloud
Goals infer parameters to achieve a certain task generalization to future samples with the same probability distribution
Examples
6
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size m
k*n/
m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
-6 -4 -2 0 2 4 6-4
-3
-2
-1
0
1
2
3
PCA principal subspace
Dictionary learning dictionary
Clustering centroids
Classification classifier parameters
(e.g. support vectors)
X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Point cloud = large matrix of feature vectors
Challenging dimensions
7
X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Point cloud = large matrix of feature vectors
Challenging dimensions
7
x1X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Point cloud = large matrix of feature vectors
Challenging dimensions
7
x1 x2X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Point cloud = large matrix of feature vectors
Challenging dimensions
7
x1 x2 xN…X X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Point cloud = large matrix of feature vectors
High feature dimension n Large collection size N
Challenging dimensions
7
x1 x2 xN…X X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Point cloud = large matrix of feature vectors
High feature dimension n Large collection size N
Challenging dimensions
7
x1 x2 xN…X X
Challenge: compress before learning ?X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning ?
Point cloud = large matrix of feature vectors
8
x1 x2 xN…X X
yNy2 …y1Y = MX
M
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning ?
Point cloud = large matrix of feature vectors
8
x1 x2 xN…X X
yNy2 …y1Y = MX
M
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning ?
Point cloud = large matrix of feature vectors
Reduce feature dimension [Calderbank & al 2009, Reboredo & al 2013]
(Random) feature projection Exploits / needs low-dimensional feature model
8
x1 x2 xN…X X
yNy2 …y1Y = MX
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Challenges of large collections
Feature projection: limited impact
9
X
Y = MX
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Challenges of large collections
Feature projection: limited impact
9
X
Y = MX
“Big Data” Challenge: compress collection size
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning ?
Point cloud = … empirical probability distribution
10
X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning ?
Point cloud = … empirical probability distribution
Reduce collection dimension (adaptive) column sampling / coresets
see e.g. [Agarwal & al 2003, Felman 2010]
sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]
10
X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning ?
Point cloud = … empirical probability distribution
Reduce collection dimension (adaptive) column sampling / coresets
see e.g. [Agarwal & al 2003, Felman 2010]
sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]
10
XM
z 2 Rm
Sketching operator nonlinear in the feature vectors
linear in their probability distribution
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning ?
Point cloud = … empirical probability distribution
Reduce collection dimension (adaptive) column sampling / coresets
see e.g. [Agarwal & al 2003, Felman 2010]
sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]
10
XM
z 2 Rm
Sketching operator nonlinear in the feature vectors
linear in their probability distribution
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Example: Compressive K-means
11
X
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size m
k*n/
m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
M z 2 Rm
Recovery algorithm
estimated centroids
ground truth
N = 1000;n = 2 m = 60
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational impact of sketching
12
Ph.D. A. Bourrier & N. Keriven
Computation time Memory
Collection size N Collection size N
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
Sketch + freq. (Compressive methods)Data (EM)
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
CHS
102 103 104 105 106
10−2
10−1
100
101
102
N
Tim
e (s
)K = 5
Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM
102 103 104 105 106
10−2
10−1
100
101
102
N
Tim
e (s
)K = 5
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
102 103 104 105 106104
105
106
107
108
K = 5
N
Mem
ory
(byt
es)
Sketch + freq. (Compressive methods)Data (EM)
102 103 104 105 106104
105
106
107
108
K = 5
N
Mem
ory
(byt
es)
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
Sketch + freq. (Compressive methods)Data (EM)
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3
distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.
In the GMM-UBM model, each speaker S is represented by one GMM (⇥
S
,↵S
). The key point is theintroduction of a model (⇥
UBM
,↵UBM
) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:
T (X ) =
p⇥S ,↵S (X )
p⇥UBM ,↵UBM (X )
. (23)
If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.
The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X
S
specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥
S
,↵S
). Werefer the reader to [51] for more details on this procedure.
In our framework, the EM or compressive estimation algorithms are used to learn theUBM.
5.2 Setup
The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.
20
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Data distribution
Sketch
The Sketch Trick
13
X ⇠ p(x)
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Data distribution
Sketch
The Sketch Trick
13
X ⇠ p(x)
y
Signal
space
x
Observation space
Signal Processing inverse problems compressive sensing
MM
Probability space
Sketch space
Machine Learning method of moments compressive learning
z
p
Linear “projection”
z` =
Zh`(x)p(x)dx
= Eh`(X)
⇡ 1
N
NX
i=1
h`(xi)
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Data distribution
Sketch
The Sketch Trick
13
X ⇠ p(x)
y
Signal
space
x
Observation space
Signal Processing inverse problems compressive sensing
MM
Probability space
Sketch space
Machine Learning method of moments compressive learning
z
p
Linear “projection”
nonlinear in the feature vectors linear in the distribution p(x)
z` =
Zh`(x)p(x)dx
= Eh`(X)
⇡ 1
N
NX
i=1
h`(xi)
finite-dimensional Mean Map Embedding, cf
Smola & al 2007, Sriperumbudur & al 2010
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Information preservation ?
Data distribution
Sketch
The Sketch Trick
13
X ⇠ p(x)
y
Signal
space
x
Observation space
Signal Processing inverse problems compressive sensing
MM
Probability space
Sketch space
Machine Learning method of moments compressive learning
z
p
Linear “projection”
nonlinear in the feature vectors linear in the distribution p(x)
z` =
Zh`(x)p(x)dx
= Eh`(X)
⇡ 1
N
NX
i=1
h`(xi)
finite-dimensional Mean Map Embedding, cf
Smola & al 2007, Sriperumbudur & al 2010
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
The Sketch Trick
Data distribution
Sketch
finite-dimensional Mean Map Embedding, cf
Smola & al 2007, Sriperumbudur & al 2010
Dimension reduction ?
14
X ⇠ p(x)
y
Signal
space
x
Observation space
Signal Processing inverse problems compressive sensing
MM
Probability space
Sketch space
Machine Learning method of moments compressive learning
z
p
Linear “projection”
nonlinear in the feature vectors linear in the distribution p(x)
z` =
Zh`(x)p(x)dx
= Eh`(X)
⇡ 1
N
NX
i=1
h`(xi)
Compressive Learning (Heuristic) Examples
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive Machine Learning
Point cloud = empirical probability distribution
Reduce collection dimension ~ sketching
16
X
z` =1
N
NX
i=1
h`(xi) 1 ` m
M z 2 Rm
Sketching operator
Choosing information preserving sketch ?
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Goal: find k centroids
Standard approach = K-means
Sketching approach
p(x) is spatially localized
need “incoherent” sampling choose Fourier sampling
sample characteristic function
choose sampling frequencies
Example: Compressive K-means
17
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size mk*
n/m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
z`
=1
N
NX
i=1
ejw>` xi
!` 2 Rn
~pooled Random Fourier Features, cf Rahimi & Recht 2007
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Goal: find k centroids
Standard approach = K-means
Sketching approach
p(x) is spatially localized
need “incoherent” sampling choose Fourier sampling
sample characteristic function
choose sampling frequencies
Example: Compressive K-means
17
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size mk*
n/m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
z`
=1
N
NX
i=1
ejw>` xi
!` 2 Rn
~pooled Random Fourier Features, cf Rahimi & Recht 2007
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Goal: find k centroids
Standard approach = K-means
Sketching approach
p(x) is spatially localized
need “incoherent” sampling choose Fourier sampling
sample characteristic function
choose sampling frequencies
Example: Compressive K-means
17
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size mk*
n/m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
z`
=1
N
NX
i=1
ejw>` xi
!` 2 Rn
~pooled Random Fourier Features, cf Rahimi & Recht 2007
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Goal: find k centroids
18
X M z 2 Rm
N = 1000;n = 2
Sampled Characteristic
Function
m = 60
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size mk*
n/m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
Example: Compressive K-means
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Goal: find k centroids
18
ground truth
X M z 2 Rm
N = 1000;n = 2
Sampled Characteristic
Function
m = 60
Density model=mixture of K Diracs
p ⇡KX
k=1
↵k�✓k
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size mk*
n/m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
Example: Compressive K-means
z = Mp ⇡KX
k=1
↵kM�✓k
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Goal: find k centroids
18
estimated centroids
ground truth
X M z 2 Rm
N = 1000;n = 2
Sampled Characteristic
Function
m = 60
Density model=mixture of K Diracs
p ⇡KX
k=1
↵k�✓k
Compressive Gaussian Mixture Estimation
Anthony Bourrier
12, R
´
emi Gribonval
2, Patrick P
´
erez
11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France
2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]
Motivation
Goal: Infer parameters ✓ from n-dimensional data X = {x1
, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.
n x
1
. . .xN
Learning set(size Nn)
ˆ
A
=) mˆ
z
Database sketch(size m)
L=) K ✓
Learned parameters(size K)
Figure 1: Illustration of the proposed sketching framework. A is a sketch-
ing operator, L is a learning method from the sketch.
Model and problem statement
Application to mixture of isotropic Gaussians in Rn:
fµ / exp
��kx� µk2
2
/(2�2
)
�. (1)
Data X = {xj}Nj=1 ⇠i.i.d.
p =
Pks=1↵sfµs
with:
•weights ↵1
, . . . ,↵k (positive, sum to one)•means µ
1
, . . . ,µk 2 Rn.
Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =
1
N
PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.
We want to infer the mixture parameters from ˆ
z =
ˆ
A(X ).Problem casted as:
p = argminq2⌃k
kˆz�Aqk22
, (2)
where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem
Signal x 2 Rn f 2 L1
(Rn)
Dimension n InfiniteSparsity k k
Dictionary {e1
, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!
RRn f (x)e�ih!,xidx
Algorithm
Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ
z�Ap.1. Searching new support functions:
Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.
2. k-term thresholding:Projection of ˆz onto ˆ
�
0 with positivity constraints on coefficients:
argmin�2RK
+
||ˆz�U�||22
, (3)
with U = [
ˆµ1
, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵
1
, . . . , ↵k.3. Final ”shift”:
Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.
First step Second step Third step
Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-
sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,
Red curve=reconstructed mixture, Green curve=gradient function. Green
Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.
Experimental results
Data setup: � = 1, (↵1
, . . . ,↵k) drawn uniformly on the simplex.Entries of µ
1
, . . . ,µk ⇠i.i.d.
N (0, 1).
Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).
•New support function search (step 1) initialized as ru, where r uniformly
drawn in0,max
x2X||x||
2
�and u uniformly drawn on B
2
(0, 1).
Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.
•EM: Data is stored to allow the standard optimization steps to be per-formed.
Quality measures: KL Divergence and Hellinger distance.
NCompressed EM
KL div. Hell. Mem. KL div. Hell. Mem.10
3
0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410
4
0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410
5
0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24
Table 1: Comparison between our method and an EM algorithm. n =
20, k = 10,m = 1000.
−4 −2 0 2 4 6 8
−4
−2
0
2
4
6
ˆ
A
=)
−0.5 0 0.5 10
10
20
30
40
50
60 n=10, Hell. for 80%
sketch size mk*
n/m
200 400 600 800 1000 1200 1400 1600 1800 2000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-
tion quality for n = 10.
Recovery algorithm
= “decoder”
�
Example: Compressive K-means
CLOMP =Compressive Learning OMP similar to: OMP with Replacement, Subspace Pursuit & CoSaMP
z = Mp ⇡KX
k=1
↵kM�✓k
⇡ arg min↵k,✓k
kz �KX
k=1
↵kM�✓kk2
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive K-Means: Empirical Results
19
x-20 0 20
y
-20
-10
0
10
20
100 101
m/(Kn)
10-5
100
Rela
tive m
em
ory
100 101
m/(Kn)
10-2
100
102
Rela
tive tim
e o
f est
imatio
n
100 101
m/(Kn)
0
1
2
3
4
5
Rela
tive S
SE
N=104
N=105
N=106
N=107
N training samplesK clusters
M z 2 Rm
SSE(X ,C) =NX
i=1
mink
kxi � ckk2.
Sketch vector Matrix of centroids
C = CLOMP(z) 2 Rn⇥K
xi 2 Rn
Training set
K-means objective
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive K-Means: Empirical Results
19
x-20 0 20
y
-20
-10
0
10
20
100 101
m/(Kn)
10-5
100
Rela
tive m
em
ory
100 101
m/(Kn)
10-2
100
102
Rela
tive tim
e o
f est
imatio
n
100 101
m/(Kn)
0
1
2
3
4
5
Rela
tive S
SE
N=104
N=105
N=106
N=107
N training samplesK clusters
M z 2 Rm
SSE(X ,C) =NX
i=1
mink
kxi � ckk2.
Sketch vector Matrix of centroids
C = CLOMP(z) 2 Rn⇥K
xi 2 Rn
Training set
K-means objective
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive K-Means: Empirical Results
19
x-20 0 20
y
-20
-10
0
10
20
100 101
m/(Kn)
10-5
100
Rela
tive m
em
ory
100 101
m/(Kn)
10-2
100
102
Rela
tive tim
e o
f est
imatio
n
100 101
m/(Kn)
0
1
2
3
4
5
Rela
tive S
SE
N=104
N=105
N=106
N=107
N training samplesK clusters
M z 2 Rm
SSE(X ,C) =NX
i=1
mink
kxi � ckk2.
Sketch vector Matrix of centroids
C = CLOMP(z) 2 Rn⇥K
xi 2 Rn
Training set
K-means objective
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Compressive K-Means: Empirical Results
20
N training samplesK=10 clusters
M z 2 Rm
SSE(X ,C) =NX
i=1
mink
kxi � ckk2.
Sketch vector Matrix of centroids
C = CLOMP(z) 2 Rn⇥K
xi 2 Rn
Training set
K-means objective1 rep. 5 rep.
0
0.05
0.1
0.15
0.2
SS
E/N
N1=70 000
KM
CKM
1 rep. 5 rep.
N2=300 000
1 rep. 5 rep.
N3=1 000 000
MNIST infMNIST infMNIST
Lloyd-Max vs Sketch+CLOMP algorithm with 1 or 5 replicates (random initialization)
CLOMPLloyd-Max
Spectral features
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Goal: fit k Gaussians
X M z 2 RmSampled
Characteristic Function
m = 5 000N = 60 000 000;n = 12
21
z = Mp ⇡KX
k=1
↵kMp✓k
p ⇡KX
k=1
↵kp✓k
Density model=GMM with diagonal covariance
Recovery algorithm
= “decoder”
�
Example: Compressive GMM
⇡ arg min↵k,✓k
kz �KX
k=1
↵kMp✓kk2
estimated GMM parameters (⇥,↵)
Compressive Hierarchical Splitting (CHS) = extension of CLOMP to general GMM
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Proof of Concept: Speaker Verification Results (DET-curves)
22
MFCC coefficients xi 2 R12
N = 300 000 000
~ 50 Gbytes ~ 1000 hours of speech
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Proof of Concept: Speaker Verification Results (DET-curves)
22
MFCC coefficients xi 2 R12
After silence detection
N = 60 000 000
N = 300 000 000
~ 50 Gbytes ~ 1000 hours of speech
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Proof of Concept: Speaker Verification Results (DET-curves)
22
MFCC coefficients xi 2 R12
After silence detection
N = 60 000 000
Maximum size manageable by EM
N = 300 000
N = 300 000 000
~ 50 Gbytes ~ 1000 hours of speech
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Proof of Concept: Speaker Verification Results (DET-curves)
23
MFCC coefficients xi 2 R12
After silence detection
N = 60 000 000
Maximum size manageable by EM
N = 300 000
N = 300 000 000
~ 50 Gbytes ~ 1000 hours of speech
CHS
for EMfor CHS
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
CHS
Proof of Concept: Speaker Verification Results (DET-curves)
24
MFCC coefficients xi 2 R12
After silence detection
N = 60 000 000
Maximum size manageable by EM
N = 300 000
N = 300 000 000
~ 50 Gbytes ~ 1000 hours of speech
for EM
for CHS
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
m= 5 000720 000-fold compression exploit whole collection improved performance
Proof of Concept: Speaker Verification Results (DET-curves)
25
~ 50 Gbytes ~ 1000 hours of speech
m= 10003 600 000-fold compression
m= 5007 200 000-fold compression
CHS
Computational Efficiency
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
27
z`
=1
N
NX
i=1
ejw>` xi
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
27
z`
=1
N
NX
i=1
ejw>` xi
X
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
27
z`
=1
N
NX
i=1
ejw>` xi
X
W WX
h(WX)
h(·) = ej(·)
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
27
z`
=1
N
NX
i=1
ejw>` xi
X
W WX
h(WX)
h(·) = ej(·)
z
average
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
27
z`
=1
N
NX
i=1
ejw>` xi
X
W WX
h(WX)
h(·) = ej(·)
z
average
~ One-layer random neural net
Decoding = next layers DNN ~ hierarchical sketching ?
see also [Bruna & al 2013, Giryes & al 2015]
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
28
z`
=1
N
NX
i=1
ejw>` xi
X
W WX
h(WX)
h(·) = ej(·)
z
average
Privacy-reserving sketch and forget
see also [Bruna & al 2013, Giryes & al 2015]
~ One-layer random neural net
Decoding = next layers DNN ~ hierarchical sketching ?
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
29
z`
=1
N
NX
i=1
ejw>` xi
X
W WX
h(WX)
h(·) = ej(·)
z
average
Streaming algorithms One pass; online update
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
29
z`
=1
N
NX
i=1
ejw>` xi
X
W WX
h(WX)
h(·) = ej(·)
z
average
streaming…
Streaming algorithms One pass; online update
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Computational Aspects
Sketching empirical characteristic function
30
z`
=1
N
NX
i=1
ejw>` xi
X
W WX
h(WX)
h(·) = ej(·)
z
average
… … … …
DIS TRI BU TED
Distributed computing Decentralized (HADOOP) / parallel (GPU)
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Summary: Compressive K-means / GMM
31
✓ Dimension reduction ✓ Resource efficiency
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
Sketch + freq. (Compressive methods)Data (EM)
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
CHS102 103 104 105 106
10−2
10−1
100
101
102
N
Tim
e (s
)
K = 5
Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM
102 103 104 105 106
10−2
10−1
100
101
102
N
Tim
e (s
)
K = 5
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM
102 103 104 105 106
10−1
100
101
102
103
N
Tim
e (s
)
K = 20
102 103 104 105 106104
105
106
107
108
K = 5
N
Mem
ory
(byt
es)
Sketch + freq. (Compressive methods)Data (EM)
102 103 104 105 106104
105
106
107
108
K = 5
N
Mem
ory
(byt
es)
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
Sketch + freq. (Compressive methods)Data (EM)
102 103 104 105 106104
105
106
107
108
K = 20
N
Mem
ory
(byt
es)
Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3
distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.
In the GMM-UBM model, each speaker S is represented by one GMM (⇥
S
,↵S
). The key point is theintroduction of a model (⇥
UBM
,↵UBM
) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:
T (X ) =
p⇥S ,↵S (X )
p⇥UBM ,↵UBM (X )
. (23)
If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.
The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X
S
specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥
S
,↵S
). Werefer the reader to [51] for more details on this procedure.
In our framework, the EM or compressive estimation algorithms are used to learn theUBM.
5.2 Setup
The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.
20
✓ In the pipe: information preservation (generalized RIP, “intrinsic dimension”)
✓ Neural net - like
z
•Challenge: provably good recovery algorithms ?
Conclusion
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Projections & Learning
33
y
Signal
space
x
Observation space
Signal Processing compressive sensing
M M
Probability space
Sketch space
Machine Learning compressive learning
z
p
Linear “projection”
Compressive sensing random projections of data items
Compressive learning with sketches
random projections of collections
nonlinear in the feature vectors
linear in their probability distribution
Reduce dimension of data items
Reduce size of collection
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Summary
Compressive GMM
Bourrier, G., Perez, Compressive Gaussian Mixture Estimation. ICASSP 2013 Keriven & al, Sketching for Large-Scale Learning of Mixture Models. ICASSP 2016 & arXiv:1606.02838
Compressive k-means
Keriven & al, Compressive K-Means submitted to ICASSP 2017
Compressive spectral clustering (with graph signal processing)
Tremblay & al, Accelerated Spectral Clustering using Graph Filtering of Random Signals ICASSP 2016
Tremblay & al, Compressive Spectral Clustering ICML 2016 & arXiv:1602.02018
Ex: with Amazon graph (106 edges), 5 times speedup (3 hours instead of 15 hours for k= 500 classes)
34
Challenge: compress before learning ?X
Introduction Graph signal processing... ... applied to clustering Conclusion
What’s the point of using a graph ?
N points in d = 2 dimensions.Result with k-means (k=2) :
After creating a graph fromthe N points’ interdistances,and running the spectral clus-tering algorithm (with k=2) :
N. Tremblay Graph signal processing for clustering Rennes, 13th of January 2016 8 / 26
O(k2 log2 k +N(logN + k))O(k2N)
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
Recent / ongoing work / challenges
When is information preserved with sketches / projections ? Bourrier & al, Fundamental perf. limits for ideal decoders in high-dimensional linear inverse problems. IEEE Transactions on Information Theory, 2014
Notion of Instance Optimal Decoders = Uniform guarantees Fundamental role of general Restricted Isometry Property
How to reconstruct: algorithm / decoder ? Traonmilin & G., Stable recovery of low-dimensional cones in Hilbert spaces - One RIP to rule them all. ACHA 2016
RIP guarantees for general (convex & nonconvex) regularizers
How to (maximally) reduce dimension? [Dirksen 2014] : given a random sub-gaussian linear form Puy & al, Recipes for stable linear embeddings from Hilbert spaces to ℝ^m arXiv:1509.06947
Role of covering dimension / Gaussian width of normalized secant set
What is the achievable compression for learning tasks ? Compressive statistical learning, work in progress with G. Blanchard, N. Keriven, Y. Traonmilin
Number of random moments = “intrinsic dimension” of PCA, k-means, Dictionary Learning … Statistical learning: risk minimization + generalization to future samples with same distribution
35
Guarantees ?
�(y) := argminx2H
f(x) s.t.kMx� yk ✏
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016
TH###NKS# Lisbon, Portugal June 5-8, 2017
SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations
Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017
spars2017.lx.it.pt
Top Related