Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees...
Transcript of Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees...
![Page 1: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/1.jpg)
Universite des ActuairesMachine-Learning pour les donnees massives:algorithmes randomises, en ligne et distribues
Stephan Clemencon
Institut Mines Telecom - Telecom ParisTech
July 7, 2014
![Page 2: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/2.jpg)
Agenda
I Technologies Big Data: contexte et opportunites
I Machine-Learning: un bref tour d’horizon
I Les defis des applications du ML dans l’industrie et lesservices:
Vers une generalisation?
![Page 3: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/3.jpg)
”Big Data” - Le contexte
Une accumulation de donnees massives dans de nombreuxdomaines:
I Biologie/Medecine (genomique, metabolomique, essaiscliniques, imagerie, etc.
I Grande distribution, marketing (CRM), e-commerce
I Moteurs de recherche internet (contenu multimedia)
I Reseaux sociaux (Facebook, Tweeter, ...)
I Banque/Finance (risque de marche/liquidite, acces au credit)
I Securite (ex: biometrie, videosurveillance)
I Administrations (Sante Publique, Douanes)
I Risques operationnels
![Page 4: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/4.jpg)
”Big Data” - Le contexte
Un deluge de donnees qui rend inoperant:
I les outils basiques deI stockage de donneesI gestion de base de donnees (MySQL)
I le pretraitement reposant sur l’expertise humaineI indexation, analyse semantiqueI modelisationI intelligence decisionnelle
![Page 5: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/5.jpg)
”Big Data” - Le contexteUne multitude de briques technologiques et de services disponiblespour:
I La parallelisation massive (Velocity)
I Le calcul distribue (Volume)
I La gestion de donnees sans schema predefini (Variety)
parmi lesquels:
I Le modele de programmation MapReduce: calculsparallelises/distribuees
I Framework Hadoop
I NoSQL: SGBD Cassandra, MongoDB, bases de donneesorientees graphe, moteur de recherche Elasticsearch, etc.
I Clouds: infrastructures, plate-formes, logiciels as a Service
promus par Google, Amazon, Facebook, etc.
![Page 6: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/6.jpg)
”Big Data” - Les opportunites
Des avancees spectaculaires pour
I la collecte et le stockage (distribue) des donnees
I la recherche automatique d’objets, de contenu
I le partage de donnees peu structurees
Le Big Data: un moteur pour la technologie, la science,l’economie
I Moteurs de recherche, moteurs de recommandation
I Maintenance predictive
I Marketing viral a travers les reseaux sociaux
I Detection des fraudes
I Medecine individualisee
I Publicite en ligne (retargeting)
![Page 7: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/7.jpg)
”Big Data” - Les opportunites
Ubiquite
De nombreux secteurs d’activite sont concernes:
I (e-) Commerce
I CRM
I Sante
I Defense, renseignement (e.g. cybersecurite, biometrie)
I Banque/Finance
I Transports ”intelligents”
I etc.
![Page 8: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/8.jpg)
Big Data - Recherche
Afin d’exploiter les donnees (prediction, interpetation),developper des technologies mathematiques permettant deresoudre les problemes computationnels lies:
I aux contraintes du quasi-temps reel! apprentissage automatique sequentiel (”on-line”) 6= batch,par renforcement
I au caractere distribue des donnees/ressources! apprentissage automatique distribue
I a la volumetrie des donnees! impact des techniques de sondages sur la performance desalgorithmes
![Page 9: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/9.jpg)
”Big Data” - RechercheDes techniques de visualisation, representation de donneescomplexes
I Graphes (evolutifs) - clustering, graph-mining
I Image, audio, video - filtrage, compression
I Donnees textuelles (e.g. page web, tweet)
Domaines
I Probabilite, Statistique
I Machine-Learning
I Optimisation
I Traitement du signal et de l’image
I Analyse Harmonique Computationnelle
I Analyse semantique
I etc.
![Page 10: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/10.jpg)
Goals of Statistical Learning
I Statistical issues cast as M-estimation problems:I ClassificationI RegressionI Density level set estimationI Compression, sparse representationI ... and their variants
I Minimal assumptions on the distribution
I Build realistic M-estimators for special criteriaI Questions
I Theory: optimal elements, consistency, non-asymptotic excessrisk bounds, fast rates of convergence, oracle inequalities
I Practice: numerical optimization, convexification,randomization, relaxation, constraints (distributedarchitectures, real-time, memory, etc.)
![Page 11: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/11.jpg)
Main Example: Classification (Pattern Recognition)
I (X ,Y ) random pair with unknown distribution P
I X 2 X observation vector
I Y 2 {�1,+1} binary label/class
I A posteriori probability ⇠ regression function
8x 2 X , ⌘(x) = {Y = 1 | X = x}I g : X ! {�1,+1} classifier
I Performance measure = classification error
L(g) = g(X ) 6= Y ! ming
I Solution: Bayes rule
8x 2 X , g⇤(x) = 2{⌘(x) > 1/2}� 1
I Bayes error L⇤ = L(g⇤)
![Page 12: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/12.jpg)
Main Paradigm - Empirical Risk Minimization
I Sample (X1
,Y1
), . . . , (Xn,Yn) with i.i.d. copies of(X ,Y ),class of classifiers
I Empirical Risk Minimization principle
bgn = Argming2
Ln(g) :=1
n
nX
i=1
{g(Xi ) 6= Yi}
I Best classifier in the class
g = Argming2
L(g)
I Concentration inequality
With probability 1� �:
supg2
| Ln(g)� L(g) | C
r
V
n+
r
2 log(1/�)
n
![Page 13: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/13.jpg)
Machine-Learning - Achievements
I Numerous applications:I Supervised anomaly detectionI Handwritten digit recognitionI Face recognitionI Medical diagnosisI Credit-risk screeningI CRMI Speech recognitionI Monitoring of complex systemsI etc.
I Many ”o↵-the-shelf” methodsI Neural NetworksI Support Vector MachinesI BoostingI Vector quantizationI etc.
![Page 14: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/14.jpg)
Machine-Learning - Challenges
I Ongoing intense research activity, motivated by
I Need for increasing performanceI Evolution of computing environments (data centers, clouds,
HDFS)I New applications/problems: recommending systems, search
engines, medical imagery, yield management etc.I The Big Data era
I Mathematical/computational challenges
I Volume (data deluge): ubiquity of sensors, high dimension,distributed storage/processing systems
I Variety (of data structures): text, graphs, images, signalsI Velocity (real-time): on-line prediction, evolutionary
environment, reinforcement learning strategies (exploration vsexploitation)
![Page 15: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/15.jpg)
Statistical Learning - Milestones
I The 30’s - Fisher’s (parametric/Gaussian) statistics
I Linear Discriminant AnalysisI Linear (logistic) regressionI PCA, . . .
![Page 16: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/16.jpg)
Statistical Learning - Milestones
I The 60’s and 70’s - F. Rosenblatt’s perceptron & VC theory
I First ”machine-learning” algorithm (linear binary classification)I Inspired by congnitive sciencesI Convexification, one-pass/on-line (stochastic gradient descent)I Relaxation, large margin linear classifiers, structural ERM
![Page 17: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/17.jpg)
Statistical Learning - Milestones
I The 80’s - Neural Networks & Decision Trees
I Artificial Intelligence ”A theory of learnability” Valiant ’84I The Backpropagation algorithmI The CART algorithm (’84)
![Page 18: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/18.jpg)
Statistical Learning - Milestones
I From the 90’s - Kernels & Boosting
I Kernel trick: SVM, nonlinear PCA, . . .
I AdaBoost (’95)I Lasso, compressed sensingI A comprehensive theory beyond VC conceptsI Rebirth of Q-learning
![Page 19: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/19.jpg)
Applications
Supervised Learning - Pattern Recognition/Regression
I Data with labels, e.g. (Xi ,Yi ) 2 Rd ⇥ {�1,+1}, i = 1, . . . , n.Learn to predict Y based on X
I Example: in Quality Control, X features of the product and/or
production factors, Y = +1 if ”defect” and Y = �1 otherwise.
Build a decision rule C minimizing L(C ) = P{Y 6= C (X )}
![Page 20: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/20.jpg)
Applications
Supervised Learning - Scoring
I Data with labels, e.g. (Xi ,Yi ) 2 Rd ⇥ {�1,+1}, i = 1, . . . , n.
Learn to rank all possible observations X in the same order as
that induced by P{Y = +1 | X} through a scoring functions(X )
![Page 21: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/21.jpg)
Applications
Supervised Learning - Image Recognition
I Objects are assigned to data (pixels), e.g. biometrics
I Goal: learn to assign objects to new data
![Page 22: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/22.jpg)
Empirical Risk Minimization and Stochastic Approximation
I Most learning problems consists of minimizing a functional
L(f ) = E[ (Z , f )]
where Z is the observation, f a decision rule candidate
I In general, a stochastic approximation inductive method mustbe implemented
ft+1
= ft � ⇢tdrf L(ft),
where drf L is a statistical estimate of L’s gradient based ontraining data Z
1
, . . . , Zn
![Page 23: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/23.jpg)
Empirical Risk Minimization and Stochastic Approximation
I Popular algorithms are based on these principles
I Examples: Logit, Neural Networks, linear SVM, etc.
I Computational advantages but too rigid (underfitting)
![Page 24: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/24.jpg)
Kernel Trick
I Apply a simple algorithm but... in a transformed space
K (x , x 0) = h�(x),�(x 0)i
I Examples: Nonlinear SVM, Kernel PCA, SVR
I Kernels for images, text data, biological sequences, etc.
![Page 25: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/25.jpg)
Greedy Algorithms
I Recursive methods exploring exhaustively a structuredspace at each step
I Examples: CART, projection pursuit, matching pursuit, etc.
I Highly interpretable/visualizable but poor performance
![Page 26: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/26.jpg)
No Free Lunch
![Page 27: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/27.jpg)
Ensemble learning
Heuristic: combine predictions output by weak decision rulesAmit & Geman (’97) for image recognition
I Example committee-based binary classification:predict Y 2 {�1,+1} based on X
Cagg (X ) = sgn
MX
m=1
!mCm(X )
!
,
where !m controls the impact of the vote of weak rule Cm
I The Bootstrap Aggregating method - Breiman (’96)
The Cm’s re learnt from bootstrap versions of the trainingdata and !m ⌘ 1) Bagging reduces instability of prediction rules
![Page 28: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/28.jpg)
Ensemble learning
I The Adaptive Boosting algorithm for binary classificationFreund & Shapire (’95) - Slow learning
I AdaBoost can be interpreted as a forward additivestagewise modelling strategy to minimize a convexifiedversion of the risk
E[exp(�YMX
m=1
↵mCm(X ))]
I A serious competitor: Random Forest, Breiman (’01)Bagging applied to randomized decision trees
I Boosting methods and Random Forests outperform oldermethods in most cases
![Page 29: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/29.jpg)
Applications
Unsupervised Learning - Anomaly/Novelty Detection
I Data with no labels, e.g. Xi 2 Rd , i = 1, . . . , n
I Example: monitoring of complex systems,e.g. aircraft systems, fraud detection, predictive maintenance,cybersecurity
I Detect abnormal observations - Rarity replaces labeling
I 1-class SVM: bG↵ = {x 2 X :Pn
i=1
↵iK (x ,Xi ) � tµ}
![Page 30: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/30.jpg)
Applications
Unsupervised Learning - Anomaly/Novelty Ranking
I Data with no labels, e.g. Xi 2 Rd , i = 1, . . . , n
I Rank data by degree of novelty/abnormality
I Distributed Fleet Monitoring: check the 5 % the mostabnormal, then the next 5 %, etc.
10
Optimal
Good
Bad
(Extremal behavior)
MASS
VOLUME
(Modes)
![Page 31: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/31.jpg)
Feature Selection
A quick algorithm has been proposed by Efron et al (2002) tocompute
b�n(�) = Argmin�
"
1
n
nX
i=1
(y i � xTi �)2 + �
pX
i=1
|�i |#
,
for all � > 0.As � decreases one obtains more and more active (i.e. non zero)coe�cients. The regularization path
� 7! b�n(�)
thus defines as � # 0 a sequence of models with increasingdimension.
![Page 32: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/32.jpg)
Variable selectionUnder `1 constraints, the points that are the `2-furthest away fromthe origin are on the axes (zero coe�cient):
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
l1 unit ballcontaining l2 ballcontained l2 ball
![Page 33: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/33.jpg)
Industrial example (Renault Technocentre)Regularization path: �1,2, . . . ,�1,p VS �
0 0.5 1 1.5 2 2.5 3−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15
![Page 34: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/34.jpg)
Lasso - L1 penalty
I Many variants: group Lasso, lasso and elastic net
I L1
penalty ensures sparsity
I Compressed sensing: Candes & Tao (’04), Donoho (’04)
I Numerous applications, e.g. matrix completion, recommendersystems
![Page 35: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/35.jpg)
Spectral clustering - Ng et al. (’01)
I Partition the vertices of a graph, clustering
I Graph Laplacian L = D �W , D is the degree matrix and Wis the adjacency/weight matrix
I Spectral Clustering using the normalised versionL = D�1/2LD�1/2
(i) Find k-smallest eigenvectors Vk = (v1
, . . . , vk) of L(ii) Normalise Vk ’s rows: Vk diag(VkV t
k )�1/2Vk
(iii) Cluster rows of Vk with the k-means algorithm
![Page 36: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/36.jpg)
Spectral clustering - Ng et al. (’01)
![Page 37: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/37.jpg)
Spectral clustering - Ng et al. (’01)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
3536
37
38
39
![Page 38: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/38.jpg)
Community Detection
![Page 39: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/39.jpg)
Spectral clustering - Ng et al. (’01)
I Partition the vertices of a graph, clustering
I Graph Laplacian L = D �W , D is the degree matrix and Wis the adjacency/weight matrix
I Spectral Clustering using the normalised versionL = D�1/2LD�1/2
(i) Find k-smallest eigenvectors Vk = (v1
, . . . , vk) of L(ii) Normalise Vk ’s rows: Vk diag(VkV t
k )�1/2Vk
(iii) Cluster rows of Vk with the k-means algorithm
![Page 40: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/40.jpg)
Scaling-up Machine-Learning Algorithms
I ”Smart Randomization/Sampling”
I Massive Parallelization: break a large optimization probleminto smaller problemse.g. Cascade SVM, parallel large-scale feature selection,parallel clustering
I Distributed Optimization
I Many frameworks are available:MapReduce (+In Memory=PLANET, IBM PML, Mahout),DryadLINQ, MADlib, Storm, etc.
![Page 41: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/41.jpg)
How to apply the ERM paradigm to Big Data?
I Suppose that n is too large to evaluate the empirical riskLn(g)
I Common sense: run your preferred learning algorithm using asubsample of ”reasonable” size B << n, e.g. by drawing withreplacement in the original training data set...
I ... but of course, statistical performance is downgraded
1/p
n << 1/p
B
![Page 42: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/42.jpg)
How to apply the ERM paradigm to Big Data?
I Suppose that n is too large to evaluate the empirical riskLn(g)
I Common sense: run your preferred learning algorithm using asubsample of ”reasonable” size B << n, e.g. by drawing withreplacement in the original training data set...
I ... but of course, statistical performance is downgraded
1/p
n << 1/p
B
![Page 43: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/43.jpg)
”Smart sampling”
I Use side information and implement your mini-batch SGDwith a Horvitz-Thompson estimate of the local gradient
I In various situations, the performance criterion is not a basicsample mean statistic any more but a U-statistic
I Examples:
I Clustering: within cluster point scatter related to a partition P2
n(n � 1)
X
i<j
D(Xi ,Xj)X
C2PI{(Xi ,Xj) 2 C2}
I Graph inference (link prediction)I RankingI · · ·
![Page 44: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/44.jpg)
Example: Ranking
I Data with ordinal label:(X
1
,Y1
), . . . , (Xn,Yn) 2�
X ⇥ {1, . . . , K}�⌦n
I Want to: rank X1
, . . . ,Xn through a scoring functions : X ! s.t.
s(X ) and Y tend to increase/decrease together with highprobability
I Quantitative formulation: maximize the criterion
L(s) = P{s(X (1)) < . . . < s(X (k)) | Y (1) = 1, . . . , Y (K) = K}
I Observations: nk i.i.d. copies of X given Y = k,
X (k)
1
, . . . , X (k)
nk
n = n1
+ . . . + nK
![Page 45: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/45.jpg)
Example: Ranking
I Data with ordinal label:(X
1
,Y1
), . . . , (Xn,Yn) 2�
X ⇥ {1, . . . , K}�⌦n
I Want to: rank X1
, . . . ,Xn through a scoring functions : X ! s.t.
s(X ) and Y tend to increase/decrease together with highprobability
I Quantitative formulation: maximize the criterion
L(s) = P{s(X (1)) < . . . < s(X (k)) | Y (1) = 1, . . . , Y (K) = K}
I Observations: nk i.i.d. copies of X given Y = k,
X (k)
1
, . . . , X (k)
nk
n = n1
+ . . . + nK
![Page 46: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/46.jpg)
Example: Ranking
I Data with ordinal label:(X
1
,Y1
), . . . , (Xn,Yn) 2�
X ⇥ {1, . . . , K}�⌦n
I Want to: rank X1
, . . . ,Xn through a scoring functions : X ! s.t.
s(X ) and Y tend to increase/decrease together with highprobability
I Quantitative formulation: maximize the criterion
L(s) = P{s(X (1)) < . . . < s(X (k)) | Y (1) = 1, . . . , Y (K) = K}
I Observations: nk i.i.d. copies of X given Y = k,
X (k)
1
, . . . , X (k)
nk
n = n1
+ . . . + nK
![Page 47: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/47.jpg)
Example: Ranking
I Data with ordinal label:(X
1
,Y1
), . . . , (Xn,Yn) 2�
X ⇥ {1, . . . , K}�⌦n
I Want to: rank X1
, . . . ,Xn through a scoring functions : X ! s.t.
s(X ) and Y tend to increase/decrease together with highprobability
I Quantitative formulation: maximize the criterion
L(s) = P{s(X (1)) < . . . < s(X (k)) | Y (1) = 1, . . . , Y (K) = K}
I Observations: nk i.i.d. copies of X given Y = k,
X (k)
1
, . . . , X (k)
nk
n = n1
+ . . . + nK
![Page 48: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/48.jpg)
Example: Ranking
I A natural empirical counterpart of L(s) is
bLn(s) =
Pn1
i1
=1
· · ·PnKiK=1
In
s(X (1)
i1
) < . . . < s(X (K)
iK)o
n1
⇥ · · ·⇥ nK,
I But the number of terms to be summed is prohibitive!
n1
⇥ . . .⇥ nK
I Maximization of bLn(s) is computationally unfeasible...
![Page 49: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/49.jpg)
Example: Ranking
I A natural empirical counterpart of L(s) is
bLn(s) =
Pn1
i1
=1
· · ·PnKiK=1
In
s(X (1)
i1
) < . . . < s(X (K)
iK)o
n1
⇥ · · ·⇥ nK,
I But the number of terms to be summed is prohibitive!
n1
⇥ . . .⇥ nK
I Maximization of bLn(s) is computationally unfeasible...
![Page 50: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/50.jpg)
Example: Ranking
I A natural empirical counterpart of L(s) is
bLn(s) =
Pn1
i1
=1
· · ·PnKiK=1
In
s(X (1)
i1
) < . . . < s(X (K)
iK)o
n1
⇥ · · ·⇥ nK,
I But the number of terms to be summed is prohibitive!
n1
⇥ . . .⇥ nK
I Maximization of bLn(s) is computationally unfeasible...
![Page 51: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/51.jpg)
Generalized U-statistics
I K � 1 samples and degrees (d1
, . . . , dK ) 2 N⇤K
I (X (k)
1
, . . . , X (k)
nk ), 1 k K , K independent i.i.d. samplesdrawn from Fk(dx) on X k respectively
I Kernel H : X d1
1
⇥ · · ·⇥ X dKK ! R, square integrable w.r.t.
µ = F⌦d1
1
⌦ · · ·⌦ F⌦dKK
![Page 52: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/52.jpg)
Generalized U-statistics
DefinitionThe K -sample U-statistic of degrees (d
1
, . . . , dK ) with kernel H is
Un(H) =
P
I1
. . .P
IKH(X(1)
I1
;X(2)
I2
; . . . ;X(K)
IK)
�n1
d1
�⇥ · · · �nKdK
� ,
whereP
Ikrefers to summation over all
�nkdk
�
subsets
X(k)
Ik= (X (k)
i1
, . . . , X (k)
idk) related to a set Ik of dk indexes
1 i1
< . . . < idk nk
It is said symmetric when H is permutation symmetric in each set
of dk arguments X(k)
Ik.
References: Lee (1990)
![Page 53: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/53.jpg)
Generalized U-statistics
I Unbiased estimator of
✓(H) = E[H(X (1)
1
, . . . , X (1)
d1
, . . . , X (K)
1
, . . . , X (K)
dk)]
with minimum variance
I Asymptotically Gaussian as nk/n! �k > 0 fork = 1, . . . , K
I Its computation requires the summation of
KY
k=1
✓
nk
dk
◆
terms
I K -partite ranking: dk = 1 for 1 k K
Hs(x1
, . . . , xK ) = I {s(x1
) < s(x2
) < · · · < s(xK )}
![Page 54: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/54.jpg)
Incomplete U-statistics
I Replace Un(H) by an incomplete version, involving much lessterms
I Build a set DB of cardinality B built by sampling withreplacement in the set ⇤ of indexes
((i (1)
1
, . . . , i (1)
d1
), . . . , (i (K)
1
, . . . , i (K)
dK))
with 1 i (k)
1
< . . . < i (k)
dk nk , 1 k K
I Compute the Monte-Carlo version based on B terms
eUB(H) =1
B
X
(I1
, ..., IK )2DB
H(X (1)
I1
, . . . , X (K)
IK)
I An incomplete U-statistic is NOT a U-statistic
![Page 55: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/55.jpg)
M-Estimation based on incomplete U-statistics
I Replace the criterion by a tractable incomplete version basedon B = O(n) terms
minH2H
eUB(H)
I This leads to investigate the maximal deviations
supH2H
�
�
�
eUB(H)� Un(H)�
�
�
![Page 56: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/56.jpg)
Main Result
TheoremLet H be a VC major class of bounded symmetric kernels of finiteVC dimension V < +1. Set MH = sup
(H,x)2H⇥X |H(x)|. Then,
(i) Pn
supH2H�
�
�
eUB(H)� Un(H)�
�
�
> ⌘o
2(1 + #⇤)V ⇥ e�B⌘2/M2
H
(ii) for all � 2 (0, 1), with probability at least 1� �, we have:
1
MHsupH2H
�
�
�
eUB(H)� Eh
eUB(H)i
�
�
�
2
r
2V log(1 + )
+
r
log(2/�)
+
r
V log(1 + #⇤) + log(4/�)
B,
where = min{bn1
/d1
c, . . . , bnK/dKc}
![Page 57: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/57.jpg)
Consequences
I Empirical risk sampling with B = O(n) yields a rate bound ofthe order O(
p
log n/n)
I One su↵ers no loss in terms of learning rate, while drasticallyreducing computational cost
![Page 58: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/58.jpg)
Example: Ranking
Empirical ranking performance for SVMrank based on 1%, 5%,10%, 20% and 100% of the ”LETOR 2007” dataset.
![Page 59: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/59.jpg)
Context
Consider a network composed of N agents
IAgents process local data
IAgents cooperate to estimate some global parameter
1/5
![Page 60: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/60.jpg)
A regression example
Data set formed by n samples (X
i
,Yi
) (i = 1 . . . n)I
Y
i
= variable to be explained
IX
i
= explanatory features
Looking for a model.
Linear regression example:
min
x
nX
i=1
kYi
� x
T
X
i
k2
Distributed processing: the problem is separable
min
x
X
v2V
nX
i=1
`(xT
X
i,v ,Yi,v ) + r(x)
[Boyd’11, Agarwal’11]
2/5
![Page 61: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/61.jpg)
A regression example
Data set formed by n samples (X
i
,Yi
) (i = 1 . . . n)I
Y
i
= variable to be explained
IX
i
= explanatory features
Looking for a model.
Linear regression example:
min
x
nX
i=1
`(xT
X
i
,Yi
) + r(x)
Distributed processing: the problem is separable
min
x
X
v2V
nX
i=1
`(xT
X
i,v ,Yi,v ) + r(x)
[Boyd’11, Agarwal’11]
2/5
![Page 62: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/62.jpg)
A regression example
Data set formed by n samples (X
i
,Yi
) (i = 1 . . . n)I
Y
i
= variable to be explained
IX
i
= explanatory features
Looking for a model.
Linear regression example:
min
x
nX
i=1
`(xT
X
i
,Yi
) + r(x)
Distributed processing: the problem is separable
min
x
X
v2V
nX
i=1
`(xT
X
i,v ,Yi,v ) + r(x)
[Boyd’11, Agarwal’11]
2/5
![Page 63: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/63.jpg)
Formally
min
x
X
v2V
f
v
(x)
I G = (V ,E) is the graph modelling the network
If
v
is the cost function of agent v
Di�culty :
Pv
f
v
is nowhere observed.
Methods : from distributed gradient algorithms to advanced proximal methods
Common principle :
1. process local data
2. exchange information with neighbors
3. iterate.
3/5
![Page 64: Université des Actuaires Machine-Learning pour les …...Machine-Learning pour les donn´ees massives: algorithmes randomis´es, en ligne et distribu´es St´ephan Cl´emen¸con Institut](https://reader034.fdocuments.net/reader034/viewer/2022043019/5f3b9dfdc1fe7649d53e3a3a/html5/thumbnails/64.jpg)
Key issues
IDistribute cutting-edge optimization algorithms (eq. primal-dual methods,
fast-admm, etc.)
IInclude stochastic perturbations:
I On-line algorithmsI Asynchronism
IInvestigate specific ML application, e.g. ranking
4/5