Nonnegative Matrix Factorizations for Intelligent Data Analysis
Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix...
Transcript of Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix...
![Page 1: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/1.jpg)
Learning with Matrix Factorizations
Nati SrebroDepartment of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Adviser: Tommi Jaakkola (MIT)Committee: Alan Willsky (MIT), Tali Tishby (HUJI), Josh Tenenbaum (MIT)
Joint work with Shie Mannor (McGill), Noga Alon (TAU) and Jason Rennie (MIT)
Lots of help from John Barnett, Karen Livescu, Adi Shraibman, Michael Collins,Erik Demaine, Michel Goemans, David Karger and Dmitry Panchenko
![Page 2: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/2.jpg)
Dimensionality Reduction:Low dimensional representation capturing important aspects of high dimensional data
observations small representation yy u
image varying aspects of scenegene expression levels description of biological process
a user’s preferences characterization of userdocument topics
• Compression (mostly to reduce processing time)• Reconstructing latent signal
– biological processes through gene expression• Capturing structure in a corpus
– documents, images, etc• Prediction: collaborative filtering
![Page 3: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/3.jpg)
Linear Dimensionality Reduction× v1u1yy×+ v2u2
+ × v3u3
![Page 4: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/4.jpg)
Linear Dimensionality Reductiontitles
yy u1 v1
titles× comic value
u2 v2×+u3
+1
+2
-1
dramatic value
preferences of a specific user(real-valued preference level
for each title)
+ violence× v3
characteristics of the user
![Page 5: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/5.jpg)
Linear Dimensionality Reductiontitles
V’titles
×yy u
![Page 6: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/6.jpg)
Linear Dimensionality Reductionyy u V’YY
datadata U
titles
user
s
≈
titles
×
![Page 7: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/7.jpg)
Matrix Factorization
V’YYdatadata U X
rank k
×=≈
• Non-Negativity [LeeSeung99]
• Stochasticity (convexity) [LeeSeung97][Barnett04]
• Sparsity– Clustering as an extreme (when rows of U sparse)
• Unconstrained: Low Rank Approximation
![Page 8: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/8.jpg)
Matrix Factorization
V’YYdatadata U Z
i.i.d.Gaussian
×= +
• Additive Gaussian noise minimize |Y-UV’|Fro
( ) ∑ ∑ +−== −
ij ijijijijij constUVYUVYPYUVL 2
21 )'()'|(log;'log 2σ
![Page 9: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/9.jpg)
Matrix Factorization
V’YYdatadata U Z
i.i.d.p(Zij)
×= +
• Additive Gaussian noise minimize |Y-UV’|Fro
• General additive noise minimize ∑ -log p(Yij-UV’ij)
![Page 10: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/10.jpg)
Matrix Factorization
• Additive Gaussian noise minimize |Y-UV’|Fro
• General additive noise minimize ∑ -log p(Yij-UV’ij)• General conditional models minimize ∑ -log p(Yij|UV’ij)
– Multiplicative noise– Binary observations: Logistic LRA– Exponential PCA [Collins+01]
– Multinomial (pLSA [Hofman01])• Other loss functions [Gordon02] minimize ∑loss(UV’ij;Yij)
V’YYdatadata U
×
p(Yij|UV’ij)
![Page 11: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/11.jpg)
The Aspect Model (pLSA)
YYcoco--occurrenceoccurrence
countscounts
I
Jword
J
I
T
topic 1..k
p(j|t)
p(i,t)
[Hoffman+99]
Xp(i,j)rank k
×=
document
Y|X ∼ Multinomial(N,X)
Yij|Xij ∼ Binomial(N,Xij)
N=∑ Yij
![Page 12: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/12.jpg)
Logistic Low Rank Approximation
[Schein+03]
Yij|Xij~Bin(N,g(Xij))Sufficient Dimensionality Reduction[Globerson+02]
Natural parameterizationunconstrained Xij
P(Yij=1) = XijYij|Xij~Bin(N,Xij)Aspect Model (pLSA) [Hoffman+99]
Mean parameterization0 · Xij · 1E[Yij|Xij]=Xij
Independent Bernoulli
Independent Binomials
Multinomial
≡ NMF if ∑Xij=1 ≈ NMF [Lee+01]
hingeloss
≈
Low-Rank Models for Matrices of Counts, Occurrences or Frequencies
p(Yij|Xij) ∝ exp(YijXij+F(Yij))Exponential PCA: [Collins+01]
g(x)=1/(1+ex)
![Page 13: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/13.jpg)
Outline• Finding Low Rank Approximations
– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2
– Use WLRA Basis for other losses / conditional models
• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct
solution? Not always…
• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)
![Page 14: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/14.jpg)
Finding Low Rank ApproximationFind rank-k X minimizing ∑loss(Xij;Yij) (=log p(Y|X))
• Non-convex:– “X is rank-k” is a non-convex constraint– ∑loss((UV)ij;Yij) not convex in U,V
• rank-k X minimizing ∑(Yij-Xij)2:– non-convex, but no (non-global) local minima– solution: leading components of SVD
• For other loss functions, or with missing data:cannot use SVD, local minima, difficult problem
• Weighted Low Rank Approximation:minimize ∑Wij(Yij-Xij)2
Arbitrarily specified weights(part of input)
![Page 15: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/15.jpg)
WLRA: Optimization∑ −=
ijijij UVYWUVJ 2)'()'(
For fixed V, find optimal UFor fixed U, find optimal V
)'(min)(* UVJVJU
=
))'*((*'2)'(* WYVUUVJV ⊗−=∂∂
Conjugate gradient descent on J*Optimize km parameters instead of k(n+m)
V’U
×k m
k
n
X←LowRankApprox(W⊗Y+(1-W)⊗X)
elementwiseproductEM approach:
![Page 16: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/16.jpg)
Newton Optimization forNon-Quadratic Loss Functions
minimize ∑ijloss(Xij;Yij)loss(Xij;Yij) convex in Xij
• Iteratively optimize quadratic approximations of objective
• Each such quadratic optimization is a weighted low rank approximation
![Page 17: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/17.jpg)
Maximum Likelihood Estimation with Gaussian Mixture Noise
YYobservedobserved
Xrank k
Znoise
C(hidden)
= +
Mixture of Gaussians:{pc,µc,σc}
E step: calculate posteriors of CM step: WLRA with
∑=
=c C
ijij
cCW 2
)Pr(σ ij
c C
Cijijij WcCYA ∑
=+= 2
)Pr(σ
µ
![Page 18: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/18.jpg)
Outline• Finding Low Rank Approximations
– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2
– Use WLRA Basis for other losses / conditional models
• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct
solution? Not always…
• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)
![Page 19: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/19.jpg)
Asymptotic Behavior ofLow Rank Approximation
YY ZXparameters
rank k
random
independent= +
Single observation,Number of parameters is linear in number of observables
Can never approach correct estimation of parameters
What can be estimated is row-space of X
![Page 20: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/20.jpg)
YY Zrandom
independentU
V= +
Number of parameters is linear in number of observables
What can be estimated is row-space of X
![Page 21: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/21.jpg)
U
random
YY Zrandom
independent
V= +
What can be estimated is row-space of X
![Page 22: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/22.jpg)
V×
What can be estimated is row-space of X
Multiple samples of random variable y
uyy z= +
![Page 23: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/23.jpg)
Probabilistic PCA
V×uyy z
Gaussian
~= +Gaussian
~
σ2IΣXrank k
xGaussian
~estimated
parametersestimated
parameters
Maximum likelihood ≡ PCA[Tipping Bishop 97]
![Page 24: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/24.jpg)
Probabilistic PCA
V×uyy z= +Gaussian(σ2I)
~
Gaussian
~
estimated parametersestimated
parameters
V×uyy ~ ~Multinomial( N, )
Latent Dirichlet Allocation[Blei Ng Jordan 03]
[Tipping Bishop 97]
Dirichlet(α)
estimated parametersestimated
parameters
![Page 25: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/25.jpg)
Generative and Non-GenerativeLow Rank Models
Y=X+Gaussian “Probabilistic PCA”
pLSA,Y~Binom(N,X) Latent Dirichlet Allocatoin
Parametric generative models:Consistency of Maximum Likelihood
estimation guaranteedif model assumptions hold
(both on Y|X and on U)
Non-parametricgenerative models
![Page 26: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/26.jpg)
estimated parametersrow-space of V,not entries of V
V×uyy z
Gaussian
~= +
xPU
nonparametric,unconstrained
nuisance
Non-parametric model,estimation of a parametric part of the modelMaximum Likelihood ≡ “Single Observed Y”
![Page 27: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/27.jpg)
Consistency of ML Estimationtrue V
estimated V
![Page 28: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/28.jpg)
Consistency of ML Estimation
[ ]))((logmax);( uVzxpExV Zuz −+=Ψ
ML estimator is consistentfor any Pu
);( xVΨfor all x,
V maximizes iff V spans x
xexpected
contribution of x to likelihood of V
When Z~Gaussian, =Ψ );( xV E[L2 distance of x+z from V]
V
For iid Gaussian noise, ML estimation (PCA) of the low-rank sub-space is consistent
![Page 29: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/29.jpg)
Consistency of ML Estimation
[ ]))((logmax);( uVzxpExV Zuz −+=Ψ
V
expected contribution of x to
likelihood of Vx=0
);( xVΨfor all x,
V maximizes iff V spans x
ML estimator is consistentfor any Pu
is constant for all V)0;(VΨML estimator is consistent
for any Pu
![Page 30: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/30.jpg)
Consistency of ML Estimation
[ ]))((logmax);( uVzxpExV Zuz −+=Ψ
V
expected contribution of x to
likelihood of V
θx=0
|][|21])[( iz
Z eizp −=Laplace:
1
1.5
=Ψ− )0;(V
θ= π2π
vertical axis
horizontal axishorizontal axis
![Page 31: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/31.jpg)
Consistency of ML Estimation
expected contribution of x to
likelihood of V
X=UV’General conditional model for Y|X
[ ]xuVYpExV XYuXY |)|(logmax);( ||=Ψ
);( xVΨfor all x,
V maximizes iff V spans x
ML estimator is consistentfor any Pu
is constant for all V)0;(VΨML estimator is consistent
for any Pu
![Page 32: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/32.jpg)
Consistency of ML Estimation
expected contribution of x to
likelihood of V
V
x=0
][| 11])[|1][( ixXY e
ixiYp −−==
θ
Logistic:[ ]xuVYpExV XYuXY |)|(logmax);( ||=Ψ
=Ψ− )0;(V
θ= π2
π
![Page 33: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/33.jpg)
Consistent Estimation• Additive i.i.d. noise Y=X+Z:
Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)
ΣX + σ2IΣYSpan of k leading eigenvectors of
rank k
![Page 34: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/34.jpg)
Consistent Estimation• Additive i.i.d. noise Y=X+Z:
Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)
ΣY = ΣX + ΣZ = ΣX + σ2IΣY
s1, s2, …, sk, 0, 0, …, 0
s1+σ2, s2+σ 2, …, sk+σ2, σ2, σ2, …, σ2
![Page 35: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/35.jpg)
Consistent Estimation• Additive i.i.d. noise Y=X+Z:
Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)
10 100 1000 100000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Sample size (number of observed rows)
MLPCA
dist
ance
to c
orre
ct s
ubsp
ace
Noise:0.99 N(0,1) + 0.01 N(0,100)
![Page 36: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/36.jpg)
Consistent Estimation• Additive i.i.d. noise Y=X+Z:
Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)
• Unbiased noise E[Y|X]=X:Maximum Likelihood generally not consistentPCA not consistent
-- Can correct by ignoring diagonal of covariance
• Exponential PCA (X are natural parameters)Maximum Likelihood generally not consistentCovariance methods not consistent??????
![Page 37: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/37.jpg)
Outline• Finding Low Rank Approximations
– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2
– Use WLRA Basis for other losses / conditional models
• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct
solution? Not always…
• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)
![Page 38: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/38.jpg)
Collaborative FilteringBased on preferences so far, and preferences of others:⇒ Predict further preferences
movies Implicit or explicit preferences?
Type of queries.
2 1 4 5
4 1 3 5
5 4 1 33 5 2
4 5 3
2 1 41 5 5 4
2 5 43 3 1 5 2 13 1 2 34 5 1 3
3 3 52 1 1
5 2 4 41 3 1 5 4 5
1 2 4 5
user
s
![Page 39: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/39.jpg)
Matrix CompletionBased on partially observed matrix:
⇒ Predict unobserved entries “Will user i like movie j?”movies
2 1 4 5
4 1 3 5
5 4 1 33 5 2
4 5 3
2 1 41 5 5 4
2 5 43 3 1 5 2 13 1 2 34 5 1 3
3 3 52 1 1
5 2 4 41 3 1 5 4 5
1 2 4 5
?
? ?
?
? ?
??
?
?
user
s
![Page 40: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/40.jpg)
Matrix Completion withMatrix Factorization
Fit factorizable (low-rank) matrix X=UV’to observed entries.
minimize Σloss(Xij;Yij)
Use matrix X to predict unobserved entries.
V’2 1 4 5
4 1 3 5
5 4 1 33 5 2
4 5 3
2 1 41 5 5 4
2 5 43 3 1 5 2 13 1 2 34 5 1 3
3 3 52 1 1
5 2 4 41 3 1 5 4 5
1 2 4 5
?
? ?
?
? ?
??
?
?
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
Uobservationprediction
[Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]
![Page 41: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/41.jpg)
Matrix Completion withMatrix Factorization
When U is fixed,each row is a linear classification problem:•rows of U are feature vectors•columns of V are linear classifiers
Fitting U and V:Learning features that work well across all
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1v1 v2 v3 v4 v5 v6 v7 v8 v9 v10v11v12
U
-1.51.3 0.4
-4.88.3 2.5
3.40.7 -0.2
1.61.7 -5.2
0.9-3.7 2.1
2.74.3 -0.5
6.44.7 0.2
-5.86.0 0.3
-1.5
-4.8
3.4
1.6
0.9
2.7
6.4
-5.8
0.4
2.5
-0.2
-5.2
2.1
-0.5
0.2
0.3
1.3
8.3
0.7
1.7
-3.7
4.3
4.7
6.0
classification problems.
![Page 42: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/42.jpg)
Max-Margin Matrix Factorization
Instead of bounding dimensionality of U,V, bound norms of U,V
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
low normlow norm
‹Ui,Vj›
For observed Yij ∈ ±1:Yij Xij ≥ Margin
![Page 43: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/43.jpg)
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
low normlow norm
‹Ui,Vj›
For observed Yij ∈ ±1:Yij Xij ≥ Margin
(maxi|Ui|2) (maxj|Vj|2) · 1
(∑i |Ui|2) (∑j |Vj|2) · 1
Max-Margin Matrix Factorization
U is fixed:each column of V is SVM
bound norms on average:
bound norms uniformly:
![Page 44: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/44.jpg)
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
low normlow norm
‹Ui,Vj›
For observed Yij ∈ ±1:Yij Xij ≥ Margin
(maxi|Ui|2) (maxj|Vj|2) · 1
(∑i |Ui|2) (∑j |Vj|2) · 1
Max-Margin Matrix Factorization
U is fixed:each column of V is SVM
bound norms on average:
bound norms uniformly:
![Page 45: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/45.jpg)
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
Geometric Interpretationcolumns of V
(items)rows of U
(users)
(maxi|Ui|2) (maxj|Vj|2) · 1
![Page 46: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/46.jpg)
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
Geometric Interpretationcolumns of V
(items)rows of U
(users)
(maxi|Ui|2) (maxj|Vj|2) · 1
![Page 47: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/47.jpg)
Finding Max-Margin Matrix Factorizations
(maxi|Ui|2) (maxj|Vj|2) · 1(∑i |Ui|2) (∑j |Vj|2) · 1X=UVX=UV
maximize Mmaximize MYij Xij ≥ MYij Xij ≥ M
Unlike rank(X) · k, these are convex constraints!
![Page 48: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/48.jpg)
Finding Max-Margin Matrix Factorizations
X-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
Y-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
QQmaximize MYij Xij ≥ M
X=UV(∑i |Ui|2) (∑j |Vj|2) · 1
|X|tr = ∑ (singular values of X)
Dual variable Qij for each observed (i,j)
maximize ∑ Qij
BX’XA p.s.d.
minimize tr(A)+tr(B)Yij Xij ≥ 1
+ c ∑ξij
- ξij 0 · Qij · c
||Q ⊗ Y||2 · 1
sparse elementwise product (zero for unobserved entries)
![Page 49: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/49.jpg)
Finding Max-Margin Matrix Factorizations
• Semi-definite program with sparse dual:Limited by number of observations, not size
(for both average-norm and max-norm)
• Current implementation: use CSDP (off-the-shelf solver), up to 30k observations (e.g. 1000x1000, 3% observed)
• For large-scale problems: updates on dual alone ?Dual variable Qij for each observed (i,j)
maximize ∑ Qij
BX’XA p.s.d.
minimize tr(A)+tr(B)Yij Xij ≥ 1
+ c ∑ξij
- ξij 0 · Qij · c
||Q ⊗ Y||2 · 1
sparse elementwise product (zero for unobserved entries)
![Page 50: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/50.jpg)
Loss Functions for Rankings
Xij
0Yij=+1
loss
(Xij;Y
ij)
![Page 51: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/51.jpg)
Loss Functions for Rankings
Xij
θ1 θ2 θ6θ5θ4θ3
765421 3Yij =
loss
(Xij;Y
ij)
![Page 52: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/52.jpg)
Loss Functions for Rankings
XijYij=3θ1 θ2 θ6θ5θ4θ3
Immediate-threshold loss
loss
(Xij;Y
ij)
[Shashua Levin 03]
all-threshold loss
• All-threshold loss is a bound on the absolute rank-difference• For both loss functions: learn per-user θ’s
![Page 53: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/53.jpg)
Experimental Results on MovieLens Subset
all threshold
MMMF
immediate threshold
MMMF
K-medians K=2 Rank-1 Rank-2
Rank Difference 0.670 0.715 0.674 0.698 0.714
Zero One Error 0.553 0.542 0.558 0.559 0.553
100 users × 100 movies subset of MovieLens,3515 training ratings, 3515 test ratings
![Page 54: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/54.jpg)
Outline• Finding Low Rank Approximations
– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2
– Use WLRA Basis for other losses / conditional models
• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct
solution? Not always…
• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)
![Page 55: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/55.jpg)
Generalization Error Bounds
D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error
Assuming a low-rank structure (eigengap):Asymptotic behavior [Azar+01]
Sample complexity, query strategy [Drineas+02]
X
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
Yunknown,
assumption-free
![Page 56: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/56.jpg)
Generalization Error Bounds
D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error
DS(X;Y) = ∑ij∈S loss(Xij;Yij)/|S|empirical error
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
X YSS
random
unknown, assumption-free
∀Y PrS ( ∀rank-k X D(X;Y)<DS(X;Y)+ε ) > 1-δ
training set
source distribution
hypothesis
![Page 57: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/57.jpg)
Generalization Error Bounds0/1 loss: loss(Xij;Yij) = 1 when sign(Xij)≠Yij
D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error
DS(X,Y) = ∑ij∈S loss(Xij;Yij)/|S|empirical error
loss(Xij;Yij) ∼ Bernoulli(D(X;Y))For particular X,Y:
Pr( DS(X;Y) < D(X;Y)-ε ) < e-2|S|ε2randomrandom
randomUnion bound over all possible Xs:
∀Y PrS ( ∀X D(X,Y)<DS(X,Y)+ε ) > 1-δ
||2loglog)( 18
Smnk k
emδε
++=
log(# of possible Xs)
![Page 58: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/58.jpg)
Number of Sign Configurations of Rank-k Matrices
-0.36 -0.00 -0.98 -1.36 -0.26-0.07 -0.24 -1.14 -0.82 -0.120.19 -0.44 -1.27 -0.34 0.01-0.81 0.84 1.17 -1.07 -0.35-0.48 0.21 -0.43 -1.28 -0.28-0.35 0.13 -0.43 -1.02 -0.22-0.54 0.68 1.26 -0.43 -0.20-0.22 0.40 0.98 0.10 -0.051.13 -1.06 -1.21 1.72 0.511.30 -0.73 0.55 3.12 0.720.53 -0.45 -0.40 0.90 0.25-0.04 -0.00 -0.14 -0.17 -0.03-0.80 0.70 0.67 -1.32 -0.37-1.81 1.28 0.28 -3.73 -0.930.68 -0.56 -0.43 1.21 0.33
-0.46 0.65-0.20 0.700.02 0.74-0.61 -0.59-0.50 0.33-0.38 0.31-0.35 -0.69-0.09 -0.560.90 0.571.27 -0.520.44 0.17-0.06 0.09-0.66 -0.29-1.65 0.090.58 0.16
1.11 -0.81 -0.27 2.24 0.570.22 -0.58 -1.70 -0.52 0.00
{ x | ∀i Pi(xWarren (1968): The number of connected components of ) ≠ 0 }
Xi,j = ∑r Ui,r Vr,j
nk variables mk variables
nm polynomials of degree 2
polynomials
X
V
U
4 e (degree) (#polys)
(# variables)
(# variables)is at most1
2 34
567
8
P1(x1,x2)=0
P2(x1,x2)=0
P3(x1,x2)=0Based on [Alon95]
![Page 59: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/59.jpg)
Number of Sign Configurations of Rank-k Matrices
-0.36 -0.00 -0.98 -1.36 -0.26-0.07 -0.24 -1.14 -0.82 -0.120.19 -0.44 -1.27 -0.34 0.01-0.81 0.84 1.17 -1.07 -0.35-0.48 0.21 -0.43 -1.28 -0.28-0.35 0.13 -0.43 -1.02 -0.22-0.54 0.68 1.26 -0.43 -0.20-0.22 0.40 0.98 0.10 -0.051.13 -1.06 -1.21 1.72 0.511.30 -0.73 0.55 3.12 0.720.53 -0.45 -0.40 0.90 0.25-0.04 -0.00 -0.14 -0.17 -0.03-0.80 0.70 0.67 -1.32 -0.37-1.81 1.28 0.28 -3.73 -0.930.68 -0.56 -0.43 1.21 0.33
-0.46 0.65-0.20 0.700.02 0.74-0.61 -0.59-0.50 0.33-0.38 0.31-0.35 -0.69-0.09 -0.560.90 0.571.27 -0.520.44 0.17-0.06 0.09-0.66 -0.29-1.65 0.090.58 0.16
1.11 -0.81 -0.27 2.24 0.570.22 -0.58 -1.70 -0.52 0.00
{ x | ∀i Pi(xWarren (1968): The number of connected components of ) ≠ 0 }
Xi,j = ∑r Ui,r Vr,j
nk variables mk variables
nm polynomials of degree 2
polynomials
X
V
U
4 e (degree) (#polys)
(# variables)
(# variables)
12 3
456
7
8
P1(x1,x2)=0
P2(x1,x2)=0
P3(x1,x2)=0
2 nmk(n+m)
k(n+m)
is at most
Based on [Alon95]
![Page 60: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/60.jpg)
Generalization Error Bounds:Low Rank Matrix Factorization
D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error
DS(X,Y) = ∑ij∈S loss(Xij;Yij)/|S|empirical error
∀Y PrS ( ∀X of rank-k D(X,Y)<DS(X,Y)+ε ) > 1-δ
||2loglog)( 18
Smnk k
emδε
++=0/1 loss: loss(Xij;Yij) = sign(XijYij)
||logloglog)(
61
)(8
Smnk mnk
Skem
δε++
= +loss(Xij;Yij) · 1:(by bounding the psudodimension)
![Page 61: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/61.jpg)
Generalization Error Bounds:Large Margin Matrix Factorization
D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error
loss(Xij;Yij) = sign(XijYij) l
DS(X;Y) = ∑ij∈S loss1(Xij;Yij)/|S|empirical error
oss1(Xij;Yij) = sign(XijYij-1)
∀Y PrS ( ∀X D(X,Y)<DS(X,Y)+ε ) > 1-δ
||loglog)(ln
124
SnmnRmK δε ++
=
universal constant from [Seginer00]bound on spectral norm of random matrix
(∑ |Ui|2/n)(∑ |Vi|2/m) · R2:
||log)(12
12
SmnR δε ++
=(max |Ui|2)(max |Vj|2) · R2:
![Page 62: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/62.jpg)
Maximum Margin Matrix Factorizationas a Convex Combination of Classifiers
{ UV | (∑ |Ui|2)(∑ |Vi|2) · 1 }= convex-hull( { uv’ | u ∈ Rn, v ∈ Rm |u|=|v|=1} )
conv( { uv’ | u ∈ ±1n, v ∈ ±1m} ) ⊂ { UV | (max |Ui|2)(max |Vj|2) · 1 }
⊂ 2 conv( { uv’ | u ∈ ±1n, v ∈ ±1m} )
Grothendiek's Inequality
![Page 63: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/63.jpg)
Summary• Finding Low Rank Approximations
– Weighted Low Rank Approximations– Basis for other loss function: Newton, Gaussian mixtures
• Consistency of Low Rank Approximation– ML for popular low-rank models is not consistent!– PCA consistent for additive noise; diagonal ignoring for unbiased– Efficient estimators?– Consistent estimators for Exponential-PCA?
• Maximum Margin Matrix Factorization– Correspondence with large margin linear classification– Sparse SDPs for both avarage-norm and max-norm formulations– Direct optimization of dual would enable large-scale applications
• Generalization Error Bounds for Collaborative Prediction– First “assumption free” bounds for matrix completion– Both for Low-Rank and for Max-Margin– Observation process?
![Page 64: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts](https://reader031.fdocuments.net/reader031/viewer/2022022805/5ca5176f88c9935a308b9506/html5/thumbnails/64.jpg)
Average February Temperature(centigrade)
Tel-Aviv TorontoBostonHaifa(Technion)
????
-20
-10
0
10
20
30