Thesis Defense
description
Transcript of Thesis Defense
![Page 1: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/1.jpg)
Carnegie Mellon
Thesis Defense
Joseph K. Bradley
Learning Large-Scale Conditional Random
Fields
CommitteeCarlos Guestrin (U. of Washington, Chair)Tom MitchellJohn Lafferty (U. of Chicago)Andrew McCallum (U. of Massachusetts at Amherst)
1 / 18 / 2013
![Page 2: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/2.jpg)
Modeling Distributions
2
Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.
X2: deadline?
X1: losing sleep?
X3: sick?
X4: losing hair?
X5: overeating?
X6: loud roommate?
X7: taking classes?
X8: cold weather?X9: exercising?
X11: single?X10: gaining weight?
![Page 3: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/3.jpg)
Modeling Distributions
3
X2: deadline?
X1: losing sleep?
X5: overeating?
X7: taking classes?
= P( losing sleep, overeating | deadline, taking classes )
Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.
![Page 4: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/4.jpg)
Markov Random Fields (MRFs)
4
X2: deadline?
X1: losing sleep?
X3: sick?
X4: losing hair?
X5: overeating?
X6: loud roommate?
X7: taking classes?
X8: cold weather?X9: exercising?
X10: single?X10: gaining weight?
Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.
![Page 5: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/5.jpg)
Markov Random Fields (MRFs)
5
X2
X1
X3
X4
X5
X6
X7
X8X9
X10X10
graphical
structure
factor (parameters)
Goal: Model distribution P(X) over random variables X
![Page 6: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/6.jpg)
Conditional Random Fields (CRFs)
6
X2
Y1
Y3
Y4
Y5
X1
X3
X4X5
X6Y2
MRFs: P(X) CRFs: P(Y|X) (Lafferty et al., 2001)
Do not model P(X)Simpler structure (over Y only)
![Page 7: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/7.jpg)
MRFs & CRFs
7
Benefits•Principled statistical and computational framework•Large body of literatureApplications•Natural language processing (e.g., Lafferty et al., 2001)•Vision (e.g., Tappen et al., 2007)•Activity recognition (e.g., Vail et al., 2007)•Medical applications (e.g., Schmidt et al., 2008)•...
![Page 8: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/8.jpg)
Challenges
8
Goal: Given data, learn CRF structure and parameters.
X2
Y1
Y3
Y4
Y5
X1
X5
X6Y2
Many learning methods require inference, i.e., answering queries P(A|B)
NP hard in general(Srebro, 2003)
Big structured optimization
problem
NP hard to approximate(Roth, 1996)
Approximations often lack strong guarantees.
![Page 9: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/9.jpg)
Thesis Statement
CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems.
We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
9
![Page 10: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/10.jpg)
Outline
Parameter Learning Learning without
intractable inferenceSc
alin
g co
re
met
hods
10
Structure Learning Learning
tractable structures
Parallel Regression Multicore sparse
regressionPara
llel
scal
ing
solve via
![Page 11: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/11.jpg)
Outline
Parameter Learning Learning without
intractable inferenceSc
alin
g co
re
met
hods
11
![Page 12: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/12.jpg)
Log-linear MRFs
12
X2
X1
X3
X4
X5
X6
X7
X8X9
X10X10
Goal: Model distribution P(X) over random variables X
Parameters FeaturesAll results generalize to CRFs.
![Page 13: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/13.jpg)
Parameter Learning: MLE
13
Traditional method: max-likelihood estimation (MLE)Minimize objective:
Loss
Gold Standard: MLE is (optimally) statistically efficient.
Parameter LearningGiven structure Φ and samples from Pθ*(X),Learn parameters θ.
![Page 14: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/14.jpg)
Parameter Learning: MLE
14
![Page 15: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/15.jpg)
Parameter Learning: MLE
15
MLE requires inference.Provably hard for general MRFs. (Roth, 1996)
Inference makeslearning hard.
Can we learn withoutintractable inference?
![Page 16: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/16.jpg)
Parameter Learning: MLE
16
Inference makeslearning hard.
Can we learn withoutintractable inference?
Approximate inference & objectives
• Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ...
• Many lack strong theory.• Almost no guarantees for general
MRFs or CRFs.
![Page 17: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/17.jpg)
Our Solution
17
Max Likelihood Estimation (MLE)
Optimal High Difficult
Max Pseudolikelihood Estimation (MPLE)
High Low Easy
Sample complexit
y
Computational complexity
Parallel optimizati
on
PAC learnabilityfor many MRFs!
Bradley, Guestrin (2012)
![Page 18: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/18.jpg)
Our Solution
18
Max Likelihood Estimation (MLE)
Optimal High Difficult
Sample complexit
y
Computational complexity
Parallel optimizati
on
PAC learnabilityfor many MRFs!
Max Pseudolikelihood Estimation (MPLE)
High Low Easy
Bradley, Guestrin (2012)
![Page 19: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/19.jpg)
Our Solution
19
Max Likelihood Estimation (MLE)
Optimal High
Max Pseudolikelihood Estimation (MPLE)
Difficult
High Low Easy
Max Composite Likelihood Estimation (MCLE)
Low Low Easy
Sample complexit
y
Computational complexity
Parallel optimizati
on
Choose MCLE structure to optimize trade-offs
Bradley, Guestrin (2012)
![Page 20: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/20.jpg)
Deriving Pseudolikelihood (MPLE)
20
X2
X1
X3
X4
X5
MLE:
Hard to compute.So replace it!
![Page 21: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/21.jpg)
Deriving Pseudolikelihood (MPLE)
21
X1
MLE:
Estimate via regression:
MPLE:
(Besag, 1975)
Tractable inference!
![Page 22: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/22.jpg)
Pseudolikelihood (MPLE)
22
Pros•No intractable inference!•Consistent estimator
Cons•Less statistically efficient than MLE (Liang & Jordan, 2008)•No PAC bounds
PAC = Probably Approximately Correct(Valiant, 1984)
MPLE:
(Besag, 1975)
![Page 23: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/23.jpg)
Sample Complexity: MLE
23
# parameters (length of θ)
Λmin: min eigenvalue of Hessian of loss at θ*
probability of failure
Our Theorem: Bound on n (# training examples needed)
Recall: Requires intractable inference.
parameter error (L1)
![Page 24: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/24.jpg)
Sample Complexity: MPLE
24
# parameters (length of θ)
Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]
probability of failureparamete
r error (L1)
Our Theorem: Bound on n (# training examples needed)
Recall: Tractable inference.
PAC learnabilityfor many MRFs!
![Page 25: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/25.jpg)
Sample Complexity: MPLE
25
Our Theorem: Bound on n (# training examples needed)
PAC learnabilityfor many MRFs!
Related WorkRavikumar et al. (2010)• Regression Yi~X with Ising models• Basis of our theoryLiang & Jordan (2008)• Asymptotic analysis of MLE, MPLE• Our bounds match theirsAbbeel et al. (2006)• Only previous method with PAC bounds for high-treewidth
MRFs• We extend their work:
• Extension to CRFs, algorithmic improvements, analysis• Their method is very similar to MPLE.
![Page 26: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/26.jpg)
Trade-offs: MLE & MPLE
26
Our Theorem: Bound on n (# training examples needed)
Sample — computational complexitytrade-off
MLELarger Λmin => Lower sample complexity
Higher computational complexity
MPLESmaller Λmin => Higher sample complexity
Lower computational complexity
![Page 27: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/27.jpg)
Trade-offs: MPLE
27
X1
Joint optimization for MPLE:
X2
Disjoint optimization for MPLE:
2 estimates of Average estimates
Lower sample complexity
Data-parallel
Sample complexity — parallelismtrade-off
![Page 28: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/28.jpg)
Synthetic CRFs
28
RandomAssociative
Chains Stars Grids
Factor strength = strength of variable interactions
![Page 29: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/29.jpg)
Predictive Power of Bounds
29
Errors should be ordered: MLE < MPLE < MPLE-disjoint
L1 p
aram
erro
r ε
# training examples
MLEMPLE
MPLE-disjoint
Factors: random, fixed strengthLength-4 chains
bette
r
![Page 30: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/30.jpg)
Predictive Power of Bounds
30
MLE & MPLE Sample Complexity:
Factors: randomLength-6 chains
10,000 train exs
MLE
Actu
al ε
bette
r
harder
![Page 31: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/31.jpg)
Failure Modes of MPLE
31
How do Λmin(MLE) and Λmin(MPLE) vary for different models?
Sample complexity:
Model diamet
er
Factor strengt
h
Node degree
![Page 32: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/32.jpg)
Λmin: Model Diameter
32
Λmin ratio: MLE/MPLE(Higher = MLE better)
Model diameterΛ m
in ra
tio
Relative MPLE performance is independent of diameter in chains.(Same for random factors)
Factors: associative, fixed strengthChains
![Page 33: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/33.jpg)
Λmin: Factor Strength
33
Λmin ratio: MLE/MPLE(Higher = MLE better)
Factor strengthΛ m
in ra
tio
Factors: associativeLength-8 Chains
MPLE performs poorly with strong factors.(Same for random factors, and star & grid models)
![Page 34: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/34.jpg)
Λmin: Node Degree
34
Λmin ratio: MLE/MPLE(Higher = MLE better)
Node degree
Λ min ra
tio
Factors: associative, fixed strength
Stars
MPLE performs poorly with high-degree nodes.(Same for random factors)
![Page 35: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/35.jpg)
Failure Modes of MPLE
35
How do Λmin(MLE) and Λmin(MPLE) vary for different models?
Sample complexity:
Model diamet
er
Factor strengt
h
Node degree
We can often fix this!
![Page 36: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/36.jpg)
Composite Likelihood (MCLE)
36
MLE: Estimate P(Y) all at once
![Page 37: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/37.jpg)
Composite Likelihood (MCLE)
37
MLE: Estimate P(Y) all at once
MPLE: Estimate P(Yi|Y\i) separately
Yi
![Page 38: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/38.jpg)
Composite Likelihood (MCLE)
38
MLE: Estimate P(Y) all at once
MPLE: Estimate P(Yi|Y\i) separately
YAi
Something in between?
Composite Likelihood (MCLE):
Estimate P(YAi|Y\Ai) separately.(Lindsay, 1988)
![Page 39: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/39.jpg)
Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization
Composite Likelihood (MCLE)
39
MCLE Class:Node-disjoint subgraphs which cover graph.
![Page 40: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/40.jpg)
Composite Likelihood (MCLE)
40
MCLE Class:Node-disjoint subgraphs which cover graph.
• Trees (tractable inference)
• Follow structure of P(X)• Cover star
structures• Cover strong factors
• Choose large components
Combs
Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization
![Page 41: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/41.jpg)
Structured MCLE on a Grid
41
Grid size |X|Log
loss
ratio
(oth
er/M
LE)
MCLE (combs)
MPLE
Grid size |X|
Trai
ning
tim
e (s
ec)
MCLE (combs)
MPLE
MLE
Grid. Associative factors.10,000 train exs. Gibbs sampling.
bette
r
MCLE (combs) lowers sample complexity...without increasing computation!
MCLE tailoredto model structure.
Also in thesis: tailoring to
correlations in data.
![Page 42: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/42.jpg)
Summary: Parameter Learning
42
Likelihood (MLE) Optimal High
Pseudolikelihood (MPLE)
Difficult
High Low Easy
Composite Likelihood (MCLE) Low Low Easy
Sample complexit
y
Computational complexity
Parallel optimizati
on
• Finite sample complexity bounds for general MRFs, CRFs• PAC learnability for certain classes
• Empirical analysis• Guidelines for choosing MCLE structures: tailor to model, data
![Page 43: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/43.jpg)
OutlineSc
alin
g co
re
met
hods
43
Structure Learning Learning
tractable structures
![Page 44: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/44.jpg)
CRF Structure Learning
44
X3: deadline?
Y1: losing sleep?
Y3: sick? Y2: losing hair?
X1: loud roommate?
X2: taking classes?
Structure learning: Choose YC
I.e., learn conditional independence
Evidence selection: Choose XD
I.e., select X relevant to each YC
![Page 45: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/45.jpg)
Related WorkPrevious Work
Method Structure learning?
Tractable inference?
Evidence selection?
Torralba et al. (2004)
Boosted Random Fields
Yes No Yes
Schmidt et al. (2008)
Block-L1 regularized pseudolikelihood
Yes No No
Shahaf et al. (2009)
Edge weights +low-treewidth model
Yes Yes No
Most similar to our work: They focus on selecting treewidth-k structures. We focus on the choice of edge weight.
45
![Page 46: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/46.jpg)
Tree CRFs with Local Evidence
GoalGiven:
DataLocal evidence
Learn tree CRF structureVia a scalable method
Bradley, Guestrin (2010)
46
Xi relevant to each Yi
Fast inference at test-time
![Page 47: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/47.jpg)
Chow-Liu for MRFs
47
Chow & Liu (1968)
Y1Y2
Y3
AlgorithmWeight edges with mutual information:
![Page 48: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/48.jpg)
Chow-Liu for MRFs
48
Chow & Liu (1968)Algorithm
Weight edges with mutual information:Choose max-weight spanning tree.
Y1Y2
Y3
Chow-Liu finds amax-likelihood structure.
![Page 49: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/49.jpg)
Chow-Liu for CRFs?What edge weight? must be efficient to compute
Global Conditional Mutual Information
(CMI)
Pro: Finds max-likelihood structure (with enough data)
Con: Intractable for large |X|
49
AlgorithmWeight each possible edge:
Choose max-weight spanning tree.
![Page 50: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/50.jpg)
Generalized Edge WeightsGlobal
CMI
50
Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over
Yi,Yj,Xi,Xj
TheoremNo LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).
![Page 51: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/51.jpg)
Heuristic Edge Weights
Decomposable
Conditional Influence
(DCI)
Local CMI
Method Guarantees Compute w(i,j) tractably
Comments
Global CMI Recovers true tree
No Shahaf et al. (2009)Local CMI Lower-bounds likelihood gain
Yes Fails with strong Yi—Xi potentials
DCI Exact likelihood gain for some edges
Yes Best empirically
Global CMI
51
![Page 52: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/52.jpg)
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500
Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.
# training examples
Frac
tion
edge
s rec
over
ed
DCIGlobal CMI
Local CMI
Schmidt et al.
True CRF
52
![Page 53: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/53.jpg)
Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.
0
5000
10000
15000
20000
0 100 200 300 400 500
Seco
nds
# training examples
Global CMI
DCILocal CMI
Schmidt et al.
53
bette
r
![Page 54: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/54.jpg)
fMRI Tests
X: fMRI voxels (500)
Y: semantic features (218)
predict(Application & data from Palatucci et al., 2009)
Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg
bette
r
)]|([log XYPE Disconnected(Palatucci et al., 2009)
DCI 1
DCI 2
54
![Page 55: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/55.jpg)
Summary: Structure Learning
55
• Analyzed generalizing Chow-Liu to CRFs• Proposed class of edge weights: Local Linear Entropy Scores
• Negative result: insufficient for recovering trees• Discovered useful heuristic edge weights: Local CMI, DCI
• Promising empirical results on synthetic & fMRI data
Generalized Chow-LiuCompute edge
weightsMax-weight spanning tree
w12
w25 w45
w24
w23
![Page 56: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/56.jpg)
Outline
56
Parallel Regression Multicore sparse
regressionPara
llel
scal
ing
Parameter LearningPseudolikelihoodCanonical parameterizationSc
alin
g co
re
met
hods Structure Learning
Generalized Chow-Liu
solve via Compute edge weights via P(Yi,Yj |
Xij )
Regress each variable on its
neighbors:P( Xi | X\i )
![Page 57: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/57.jpg)
Sparse (L1) Regression(Bradley, Kyrola, Bickson, Guestrin, 2011)
Bias towards sparse solutions
Lasso (Tibshirani, 1996)
Objective:Goal: Predict from , given samples
Useful in high-dimensional setting (# features >> # examples) Lasso and sparse logistic regression
57
![Page 58: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/58.jpg)
Parallelizing LASSOMany LASSO optimization algorithms
Gradient descent, interior point, stochastic gradient, shrinkage, hard/soft thresholdingCoordinate descent (a.k.a. Shooting (Fu, 1998))
One of the fastest algorithms (Yuan et al., 2010)
Parallel optimizationMatrix-vector ops (e.g., interior point)Stochastic gradient (e.g., Zinkevich et al., 2010)Shooting
Not great empirically Best for many samples, not large d
Inherently sequential
Shotgun: Parallel coordinate descent for L1 regression simple algorithm, elegant analysis
58
![Page 59: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/59.jpg)
Shooting: Sequential SCDwhere
Stochastic Coordinate Descent (SCD)While not converged,
Choose random coordinate j,Update wj (closed-form minimization)
59
![Page 60: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/60.jpg)
Shotgun: Parallel SCDwhere
Shotgun Algorithm (Parallel SCD)While not converged,
On each of P processors,Choose random coordinate j,Update wj (same as for Shooting)
Nice case:Uncorrelatedfeatures
Bad case:Correlatedfeatures
Is SCD inherently sequential?
60
![Page 61: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/61.jpg)
Shotgun: Theory
Convergence Theorem
Final objective
Assume # parallel updates
iterations
where
= spectral radius of XTX
Optimal objective
Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009)
61
![Page 62: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/62.jpg)
Shotgun: Theory
Convergence Theorem
final - opt objective
Assume
iterations
# parallel updates
where = spectral radius of X ’X.Nice case:Uncorrelatedfeatures
Bad case:Correlatedfeatures
(at worst)
where
62
![Page 63: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/63.jpg)
Shotgun: TheoryConvergence Theorem
Assume
... linear speedups predicted.
Up to a threshold...
Experiments matchour theory!
63
Pmax=79Mug32_singlepixcam
T (it
erat
ions
)
P (parallel updates)
SparcoProblem7Pmax=284
T (it
erat
ions
)
P (parallel updates)
![Page 64: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/64.jpg)
Lasso ExperimentsCompared many algorithms
Interior point (L1_LS)Shrinkage (FPC_AS, SpaRSA)Projected gradient (GPSR_BB)Iterative hard thresholding (Hard_IO)Also ran: GLMNET, LARS, SMIDAS
35 datasetsλ=.5, 10ShootingShotgun P = 8 (multicore)
Single-Pixel Camera
Sparco (van den Berg et al., 2009) Sparse Compressed Imaging Large, Sparse Datasets
64
Shotgun provesmost scalable & robust
![Page 65: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/65.jpg)
Shotgun: SpeedupAggregated results from all tests
Spee
dup
# cores
Optimal
Lasso Iteration Speedup
Lasso Time Speedup
Logistic Reg. Time Speedup
Not so great
But we are doingfewer iterations!
Explanation:Memory wall (Wulf & McKee, 1995)The memory bus gets flooded.
Logistic regression uses more FLOPS/datum.Extra computation hides memory latency.Better speedups on average!
65
![Page 66: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/66.jpg)
Summary: Parallel Regression
66
• Shotgun: parallel coordinate descent on multicore• Analysis: near-linear speedups, up to problem-dependent limit• Extensive experiments (37 datasets, 7 other methods)
• Our theory predicts empirical behavior well.• Shotgun is one of the most scalable methods.
ShotgunDecompose
computation by coordinate updates
Trade a little extra computation for a lot of
parallelism
![Page 67: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/67.jpg)
Recall: Thesis StatementWe can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
67
Parameter LearningStructured composite likelihood
Structure LearningGeneralized Chow-Liu
Parallel RegressionShotgun: parallel coordinate descent
MLE MCLE MPLEw12
w25 w45
w24
w23
Decompositions use model structure & locality. Trade-offs use model- and data-specific methods.
![Page 68: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/68.jpg)
Future Work: Unified System
68
Parameter Learning Structure Learning
Parallel Regression
Structured MCLEAutomatically:• choose MCLE structure
& parallelization strategy
• to optimize trade-offs,• tailored to model &
data.
Shotgun (multicore)Distributed
Limited communication in distributed setting.
Handle complex objectives (e.g., MCLE).
L1 Structure Learning
Learning TreesUse structured
MCLE?
Learn trees for parameter estimators?
![Page 69: Thesis Defense](https://reader035.fdocuments.net/reader035/viewer/2022062811/56815f5f550346895dce44e1/html5/thumbnails/69.jpg)
Summary
Parameter learning Structured composite likelihoodFinite sample complexity boundsEmpirical analysisGuidelines for choosing MCLE
structures: tailor to model, dataAnalyzed canonical parameterization
of Abbeel et al. (2006)
69
We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
Structure learning Generalizing Chow-Liu to CRFsProposed class of edge weights:
Local Linear Entropy ScoresInsufficient for recovering trees
Discovered useful heuristic edge weights: Local CMI, DCIPromising empirical results on
synthetic & fMRI dataParallel regression Shotgun: parallel coordinate descent on multicoreAnalysis: near-linear speedups, up to problem-dependent limitExtensive experiments (37 datasets, 7 other methods)
Our theory predicts empirical behavior well.Shotgun is one of the most scalable methods.
Thank
you!