Sample Complexity of Composite Likelihood
description
Transcript of Sample Complexity of Composite Likelihood
![Page 1: Sample Complexity of Composite Likelihood](https://reader035.fdocuments.net/reader035/viewer/2022062809/56815904550346895dc63732/html5/thumbnails/1.jpg)
Sample Complexity of Composite LikelihoodJoseph K. Bradley & Carlos Guestrin
PAC-learning parameters for general MRFs & CRFsvia practical methods: pseudolikelihood & structured composite likelihood.
X1: deadline?
X2: bags under eyes?
X3: sick?
X4: losing hair?
X5: overeating?
Example MRF: the health of a grad student
= P( deadline | bags under eyes, losing hair )
Example query:
factor
Binary X:
Markov Random Fields (MRFs)
Model distribution P(X) over random variables X
as a log-linear MRF:
ParametersFeaturesRequires inference.
Provably hard for general MRFs.
Conditional Random Fields (CRFs)
Model conditional distribution P(X|E) over random variables X,
given variables E:
Pro: Model X, not E. Inference exponential only in |X|, not in |E|.
Con: Z depends on E!
Compute Z(e) for every training example!
MLE objective:
Maximum Likelihood Estimation (MLE)
Minimize objective:
Given data: n i.i.d. samples from
Loss
L2 regularization is more common. Our analysis applies to L1 & L2.
Gold Standard: MLE is (optimally) statistically efficient.
Regularization
Hard to compute (inference).
Can we learn without intractable inference?
MLE AlgorithmIterate:•Compute gradient.•Step along gradient.
Sample Complexity Bounds
Related WorkRavikumar et al. (2010)•PAC bounds for regression Yi ~ X with Ising factors.•Our theory is largely derived from this work.
Maximum Pseudolikelihood (MPLE)
MLE loss:
Pseudolikelihood (MPLE) loss:
Intuition: Approximate distribution as product of local conditionals.
(Besag, 1975)
(Lafferty et al., 2001)
Hard to compute replace it!
X1: deadline?X2: bags under eyes?
Pro: No intractable inference requiredPro: Consistent estimatorCon: Less statistically efficient than MLECon: No PAC bounds
Bound on Parameter Error: MLE, MPLE
Λmin for MLE: min eigenvalue of Hessian of loss at θ*:
TheoremMLE or MPLE using L1 or L2 regularization
achieve avg. per-parameter error
with probability ≥ 1-δ
using n i.i.d. samples from Pθ*(X):
Λmin for MPLE: mini [ min eigval of Hessian of loss component i at θ* ]:
Probability of failureAvg. per-parameter
error
Bound on Log LossTheorem
If the parameter estimation error ε is small,
then the log loss converges quadratically in ε:
else the log loss converges linearly in ε: Max feature magnitude
Joint vs. Disjoint Optimization
Joint MPLE:
DisjointMPLE:
Avg of from separate estimates
Pro: Data parallel
Con: Worse bound (extra factors |X|)
TheoremSample Complexity Bound for Disjoint MPLE:
# parameters (length of θ)
MLE: Estimate P(Y) all at once
MPLE: Estimate P(Yi|Y-i) separately
YAi
Something in between? Estimate a larger component, but keep
inference tractable.
Composite Likelihood (MCLE):Estimate P(YAi|Y-Ai) separately, YAi in Y.(Lindsay, 1988)
Yi
E.g., model with:•Weak horizontal factors•Strong vertical factors
Good choice: vertical combs
Choosing MCLE components YAi:•Larger is better.•Keep inference tractable.•Use model structure.
Bound on Parameter Error: MCLE
ρmin = minj [ sum over components Ai which estimate θj of
[ min eigval of Hessian of at θ* ].Mmax = maxj [ number of components Ai which estimate θj ].MPLE: One bad estimatorP(Xi|X-i) can give bad results.
MCLE: The effect of a bad estimator P(XAi|X-Ai) can be averaged out by other good estimators.
TheoremIntuition:ρmin/Mmax = Average Λmin
(over multiple components estimating each parameter)
Experimental Setup
ChainsStars
Grids
Random:
otherwise
ifAssociative:
factor strength
X1
X2
Factors
Structures Learning• 10 runs with separate datasets• Optimized with conjugate gradient
• MLE on big grids: stochastic gradient with Gibbs sampling
L1 p
ara
m e
rror
L1 p
ara
merr
or
bou
nd
Training set size
Log (
base
e)
loss
Training set size
Log loss
bou
nd
,giv
en p
ara
ms
MPLE-disjoint
MPLE
MLE
Chain.|X|=4.Random factors.
Predictive Power of Bounds
Parameter estimation error ≤ f(sample size)(looser bound)
Log loss≤ f(param estimation error)(tighter bound)
1/Λmin
L1 p
ara
m e
rror
L1 p
ara
m e
rror
bou
nd
Chains.Random factors.10,000 train exs.MLE (similar results for MPLE) r=23
r=11
r=5
Is the bound still useful (predictive)?
Yes! Actual error vs. bound:•Different constants•Similar behavior•Nearly independent of r
Background
Structured Composite Likelihood
Composite Likelihood (MCLE)
Tightness of Bounds
Future Work
Liang and Jordan (2008)•Asymptotic bounds for pseudolikelihood, composite likelihood.•Our finite sample bounds are of the same order.
Learning with approximate inference•No previous PAC-style bounds for general MRFs, CRFs.•c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006)
Grid width(Fixed factor strength)
Λm
in r
ati
o combs
MPLE
Factor strength(Fixed |Y|=8)
Λm
in r
ati
o
combs
MPLE
Grid size |X|Log loss
rati
o (
oth
er/
MLE
)
combs
MPLE
Grid size |X|
Tra
inin
g t
ime (
sec)
combsMPLE
MLE Grid.Associative factors (fixed strength).10,000 training samples.
bett
er
Combs (MCLE) lower sample complexity--without increasing computation!
Grid width
Λm
in
MPLE
MLE
Combs - both
Combs - vertical
Combs - horizontal
Grid with strong vertical (associative) factors.
bett
er
Learning Test
Averaging MCLE ComponentsBest: Component structure matches model structure.Average: Reasonable choice without prior knowledge of θ*.Worst:
Component structure does not match model structure.
Chains Stars
bett
er
Grids
All plots are for associative factors. (Random factors behave similarly.)
Plotted: Ratio (Λmin for MLE) / (Λmin for other method)
Factor strength(Fixed |Y|=8)
Λm
in r
ati
o MPLE
Factor strength(Fixed |Y|=8)
Λm
in r
ati
o MPLE
Model size |Y|(Fixed factor strength)
Λm
in r
ati
o MPLE
Model size |Y|(Fixed factor strength)
Λm
in r
ati
o
MPLE
Model diameter is not important.
MPLE is worse for high-degree nodes.
MPLE is worse for big grids.
MPLE is worse for strong factors. Combs (Structured MCLE) improve upon MPLE.
How do the bounds vary w.r.t. model properties?
Λmin for Various Models
• Theoretical understanding of how Λmin varies with model properties.
• Choosing MCLE structure on natural graphs.• Parallel learning: Lowering sample complexity of disjoint optimization via
limited communication.• Comparing with MLE using approximate inference.
Acknowledgements• Thanks to John Lafferty, Geoff Gordon, and our reviewers for helpful feedback.• Funded by NSF Career IIS-0644225, ONR YIP N00014-08-1- 0752, and ARO
MURI W911NF0810242.
Abbeel et al. (2006)
Theorem If the canonical parameterization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization.
• Only previous method for PAC-learning high-treewidth discrete MRFs.• (Low-degree factor graphs over discrete X.)
• Main idea (their “canonical parameterization”):• Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ).
• Fine print: Each factor is instantiated 2|Ci| times using a reference assignment.• Estimate each small factor P( XCi | X-Ci ) from data.
Computing MPLE directly is faster. Our analysis covers their learning method.