Sample Complexity of Composite Likelihood

1
Sample Complexity of Composite Likelihood Joseph K. Bradley & Carlos Guestrin PAC-learning parameters for general MRFs & CRFs via practical methods: pseudolikelihood & structured composite likelihood. X 1 : deadline? X 2 : bags under eyes? X 3 : sick? X 4 : losing hair? X 5 : overeating? Example MRF : the health of a grad student = P( deadline | bags under eyes, losing hair ) Example query: factor Binary X: Markov Random Fields (MRFs) Model distribution P(X) over random variables X as a log-linear MRF: Parameters Features Requires inference. Provably hard for general MRFs. Conditional Random Fields (CRFs) Model conditional distribution P(X|E) over random variables X, given variables E: Pro: Model X, not E. Inference exponential only in |X|, not in |E|. Con: Z depends on E! Compute Z(e) for every training example! MLE objective: Maximum Likelihood Estimation (MLE) Minimize objective : Given data: n i.i.d. samples from Loss L2 regularizatio n is more common. Our analysis applies to L1 & L2. Gold Standard: MLE is (optimally) statistically efficient. Regularization Hard to compute (inference). Can we learn without intractable inference? MLE Algorithm Iterate: •Compute gradient. •Step along gradient. Sample Complexity Bounds Related Work Ravikumar et al. (2010) •PAC bounds for regression Y i ~ X with Ising factors. •Our theory is largely derived from this work. Maximum Pseudolikelihood (MPLE) MLE loss: Pseudolikelihood (MPLE) loss: Intuition: Approximate distribution as product of local conditionals. (Besag, 1975) (Lafferty et al., 2001) Hard to compute replace it! X 1 : deadline? X 2 : bags under eyes? Pro: No intractable inference required Pro: Consistent estimator Con: Less statistically efficient than MLE Con: No PAC bounds Bound on Parameter Error: MLE, MPLE Λ min for MLE: min eigenvalue of Hessian of loss at θ*: Theorem MLE or MPLE using L1 or L2 regularization achieve avg. per-parameter error with probability ≥ 1-δ using n i.i.d. samples from P θ* (X): Λ min for MPLE: min i [ min eigval of Hessian of loss component i at θ* ]: Probabil ity of failure Avg. per-parameter error Bound on Log Loss Theorem If the parameter estimation error ε is small, then the log loss converges quadratically in ε: else the log loss converges linearly in ε: Max feature magnitude Joint vs. Disjoint Optimization Joint MPLE: Disjoint MPLE: Avg of from separate estimates Pro: Data parallel Con: Worse bound (extra factors |X|) Theorem Sample Complexity Bound for Disjoint MPLE: # parameters (length of θ) MLE: Estimate P(Y) all at once MPLE: Estimate P(Y i |Y -i ) separately Y Ai Something in between? Estimate a larger component, but keep inference tractable. Composite Likelihood (MCLE): Estimate P(Y Ai |Y -Ai ) separately, Y Ai in Y. (Lindsay, 1988) Y i E.g., model with: •Weak horizontal factors •Strong vertical factors Good choice: vertical combs Choosing MCLE components Y Ai : •Larger is better. •Keep inference tractable. •Use model structure. Bound on Parameter Error: MCLE ρ min = min j [ sum over components Ai which estimate θ j of [ min eigval of Hessian of at θ* ]. M max = max j [ number of components Ai which estimate θ j ]. MPLE: One bad estimator P(X i |X -i ) can give bad results. MCLE: The effect of a bad estimator P(X Ai |X -Ai ) can be averaged out by other good estimators. Theorem Intuition: ρ min /M max = Average Λ min (over multiple components estimating each parameter) Experimental Setup Chains Stars Grids Random: otherwise if Associative: factor strengt h X 1 X 2 Factors Structures Learnin g 10 runs with separate datasets Optimized with conjugate gradient MLE on big grids: stochastic gradient with Gibbs sampling L1 param error L1 param error bound Training set size Log (base e) loss Training set size Log loss bound, given params MPLE-disjoint MPLE MLE Chain. |X|=4. Random factors. Predictive Power of Bounds Parameter estimation error ≤ f(sample size) (looser bound) Log loss ≤ f(param estimation error) (tighter bound) 1/Λ min L1 param error L1 param error bound Chains. Random factors. 10,000 train exs. MLE (similar results for MPLE) r=23 r=11 r=5 Is the bound still useful (predictive)? Yes! Actual error vs. bound: •Different constants •Similar behavior •Nearly independent of r Background Structured Composite Likelihood Composite Likelihood (MCLE) Tightness of Bounds Future Work Liang and Jordan (2008) •Asymptotic bounds for pseudolikelihood, composite likelihood. •Our finite sample bounds are of the same order. Learning with approximate inference •No previous PAC-style bounds for general MRFs, CRFs. •c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006) Grid width (Fixed factor strength) Λ min ratio combs MPLE Factor strength (Fixed |Y|=8) Λ min ratio combs MPLE Grid size |X| Log loss ratio (other/MLE) combs MPLE Grid size |X| Training time (sec) combs MPLE MLE Grid. Associative factors (fixed strength). 10,000 training samples. better Combs (MCLE) lower sample complexity--without increasing computation! Grid width Λ min MPLE MLE Combs - both Combs - vertical Combs - horizontal Grid with strong vertical (associative) factors. better Learning Test Averaging MCLE Components Best: Component structure matches model structure. Average: Reasonable choice without prior knowledge of θ*. Worst: Component structure does not match model structure. Chains Stars bette r Grids All plots are for associative factors. (Random factors behave similarly.) Plotted: Ratio (Λ min for MLE) / (Λ min for other method) Factor strength (Fixed |Y|=8) Λ min ratio MPLE Factor strength (Fixed |Y|=8) Λ min ratio MPLE Model size |Y| (Fixed factor strength) Λ min ratio MPLE Model size |Y| (Fixed factor strength) Λ min ratio MPLE Model diameter is not important. MPLE is worse for high- degree nodes. MPLE is worse for big grids. MPLE is worse for strong factors. Combs (Structured MCLE) improve upon MPLE. How do the bounds vary w.r.t. model properties? Λ min for Various Models Theoretical understanding of how Λ min varies with model properties. Choosing MCLE structure on natural graphs. Parallel learning: Lowering sample complexity of disjoint optimization via limited communication. Comparing with MLE using approximate inference. Acknowledgements Thanks to John Lafferty, Geoff Gordon, and our reviewers for helpful feedback. Funded by NSF Career IIS-0644225, ONR YIP N00014-08-1- 0752, and ARO MURI W911NF0810242. Abbeel et al. (2006) Theorem If the canonical parameterization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization. Only previous method for PAC-learning high-treewidth discrete MRFs. (Low-degree factor graphs over discrete X.) Main idea (their “canonical parameterization”): Re-write P(X) as a ratio of many small factors P( X Ci | X -Ci ). Fine print: Each factor is instantiated 2 |Ci| times using a reference assignment. Estimate each small factor P( X Ci | X -Ci ) from data. Computing MPLE directly is faster. Our analysis covers their learning method.

description

Sample Complexity of Composite Likelihood. Joseph K. Bradley & Carlos Guestrin. better. better. better. PAC-learning parameters for general MRFs & CRFs via practical methods: pseudolikelihood & structured composite likelihood. - PowerPoint PPT Presentation

Transcript of Sample Complexity of Composite Likelihood

Page 1: Sample Complexity of Composite Likelihood

Sample Complexity of Composite LikelihoodJoseph K. Bradley & Carlos Guestrin

PAC-learning parameters for general MRFs & CRFsvia practical methods: pseudolikelihood & structured composite likelihood.

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

Example MRF: the health of a grad student

= P( deadline | bags under eyes, losing hair )

Example query:

factor

Binary X:

Markov Random Fields (MRFs)

Model distribution P(X) over random variables X

as a log-linear MRF:

ParametersFeaturesRequires inference.

Provably hard for general MRFs.

Conditional Random Fields (CRFs)

Model conditional distribution P(X|E) over random variables X,

given variables E:

Pro: Model X, not E. Inference exponential only in |X|, not in |E|.

Con: Z depends on E!

Compute Z(e) for every training example!

MLE objective:

Maximum Likelihood Estimation (MLE)

Minimize objective:

Given data: n i.i.d. samples from

Loss

L2 regularization is more common. Our analysis applies to L1 & L2.

Gold Standard: MLE is (optimally) statistically efficient.

Regularization

Hard to compute (inference).

Can we learn without intractable inference?

MLE AlgorithmIterate:•Compute gradient.•Step along gradient.

Sample Complexity Bounds

Related WorkRavikumar et al. (2010)•PAC bounds for regression Yi ~ X with Ising factors.•Our theory is largely derived from this work.

Maximum Pseudolikelihood (MPLE)

MLE loss:

Pseudolikelihood (MPLE) loss:

Intuition: Approximate distribution as product of local conditionals.

(Besag, 1975)

(Lafferty et al., 2001)

Hard to compute replace it!

X1: deadline?X2: bags under eyes?

Pro: No intractable inference requiredPro: Consistent estimatorCon: Less statistically efficient than MLECon: No PAC bounds

Bound on Parameter Error: MLE, MPLE

Λmin for MLE: min eigenvalue of Hessian of loss at θ*:

TheoremMLE or MPLE using L1 or L2 regularization

achieve avg. per-parameter error

with probability ≥ 1-δ

using n i.i.d. samples from Pθ*(X):

Λmin for MPLE: mini [ min eigval of Hessian of loss component i at θ* ]:

Probability of failureAvg. per-parameter

error

Bound on Log LossTheorem

If the parameter estimation error ε is small,

then the log loss converges quadratically in ε:

else the log loss converges linearly in ε: Max feature magnitude

Joint vs. Disjoint Optimization

Joint MPLE:

DisjointMPLE:

Avg of from separate estimates

Pro: Data parallel

Con: Worse bound (extra factors |X|)

TheoremSample Complexity Bound for Disjoint MPLE:

# parameters (length of θ)

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y-i) separately

YAi

Something in between? Estimate a larger component, but keep

inference tractable.

Composite Likelihood (MCLE):Estimate P(YAi|Y-Ai) separately, YAi in Y.(Lindsay, 1988)

Yi

E.g., model with:•Weak horizontal factors•Strong vertical factors

Good choice: vertical combs

Choosing MCLE components YAi:•Larger is better.•Keep inference tractable.•Use model structure.

Bound on Parameter Error: MCLE

ρmin = minj [ sum over components Ai which estimate θj of

[ min eigval of Hessian of at θ* ].Mmax = maxj [ number of components Ai which estimate θj ].MPLE: One bad estimatorP(Xi|X-i) can give bad results.

MCLE: The effect of a bad estimator P(XAi|X-Ai) can be averaged out by other good estimators.

TheoremIntuition:ρmin/Mmax = Average Λmin

(over multiple components estimating each parameter)

Experimental Setup

ChainsStars

Grids

Random:

otherwise

ifAssociative:

factor strength

X1

X2

Factors

Structures Learning• 10 runs with separate datasets• Optimized with conjugate gradient

• MLE on big grids: stochastic gradient with Gibbs sampling

L1 p

ara

m e

rror

L1 p

ara

merr

or

bou

nd

Training set size

Log (

base

e)

loss

Training set size

Log loss

bou

nd

,giv

en p

ara

ms

MPLE-disjoint

MPLE

MLE

Chain.|X|=4.Random factors.

Predictive Power of Bounds

Parameter estimation error ≤ f(sample size)(looser bound)

Log loss≤ f(param estimation error)(tighter bound)

1/Λmin

L1 p

ara

m e

rror

L1 p

ara

m e

rror

bou

nd

Chains.Random factors.10,000 train exs.MLE (similar results for MPLE) r=23

r=11

r=5

Is the bound still useful (predictive)?

Yes! Actual error vs. bound:•Different constants•Similar behavior•Nearly independent of r

Background

Structured Composite Likelihood

Composite Likelihood (MCLE)

Tightness of Bounds

Future Work

Liang and Jordan (2008)•Asymptotic bounds for pseudolikelihood, composite likelihood.•Our finite sample bounds are of the same order.

Learning with approximate inference•No previous PAC-style bounds for general MRFs, CRFs.•c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006)

Grid width(Fixed factor strength)

Λm

in r

ati

o combs

MPLE

Factor strength(Fixed |Y|=8)

Λm

in r

ati

o

combs

MPLE

Grid size |X|Log loss

rati

o (

oth

er/

MLE

)

combs

MPLE

Grid size |X|

Tra

inin

g t

ime (

sec)

combsMPLE

MLE Grid.Associative factors (fixed strength).10,000 training samples.

bett

er

Combs (MCLE) lower sample complexity--without increasing computation!

Grid width

Λm

in

MPLE

MLE

Combs - both

Combs - vertical

Combs - horizontal

Grid with strong vertical (associative) factors.

bett

er

Learning Test

Averaging MCLE ComponentsBest: Component structure matches model structure.Average: Reasonable choice without prior knowledge of θ*.Worst:

Component structure does not match model structure.

Chains Stars

bett

er

Grids

All plots are for associative factors. (Random factors behave similarly.)

Plotted: Ratio (Λmin for MLE) / (Λmin for other method)

Factor strength(Fixed |Y|=8)

Λm

in r

ati

o MPLE

Factor strength(Fixed |Y|=8)

Λm

in r

ati

o MPLE

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o MPLE

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o

MPLE

Model diameter is not important.

MPLE is worse for high-degree nodes.

MPLE is worse for big grids.

MPLE is worse for strong factors. Combs (Structured MCLE) improve upon MPLE.

How do the bounds vary w.r.t. model properties?

Λmin for Various Models

• Theoretical understanding of how Λmin varies with model properties.

• Choosing MCLE structure on natural graphs.• Parallel learning: Lowering sample complexity of disjoint optimization via

limited communication.• Comparing with MLE using approximate inference.

Acknowledgements• Thanks to John Lafferty, Geoff Gordon, and our reviewers for helpful feedback.• Funded by NSF Career IIS-0644225, ONR YIP N00014-08-1- 0752, and ARO

MURI W911NF0810242.

Abbeel et al. (2006)

Theorem If the canonical parameterization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization.

• Only previous method for PAC-learning high-treewidth discrete MRFs.• (Low-degree factor graphs over discrete X.)

• Main idea (their “canonical parameterization”):• Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ).

• Fine print: Each factor is instantiated 2|Ci| times using a reference assignment.• Estimate each small factor P( XCi | X-Ci ) from data.

Computing MPLE directly is faster. Our analysis covers their learning method.