Sample Complexity of Composite Likelihood

Sample Complexity of Composite LikelihoodJoseph K. Bradley & Carlos Guestrin

PAC-learning parameters for general MRFs & CRFsvia practical methods: pseudolikelihood & structured composite likelihood.

X1: deadline?

X2: bags under eyes?

X3: sick?

X4: losing hair?

X5: overeating?

Example MRF: the health of a grad student

= P( deadline | bags under eyes, losing hair )

Example query:

factor

Binary X:

Markov Random Fields (MRFs)

Model distribution P(X) over random variables X

as a log-linear MRF:

ParametersFeaturesRequires inference.

Provably hard for general MRFs.

Conditional Random Fields (CRFs)

Model conditional distribution P(X|E) over random variables X,

given variables E:

Pro: Model X, not E. Inference exponential only in |X|, not in |E|.

Con: Z depends on E!

Compute Z(e) for every training example!

MLE objective:

Maximum Likelihood Estimation (MLE)

Minimize objective:

Given data: n i.i.d. samples from

Loss

L2 regularization is more common. Our analysis applies to L1 & L2.

Gold Standard: MLE is (optimally) statistically efficient.

Regularization

Hard to compute (inference).

Can we learn without intractable inference?

MLE AlgorithmIterate:•Compute gradient.•Step along gradient.

Sample Complexity Bounds

Related WorkRavikumar et al. (2010)•PAC bounds for regression Yi ~ X with Ising factors.•Our theory is largely derived from this work.

Maximum Pseudolikelihood (MPLE)

MLE loss:

Pseudolikelihood (MPLE) loss:

Intuition: Approximate distribution as product of local conditionals.

(Besag, 1975)

(Lafferty et al., 2001)

Hard to compute replace it!

X1: deadline?X2: bags under eyes?

Pro: No intractable inference requiredPro: Consistent estimatorCon: Less statistically efficient than MLECon: No PAC bounds

Bound on Parameter Error: MLE, MPLE

Λmin for MLE: min eigenvalue of Hessian of loss at θ*:

TheoremMLE or MPLE using L1 or L2 regularization

achieve avg. per-parameter error

with probability ≥ 1-δ

using n i.i.d. samples from Pθ*(X):

Λmin for MPLE: mini [ min eigval of Hessian of loss component i at θ* ]:

Probability of failureAvg. per-parameter

error

Bound on Log LossTheorem

If the parameter estimation error ε is small,

then the log loss converges quadratically in ε:

else the log loss converges linearly in ε: Max feature magnitude

Joint vs. Disjoint Optimization

Joint MPLE:

DisjointMPLE:

Avg of from separate estimates

Pro: Data parallel

Con: Worse bound (extra factors |X|)

TheoremSample Complexity Bound for Disjoint MPLE:

# parameters (length of θ)

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y-i) separately

YAi

Something in between? Estimate a larger component, but keep

inference tractable.

Composite Likelihood (MCLE):Estimate P(YAi|Y-Ai) separately, YAi in Y.(Lindsay, 1988)

Yi

E.g., model with:•Weak horizontal factors•Strong vertical factors

Good choice: vertical combs

Choosing MCLE components YAi:•Larger is better.•Keep inference tractable.•Use model structure.

Bound on Parameter Error: MCLE

ρmin = minj [ sum over components Ai which estimate θj of

[ min eigval of Hessian of at θ* ].Mmax = maxj [ number of components Ai which estimate θj ].MPLE: One bad estimatorP(Xi|X-i) can give bad results.

MCLE: The effect of a bad estimator P(XAi|X-Ai) can be averaged out by other good estimators.

TheoremIntuition:ρmin/Mmax = Average Λmin

(over multiple components estimating each parameter)

Experimental Setup

ChainsStars

Grids

Random:

otherwise

ifAssociative:

factor strength

X1

X2

Factors

Structures Learning• 10 runs with separate datasets• Optimized with conjugate gradient

• MLE on big grids: stochastic gradient with Gibbs sampling

L1 p

ara

m e

rror

L1 p

ara

merr

or

bou

nd

Training set size

Log (

base

e)

loss

Training set size

Log loss

bou

nd

,giv

en p

ara

ms

MPLE-disjoint

MPLE

MLE

Chain.|X|=4.Random factors.

Predictive Power of Bounds

Parameter estimation error ≤ f(sample size)(looser bound)

Log loss≤ f(param estimation error)(tighter bound)

1/Λmin

L1 p

ara

m e

rror

L1 p

ara

m e

rror

bou

nd

Chains.Random factors.10,000 train exs.MLE (similar results for MPLE) r=23

r=11

r=5

Is the bound still useful (predictive)?

Yes! Actual error vs. bound:•Different constants•Similar behavior•Nearly independent of r

Background

Structured Composite Likelihood

Composite Likelihood (MCLE)

Tightness of Bounds

Future Work

Liang and Jordan (2008)•Asymptotic bounds for pseudolikelihood, composite likelihood.•Our finite sample bounds are of the same order.

Learning with approximate inference•No previous PAC-style bounds for general MRFs, CRFs.•c.f.: Hinton (2002), Koller & Friedman (2009), Wainwright (2006)

Grid width(Fixed factor strength)

Λm

in r

ati

o combs

MPLE

Factor strength(Fixed |Y|=8)

Λm

in r

ati

o

combs

MPLE

Grid size |X|Log loss

rati

o (

oth

er/

MLE

)

combs

MPLE

Grid size |X|

Tra

inin

g t

ime (

sec)

combsMPLE

MLE Grid.Associative factors (fixed strength).10,000 training samples.

bett

er

Combs (MCLE) lower sample complexity--without increasing computation!

Grid width

Λm

in

MPLE

MLE

Combs - both

Combs - vertical

Combs - horizontal

Grid with strong vertical (associative) factors.

bett

er

Learning Test

Averaging MCLE ComponentsBest: Component structure matches model structure.Average: Reasonable choice without prior knowledge of θ*.Worst:

Component structure does not match model structure.

Chains Stars

bett

er

Grids

All plots are for associative factors. (Random factors behave similarly.)

Plotted: Ratio (Λmin for MLE) / (Λmin for other method)


Λm

in r

ati

o MPLE


Λm

in r

ati

o MPLE

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o MPLE

Model size |Y|(Fixed factor strength)

Λm

in r

ati

o

MPLE

Model diameter is not important.

MPLE is worse for high-degree nodes.

MPLE is worse for big grids.

MPLE is worse for strong factors. Combs (Structured MCLE) improve upon MPLE.

How do the bounds vary w.r.t. model properties?

Λmin for Various Models

• Theoretical understanding of how Λmin varies with model properties.

• Choosing MCLE structure on natural graphs.• Parallel learning: Lowering sample complexity of disjoint optimization via

limited communication.• Comparing with MLE using approximate inference.

Acknowledgements• Thanks to John Lafferty, Geoff Gordon, and our reviewers for helpful feedback.• Funded by NSF Career IIS-0644225, ONR YIP N00014-08-1- 0752, and ARO

MURI W911NF0810242.

Abbeel et al. (2006)

Theorem If the canonical parameterization uses the factorization of P(X), it is equivalent to MPLE with disjoint optimization.

• Only previous method for PAC-learning high-treewidth discrete MRFs.• (Low-degree factor graphs over discrete X.)

• Main idea (their “canonical parameterization”):• Re-write P(X) as a ratio of many small factors P( XCi | X-Ci ).

• Fine print: Each factor is instantiated 2|Ci| times using a reference assignment.• Estimate each small factor P( XCi | X-Ci ) from data.

Computing MPLE directly is faster. Our analysis covers their learning method.

Sample Complexity of Composite Likelihood

Documents

Transcript of Sample Complexity of Composite Likelihood