PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

Post on 14-Jan-2016

40 views 2 download

Tags:

description

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers. ICML 2005 François Laviolette and Mario Marchand Université Laval. PLAN. The “traditional” PAC-Bayes theorem (for the usual data-independent setting ) - PowerPoint PPT Presentation

Transcript of PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for

Sample-Compressed Gibbs Classifiers

ICML 2005ICML 2005

François Laviolette and Mario MarchandUniversité Laval

PLAN The “traditional” PAC-Bayes theorem

(for the usual data-independent setting )

The “generalized” PAC-Bayes theorem (for the more general sample compression setting)

Implications and follow-ups

A result from folklore :

In particular, for Gibbs classifiers:

What if we choose P after observing the data?

The “traditional” PAC-Bayes Theorem

The Gibbs and the majority vote We have a bound for GQ but we normally use instead the Bayes

classifier BQ (which is the Q-weighted majority vote classifier)

Consequently R(BQ) · 2R(GQ) (can be improved with the “de-randomization” technique of Langford-Shaw-Taylor 2003)

So the PAC-Bayes theorem also gives a bound on the Majority vote classifier.

The sample compression setting Theorem 1 is valid in the usual data-independent

setting where H is defined without reference to the training data

Example: H = the set of all linear classifiers h: Rn!{-1,+1}

In the more general sample compression setting, each classifier is identified by 2 different sources of information:

The compression set: an (ordered) subset of the training set A message string of additional information needed to identify a

classifier

Theorem 1 is not valid in this more general setting

To be more precise: In the sample compression setting, there exists a

“reconstruction” function R that gives a classifier

h = R(, Si)

when given a compression set Si and a message string .

Recall that Si is an ordered subset of the training set S where the order is specified by i=(i1, i2, … , i|i|).

Examples

Set Covering Machines (SCM) [Marchand and Shaw-Taylor JMLR 2002]

Decision List Machines (DLM) [Marchand and Sokolova JMLR 2005]

Support Vector Machines (SVM) Nearest neighbour classifiers (NNC) …

We will thus use priors defined over the set of all the parameters (i,) needed by the reconstruction function R, once a training set S is given.

The priors should be written as:

Priors in the sample compression setting

The priors must be Data-independent

The “generalized” PAC-Bayes Theorem

a (the rescaled ) incorporates Occam’s principle of parsimony

The new PAC-Bayes theorem states that the risk bound for is lower than the risk bound for any .

The PAC-Bayes theorem for bounded compression set size

Conclusion The new PAC-Bayes bound

is valid in the more general sample compression setting.

incorporates automatically the Occam’s principle of parsimony

A sample compressed Gibbs classifier can have a smaller risk bound than any of its member.

The next steps Finding derived bounds for particular sample

compressed classifiers like: majority votes of SCMs and DLMs, SVMs NNCs.

Developing new learning algorithms based on the theoretical information given by the bound.

A tight Risk bound for Majority vote classifiers ?