Post on 14-Jan-2016
description
PAC-Bayes Risk Bounds for
Sample-Compressed Gibbs Classifiers
ICML 2005ICML 2005
François Laviolette and Mario MarchandUniversité Laval
PLAN The “traditional” PAC-Bayes theorem
(for the usual data-independent setting )
The “generalized” PAC-Bayes theorem (for the more general sample compression setting)
Implications and follow-ups
A result from folklore :
In particular, for Gibbs classifiers:
What if we choose P after observing the data?
The “traditional” PAC-Bayes Theorem
The Gibbs and the majority vote We have a bound for GQ but we normally use instead the Bayes
classifier BQ (which is the Q-weighted majority vote classifier)
Consequently R(BQ) · 2R(GQ) (can be improved with the “de-randomization” technique of Langford-Shaw-Taylor 2003)
So the PAC-Bayes theorem also gives a bound on the Majority vote classifier.
The sample compression setting Theorem 1 is valid in the usual data-independent
setting where H is defined without reference to the training data
Example: H = the set of all linear classifiers h: Rn!{-1,+1}
In the more general sample compression setting, each classifier is identified by 2 different sources of information:
The compression set: an (ordered) subset of the training set A message string of additional information needed to identify a
classifier
Theorem 1 is not valid in this more general setting
To be more precise: In the sample compression setting, there exists a
“reconstruction” function R that gives a classifier
h = R(, Si)
when given a compression set Si and a message string .
Recall that Si is an ordered subset of the training set S where the order is specified by i=(i1, i2, … , i|i|).
Examples
Set Covering Machines (SCM) [Marchand and Shaw-Taylor JMLR 2002]
Decision List Machines (DLM) [Marchand and Sokolova JMLR 2005]
Support Vector Machines (SVM) Nearest neighbour classifiers (NNC) …
We will thus use priors defined over the set of all the parameters (i,) needed by the reconstruction function R, once a training set S is given.
The priors should be written as:
Priors in the sample compression setting
The priors must be Data-independent
The “generalized” PAC-Bayes Theorem
a (the rescaled ) incorporates Occam’s principle of parsimony
The new PAC-Bayes theorem states that the risk bound for is lower than the risk bound for any .
The PAC-Bayes theorem for bounded compression set size
Conclusion The new PAC-Bayes bound
is valid in the more general sample compression setting.
incorporates automatically the Occam’s principle of parsimony
A sample compressed Gibbs classifier can have a smaller risk bound than any of its member.
The next steps Finding derived bounds for particular sample
compressed classifiers like: majority votes of SCMs and DLMs, SVMs NNCs.
Developing new learning algorithms based on the theoretical information given by the bound.
A tight Risk bound for Majority vote classifiers ?