Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve...
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve...
Nonparametric Bayesian Classification
Marc A. CoramUniversity of Chicago
http://galton.uchicago.edu/~coram
Persi Diaconis Steve Lalley
Related Approaches
• Chipman, George, McCullough• Bayesian CART (1998 a,b)
• Nested• CART-like• Coordinate aligned splits• Good “search” ability
• Denison, Mallick, Smith• Bayesian CART• Bayesian splines and “MARS”
Outline
• Medical example
• Theoretical framework
• Bayesian proposal
• Implementation
• Simulation experiments
• Theoretical results
• Extensions to a general setting
Example: AIDS Data(1-dimensional)
• AIDS patients
• Covariate of interest: viral resistance level in blood sample
• Goal: estimate conditional probability of response
Idealized Setting
(X,Y) iid pairs
X (covariate) X [0,1]
Y (response) Y {0,1}
f0 (true parameter) f0(x)=P(Y=1|X=x)What, then, is a straightforward way to
proceed, thinking like a Bayesian?
Prior on f: 1-dimension
• Pick a non-negative integer M at randomSay, choose M=0 with prob 1/2
M=1 with prob 1/4M=2 with prob 1/8….
• Conditional on M=m, Randomly choose a step functionfrom [0,1] into [0,1] with m jumps
• (i.e. locate the m jumps and (m+1) valuesindependently and uniformly)
Perspective• Simple prior on stepwise functions• Functions are parameterized by:
• Goal: Get samples from the posterior; average to estimate posterior mean curve
• Idea: Use MCMC, but prefer analytical calculations whenever possible
0
1
0
]1,0[
]1,0[
,2,1,0
m
m
m
m
v
u
m # regions
jump locations
function values
Observations
•The joint distribution of U, V, and the data has density proportional to:
• Conditional on u, the counts are sufficient for v.
interval'intails#),,(
interval'inheads#),,(00
11
thjnn
thjnn
jj
jj
yxu
yxu
1
1
)1(01
)1(2m
j
nj
nj
m jj vv
where:
Observations IIThe marginal of the posterior on U has density proportional to:
1||
1
01)1|(| ),(2u
jjj
u nn
Conditional on U=u and the data, V’s are independent Beta random variables
)1,1(~)data,(| 10 jjj nnV u
2
1)data,|(
10
1
jj
jj nn
nVE u
and
1
0 )!1(!!
)1(),(baba
duuuba baWhere:
Consequently…
• In principle:• We put a prior on piecewise constant curves• The curves are specified by
•u, a vector in [0,1]m
•v, a vector in [0,1]m+1
• for some m• We sample curves from the posterior using MCMC• We take the posterior mean (pointwise) of the
sampled curves
• In practice:• We need only sample from the posterior on u• We can then compute the conditional mean of all
the curves with this u.
Implementation
• Build a reversible base chain to sample U from the prior• E.g., start with an empty vector and add,
delete, and move coordinates randomly
• Apply Metropolis-Hastings to construct a new chain which samples from the posterior on U
• Compute:
VV
u
1
u
f
fEf
xnn
n
xfExf
m
jj
jj
j
ˆAverage
)data|(ˆ
)(2
1
))(data,|()(ˆ
1
101
1
Simulation Experiment (a)
True Posterior Mean
n=1024
True Posterior Mean
n=1024
True Posterior Mean
n=1024
True Posterior Mean
n=1024
Predictive Probability Surface
Posterior on #-jumps
Stable w.r.t Prior
Decomposition
Classification and Regression Trees
(CART)• Consider splitting the data into the set with
X<x and the set with X>x• Choose x to maximize the fit• Recurse on each subset• “Prune” away splits according to a
complexity criterion whose parameter is determined by cross-validation
• Splits that do not “explain” enough variability get pruned off
Simulation Experiment (b) True Posterior Mean CART
Bagging
• To “bag” an estimator you treat the estimator as a black box
• Repeatedly, generate bootstrap resamples from the data set and run the estimator on these new “data sets.”
• Average the resulting estimates
Simulation Experiment (c) True Posterior Mean CART Bagged Cart: Full Trees
Simulation Experiment (d) True Posterior Mean CART Bagged Cart: cp=0.005
Simulation Experiment (e) True Posterior Mean CART Bagged Cart: cp=0.01
Sim
ula
tion
s 2-1
0
CART
Bagged CART: cp=0.01
Bagged Bayes??
Smoothers?
Boosting? (Lasso Stumps)
Dyadic Bayes [Diaconis, Freedman]
Monotone Invariance?
Bayesian Consistency
• • Consistent at f0 if:
The posterior probability of N tends to 1 a.s. for any > 0
• Since all f are bounded in L1, Consistency implies a fortiori that:
}||||:{ 10 fffN
nff asa.s.Linˆ 10
Sample Size 8192
Related WorkDiaconis and Freedman (1995)
• Similar hierarchical prior, but:• Aggressive splitting• Fixed split points
• Strong Results:• If dies off at a specific geometric rate
Consistency for all f0
• If dies off just slower than this Posterior will be inconsistent at f0=1/2
Consistency results cannot be taken for granted
DF: K~Given K=k, split into 2k equal pieces.
(k=3)
Consistency Theorem: ThesisIf (Xi,Yi) are drawn iid via (i=1..n)
X ~ U(0,1)Y|X=x ~ Bernoulli(f0(x))
And if is the specified prior on f, chosen so that the tails the prior on hierarchy level M,
decay like exp(-n log(n) )Then n, the posterior,
is a consistent estimate of f0,for any measurable f0.
Method of Proof
• Barron, Schervish, Wasserman (1999)
• Need to show:• Lemma 1: Prior puts positive mass on all
Kullback-Leibler information neighborhoods of f0
• Choose sieves:
Fn={f: f has no more than n/log(n) splits}
• Lemma 2: The -upper metric entropy of Fn is o(n)
• Lemma 3: (Fnc) decays exponentially
New Result
• Coram and Lalley 2004/5 ( hopefully )• Consistency holds for any prior with infinite
support, if the true function is not identically ½.• Consistency for the ½ case depends on the tail
decay*
• Proof revolves around a large-deviation question:• How does predictive probability behave as n --> infinity
for a model with m=an splits? (0<a<infinity)
• Proof uses subadditive ergodic theorem to take advantage of self-similarity in the problem
A Guessing Game
1/2
1/2
Flip a fair coin repeatedly
Pick p in [0,1] at randomFlip that p-coin repeatedly
64
128
256
512
1024
2048
4096
8192
A Voronoi Prior for [0,1]d
1 2
3
5
4
V1 V2
V3
V4
V5
A Modified Voronoi Prior for General Spaces
• Choose M, as before
• Draw V=(V1, V2, … Vk)
• With each Vj drawn without replacement from an a-priori fixed set A
• In practice, I take A={X1, …, Xn}
• This approximates drawing the V’s from the marginal dist of X
Discussion
• CON:• Not quite Bayesian
• A depends on the data
• PRO:• Only partitions the relevant subspace• Applies in general metric spaces• Only depends on D, the pairwise distance
matrix• Intuitive Content
Intuition
2 3 4 5 6Samples from the prior with k parts
k =
2-dimensional Simulated Data
Posterior Samples
Posterior Mean
Bagged Cart
Weighted Voronoi
Acknowledgements
• Steve Lalley• Persi Diaconis
National Science Foundation Lieberman Fellowship
• AIDS Data: Andrew Zolopa, Howard Rice
Future Directions
• Theoretical• Extend theoretical results to more general setting• Tighten results to determine where inconsistency
first appears• Determine rate of convergence
• Practical• Refine MCMC mixing using better simulated
tempering• Improve computational speed• Explore weighted Voronoi and “smoothed” Voronoi
priors• Compare with SVMs and Boosting• Use the posterior to produce confidence statements
Smoothing
Highlights
• Straightforward Bayesian motivation• Implementation actually works• Prior can be adjusted to utilize
domain knowledge• Provides a framework for inference• Compares favorably with
CART/Bagged CART• Theoretically tractable• Targets high dimensional problems
Background• Enormous Literature
• Theoretical results starting from the consistency of nearest neighbors• Methodologies
• CART• Logistic Regression• Wavelets• SVM’s• Neural Nets
• Bayesian Literature• Bayesian CART• Image Segmentation
• Bayesian Theory• Diaconis and Freedman• Barron, Schervish, Wasserman
Posterior Calculation(2-dimensional example)
Spatial Adaptation
Stephane NullinsPRISME
1. Pick K=k from
2. Partition of [0,1] into k intervals
3. Assign Each j an Sj iid U(0,1)
Nonparametric Prior: 1-dimension
)21
(Geometric
)[)[
[ ]0 1
v2 v1 v3
[ ])[1 2 3 4
)[)[[ ])[1 2 3 4
k
jjj xSxf
1
)()( 1
Consistency Results(1-dimensional)
Setup:Setup:X’s iid U(0,1)Y|X=x ~ Bernoulli(f0(x)) is the prior on k
Result:Result:• If the tails of decay geometrically, then
for any measurable f0,n is consistent at f0.
Key tools:Key tools:• Kullback-Leibler inequalities, Weierstrass approximation
(Prior is ``Dense’’)
• Sieves:
(Prior is “almost” finite dimensional)
• Upper brackets:
(Prior is “almost” finite)
• Large deviations:
(Each likelihood ratio test is asymptotically powerful)