Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve...

Post on 22-Dec-2015

221 views 0 download

Tags:

Transcript of Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve...

Nonparametric Bayesian Classification

Marc A. CoramUniversity of Chicago

http://galton.uchicago.edu/~coram

Persi Diaconis Steve Lalley

Related Approaches

• Chipman, George, McCullough• Bayesian CART (1998 a,b)

• Nested• CART-like• Coordinate aligned splits• Good “search” ability

• Denison, Mallick, Smith• Bayesian CART• Bayesian splines and “MARS”

Outline

• Medical example

• Theoretical framework

• Bayesian proposal

• Implementation

• Simulation experiments

• Theoretical results

• Extensions to a general setting

Example: AIDS Data(1-dimensional)

• AIDS patients

• Covariate of interest: viral resistance level in blood sample

• Goal: estimate conditional probability of response

Idealized Setting

(X,Y) iid pairs

X (covariate) X [0,1]

Y (response) Y {0,1}

f0 (true parameter) f0(x)=P(Y=1|X=x)What, then, is a straightforward way to

proceed, thinking like a Bayesian?

Prior on f: 1-dimension

• Pick a non-negative integer M at randomSay, choose M=0 with prob 1/2

M=1 with prob 1/4M=2 with prob 1/8….

• Conditional on M=m, Randomly choose a step functionfrom [0,1] into [0,1] with m jumps

• (i.e. locate the m jumps and (m+1) valuesindependently and uniformly)

Perspective• Simple prior on stepwise functions• Functions are parameterized by:

• Goal: Get samples from the posterior; average to estimate posterior mean curve

• Idea: Use MCMC, but prefer analytical calculations whenever possible

0

1

0

]1,0[

]1,0[

,2,1,0

m

m

m

m

v

u

m # regions

jump locations

function values

Observations

•The joint distribution of U, V, and the data has density proportional to:

• Conditional on u, the counts are sufficient for v.

interval'intails#),,(

interval'inheads#),,(00

11

thjnn

thjnn

jj

jj

yxu

yxu

1

1

)1(01

)1(2m

j

nj

nj

m jj vv

where:

Observations IIThe marginal of the posterior on U has density proportional to:

1||

1

01)1|(| ),(2u

jjj

u nn

Conditional on U=u and the data, V’s are independent Beta random variables

)1,1(~)data,(| 10 jjj nnV u

2

1)data,|(

10

1

jj

jj nn

nVE u

and

1

0 )!1(!!

)1(),(baba

duuuba baWhere:

Consequently…

• In principle:• We put a prior on piecewise constant curves• The curves are specified by

•u, a vector in [0,1]m

•v, a vector in [0,1]m+1

• for some m• We sample curves from the posterior using MCMC• We take the posterior mean (pointwise) of the

sampled curves

• In practice:• We need only sample from the posterior on u• We can then compute the conditional mean of all

the curves with this u.

Implementation

• Build a reversible base chain to sample U from the prior• E.g., start with an empty vector and add,

delete, and move coordinates randomly

• Apply Metropolis-Hastings to construct a new chain which samples from the posterior on U

• Compute:

VV

u

1

u

f

fEf

xnn

n

xfExf

m

jj

jj

j

ˆAverage

)data|(ˆ

)(2

1

))(data,|()(ˆ

1

101

1

Simulation Experiment (a)

True Posterior Mean

n=1024

True Posterior Mean

n=1024

True Posterior Mean

n=1024

True Posterior Mean

n=1024

Predictive Probability Surface

Posterior on #-jumps

Stable w.r.t Prior

Decomposition

Classification and Regression Trees

(CART)• Consider splitting the data into the set with

X<x and the set with X>x• Choose x to maximize the fit• Recurse on each subset• “Prune” away splits according to a

complexity criterion whose parameter is determined by cross-validation

• Splits that do not “explain” enough variability get pruned off

Simulation Experiment (b) True Posterior Mean CART

Bagging

• To “bag” an estimator you treat the estimator as a black box

• Repeatedly, generate bootstrap resamples from the data set and run the estimator on these new “data sets.”

• Average the resulting estimates

Simulation Experiment (c) True Posterior Mean CART Bagged Cart: Full Trees

Simulation Experiment (d) True Posterior Mean CART Bagged Cart: cp=0.005

Simulation Experiment (e) True Posterior Mean CART Bagged Cart: cp=0.01

Sim

ula

tion

s 2-1

0

CART

Bagged CART: cp=0.01

Bagged Bayes??

Smoothers?

Boosting? (Lasso Stumps)

Dyadic Bayes [Diaconis, Freedman]

Monotone Invariance?

Bayesian Consistency

• • Consistent at f0 if:

The posterior probability of N tends to 1 a.s. for any > 0

• Since all f are bounded in L1, Consistency implies a fortiori that:

}||||:{ 10 fffN

nff asa.s.Linˆ 10

Sample Size 8192

Related WorkDiaconis and Freedman (1995)

• Similar hierarchical prior, but:• Aggressive splitting• Fixed split points

• Strong Results:• If dies off at a specific geometric rate

Consistency for all f0

• If dies off just slower than this Posterior will be inconsistent at f0=1/2

Consistency results cannot be taken for granted

DF: K~Given K=k, split into 2k equal pieces.

(k=3)

Consistency Theorem: ThesisIf (Xi,Yi) are drawn iid via (i=1..n)

X ~ U(0,1)Y|X=x ~ Bernoulli(f0(x))

And if is the specified prior on f, chosen so that the tails the prior on hierarchy level M,

decay like exp(-n log(n) )Then n, the posterior,

is a consistent estimate of f0,for any measurable f0.

Method of Proof

• Barron, Schervish, Wasserman (1999)

• Need to show:• Lemma 1: Prior puts positive mass on all

Kullback-Leibler information neighborhoods of f0

• Choose sieves:

Fn={f: f has no more than n/log(n) splits}

• Lemma 2: The -upper metric entropy of Fn is o(n)

• Lemma 3: (Fnc) decays exponentially

New Result

• Coram and Lalley 2004/5 ( hopefully )• Consistency holds for any prior with infinite

support, if the true function is not identically ½.• Consistency for the ½ case depends on the tail

decay*

• Proof revolves around a large-deviation question:• How does predictive probability behave as n --> infinity

for a model with m=an splits? (0<a<infinity)

• Proof uses subadditive ergodic theorem to take advantage of self-similarity in the problem

A Guessing Game

1/2

1/2

Flip a fair coin repeatedly

Pick p in [0,1] at randomFlip that p-coin repeatedly

64

128

256

512

1024

2048

4096

8192

A Voronoi Prior for [0,1]d

1 2

3

5

4

V1 V2

V3

V4

V5

A Modified Voronoi Prior for General Spaces

• Choose M, as before

• Draw V=(V1, V2, … Vk)

• With each Vj drawn without replacement from an a-priori fixed set A

• In practice, I take A={X1, …, Xn}

• This approximates drawing the V’s from the marginal dist of X

Discussion

• CON:• Not quite Bayesian

• A depends on the data

• PRO:• Only partitions the relevant subspace• Applies in general metric spaces• Only depends on D, the pairwise distance

matrix• Intuitive Content

Intuition

2 3 4 5 6Samples from the prior with k parts

k =

2-dimensional Simulated Data

Posterior Samples

Posterior Mean

Bagged Cart

Weighted Voronoi

Acknowledgements

• Steve Lalley• Persi Diaconis

National Science Foundation Lieberman Fellowship

• AIDS Data: Andrew Zolopa, Howard Rice

Future Directions

• Theoretical• Extend theoretical results to more general setting• Tighten results to determine where inconsistency

first appears• Determine rate of convergence

• Practical• Refine MCMC mixing using better simulated

tempering• Improve computational speed• Explore weighted Voronoi and “smoothed” Voronoi

priors• Compare with SVMs and Boosting• Use the posterior to produce confidence statements

Smoothing

Highlights

• Straightforward Bayesian motivation• Implementation actually works• Prior can be adjusted to utilize

domain knowledge• Provides a framework for inference• Compares favorably with

CART/Bagged CART• Theoretically tractable• Targets high dimensional problems

Background• Enormous Literature

• Theoretical results starting from the consistency of nearest neighbors• Methodologies

• CART• Logistic Regression• Wavelets• SVM’s• Neural Nets

• Bayesian Literature• Bayesian CART• Image Segmentation

• Bayesian Theory• Diaconis and Freedman• Barron, Schervish, Wasserman

Posterior Calculation(2-dimensional example)

Spatial Adaptation

Stephane NullinsPRISME

1. Pick K=k from

2. Partition of [0,1] into k intervals

3. Assign Each j an Sj iid U(0,1)

Nonparametric Prior: 1-dimension

)21

(Geometric

)[)[

[ ]0 1

v2 v1 v3

[ ])[1 2 3 4

)[)[[ ])[1 2 3 4

k

jjj xSxf

1

)()( 1

Consistency Results(1-dimensional)

Setup:Setup:X’s iid U(0,1)Y|X=x ~ Bernoulli(f0(x)) is the prior on k

Result:Result:• If the tails of decay geometrically, then

for any measurable f0,n is consistent at f0.

Key tools:Key tools:• Kullback-Leibler inequalities, Weierstrass approximation

(Prior is ``Dense’’)

• Sieves:

(Prior is “almost” finite dimensional)

• Upper brackets:

(Prior is “almost” finite)

• Large deviations:

(Each likelihood ratio test is asymptotically powerful)