Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve...

66
Nonparametric Bayesian Classification Marc A. Coram University of Chicago http://galton.uchicago.edu/~coram Persi Diaconis Steve Lalley
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve...

Page 1: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Nonparametric Bayesian Classification

Marc A. CoramUniversity of Chicago

http://galton.uchicago.edu/~coram

Persi Diaconis Steve Lalley

Page 2: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Related Approaches

• Chipman, George, McCullough• Bayesian CART (1998 a,b)

• Nested• CART-like• Coordinate aligned splits• Good “search” ability

• Denison, Mallick, Smith• Bayesian CART• Bayesian splines and “MARS”

Page 3: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Outline

• Medical example

• Theoretical framework

• Bayesian proposal

• Implementation

• Simulation experiments

• Theoretical results

• Extensions to a general setting

Page 4: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Example: AIDS Data(1-dimensional)

• AIDS patients

• Covariate of interest: viral resistance level in blood sample

• Goal: estimate conditional probability of response

Page 5: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Idealized Setting

(X,Y) iid pairs

X (covariate) X [0,1]

Y (response) Y {0,1}

f0 (true parameter) f0(x)=P(Y=1|X=x)What, then, is a straightforward way to

proceed, thinking like a Bayesian?

Page 6: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Prior on f: 1-dimension

• Pick a non-negative integer M at randomSay, choose M=0 with prob 1/2

M=1 with prob 1/4M=2 with prob 1/8….

• Conditional on M=m, Randomly choose a step functionfrom [0,1] into [0,1] with m jumps

• (i.e. locate the m jumps and (m+1) valuesindependently and uniformly)

Page 7: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Perspective• Simple prior on stepwise functions• Functions are parameterized by:

• Goal: Get samples from the posterior; average to estimate posterior mean curve

• Idea: Use MCMC, but prefer analytical calculations whenever possible

0

1

0

]1,0[

]1,0[

,2,1,0

m

m

m

m

v

u

m # regions

jump locations

function values

Page 8: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Observations

•The joint distribution of U, V, and the data has density proportional to:

• Conditional on u, the counts are sufficient for v.

interval'intails#),,(

interval'inheads#),,(00

11

thjnn

thjnn

jj

jj

yxu

yxu

1

1

)1(01

)1(2m

j

nj

nj

m jj vv

where:

Page 9: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Observations IIThe marginal of the posterior on U has density proportional to:

1||

1

01)1|(| ),(2u

jjj

u nn

Conditional on U=u and the data, V’s are independent Beta random variables

)1,1(~)data,(| 10 jjj nnV u

2

1)data,|(

10

1

jj

jj nn

nVE u

and

1

0 )!1(!!

)1(),(baba

duuuba baWhere:

Page 10: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Consequently…

• In principle:• We put a prior on piecewise constant curves• The curves are specified by

•u, a vector in [0,1]m

•v, a vector in [0,1]m+1

• for some m• We sample curves from the posterior using MCMC• We take the posterior mean (pointwise) of the

sampled curves

• In practice:• We need only sample from the posterior on u• We can then compute the conditional mean of all

the curves with this u.

Page 11: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Implementation

• Build a reversible base chain to sample U from the prior• E.g., start with an empty vector and add,

delete, and move coordinates randomly

• Apply Metropolis-Hastings to construct a new chain which samples from the posterior on U

• Compute:

VV

u

1

u

f

fEf

xnn

n

xfExf

m

jj

jj

j

ˆAverage

)data|(ˆ

)(2

1

))(data,|()(ˆ

1

101

1

Page 12: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Simulation Experiment (a)

True Posterior Mean

n=1024

Page 13: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

True Posterior Mean

n=1024

Page 14: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

True Posterior Mean

n=1024

Page 15: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

True Posterior Mean

n=1024

Page 16: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Predictive Probability Surface

Page 17: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Posterior on #-jumps

Page 18: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Stable w.r.t Prior

Page 19: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Decomposition

Page 20: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Classification and Regression Trees

(CART)• Consider splitting the data into the set with

X<x and the set with X>x• Choose x to maximize the fit• Recurse on each subset• “Prune” away splits according to a

complexity criterion whose parameter is determined by cross-validation

• Splits that do not “explain” enough variability get pruned off

Page 21: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Simulation Experiment (b) True Posterior Mean CART

Page 22: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Bagging

• To “bag” an estimator you treat the estimator as a black box

• Repeatedly, generate bootstrap resamples from the data set and run the estimator on these new “data sets.”

• Average the resulting estimates

Page 23: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Simulation Experiment (c) True Posterior Mean CART Bagged Cart: Full Trees

Page 24: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Simulation Experiment (d) True Posterior Mean CART Bagged Cart: cp=0.005

Page 25: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Simulation Experiment (e) True Posterior Mean CART Bagged Cart: cp=0.01

Page 26: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Sim

ula

tion

s 2-1

0

Page 27: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

CART

Bagged CART: cp=0.01

Page 28: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Bagged Bayes??

Page 29: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Smoothers?

Page 30: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Boosting? (Lasso Stumps)

Page 31: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Dyadic Bayes [Diaconis, Freedman]

Page 32: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Monotone Invariance?

Page 33: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Bayesian Consistency

• • Consistent at f0 if:

The posterior probability of N tends to 1 a.s. for any > 0

• Since all f are bounded in L1, Consistency implies a fortiori that:

}||||:{ 10 fffN

nff asa.s.Linˆ 10

Page 34: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Sample Size 8192

Page 35: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Related WorkDiaconis and Freedman (1995)

• Similar hierarchical prior, but:• Aggressive splitting• Fixed split points

• Strong Results:• If dies off at a specific geometric rate

Consistency for all f0

• If dies off just slower than this Posterior will be inconsistent at f0=1/2

Consistency results cannot be taken for granted

DF: K~Given K=k, split into 2k equal pieces.

(k=3)

Page 36: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Consistency Theorem: ThesisIf (Xi,Yi) are drawn iid via (i=1..n)

X ~ U(0,1)Y|X=x ~ Bernoulli(f0(x))

And if is the specified prior on f, chosen so that the tails the prior on hierarchy level M,

decay like exp(-n log(n) )Then n, the posterior,

is a consistent estimate of f0,for any measurable f0.

Page 37: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Method of Proof

• Barron, Schervish, Wasserman (1999)

• Need to show:• Lemma 1: Prior puts positive mass on all

Kullback-Leibler information neighborhoods of f0

• Choose sieves:

Fn={f: f has no more than n/log(n) splits}

• Lemma 2: The -upper metric entropy of Fn is o(n)

• Lemma 3: (Fnc) decays exponentially

Page 38: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

New Result

• Coram and Lalley 2004/5 ( hopefully )• Consistency holds for any prior with infinite

support, if the true function is not identically ½.• Consistency for the ½ case depends on the tail

decay*

• Proof revolves around a large-deviation question:• How does predictive probability behave as n --> infinity

for a model with m=an splits? (0<a<infinity)

• Proof uses subadditive ergodic theorem to take advantage of self-similarity in the problem

Page 39: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

A Guessing Game

1/2

1/2

Flip a fair coin repeatedly

Pick p in [0,1] at randomFlip that p-coin repeatedly

Page 40: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

64

Page 41: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

128

Page 42: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

256

Page 43: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

512

Page 44: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

1024

Page 45: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

2048

Page 46: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

4096

Page 47: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

8192

Page 48: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

A Voronoi Prior for [0,1]d

1 2

3

5

4

V1 V2

V3

V4

V5

Page 49: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

A Modified Voronoi Prior for General Spaces

• Choose M, as before

• Draw V=(V1, V2, … Vk)

• With each Vj drawn without replacement from an a-priori fixed set A

• In practice, I take A={X1, …, Xn}

• This approximates drawing the V’s from the marginal dist of X

Page 50: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Discussion

• CON:• Not quite Bayesian

• A depends on the data

• PRO:• Only partitions the relevant subspace• Applies in general metric spaces• Only depends on D, the pairwise distance

matrix• Intuitive Content

Page 51: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Intuition

2 3 4 5 6Samples from the prior with k parts

k =

Page 52: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

2-dimensional Simulated Data

Page 53: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Posterior Samples

Page 54: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Posterior Mean

Page 55: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Bagged Cart

Page 56: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Weighted Voronoi

Page 57: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Acknowledgements

• Steve Lalley• Persi Diaconis

National Science Foundation Lieberman Fellowship

• AIDS Data: Andrew Zolopa, Howard Rice

Page 58: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.
Page 59: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Future Directions

• Theoretical• Extend theoretical results to more general setting• Tighten results to determine where inconsistency

first appears• Determine rate of convergence

• Practical• Refine MCMC mixing using better simulated

tempering• Improve computational speed• Explore weighted Voronoi and “smoothed” Voronoi

priors• Compare with SVMs and Boosting• Use the posterior to produce confidence statements

Page 60: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Smoothing

Page 61: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Highlights

• Straightforward Bayesian motivation• Implementation actually works• Prior can be adjusted to utilize

domain knowledge• Provides a framework for inference• Compares favorably with

CART/Bagged CART• Theoretically tractable• Targets high dimensional problems

Page 62: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Background• Enormous Literature

• Theoretical results starting from the consistency of nearest neighbors• Methodologies

• CART• Logistic Regression• Wavelets• SVM’s• Neural Nets

• Bayesian Literature• Bayesian CART• Image Segmentation

• Bayesian Theory• Diaconis and Freedman• Barron, Schervish, Wasserman

Page 63: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Posterior Calculation(2-dimensional example)

Page 64: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Spatial Adaptation

Stephane NullinsPRISME

Page 65: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

1. Pick K=k from

2. Partition of [0,1] into k intervals

3. Assign Each j an Sj iid U(0,1)

Nonparametric Prior: 1-dimension

)21

(Geometric

)[)[

[ ]0 1

v2 v1 v3

[ ])[1 2 3 4

)[)[[ ])[1 2 3 4

k

jjj xSxf

1

)()( 1

Page 66: Nonparametric Bayesian Classification Marc A. Coram University of Chicago coram Persi DiaconisSteve Lalley.

Consistency Results(1-dimensional)

Setup:Setup:X’s iid U(0,1)Y|X=x ~ Bernoulli(f0(x)) is the prior on k

Result:Result:• If the tails of decay geometrically, then

for any measurable f0,n is consistent at f0.

Key tools:Key tools:• Kullback-Leibler inequalities, Weierstrass approximation

(Prior is ``Dense’’)

• Sieves:

(Prior is “almost” finite dimensional)

• Upper brackets:

(Prior is “almost” finite)

• Large deviations:

(Each likelihood ratio test is asymptotically powerful)