How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander...

42
How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)

Transcript of How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander...

Page 1: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

How to do Bayes-Optimal Classification with Massive Datasets:

Large-scale Quasar Discovery

Alexander GrayGeorgia Institute of Technology

College of Computing

Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG),

Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)

Page 2: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

What I do

Often the most general and powerful statistical (or “machine learning”) methods are computationally infeasible.

I design machine learning methods and fast algorithms to make such statistical methods possible on massive datasets (without sacrificing accuracy).

Page 3: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Quasar detection

• Science motivation: use quasars to trace the distant/old mass in the universe

• Thus we want lots of sky SDSS DR1, 2099 square degrees, to g = 21

• Biggest quasar catalog to date: tens of thousands

• Should be ~1.6M z<3 quasars to g=21

Page 4: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Classification

• Traditional approach: look at 2-d color-color plot (UVX method)– doesn’t use all available information– not particularly accurate (~60% for relatively

bright magnitudes)

• Statistical approach: Pose as classification.1.Training: Train a classifier on large set of

known stars and quasars (‘training set’)

2.Prediction: The classifier will label an unknown set of objects (‘test set’)

Page 5: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Which classifier?

1. Statistical question: Must handle arbitrary nonlinear decision boundaries, noise/overlap

2. Computational question: We have 16,713 quasars from [Schneider et al. 2003] (.08<z<5.4), 478,144 stars (semi-cleaned sky sample) – way too big for many classifiers

3. Scientific question: We must be able to understand what it’s doing and why, and inject scientific knowledge

Page 6: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Which classifier?

• Popular answers:– logistic regression: fast but linear only– naïve Bayes classifier: fast but quadratic only – decision tree: fast but not the most accurate– support vector machine: accurate but O(N3) – boosting: accurate but requires thousands of

classifiers– neural net: reasonable compromise but

awkward/human-intensive to train

• The good nonparametric methods are also black boxes – hard/impossible to interpret

Page 7: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Main points of this talk

1. nonparametric Bayes classifier

2. can be made fast (algorithm design)

3. accurate and tractable science

Page 8: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Main points of this talk

1. nonparametric Bayes classifier

2. can be made fast (algorithm design)

3. accurate and tractable science

Page 9: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Optimal decision theory

Optimal decision boundary

Star density

Quasar density

x

dens

ity

f(x

)

Page 10: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Bayes’ rule, for Classification

)|(ˆ)()|(ˆ)(

)|(ˆ)()|(

2211

111 CxfCPCxfCP

CxfCPxCP

qq

qq

Page 11: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

)|(ˆ)()|(ˆ)(

)|(ˆ)()|(

2211

111 CxfCPCxfCP

CxfCPxCP

qq

qq

)|(ˆ 2Cxf q

)|(ˆ 1Cxf q

Page 12: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

)|(ˆ)()|(ˆ)(

)|(ˆ)()|(

2211

111 CxfCPCxfCP

CxfCPxCP

qq

qq

)|(ˆ 2Cxf q

So how do you estimatean arbitrary density?

)|(ˆ 1Cxf q

Page 13: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Kernel Density Estimation (KDE)

N

qrrqhq xxK

Nxf )(

1)(ˆ

N

qrrqq hxx

hCNxf }2/exp{

)(11

)(ˆ 2

for example (Gaussian kernel):

Page 14: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Kernel Density Estimation (KDE)

N

qrrqhq xxK

Nxf )(

1)(ˆ

• There is a principled way to choose the optimal smoothing parameter h

• Guaranteed to converge to the true underlying density (consistency)

• Nonparametric – distribution need not be known

Page 15: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Nonparametric Bayes Classifier (NBC)[1951]

)|(ˆ)()|(ˆ)(

)|(ˆ)()|(

2211

111 CxfCPCxfCP

CxfCPxCP

qq

qq

• Nonparametric – distribution can be arbitrary• This is Bayes-optimal, given the right densities• Very clear interpretation• Parameter choices are easy to understand, automatable• There’s a way to enter prior information

Main obstacle: )(MNO

Page 16: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Main points of this talk

1. nonparametric Bayes classifier

2. can be made fast (algorithm design)

3. accurate and tractable science

Page 17: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

kd-trees:most widely-used space-

partitioning tree[Bentley 1975], [Friedman, Bentley &

Finkel 1977]

• Univariate axis-aligned splits• Split on widest dimension• O(N log N) to build, O(N) space

Page 18: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

A kd-tree: level 1

Page 19: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

A kd-tree: level 2

Page 20: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

A kd-tree: level 3

Page 21: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

A kd-tree: level 4

Page 22: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

A kd-tree: level 5

Page 23: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

A kd-tree: level 6

Page 24: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

For higher dimensions:

ball-trees(computational geometry)

Page 25: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

We have a fast algorithm forKernel Density Estimation (KDE)

N

qrrqhq xxK

Nxf )(

1)(ˆ

• Generalization of N-body algorithms (multipole expansions optional)

• Dual kd-tree traversal: O(N)

• Works in arbitrary dimension

• The fastest method to date [Gray & Moore 2003]

Page 26: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

We could just use the KDE algorithm for each class. But:

• for the Gaussian kernel this is approximate• choosing the smoothing parameter to minimize (cross-

validated) classification error is more accurate

But we need a fast algorithm for theNonparametric Bayes Classifier (NBC)

)|(ˆ)()|(ˆ)(

)|(ˆ)()|(

2211

111 CxfCPCxfCP

CxfCPxCP

qq

qq

Page 27: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Leave-one-out cross-validation

N

rrr

hr xcxcIhCVscore )()(ˆ)(

Observations: 1. Doing bandwidth selection requires only

prediction.2. To predict class label, we don’t need to compute

the full densities. Just which one is higher.

We can make a fast exact algorithm for prediction

Page 28: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Fast NBC prediction algorithm

1. Build a tree for each class

Page 29: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Fast NBC prediction algorithm

2. Obtain bounds on P(C)f(xq|C) for each class

P(C1)f(xq|C1) P(C2)f(xq|C2)

xq

Page 30: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Fast NBC prediction algorithm

3. Choose the next node-pair with priority = bound difference

P(C1)f(xq|C1) P(C2)f(xq|C2)

xq

Page 31: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Fast NBC prediction algorithm

3. Choose the next node-pair with priority = bound difference

P(C1)f(xq|C1) P(C2)f(xq|C2)

50-100x speedup

exact

Page 32: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Main points of this talk

1. nonparametric Bayes classifier

2. can be made fast (algorithm design)

3. accurate and tractable science

Page 33: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Resulting quasar catalog

• 100,563 UVX quasar candidates• Of 22,737 objects w/ spectra, 97.6% are

quasars. We estimate 95.0% efficiency overall. (aka “purity”: good/all)

• 94.7% completeness w.r.t. g<19.5 UVX quasars from DR1 (good/all true)

• Largest mag. range ever: 14.2<g<21.0

• [Richards et al. 2004, ApJ]

• More recently, 195k quasars

Page 34: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Cosmic magnification [Scranton et al. 2005]

13.5M galaxies, 195,000 quasars

Most accurate measurement of cosmic magnification to date

[Nature, April 2005]

more fluxmore area

Page 35: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Next steps (in progress)

• better accuracy via coordinate-dependent priors

• 5 magnitudes

• use simulated quasars to push to higher redshift

• use DR4 higher-quality data

• faster bandwidth search

• 500k quasars easily, then 1M

Page 36: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf)

[Gray & Moore 2000], [Miller et al. 2003], etc.

• n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.]

• density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003]

• Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat]

• nonparametric regression • clustering: k-means and mixture models, more…

Page 37: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf)

[Gray & Moore 2000], [Miller et al. 2003], etc.

• n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.]

• density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003]

• Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat]

• nonparametric regression • clustering: k-means and mixture models, more…

fastest algs

Page 38: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf)

[Gray & Moore 2000], [Miller et al. 2003], etc.

• n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.]

• density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003]

• Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat]

• nonparametric regression • clustering: k-means and mixture models, more…

fastest alg

fastest alg

Page 39: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf)

[Gray & Moore 2000], [Miller et al. 2003], etc.

• n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.]

• density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003]

• Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat]

• nonparametric regression • clustering: k-means and mixture models, more…

fastest alg

fastest alg

fastest alg

Page 40: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf)

[Gray & Moore 2000], [Miller et al. 2003], etc.

• n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.]

• density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003]

• Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat]

• nonparametric regression • clustering: k-means and mixture models, more…

fastest alg

fastest alg

fastest alg

fastest alg

Page 41: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf)

[Gray & Moore 2000], [Miller et al. 2003], etc.

• n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.]

• density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003]

• Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat]

• nonparametric regression • clustering: k-means and mixture models, others• support vector machines, maybe

fastest alg

fastest alg

fastest alg

fastest alg

we’ll see…

Page 42: How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing.

Take-home messages

• Estimating a density? Use kernel density estimation (KDE).

• Classification problem? Consider the nonparametric Bayes classifier (NBC).

• Want to do these on huge datasets? Talk to us, use our software.

• Different computational/statistical problem? Grab me after the talk!

[email protected]