Data Mining Sky Surveys Jan 21 2014

46
Data Mining In Modern Astronomy Sky Surveys: Supervised Learning & Astronomy Applications Ching-Wa Yip [email protected]; Bloomberg 518 1/21/2014 JHU Intersession Course - C. W. Yip

Transcript of Data Mining Sky Surveys Jan 21 2014

Page 1: Data Mining Sky Surveys Jan 21 2014

Data Mining In Modern Astronomy Sky Surveys:

Supervised Learning & Astronomy Applications

Ching-Wa Yip

[email protected]; Bloomberg 518

1/21/2014 JHU Intersession Course - C. W. Yip

Page 2: Data Mining Sky Surveys Jan 21 2014

Machine Learning

β€’ We want computers to perform tasks. β€’ It is difficult for computers to β€œlearn” like the

human do. β€’ We use algorithms:

– Supervised, e.g.: β€’ Classification β€’ Regression

– Unsupervised, e.g.: β€’ Density Estimation β€’ Clustering β€’ Dimension Reduction

1/21/2014 JHU Intersession Course - C. W. Yip

Page 3: Data Mining Sky Surveys Jan 21 2014

Machine Learning

β€’ We want computers to perform tasks. β€’ It is difficult for computers to β€œlearn” like the

human do. β€’ We use algorithms:

– Supervised, e.g.: β€’ Classification β€’ Regression

– Unsupervised, e.g.: β€’ Density Estimation β€’ Clustering β€’ Dimension Reduction

1/21/2014 JHU Intersession Course - C. W. Yip

Page 4: Data Mining Sky Surveys Jan 21 2014

Unsupervised vs. Supervised Learning

β€’ Unsupervised:

– Given data {𝒙1, 𝒙2, 𝒙3, β‹―, 𝒙𝑛} find patterns.

– The description of a pattern may come in the form of a function (say, 𝑔 𝒙 ).

β€’ Supervised:

– Given data {(𝒙1, π’š1), (𝒙2, π’š2), (𝒙3, π’š3), β‹―, (𝒙𝑛, π’šπ‘›)} find a function such that 𝑓 𝒙 = π’š.

– π’š are the labels.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 5: Data Mining Sky Surveys Jan 21 2014

Types of Label

β€’ Class

– Binary: 0, 1

– Galaxy types: E, S, Sa, … etc.

β€’ Physical Quantities

– E.g. Redshift of a galaxy

1/21/2014 JHU Intersession Course - C. W. Yip

Page 6: Data Mining Sky Surveys Jan 21 2014

Basic Concepts in Machine Learning

β€’ Label and Unlabeled data

β€’ Datasets: training set and test set

β€’ Feature space

β€’ Distance between points

β€’ Cost Function (or called Error Function)

β€’ Shape of the data distribution

β€’ Outliers

1/21/2014 JHU Intersession Course - C. W. Yip

Page 7: Data Mining Sky Surveys Jan 21 2014

Training Set & Test Set

β€’ Training Set

– Data that are used to build the model.

β€’ Validation Set

– Data that used to evaluate the model.

– Data were not used in training.

– (Sometimes omitted.)

β€’ Test Set

– Similar to Validation Set but for the final model.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 8: Data Mining Sky Surveys Jan 21 2014

Aspects in Supervised Learning

β€’ What are the popular algorithms?

β€’ How to minimize cost function (numerically)?

– Where a model is already given.

β€’ How to select the appropriate model?

– Where we have many candidate models.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 9: Data Mining Sky Surveys Jan 21 2014

Supervised Learning: Find Decision Boundary in Labeled Data

X

X

X

X

π‘₯1

π‘₯2

1/21/2014 JHU Intersession Course - C. W. Yip

Page 10: Data Mining Sky Surveys Jan 21 2014

Supervised Learning: Find Decision Boundary in Labeled Data

X

X

X

X

π‘₯1

π‘₯2

Decision Boundary

Feature Space

1/21/2014 JHU Intersession Course - C. W. Yip

Page 11: Data Mining Sky Surveys Jan 21 2014

Algorithms for Supervised Learning: Classification and Regression

β€’ Principal Component Analysis

β€’ KNN

β€’ Support Vector Machine

β€’ Decision Tree

β€’ Regression Tree

β€’ Random Forest

β€’ Etc.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 12: Data Mining Sky Surveys Jan 21 2014

K Nearest Neighbor (KNN)

β€’ One of the simplest supervised algorithms.

β€’ Idea: use k-neighbors to estimate π’š given 𝒙.

β€’ The value of k can be determined (see β€œModel Selection”).

1/21/2014 JHU Intersession Course - C. W. Yip

Page 13: Data Mining Sky Surveys Jan 21 2014

KNN

β€’ Given {(𝒙1, π’š1), (𝒙2, π’š2), (𝒙3, π’š3), β‹―, (𝒙𝑛, π’šπ‘›)}, find π’š for a given 𝒙.

β€’ E.g., 𝒙 has two components (𝒙 ∈ 𝑅2), π’š is 0 or 1.

β€’ Estimated Class: – From the majority vote of k nearest

neighbors

0

0 0

0 0 0

0

0

0

1

1

1

1

1

1

1

0

0

π’™πŸ

π’™πŸ

. 𝒙

.

k = 1

1/21/2014 JHU Intersession Course - C. W. Yip

Page 14: Data Mining Sky Surveys Jan 21 2014

KNN

β€’ Given {(𝒙1, π’š1), (𝒙2, π’š2), (𝒙3, π’š3), β‹―, (𝒙𝑛, π’šπ‘›)}, find π’š for a given 𝒙.

β€’ E.g., 𝒙 has two components (𝒙 ∈ 𝑅2), π’š is 0 or 1.

β€’ Estimated Class: – From the majority vote of k nearest

neighbors

0

0 0

0 0 0

0

0

0

1

1

1

1

1

1

1

0

0

π’™πŸ

π’™πŸ

. 𝒙

.

k = 3

1/21/2014 JHU Intersession Course - C. W. Yip

Page 15: Data Mining Sky Surveys Jan 21 2014

Support Vector Machine (SVM): Finding Hyperplanes in the Feature

Space β€’ Map data into higher dimensional feature space. β€’ The decision boundaries, or the hyperplanes, separate

the feature into classes. – 1D data: a point – 2D data: a line – Higher-dimensional data: a hyperplane

β€’ More than one hyperplane can do the job. β€’ Support vectors are data points located closest to the

hyperplanes. – They are the most difficult to classify.

β€’ SVM chooses the hyperplanes which maximize the margin of the support vectors.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 16: Data Mining Sky Surveys Jan 21 2014

Many Hyperplanes can Separate the Data

X

X

X

X

π‘₯1

π‘₯2 Feature Space

1/21/2014 JHU Intersession Course - C. W. Yip

Page 17: Data Mining Sky Surveys Jan 21 2014

SVM finds the Hyperplane which Maximize the Margin (Perpendicular to Hyperplane)

X

X

X

X

π‘₯1

π‘₯2 Feature Space

Support Vector

Support Vector

1/21/2014 JHU Intersession Course - C. W. Yip

Page 18: Data Mining Sky Surveys Jan 21 2014

SVM in Galaxy Classification

(Classification of LAMOST galaxy spectra, Haijun Tian, 2014 in prep.)

πΆπ‘œπ‘Ÿπ‘Ÿ

𝑒𝑐𝑑

πΆπ‘™π‘Ž

𝑠𝑠𝑖𝑓

π‘–π‘π‘Žπ‘‘π‘–

π‘œπ‘›

πΆπ‘œπ‘’π‘›π‘‘

π‘‡π‘œπ‘‘π‘Ž

𝑙 𝐷

π‘Žπ‘‘π‘Ž

πΆπ‘œπ‘’π‘›π‘‘

Signal-to-Noise Ratio of Data

1/21/2014 JHU Intersession Course - C. W. Yip

Page 19: Data Mining Sky Surveys Jan 21 2014

SVM in Galaxy Classification

Rose Criterion

(Classification of LAMOST galaxy spectra, Haijun Tian, 2014 in prep.) 1/21/2014 JHU Intersession Course - C. W. Yip

Page 20: Data Mining Sky Surveys Jan 21 2014

Decision Tree

β€’ A decision tree partitions the data by features.

β€’ A decision tree has

– Root Node: top most decision node

– Decision Nodes: has two or more branches (β€œYES” or β€œNO”; β€œx > 0.8” or β€œx = 0.8” or β€œx < 0.8”).

– Left Node: where we make decision (classification or regression).

1/21/2014 JHU Intersession Course - C. W. Yip

Page 21: Data Mining Sky Surveys Jan 21 2014

Decision Tree

β€’ Given {(𝒙1, π’š1), (𝒙2, π’š2), (𝒙3, π’š3), β‹―, (𝒙𝑛, π’šπ‘›)}, find π’š for a given 𝒙.

β€’ E.g., 𝒙 has two components, π’š is 0 or 1.

0

0 0

0 0 0

0

0

0

1

1

1

1

1

1

1

0

0

π’™πŸ

π’™πŸ

1/21/2014 JHU Intersession Course - C. W. Yip

Page 22: Data Mining Sky Surveys Jan 21 2014

Decision Tree

β€’ Given {(𝒙1, π’š1), (𝒙2, π’š2), (𝒙3, π’š3), β‹―, (𝒙𝑛, π’šπ‘›)}, find π’š for a given 𝒙.

β€’ E.g., 𝒙 has two components, π’š is 0 or 1.

0

0 0

0 0 0

0

0

0

1

1

1

1

1

1

1

0

0

π’™πŸ

π’™πŸ

1 1.5

1.2

1/21/2014 JHU Intersession Course - C. W. Yip

Page 23: Data Mining Sky Surveys Jan 21 2014

Binary-Decision Tree

0

0 0

0 0 0

0

0

0

1

1

1

1

1

1

1

0

0

π’™πŸ

π’™πŸ

1 1.5

1.2

π’™πŸ < 1?

Yes No

(πŸ•, 𝟎) π’™πŸ > 1.2?

Yes No

(πŸ‘, 𝟏) π’™πŸ < 1.5?

Yes No

(𝟏, 𝟐) (𝟎, πŸ’)

Here πŸ‘, 𝟏 represents: 3 β€œ0” votes and 1 β€œ1” vote.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 24: Data Mining Sky Surveys Jan 21 2014

Regression Tree

β€’ E.g., 𝒙 has one component (= π‘₯); π’š is real.

x x

x x x

x

x

x

x

x

π‘₯

𝑦

x x

x

x

x

x

1/21/2014 JHU Intersession Course - C. W. Yip

Page 25: Data Mining Sky Surveys Jan 21 2014

Regression Tree

β€’ E.g., 𝒙 has one component (= π‘₯); π’š is real.

x x

x x x

x

x

x

x

x

π‘₯

𝑦

x x

x

x

x

x

. . 1 1.5

0.8

1.6 1.7

π‘₯ < 1?

Yes No

𝑦 = 1.7 π‘₯ > 1.5?

Yes No

𝑦 = 1.6 𝑦 = 0.8

1/21/2014 JHU Intersession Course - C. W. Yip

Page 26: Data Mining Sky Surveys Jan 21 2014

Random Forest: Reduce the Variance of Estimator

β€’ Idea: use many trees instead of just one for estimation.

β€’ Algorithm: – Sample a subset from data. – Construct i-th tree from the subset. – At each node, choose a random subset of m features.

The split the data based on those features. – Repeat for i = 1, 2, … , B trees. – Given π‘₯

β€’ Take majority vote from trees (Classification) β€’ Take average from trees (Regression)

1/21/2014 JHU Intersession Course - C. W. Yip

Page 27: Data Mining Sky Surveys Jan 21 2014

Photometric Redshift Using Random Forest

β€’ Use photometry (e.g., using SDSS filters u,g,r,i,z, or 5 data points) and galaxy inclination to estimate redshift (distance) of galaxies (Yip, Carliles & Szalay et al. 2011).

z(spec) is the true redshift, the redshift we obtained from galaxy spectra. z(photo) is the estimated redshift, the redshift we obtained from galaxy photometry.

Galaxy Inclination 1/21/2014 JHU Intersession Course - C. W. Yip

Page 28: Data Mining Sky Surveys Jan 21 2014

Comparison of Supervised Learning Algorithms

β€’ Caruana & Niculescu-Mizil (2006) compared algorithms over 11 problems and 8 performance metrics.

Columns are the probability that an algorithm would perform at 1st, 2nd , 3rd, …, etc.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 29: Data Mining Sky Surveys Jan 21 2014

Cost Function

β€’ In many machine learning algorithms, the idea is to find the model parameters 𝜽 which minimize the cost function 𝐽 𝜽 :

m is the size of the training set. β€’ That is, we want π‘šπ‘œπ‘‘π‘’π‘™ as close to π‘‘π‘Žπ‘‘π‘Ž as

possible. β€’ Note that π‘šπ‘œπ‘‘π‘’π‘™ depends on 𝜽.

𝐽 𝜽 =1

π‘š π‘šπ‘œπ‘‘π‘’π‘™ π‘‘π‘Žπ‘‘π‘Žπ‘– βˆ’ π‘‘π‘Žπ‘‘π‘Žπ‘– 2π‘š

𝑖=1

1/21/2014 JHU Intersession Course - C. W. Yip

Page 30: Data Mining Sky Surveys Jan 21 2014

Cost Function Examples: Unsupervised vs. Supervised Learning

𝐽 𝜽 =1

π‘š π‘šπ‘œπ‘‘π‘’π‘™ π‘‘π‘Žπ‘‘π‘Žπ‘– βˆ’ π‘‘π‘Žπ‘‘π‘Žπ‘– 2π‘š

𝑖=1

Supervised:

𝐽 𝜽 =1

π‘š 𝑓 π’™π’Š βˆ’ 𝒙𝑖 2π‘š

𝑖=1

π‘š = 𝑠𝑖𝑧𝑒 π‘œπ‘“ π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” 𝑠𝑒𝑑 𝒙 = π‘‘π‘Žπ‘‘π‘Ž 𝑓 𝒙 = π‘šπ‘œπ‘‘π‘’π‘™ π‘‘π‘œ π‘‘π‘Ÿπ‘Žπ‘–π‘›π‘–π‘›π‘” 𝑠𝑒𝑑

Unsupervised:

𝐽 𝜽 =1

π‘š 𝑔 π’™π’Š βˆ’ 𝒙𝑖 2π‘š

𝑖=1

π‘š = # π‘œπ‘“ 𝑏𝑖𝑛𝑠 𝑖𝑛 π‘‘π‘Žπ‘‘π‘Ž β„Žπ‘–π‘ π‘‘π‘œπ‘”π‘Ÿπ‘Žπ‘š 𝒙 = β„Žπ‘–π‘ π‘‘π‘œπ‘”π‘Ÿπ‘Žπ‘š π‘‘π‘Žπ‘‘π‘Ž π‘£π‘Žπ‘™π‘’π‘’ 𝑔 𝒙 = π‘šπ‘œπ‘‘π‘’π‘™ π‘‘π‘œ π‘‘β„Žπ‘’ β„Žπ‘–π‘ π‘‘π‘œπ‘”π‘Ÿπ‘Žπ‘š

1/21/2014 JHU Intersession Course - C. W. Yip

Page 31: Data Mining Sky Surveys Jan 21 2014

Minimization of Cost Function in order to Find Modeled Parameters

β€’ Maximum Likelihood Estimate

β€’ Bayesian Parameter Estimate

β€’ Gradient Descent

β€’ Etc.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 32: Data Mining Sky Surveys Jan 21 2014

Gradient Descent: Idea

(Source: Andrew Ng)

The minimum is local – for a different starting point, we may get different local minimum.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 33: Data Mining Sky Surveys Jan 21 2014

Minimization of Cost Function: Gradient Descent Algorithm

β€’ Start at a point in the parameter space.

β€’ Look around locally, take a baby step which has the largest descent.

β€’ Repeat until we find the minimum cost.

β€’ Mathematically, we keep refining πœƒπ‘— until convergence, in

accord to:

πœƒπ‘— := πœƒπ‘— βˆ’ Ξ±

πœ•

πœ•πœƒπ‘—π½(𝜽)

β€’ 𝛼 is the Learning rate. β€’ j runs from 1 to the number of parameters in the model. 1/21/2014 JHU Intersession Course - C. W. Yip

Page 34: Data Mining Sky Surveys Jan 21 2014

Example: Gradient Descent in 1D Data

β€’ The learning rate is positive.

πœƒ ≔ πœƒ βˆ’ α𝑑

𝑑θ𝐽(πœƒ)

Start from the right of the minimum. πœ•

πœ•ΞΈπ½(πœƒ) = slope > 0.

πœƒ decreases as the algorithm progresses.

πœƒ

𝐽(πœƒ)

. πœƒπ‘ π‘‘π‘Žπ‘Ÿπ‘‘

. πœƒπ‘ π‘‘π‘’π‘1

πœƒ

𝐽(πœƒ)

. πœƒπ‘ π‘‘π‘Žπ‘Ÿπ‘‘

. πœƒπ‘ π‘‘π‘’π‘1

Start from the left of the minimum.

. πœƒπ‘’π‘›π‘‘ πœƒπ‘’π‘›π‘‘

.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 35: Data Mining Sky Surveys Jan 21 2014

Model Selection in Machine Learning

β€’ The goal of Cost Function minimization is to select a model.

β€’ Occam’s Razor: the best explanation is usually the simplest one.

β€’ Some model selection approaches: – Bayesian Approach

– Cross Validation

– Akaike Information Criterion (AIC)

– Bayesian Information Criterion (BIC)

1/21/2014 JHU Intersession Course - C. W. Yip

Page 36: Data Mining Sky Surveys Jan 21 2014

Model Selection: How to Select from Models with Different Complexity?

β€’ Notice that the complexity parameters are not well defined and may need user’s judgment.

β€’ Complexity parameter of a model, e.g.:

– The number of nearest neighbors (k) in KNN.

– The number of basis functions in regression.

1/21/2014 JHU Intersession Course - C. W. Yip

Page 37: Data Mining Sky Surveys Jan 21 2014

Example: Fitting a combination of Basis Functions

β€’ E.g., Polynomial functions: {βˆ…0, βˆ…1, βˆ…2, β‹― , βˆ…π΅} where βˆ…π‘˜ = π‘₯π‘˜

x

x

x

x

x x

x

x

x x

x

x x

x

x

x

x

x x

π‘₯

𝑦

1/21/2014 JHU Intersession Course - C. W. Yip

Page 38: Data Mining Sky Surveys Jan 21 2014

Schematics: Fitting a combination of Basis Functions

β€’ E.g., Polynomial functions: {βˆ…0, βˆ…1, βˆ…2, β‹― , βˆ…π΅} where βˆ…π‘˜ = π‘₯π‘˜

x

x

x

x

x x

x

x

x x

x

x x

x

x

x

x

x x

π‘₯

𝑦

B = 1

1/21/2014 JHU Intersession Course - C. W. Yip

Page 39: Data Mining Sky Surveys Jan 21 2014

Schematics: Fitting a combination of Basis Functions

β€’ E.g., Polynomial functions: {βˆ…0, βˆ…1, βˆ…2, β‹― , βˆ…π΅} where βˆ…π‘˜ = π‘₯π‘˜

x

x

x

x

x x

x

x

x x

x

x x

x

x

x

x

x x

π‘₯

𝑦

B = 1

B = 3

1/21/2014 JHU Intersession Course - C. W. Yip

Page 40: Data Mining Sky Surveys Jan 21 2014

Schematics: Fitting a combination of Basis Functions

β€’ E.g., Polynomial functions: {βˆ…0, βˆ…1, βˆ…2, β‹― , βˆ…π΅} where βˆ…π‘˜ = π‘₯π‘˜

x

x

x

x

x x

x

x

x x

x

x x

x

x

x

x

x x

π‘₯

𝑦

B = 1

B = 10

Large Model Complexity: Potential Overfitting

1/21/2014 JHU Intersession Course - C. W. Yip

Page 41: Data Mining Sky Surveys Jan 21 2014

Bias-Variance Decomposition

β€’ Suppose we have an estimator of the model parameters.

β€’ An estimator is denoted as πœƒ .

β€’ Mean Square Error of an estimator can be expressed as:

MSE = bias2 + var

where 𝑀𝑆𝐸 =1

𝑁 (πœƒ βˆ’ πœƒπ‘‡π‘Ÿπ‘’π‘’)2𝑁

𝑖=1

1/21/2014 JHU Intersession Course - C. W. Yip

Page 42: Data Mining Sky Surveys Jan 21 2014

Best Model

The Best Model: Tradeoff between Bias and Variance

Model Complexity Large

Model Complexity Small

bias2 variance

MSE

1/21/2014 JHU Intersession Course - C. W. Yip

Page 43: Data Mining Sky Surveys Jan 21 2014

Intuition: Bias-Variance Tradeoff

β€’ E.g., Polynomial functions: {βˆ…0, βˆ…1, βˆ…2, β‹― , βˆ…π΅} where βˆ…π‘˜ = π‘₯π‘˜

x

x

x

x

x x

x

x

x x

x

x x

x

x

x

x

x x

π‘₯

𝑦

B = 1

π‘₯β€²

Variance of estimated 𝑦′𝒔

𝑦(𝒕𝒓𝒖𝒆)

Dotted: model from a different, random draw of data

1/21/2014 JHU Intersession Course - C. W. Yip

Page 44: Data Mining Sky Surveys Jan 21 2014

Intuition: Bias Variance Tradeoff

β€’ E.g., Polynomial functions: {βˆ…0, βˆ…1, βˆ…2, β‹― , βˆ…π΅} where βˆ…π‘˜ = π‘₯π‘˜

x

x

x

x

x x

x

x

x x

x

x x

x

x

x

x

x x

π‘₯

𝑦

B = 1

B = 10 π‘₯β€²

𝑦(𝒕𝒓𝒖𝒆) Variance of estimated 𝑦′𝒔

1/21/2014 JHU Intersession Course - C. W. Yip

Page 45: Data Mining Sky Surveys Jan 21 2014

Demonstration of Overfitting in R

1/21/2014 JHU Intersession Course - C. W. Yip

Page 46: Data Mining Sky Surveys Jan 21 2014

Some Variations of Machine Learning

β€’ Semi-Supervised Learning:

– Given data {(𝒙1, π’š1), (𝒙2, π’š2), (𝒙3, π’š3), β‹―, (π’™π‘˜ , π’šπ‘˜), π’™π‘˜+1, β‹―, 𝒙𝑛} predicts labels π’šπ‘˜+1, β‹―, π’šπ‘› for π’™π‘˜+1, β‹―, 𝒙𝑛.

β€’ Active Learning:

– Similar to Semi-Supervised but we can ask for extra labels π’šπ’Š for particular data points π’™π’Š as the algorithm runs.

1/21/2014 JHU Intersession Course - C. W. Yip