Data Mining Sky Surveys Jan 21 2014
-
Upload
techfolkcmr -
Category
Documents
-
view
223 -
download
4
Transcript of Data Mining Sky Surveys Jan 21 2014
Data Mining In Modern Astronomy Sky Surveys:
Supervised Learning & Astronomy Applications
Ching-Wa Yip
[email protected]; Bloomberg 518
1/21/2014 JHU Intersession Course - C. W. Yip
Machine Learning
β’ We want computers to perform tasks. β’ It is difficult for computers to βlearnβ like the
human do. β’ We use algorithms:
β Supervised, e.g.: β’ Classification β’ Regression
β Unsupervised, e.g.: β’ Density Estimation β’ Clustering β’ Dimension Reduction
1/21/2014 JHU Intersession Course - C. W. Yip
Machine Learning
β’ We want computers to perform tasks. β’ It is difficult for computers to βlearnβ like the
human do. β’ We use algorithms:
β Supervised, e.g.: β’ Classification β’ Regression
β Unsupervised, e.g.: β’ Density Estimation β’ Clustering β’ Dimension Reduction
1/21/2014 JHU Intersession Course - C. W. Yip
Unsupervised vs. Supervised Learning
β’ Unsupervised:
β Given data {π1, π2, π3, β―, ππ} find patterns.
β The description of a pattern may come in the form of a function (say, π π ).
β’ Supervised:
β Given data {(π1, π1), (π2, π2), (π3, π3), β―, (ππ, ππ)} find a function such that π π = π.
β π are the labels.
1/21/2014 JHU Intersession Course - C. W. Yip
Types of Label
β’ Class
β Binary: 0, 1
β Galaxy types: E, S, Sa, β¦ etc.
β’ Physical Quantities
β E.g. Redshift of a galaxy
1/21/2014 JHU Intersession Course - C. W. Yip
Basic Concepts in Machine Learning
β’ Label and Unlabeled data
β’ Datasets: training set and test set
β’ Feature space
β’ Distance between points
β’ Cost Function (or called Error Function)
β’ Shape of the data distribution
β’ Outliers
1/21/2014 JHU Intersession Course - C. W. Yip
Training Set & Test Set
β’ Training Set
β Data that are used to build the model.
β’ Validation Set
β Data that used to evaluate the model.
β Data were not used in training.
β (Sometimes omitted.)
β’ Test Set
β Similar to Validation Set but for the final model.
1/21/2014 JHU Intersession Course - C. W. Yip
Aspects in Supervised Learning
β’ What are the popular algorithms?
β’ How to minimize cost function (numerically)?
β Where a model is already given.
β’ How to select the appropriate model?
β Where we have many candidate models.
1/21/2014 JHU Intersession Course - C. W. Yip
Supervised Learning: Find Decision Boundary in Labeled Data
X
X
X
X
π₯1
π₯2
1/21/2014 JHU Intersession Course - C. W. Yip
Supervised Learning: Find Decision Boundary in Labeled Data
X
X
X
X
π₯1
π₯2
Decision Boundary
Feature Space
1/21/2014 JHU Intersession Course - C. W. Yip
Algorithms for Supervised Learning: Classification and Regression
β’ Principal Component Analysis
β’ KNN
β’ Support Vector Machine
β’ Decision Tree
β’ Regression Tree
β’ Random Forest
β’ Etc.
1/21/2014 JHU Intersession Course - C. W. Yip
K Nearest Neighbor (KNN)
β’ One of the simplest supervised algorithms.
β’ Idea: use k-neighbors to estimate π given π.
β’ The value of k can be determined (see βModel Selectionβ).
1/21/2014 JHU Intersession Course - C. W. Yip
KNN
β’ Given {(π1, π1), (π2, π2), (π3, π3), β―, (ππ, ππ)}, find π for a given π.
β’ E.g., π has two components (π β π 2), π is 0 or 1.
β’ Estimated Class: β From the majority vote of k nearest
neighbors
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
ππ
ππ
. π
.
k = 1
1/21/2014 JHU Intersession Course - C. W. Yip
KNN
β’ Given {(π1, π1), (π2, π2), (π3, π3), β―, (ππ, ππ)}, find π for a given π.
β’ E.g., π has two components (π β π 2), π is 0 or 1.
β’ Estimated Class: β From the majority vote of k nearest
neighbors
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
ππ
ππ
. π
.
k = 3
1/21/2014 JHU Intersession Course - C. W. Yip
Support Vector Machine (SVM): Finding Hyperplanes in the Feature
Space β’ Map data into higher dimensional feature space. β’ The decision boundaries, or the hyperplanes, separate
the feature into classes. β 1D data: a point β 2D data: a line β Higher-dimensional data: a hyperplane
β’ More than one hyperplane can do the job. β’ Support vectors are data points located closest to the
hyperplanes. β They are the most difficult to classify.
β’ SVM chooses the hyperplanes which maximize the margin of the support vectors.
1/21/2014 JHU Intersession Course - C. W. Yip
Many Hyperplanes can Separate the Data
X
X
X
X
π₯1
π₯2 Feature Space
1/21/2014 JHU Intersession Course - C. W. Yip
SVM finds the Hyperplane which Maximize the Margin (Perpendicular to Hyperplane)
X
X
X
X
π₯1
π₯2 Feature Space
Support Vector
Support Vector
1/21/2014 JHU Intersession Course - C. W. Yip
SVM in Galaxy Classification
(Classification of LAMOST galaxy spectra, Haijun Tian, 2014 in prep.)
πΆπππ
πππ‘
πΆππ
π π ππ
ππππ‘π
ππ
πΆππ’ππ‘
πππ‘π
π π·
ππ‘π
πΆππ’ππ‘
Signal-to-Noise Ratio of Data
1/21/2014 JHU Intersession Course - C. W. Yip
SVM in Galaxy Classification
Rose Criterion
(Classification of LAMOST galaxy spectra, Haijun Tian, 2014 in prep.) 1/21/2014 JHU Intersession Course - C. W. Yip
Decision Tree
β’ A decision tree partitions the data by features.
β’ A decision tree has
β Root Node: top most decision node
β Decision Nodes: has two or more branches (βYESβ or βNOβ; βx > 0.8β or βx = 0.8β or βx < 0.8β).
β Left Node: where we make decision (classification or regression).
1/21/2014 JHU Intersession Course - C. W. Yip
Decision Tree
β’ Given {(π1, π1), (π2, π2), (π3, π3), β―, (ππ, ππ)}, find π for a given π.
β’ E.g., π has two components, π is 0 or 1.
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
ππ
ππ
1/21/2014 JHU Intersession Course - C. W. Yip
Decision Tree
β’ Given {(π1, π1), (π2, π2), (π3, π3), β―, (ππ, ππ)}, find π for a given π.
β’ E.g., π has two components, π is 0 or 1.
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
ππ
ππ
1 1.5
1.2
1/21/2014 JHU Intersession Course - C. W. Yip
Binary-Decision Tree
0
0 0
0 0 0
0
0
0
1
1
1
1
1
1
1
0
0
ππ
ππ
1 1.5
1.2
ππ < 1?
Yes No
(π, π) ππ > 1.2?
Yes No
(π, π) ππ < 1.5?
Yes No
(π, π) (π, π)
Here π, π represents: 3 β0β votes and 1 β1β vote.
1/21/2014 JHU Intersession Course - C. W. Yip
Regression Tree
β’ E.g., π has one component (= π₯); π is real.
x x
x x x
x
x
x
x
x
π₯
π¦
x x
x
x
x
x
1/21/2014 JHU Intersession Course - C. W. Yip
Regression Tree
β’ E.g., π has one component (= π₯); π is real.
x x
x x x
x
x
x
x
x
π₯
π¦
x x
x
x
x
x
. . 1 1.5
0.8
1.6 1.7
π₯ < 1?
Yes No
π¦ = 1.7 π₯ > 1.5?
Yes No
π¦ = 1.6 π¦ = 0.8
1/21/2014 JHU Intersession Course - C. W. Yip
Random Forest: Reduce the Variance of Estimator
β’ Idea: use many trees instead of just one for estimation.
β’ Algorithm: β Sample a subset from data. β Construct i-th tree from the subset. β At each node, choose a random subset of m features.
The split the data based on those features. β Repeat for i = 1, 2, β¦ , B trees. β Given π₯
β’ Take majority vote from trees (Classification) β’ Take average from trees (Regression)
1/21/2014 JHU Intersession Course - C. W. Yip
Photometric Redshift Using Random Forest
β’ Use photometry (e.g., using SDSS filters u,g,r,i,z, or 5 data points) and galaxy inclination to estimate redshift (distance) of galaxies (Yip, Carliles & Szalay et al. 2011).
z(spec) is the true redshift, the redshift we obtained from galaxy spectra. z(photo) is the estimated redshift, the redshift we obtained from galaxy photometry.
Galaxy Inclination 1/21/2014 JHU Intersession Course - C. W. Yip
Comparison of Supervised Learning Algorithms
β’ Caruana & Niculescu-Mizil (2006) compared algorithms over 11 problems and 8 performance metrics.
Columns are the probability that an algorithm would perform at 1st, 2nd , 3rd, β¦, etc.
1/21/2014 JHU Intersession Course - C. W. Yip
Cost Function
β’ In many machine learning algorithms, the idea is to find the model parameters π½ which minimize the cost function π½ π½ :
m is the size of the training set. β’ That is, we want πππππ as close to πππ‘π as
possible. β’ Note that πππππ depends on π½.
π½ π½ =1
π πππππ πππ‘ππ β πππ‘ππ 2π
π=1
1/21/2014 JHU Intersession Course - C. W. Yip
Cost Function Examples: Unsupervised vs. Supervised Learning
π½ π½ =1
π πππππ πππ‘ππ β πππ‘ππ 2π
π=1
Supervised:
π½ π½ =1
π π ππ β ππ 2π
π=1
π = π ππ§π ππ π‘πππππππ π ππ‘ π = πππ‘π π π = πππππ π‘π π‘πππππππ π ππ‘
Unsupervised:
π½ π½ =1
π π ππ β ππ 2π
π=1
π = # ππ ππππ ππ πππ‘π βππ π‘πππππ π = βππ π‘πππππ πππ‘π π£πππ’π π π = πππππ π‘π π‘βπ βππ π‘πππππ
1/21/2014 JHU Intersession Course - C. W. Yip
Minimization of Cost Function in order to Find Modeled Parameters
β’ Maximum Likelihood Estimate
β’ Bayesian Parameter Estimate
β’ Gradient Descent
β’ Etc.
1/21/2014 JHU Intersession Course - C. W. Yip
Gradient Descent: Idea
(Source: Andrew Ng)
The minimum is local β for a different starting point, we may get different local minimum.
1/21/2014 JHU Intersession Course - C. W. Yip
Minimization of Cost Function: Gradient Descent Algorithm
β’ Start at a point in the parameter space.
β’ Look around locally, take a baby step which has the largest descent.
β’ Repeat until we find the minimum cost.
β’ Mathematically, we keep refining ππ until convergence, in
accord to:
ππ := ππ β Ξ±
π
ππππ½(π½)
β’ πΌ is the Learning rate. β’ j runs from 1 to the number of parameters in the model. 1/21/2014 JHU Intersession Course - C. W. Yip
Example: Gradient Descent in 1D Data
β’ The learning rate is positive.
π β π β Ξ±π
πΞΈπ½(π)
Start from the right of the minimum. π
πΞΈπ½(π) = slope > 0.
π decreases as the algorithm progresses.
π
π½(π)
. ππ π‘πππ‘
. ππ π‘ππ1
π
π½(π)
. ππ π‘πππ‘
. ππ π‘ππ1
Start from the left of the minimum.
. ππππ ππππ
.
1/21/2014 JHU Intersession Course - C. W. Yip
Model Selection in Machine Learning
β’ The goal of Cost Function minimization is to select a model.
β’ Occamβs Razor: the best explanation is usually the simplest one.
β’ Some model selection approaches: β Bayesian Approach
β Cross Validation
β Akaike Information Criterion (AIC)
β Bayesian Information Criterion (BIC)
1/21/2014 JHU Intersession Course - C. W. Yip
Model Selection: How to Select from Models with Different Complexity?
β’ Notice that the complexity parameters are not well defined and may need userβs judgment.
β’ Complexity parameter of a model, e.g.:
β The number of nearest neighbors (k) in KNN.
β The number of basis functions in regression.
1/21/2014 JHU Intersession Course - C. W. Yip
Example: Fitting a combination of Basis Functions
β’ E.g., Polynomial functions: {β 0, β 1, β 2, β― , β π΅} where β π = π₯π
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
π₯
π¦
1/21/2014 JHU Intersession Course - C. W. Yip
Schematics: Fitting a combination of Basis Functions
β’ E.g., Polynomial functions: {β 0, β 1, β 2, β― , β π΅} where β π = π₯π
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
π₯
π¦
B = 1
1/21/2014 JHU Intersession Course - C. W. Yip
Schematics: Fitting a combination of Basis Functions
β’ E.g., Polynomial functions: {β 0, β 1, β 2, β― , β π΅} where β π = π₯π
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
π₯
π¦
B = 1
B = 3
1/21/2014 JHU Intersession Course - C. W. Yip
Schematics: Fitting a combination of Basis Functions
β’ E.g., Polynomial functions: {β 0, β 1, β 2, β― , β π΅} where β π = π₯π
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
π₯
π¦
B = 1
B = 10
Large Model Complexity: Potential Overfitting
1/21/2014 JHU Intersession Course - C. W. Yip
Bias-Variance Decomposition
β’ Suppose we have an estimator of the model parameters.
β’ An estimator is denoted as π .
β’ Mean Square Error of an estimator can be expressed as:
MSE = bias2 + var
where πππΈ =1
π (π β ππππ’π)2π
π=1
1/21/2014 JHU Intersession Course - C. W. Yip
Best Model
The Best Model: Tradeoff between Bias and Variance
Model Complexity Large
Model Complexity Small
bias2 variance
MSE
1/21/2014 JHU Intersession Course - C. W. Yip
Intuition: Bias-Variance Tradeoff
β’ E.g., Polynomial functions: {β 0, β 1, β 2, β― , β π΅} where β π = π₯π
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
π₯
π¦
B = 1
π₯β²
Variance of estimated π¦β²π
π¦(ππππ)
Dotted: model from a different, random draw of data
1/21/2014 JHU Intersession Course - C. W. Yip
Intuition: Bias Variance Tradeoff
β’ E.g., Polynomial functions: {β 0, β 1, β 2, β― , β π΅} where β π = π₯π
x
x
x
x
x x
x
x
x x
x
x x
x
x
x
x
x x
π₯
π¦
B = 1
B = 10 π₯β²
π¦(ππππ) Variance of estimated π¦β²π
1/21/2014 JHU Intersession Course - C. W. Yip
Demonstration of Overfitting in R
1/21/2014 JHU Intersession Course - C. W. Yip
Some Variations of Machine Learning
β’ Semi-Supervised Learning:
β Given data {(π1, π1), (π2, π2), (π3, π3), β―, (ππ , ππ), ππ+1, β―, ππ} predicts labels ππ+1, β―, ππ for ππ+1, β―, ππ.
β’ Active Learning:
β Similar to Semi-Supervised but we can ask for extra labels ππ for particular data points ππ as the algorithm runs.
1/21/2014 JHU Intersession Course - C. W. Yip