INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1....
Transcript of INTRODUCTION TO MACHINE LEARNING - Amazon S3 · 2016-05-04 · Introduction to Machine Learning 1....
INTRODUCTION TO MACHINE LEARNING
Clustering with k-means
Introduction to Machine Learning
Clustering, what?
● Similar within cluster
● Dissimilar between clusters
● Clustering: grouping objects in clusters
● No labels: unsupervised classification
● Plenty possible clusterings
● Cluster: collection of objects
Introduction to Machine Learning
Clustering, why?● Pa!ern Analysis
● Visualise Data
● pre-Processing Step
● Outlier Detection
● …
● Targeted Marketing Programs
● Student Segmentations
● Data Mining
● …
Introduction to Machine Learning
Clustering, how?● Measure of Similarity: d( …, …)
• Numerical variables Metrics: Euclidean, Manhattan, …
• Categorical variables Construct your own distance
• Clustering Methods
• k-means
• Hierarchical
• …
Many variations
Introduction to Machine Learning
Cluster CentroidObjectCluster
#Clusters
Compactness and Separation● Within Cluster Sums of Squares (WSS):
Measure of compactness
● Between Cluster Sums of Squares (BSS):
Measure of separation
Minimise WSS
Maximise BSS
Cluster Centroid#Clusters
#Objects in ClusterSample Mean
Introduction to Machine Learning
Goal: Partition data in k disjoint subsets
k-Means Algorithm
0 5 10−5
05
x
y
Let’s take k = 3
Introduction to Machine Learning
1. Randomly assign k centroids
0 5 10−5
05
x
y
k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
2. Assign data to closest centroid
0 5 10−5
05
x
y
k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
2. Assign data to closest centroid
3. Moves centroids to average location
0 5 10−5
05
x
y
k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
2. Assign data to closest centroid
3. Moves centroids to average location
4. Repeat step 2 and 3
0 5 10−5
05
x
y
k-Means Algorithmk = 3Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
1. Randomly assign k centroids
2. Assign data to closest centroid
3. Moves centroids to average location
4. Repeat step 2 and 3
0 5 10−5
05
x
y
k-Means Algorithm
The algorithm has converged!
k = 3Goal: Partition data in k disjoint subsets
Introduction to Machine Learning
● Problem: WSS keeps decreasing as k increases!
● Solution: WSS starts decreasing slowly
Choosing k● Goal: Find k that minimizes WSS
WSS / TSS < 0.2 Fix k}
Introduction to Machine Learning
Choosing k: Scree PlotScree Plot: Visualizing the ratio WSS / TSS as function of k
1 2 3 4 5 6 7
0.2
0.4
0.6
0.8
1.0
k
WSS
/ TS
S
Look for the elbow in the plot
Choose k = 3
Introduction to Machine Learning
k-Means in R
● centers: Starting centroid or #clusters
● nstart: #times R restarts with different centroids
> my_km <- kmeans(data, centers, nstart)
Distance: Euclidean metric
> my_km$tot.withinss
> my_km$betweenss
WSS
BSS
INTRODUCTION TO MACHINE LEARNING
Let’s practice!
INTRODUCTION TO MACHINE LEARNING
Performance and Scaling
Introduction to Machine Learning
Cluster EvaluationNot trivial! There is no truth
● No true labels
● No true response
Evaluation methods? Depends on the goal
Goal: Compact and Separated Measurable!
Introduction to Machine Learning
Cluster MeasuresWSS and BSS:
Underlying idea:
● Variance within clusters
● Separation between clusters } Compare
Alternative:
● Diameter
● Intercluster Distance
Good indication
Introduction to Machine Learning
Diameter
0 5 10 15−5
05
x
y
: Objects : Cluster
: Distance (objects)
Measure of Compactness
Introduction to Machine Learning
0 5 10 15
−50
5
x
y
Intercluster Distance
: Objects : Clusters: Distance (objects)
Measure of Separation
Introduction to Machine Learning
0 5 10 15
−50
5
x
y
Dunn’s Index
Introduction to Machine Learning
Notes:
● High computational cost
● Worst case indicator
Dunn’s IndexHigher Dunn Be!er separated / more compact
Introduction to Machine Learning
Alternative measures● Internal Validation: based on intrinsic knowledge
● BIC Index
● Silhoue!e’s Index
● External Validation: based on previous knowledge
● Hulbert’s Correlation
● Jaccard’s Coefficient
Introduction to Machine Learning
Evaluating in RLibraries: cluster and clValid
> dunn(clusters = my_km, Data = ...)
Dunn’s Index:
● clusters: cluster partitioning vector
● Data: original dataset
Introduction to Machine Learning
Scale IssuesMetrics are o!en scale dependent!
Which pair is most similar? ( Age, Income, IQ )
● X1 = (28, 72000, 120)
● X2 = (56, 73000, 80)
● X3 = (29, 74500, 118)
● Intuition: (X1, X3)
● Euclidean: (X1, X2)
Solution: Rescale income / 1000$
Introduction to Machine Learning
StandardizingProblem: Multiple variables on different scales
Solution: Standardize your data
1. Subtract the mean
2. Divide by the standard deviation
> scale(data)
Note: Standardizing Different interpretation
INTRODUCTION TO MACHINE LEARNING
Let’s practice!
INTRODUCTION TO MACHINE LEARNING
Hierarchical Clustering
Introduction to Machine Learning
Hierarchical ClusteringHierarchy:
● Which objects cluster first?
● Which cluster pairs merge? When?
Bo!om-up:
● Starts from the objects
● Builds a hierarchy of clusters
Introduction to Machine Learning
Bo!om-Up: AlgorithmPre: Calculate distances between objects
Objects
Introduction to Machine Learning
Bo!om-Up: AlgorithmPre: Calculate distances between objects
Distance
Objects
Introduction to Machine Learning
Bo!om-Up: Algorithm1. Put every object in its own cluster
Introduction to Machine Learning
Bo!om-Up: Algorithm2. Find the closest pair of clusters Merge them
Introduction to Machine Learning
Bo!om-Up: Algorithm3. Compute distances between new cluster and old ones
Introduction to Machine Learning
Bo!om-Up: Algorithm4. Repeat steps two and three
Introduction to Machine Learning
Bo!om-Up: Algorithm4. Repeat steps two and three
Introduction to Machine Learning
Bo!om-Up: Algorithm4. Repeat steps two and three
Introduction to Machine Learning
Bo!om-Up: Algorithm4. Repeat steps two and three One cluster
Introduction to Machine Learning
Linkage-Methods● Simple-Linkage: minimal distance between clusters
● Complete-Linkage: maximal distance between clusters
● Average-Linkage: average distance between clusters
Different Clusterings
Introduction to Machine Learning
Simple-LinkageMinimal distance between objects in each clusters
Introduction to Machine Learning
Complete-LinkageMaximal distance between objects in each cluster
Introduction to Machine Learning
Single-Linkage: Chaining
● O!en undesired
Introduction to Machine Learning
Single-Linkage: Chaining
● O!en undesired
Introduction to Machine Learning
Single-Linkage: Chaining
● O!en undesired
Introduction to Machine Learning
Single-Linkage: Chaining
● Can be great outlier detector
● O!en undesired
Introduction to Machine Learning
DendrogramHe
ight
Leaves / Objects
Cut
Merge
Merge
Introduction to Machine Learning
Hierarchical Clustering in R
> dist(x, method)
Library: stats
> hclust(d, method)
● x: dataset ● method: distance
● d: distance matrix ● method: linkage
Introduction to Machine Learning
Hierarchical: Pro and Cons● Pros
● In-depth analysis
● Linkage-methods
● Cons
● High computational cost
● Can never undo merges
Different pa"ern
Introduction to Machine Learning
k-Means: Pro and Cons● Pros
● Can undo merges
● Fast computations
● Cons
● Fixed #Clusters
● Dependent on starting centroids
INTRODUCTION TO MACHINE LEARNING
Let’s practice!