K-means Clustering with Scikit-Learn

33
K-Means Clustering with Scikit-Learn Sarah Guido PyData SV 2014

description

Given at PyDataSV 2014 In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.

Transcript of K-means Clustering with Scikit-Learn

Page 1: K-means Clustering with Scikit-Learn

K-Means Clustering with Scikit-LearnSarah Guido

PyData SV 2014

Page 2: K-means Clustering with Scikit-Learn

About Me

• Today: graduated from the University of Michigan!• Soon: data scientist at Reonomy• PyLadies co-organizer• @sarah_guido

Page 3: K-means Clustering with Scikit-Learn

Outline

• What is k-means clustering?• How it works• When to use it

• K-means clustering in scikit-learn• Basic implementation• Implementation with tuned parameters

Page 4: K-means Clustering with Scikit-Learn

Clustering

• Unsupervised learning• Unlabeled data

• Split observations into groups• Distance between data points• Exploring the data

Page 5: K-means Clustering with Scikit-Learn

K-means clustering

• Formally: a method of vector quantization• Partition space into Voronoi cells

• Separate samples into n groups of equal variance

• Uses the Euclidean distance metric

Page 6: K-means Clustering with Scikit-Learn

K-means clustering

• Iterative refinement• Three basic steps• Step 1: Choose k• Iterate over:

• Step 2: Assignment• Step 3: Update

• Repeats until convergence has been reached

Page 7: K-means Clustering with Scikit-Learn

K-means clustering

• Assignment

• Update

Page 8: K-means Clustering with Scikit-Learn

K-means clustering

• Advantages• Scales well• Efficient• Will always converge

• Disadvantages• Choosing the wrong k• Convergence to local minimum

Page 9: K-means Clustering with Scikit-Learn

K-means clustering

• When to use• Normally distributed data• Large number of samples• Not too many clusters• Distance can be measured in a linear fashion

Page 10: K-means Clustering with Scikit-Learn

Scikit-Learn

• Machine learning module• Open-source• Built-in datasets• Good resources for learning

Page 11: K-means Clustering with Scikit-Learn

Scikit-Learn

• Model = EstimatorObject()• Unsupervised:

• Model.fit(dataset.data)• dataset.data = dataset

• Supervised would use the labels as a second parameter

Page 12: K-means Clustering with Scikit-Learn

K-means in scikit-learn

• Efficient and fast• You: pick n clusters, kmeans: finds n initial centroids

• Run clustering jobs in parallel

Page 13: K-means Clustering with Scikit-Learn

Dataset

• University of California Machine Learning Repository

• Individual household power consumption

Page 14: K-means Clustering with Scikit-Learn

K-means in scikit-learn

Page 15: K-means Clustering with Scikit-Learn

K-means in scikit-learn

• Results

Page 16: K-means Clustering with Scikit-Learn

K-means parameters

• n_clusters• max_iter• n_init• init • precompute_distances• tol• n_jobs• random_state

Page 17: K-means Clustering with Scikit-Learn

n_clusters: choosing k

• Graphing the variance• Information criterion• Cross-validation

Page 18: K-means Clustering with Scikit-Learn

n_clusters: choosing k

• Graphing the variance• from scipy.spatial.distance import cdist, pdist• cdist: distance computation between sets of

observations• pdist: pairwise distances between observations in the

same set

Page 19: K-means Clustering with Scikit-Learn

n_clusters: choosing k

• Graphing the variance

Page 20: K-means Clustering with Scikit-Learn

n_clusters: choosing k

• Graphing the variance

Page 21: K-means Clustering with Scikit-Learn

n_clusters: choosing k

• Graphing the variance

Page 22: K-means Clustering with Scikit-Learn

n_clusters: choosing k

n_clusters = 4 n_clusters = 7

Page 23: K-means Clustering with Scikit-Learn

n_clusters: choosing k

• n_clusters = 8 (default)

Page 24: K-means Clustering with Scikit-Learn

init

• k-means++• Default• Selects initial clusters in a way that speeds up

convergence

• random• Choose k rows at random for initial centroids

• Ndarray that gives initial centers• (n_clusters, n_features)

Page 25: K-means Clustering with Scikit-Learn

K-means revised

• Set n_clusters • 7, 8

• Set init• kmeans++, random

Page 26: K-means Clustering with Scikit-Learn

K-means revised

n_clusters = 8, init = kmeans++ n_clusters = 8, init = random

Page 27: K-means Clustering with Scikit-Learn

K-means revised

n_clusters = 7, init = kmeans++ n_clusters = 7, init = random

Page 28: K-means Clustering with Scikit-Learn

Comparing results: silhouette score

• Silhouette coefficient• No ground truth• Mean distance between an observation and all other

points in its cluster• Mean distance between an observation and all other

points in the next nearest cluster

• Silhouette score in scikit-learn• Mean of silhouette coefficient for all of the observations• Closer to 1, the better the fit• Large dataset == long time

Page 29: K-means Clustering with Scikit-Learn

Comparing results: silhouette score

• n_clusters=8, init=kmeans++• 0.8117

• n_clusters=8, init=random• 0.6511

• n_clusters=7, init=kmeans++• 0.7719

• n_clusters=7, init=random• 0.7037

Page 30: K-means Clustering with Scikit-Learn

What does this tell us?

• Patterns exist• Groups of similar observations exist• Sometimes, the defaults work• We need more exploration!

Page 31: K-means Clustering with Scikit-Learn

A few tips

• Clustering is a good way to explore your data• Intuition fails in high dimensions

• Use dimensionality reduction

• Combine with other models• Know your data

Page 32: K-means Clustering with Scikit-Learn

Materials and resources

• Scikit-learn documentation• scikit-learn.org/stable/documentation.html

• Datasets• http://archive.ics.uci.edu/ml/datasets.html• Mldata.org

• Blogs• http://datasciencelab.wordpress.com/

Page 33: K-means Clustering with Scikit-Learn

Contact me!

• Twitter: @sarah_guido• www.linkedin.com/in/sarahguido/• https://github.com/sarguido