Machine Learning - Matt Moloney

17
Socia l @tsunamiide tsunami.io Earthquake Enterprises Machine Learning K-Means Clustering

description

Introduction to machine learning covering k-means clustering and support vector machines.

Transcript of Machine Learning - Matt Moloney

Page 1: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Machine LearningK-Means Clustering

Page 2: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Agenda

Two parts

Simple Clustering Algorithm

Using ML with Large Datasets

Page 3: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

K-Means Clustering

Very elegant

Scales to large datasets

It is simple and easy to learn

Works with unsupervised data

Page 4: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Clustering Applications

Competitive Analysis Compare products from Company A with

Company B by clustering them into groups

Semi-Structured Search Engine Show different results to different users

depending on how they are classified▪ What Google thinks about you: https

://www.google.com/settings/ads/onweb/

Page 5: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Iris Dataset

Multivariate data set (i.e. each row is a float[])

Classification is labeled

Not linearly separable

Popular for testing ML Algorithms

Page 6: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

N-dimensional space

Iris data in (n-1)! charts

Page 7: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Infinite dimensional space

E.g. Classifying text documents

Charting no longer makes sense

Need to rely derived metrics

Page 8: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Distance Functions

Euclidian

Manhattan Distance

Angle between

Correlation

Page 9: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Feature Scaling

Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]

K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones

We can use this to emphasize some factors over others

Page 10: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

K-Means Algorithm

select the number of clusters (K) select a seed for each cluster

(centroid) Do {

assign each item in the training set to the closest centroid

update each centroid to the mean of the assigned items }

while (any of the centroids have moved)

Page 11: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

K-Means for the Iris Dataset

Number of clusters are known (3)

Pick seed by randomly selecting 3 rows from dataset

We intentionally pick 3 close together for demonstration

Page 12: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Experimentation

Number of clusters Distance functions Feature scaling Datasets

E.g. included abalone and breast cancer datasets

Page 13: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Applying ML to Large Datasets

Page 14: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

More Data beats Better Algorithms Faster algorithms

with more data will often beat slower algorithms with less data.

Page 15: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

The importance of speed

Some algorithms do not scale well e.g. Layered NN can take many days (not suited to tutorials)

ML algorithms need to be run repeatedly Tuning hyper-parameters K-fold cross validation Feature discovery

Page 16: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Ranking Features

Random Forest Built in, popular and effective

Leave one out My preferred

Page 17: Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

In practice

Use a fast algorithm for factor discovery

Use a slow algorithm for final solution

Many competitions are won on starting the slow algorithm as soon as possible