Machine Learning - Matt Moloney

Social @tsunamiide tsunami.io Earthquake Enterprises

Machine LearningK-Means Clustering


Agenda

Two parts

Simple Clustering Algorithm

Using ML with Large Datasets


K-Means Clustering

Very elegant

Scales to large datasets

It is simple and easy to learn

Works with unsupervised data


Clustering Applications

Competitive Analysis Compare products from Company A with

Company B by clustering them into groups

Semi-Structured Search Engine Show different results to different users

depending on how they are classified▪ What Google thinks about you: https

://www.google.com/settings/ads/onweb/

https://www.google.com/settings/ads/onweb/

https://www.google.com/settings/ads/onweb/


Iris Dataset

Multivariate data set (i.e. each row is a float[])

Classification is labeled

Not linearly separable

Popular for testing ML Algorithms


N-dimensional space

Iris data in (n-1)! charts


Infinite dimensional space

E.g. Classifying text documents

Charting no longer makes sense

Need to rely derived metrics


Distance Functions

Euclidian

Manhattan Distance

Angle between

Correlation


Feature Scaling

Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]

K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones

We can use this to emphasize some factors over others


K-Means Algorithm

select the number of clusters (K) select a seed for each cluster

(centroid) Do {

assign each item in the training set to the closest centroid

update each centroid to the mean of the assigned items }

while (any of the centroids have moved)


K-Means for the Iris Dataset

Number of clusters are known (3)

Pick seed by randomly selecting 3 rows from dataset

We intentionally pick 3 close together for demonstration


Experimentation

Number of clusters Distance functions Feature scaling Datasets

E.g. included abalone and breast cancer datasets


Applying ML to Large Datasets


More Data beats Better Algorithms Faster algorithms

with more data will often beat slower algorithms with less data.


The importance of speed

Some algorithms do not scale well e.g. Layered NN can take many days (not suited to tutorials)

ML algorithms need to be run repeatedly Tuning hyper-parameters K-fold cross validation Feature discovery


Ranking Features

Random Forest Built in, popular and effective

Leave one out My preferred


In practice

Use a fast algorithm for factor discovery

Use a slow algorithm for final solution

Many competitions are won on starting the slow algorithm as soon as possible

Machine Learning - Matt Moloney

Technology

Transcript of Machine Learning - Matt Moloney