Machine Learning - Matt Moloney
-
Upload
phillip-trelford -
Category
Technology
-
view
2.254 -
download
1
description
Transcript of Machine Learning - Matt Moloney
Social @tsunamiide tsunami.io Earthquake Enterprises
Machine LearningK-Means Clustering
Social @tsunamiide tsunami.io Earthquake Enterprises
Agenda
Two parts
Simple Clustering Algorithm
Using ML with Large Datasets
Social @tsunamiide tsunami.io Earthquake Enterprises
K-Means Clustering
Very elegant
Scales to large datasets
It is simple and easy to learn
Works with unsupervised data
Social @tsunamiide tsunami.io Earthquake Enterprises
Clustering Applications
Competitive Analysis Compare products from Company A with
Company B by clustering them into groups
Semi-Structured Search Engine Show different results to different users
depending on how they are classified▪ What Google thinks about you: https
://www.google.com/settings/ads/onweb/
Social @tsunamiide tsunami.io Earthquake Enterprises
Iris Dataset
Multivariate data set (i.e. each row is a float[])
Classification is labeled
Not linearly separable
Popular for testing ML Algorithms
Social @tsunamiide tsunami.io Earthquake Enterprises
N-dimensional space
Iris data in (n-1)! charts
Social @tsunamiide tsunami.io Earthquake Enterprises
Infinite dimensional space
E.g. Classifying text documents
Charting no longer makes sense
Need to rely derived metrics
Social @tsunamiide tsunami.io Earthquake Enterprises
Distance Functions
Euclidian
Manhattan Distance
Angle between
Correlation
Social @tsunamiide tsunami.io Earthquake Enterprises
Feature Scaling
Many ML algorithms rely on the features to be in the range of [-1,1] or [0,1]
K-means will work with any range but for many distance functions larger ranges will crowed out smaller ones
We can use this to emphasize some factors over others
Social @tsunamiide tsunami.io Earthquake Enterprises
K-Means Algorithm
select the number of clusters (K) select a seed for each cluster
(centroid) Do {
assign each item in the training set to the closest centroid
update each centroid to the mean of the assigned items }
while (any of the centroids have moved)
Social @tsunamiide tsunami.io Earthquake Enterprises
K-Means for the Iris Dataset
Number of clusters are known (3)
Pick seed by randomly selecting 3 rows from dataset
We intentionally pick 3 close together for demonstration
Social @tsunamiide tsunami.io Earthquake Enterprises
Experimentation
Number of clusters Distance functions Feature scaling Datasets
E.g. included abalone and breast cancer datasets
Social @tsunamiide tsunami.io Earthquake Enterprises
Applying ML to Large Datasets
Social @tsunamiide tsunami.io Earthquake Enterprises
More Data beats Better Algorithms Faster algorithms
with more data will often beat slower algorithms with less data.
Social @tsunamiide tsunami.io Earthquake Enterprises
The importance of speed
Some algorithms do not scale well e.g. Layered NN can take many days (not suited to tutorials)
ML algorithms need to be run repeatedly Tuning hyper-parameters K-fold cross validation Feature discovery
Social @tsunamiide tsunami.io Earthquake Enterprises
Ranking Features
Random Forest Built in, popular and effective
Leave one out My preferred
Social @tsunamiide tsunami.io Earthquake Enterprises
In practice
Use a fast algorithm for factor discovery
Use a slow algorithm for final solution
Many competitions are won on starting the slow algorithm as soon as possible