Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

34
Machine Learning on Spark Shivaram Venkataraman UC Berkeley

Transcript of Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Page 1: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Machine Learning on Spark

Shivaram VenkataramanUC Berkeley

Page 2: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Computer Science

Machine learning

Statistics

Page 3: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Machine learning

Spam filters

Recommendations

Click prediction

Search ranking

Page 4: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Page 5: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Implementing Machine Learning

Machine learning algorithms are

- Complex, multi-stage

- Iterative

MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

Page 6: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Spark RDDs efficient data sharing

In-memory caching accelerates performance

- Up to 20x faster than Hadoop

Easy to use high-level programming interface

- Express complex algorithms ~100 lines.

Machine Learning using Spark

Page 7: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Page 8: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Clustering using Spark

Focus: Implementation and Performance

Page 9: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Clustering

Grouping data according to similarity

Distance EastD

ista

nce

Nort

h E.g. archaeological dig

Page 10: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Clustering

Grouping data according to similarity

Distance EastD

ista

nce

Nort

h E.g. archaeological dig

Page 11: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Benefits

• Popular• Fast• Conceptually

straightforward

Distance EastD

ista

nce

Nort

h E.g. archaeological dig

Page 12: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means: preliminaries

Feature 1Fe

atu

re 2

Data: Collection of values

data = lines.map(line=> parseVector(line))

Page 13: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means: preliminaries

Feature 1Fe

atu

re 2

Dissimilarity: Squared Euclidean distance

dist = p.squaredDist(q)

Page 14: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means: preliminaries

Feature 1Fe

atu

re 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

Page 15: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means: preliminaries

Feature 1Fe

atu

re 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

Page 16: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

Page 17: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

Page 18: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 19: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 20: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 21: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 22: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 23: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 24: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

pointsGroup = closest.groupByKey()

Page 25: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

pointsGroup = closest.groupByKey()

newCenters = pointsGroup.mapValues( ps => average(ps))

Page 26: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

pointsGroup = closest.groupByKey()

newCenters = pointsGroup.mapValues( ps => average(ps))

Page 27: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters = pointsGroup.mapValues( ps => average(ps))

Page 28: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

Page 29: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Algorithm

Feature 1

Featu

re 2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

Page 30: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K-Means Source

Feature 1

Featu

re 2

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

newCenters =pointsGroup.mapValues( ps => average(ps))

while (d > ɛ){

}

d = distance(centers, newCenters)

centers = newCenters.map(_)

Page 31: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Ease of use

Interactive shell:

Useful for featurization, pre-processing data

Lines of code for K-Means

- Spark ~ 90 lines – (Part of hands-on tutorial !)

- Hadoop/Mahout ~ 4 files, > 300 lines

Page 32: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

25 50 1000

50

100

150

200

250

300274

157

106

197

121

87

143

61

33

Hadoop HadoopBinMemSpark

Number of machines

Itera

tio

n t

ime (

s)

K-Means

25 50 1000

50

100

150

200

250

184

111

76

116

80

62

15

6 3

HadoopHadoopBinMemSpark

Number of machines

Itera

tio

n t

ime (

s)

Logistic Regression

Performance

[Zaharia et. al, NSDI’12]

Page 33: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

K means clustering using Spark Hands-on exercise this afternoon !

Examples and more: www.spark-project.org

Spark: Framework for cluster computing Fast and easy machine learning programs

Conclusion

Page 34: Machine Learning on Spark Shivaram Venkataraman UC Berkeley.