Machine Learning on Spark

34
Machine Learning on Spark Shivaram Venkataraman UC Berkeley

description

Machine Learning on Spark. Shivaram Venkataraman UC Berkeley. Machine learning. Computer Science . Statistics. Spam filters. Click prediction. Machine learning. Recommendations. Search ranking. Classification. Clustering. Regression. Machine learning techniques. Active learning. - PowerPoint PPT Presentation

Transcript of Machine Learning on Spark

Page 1: Machine Learning on Spark

Machine Learning on Spark

Shivaram VenkataramanUC Berkeley

Page 2: Machine Learning on Spark

Computer Science Machine learning

Statistics

Page 3: Machine Learning on Spark

Machine learning

Spam filters

Recommendations

Click prediction

Search ranking

Page 4: Machine Learning on Spark

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Page 5: Machine Learning on Spark

Implementing Machine Learning Machine learning algorithms are

- Complex, multi-stage- Iterative

MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

Page 6: Machine Learning on Spark

Spark RDDs efficient data sharing

In-memory caching accelerates performance- Up to 20x faster than Hadoop

Easy to use high-level programming interface- Express complex algorithms ~100 lines.

Machine Learning using Spark

Page 7: Machine Learning on Spark

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Page 8: Machine Learning on Spark

K-Means Clustering using Spark

Focus: Implementation and Performance

Page 9: Machine Learning on Spark

Clustering

Grouping data according to similarity

Distance EastDi

stan

ce N

orth E.g. archaeological dig

Page 10: Machine Learning on Spark

Clustering

Grouping data according to similarity

Distance EastDi

stan

ce N

orth E.g. archaeological dig

Page 11: Machine Learning on Spark

K-Means Algorithm

Benefits

• Popular• Fast• Conceptually

straightforward

Distance EastDi

stan

ce N

orth E.g. archaeological dig

Page 12: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atur

e 2

Data: Collection of values

data = lines.map(line=> parseVector(line))

Page 13: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atur

e 2

Dissimilarity: Squared Euclidean distance

dist = p.squaredDist(q)

Page 14: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atur

e 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

Page 15: Machine Learning on Spark

K-Means: preliminaries

Feature 1Fe

atur

e 2

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

Page 16: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

Page 17: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

Page 18: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 19: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 20: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

Page 21: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 22: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 23: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

Assign each cluster center to be the mean of its cluster’s data points.

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))

Page 24: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

Page 25: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues( ps => average(ps))

Page 26: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues( ps => average(ps))

Page 27: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters = pointsGroup.mapValues( ps => average(ps))

Page 28: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

Page 29: Machine Learning on Spark

K-Means Algorithm

Feature 1

Feat

ure

2

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

Page 30: Machine Learning on Spark

K-Means Source

Feature 1

Feat

ure

2

centers = data.takeSample( false, K, seed)

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()newCenters =pointsGroup.mapValues( ps => average(ps))

while (d > ɛ){

}

d = distance(centers, newCenters)

centers = newCenters.map(_)

Page 31: Machine Learning on Spark

Ease of use Interactive shell: Useful for featurization, pre-processing data

Lines of code for K-Means- Spark ~ 90 lines – (Part of hands-on tutorial !)- Hadoop/Mahout ~ 4 files, > 300 lines

Page 32: Machine Learning on Spark

25 50 1000

50

100

150

200

250

300 274

157

106

197

121

87

143

61

33

Hadoop HadoopBinMemSpark

Number of machines

Iter

atio

n ti

me

(s)

K-Means

25 50 1000

50

100

150

200

250

184

111

76

116

80

62

15 6 3

HadoopHadoopBinMemSpark

Number of machines

Iter

atio

n ti

me

(s)

Logistic Regression

Performance

[Zaharia et. al, NSDI’12]

Page 33: Machine Learning on Spark

K means clustering using Spark Hands-on exercise this afternoon !

Examples and more: www.spark-project.org

Spark: Framework for cluster computing Fast and easy machine learning programs

Conclusion

Page 34: Machine Learning on Spark