Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark...

22
Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Transcript of Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark...

Page 1: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Machine Learning in the Cloud with Spark

R212 Data Centric Systems and NetworkingOpen Source Project Study

by Haikal Pribadi

Page 2: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Machine Learning in the Cloud with Spark

Problem domain

Motivation and Contribution

Project Goal

Project Evaluation

Page 3: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Converging Trends

Page 4: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Converging Trends

Page 5: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Converging Trends

Page 6: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Converging Trends

Page 7: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Converging Trends

Page 8: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Motivation and Contribution

Page 9: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Why Scale Up to the Cloud?

Input data size– Large training data instances

● e.g DryadLINQ and MapReduce

– High input dimensionality● e.g GraphLab and GraphChi

Page 10: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Why Scale Up to the Cloud?

Complexity of data and computation– Data complexity brings algorithm comlexity

● e.g. PLANET (on top of MapReduce)

Page 11: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Why Scale Up to the Cloud?

Time constraint and parameter tuning– Distribute system usage to increase throughput– Model and hyper-parameter selection are repetitive

and independent

Page 12: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Why Scale Up to the Cloud?

Data Parallelism– MapReduce, GraphLab

Task Parallelism– Multicores, GPUs (CUDA), MPI

or perhaps Hybrid?– Spark, GraphLab, DryadLINQ

Page 13: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Problem?

With the various option of distributed architectures, implementing different machine systems become very task-specific

Different architectures brings different benefits and constraints

Page 14: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Spark + MLbase

Unified scalable machine learning

Suitable for many common Machine Learning problem

(project hypothesis)

Page 15: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Project Goal

Page 16: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Develop Mainstream ML

Evaluate Spark+MLbase on developing common Machine Learning problems

Page 17: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Develop Mainstream ML

Classification– e.g. Bayesian classifier for Spam Filtering

Clustering– e.g. k-means clustering for market segmentation

Regression– e.g. Linear regression on weather forecasting

Collaborative Filtering– Alternating Least Square for recommendation systems

Page 18: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Project Goal

Platform– Amazon EC2

Run time– Spark

Application– MLbase

Page 19: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Project Evaluation

Page 20: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Evaluation and Analysis

Parallelism– Granularity of data parallelism and task parallelism

Algorithm complexity– Complexity of customization

Programming paradigm– Learning curve, expressiveness and dataflow

Dataset distribution– Management of large dataset

Page 21: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Performance Comparison

Learn a most suitable algorithm to be come a benchmark

Compare performance with [e.g.] GraphLab– Speed (sequential runs)– Scalability (efficiency increase with parallelism)– Throughput (t ime / input size)

Page 22: Machine Learning in the Cloud with Spark - cl.cam.ac.uk · Machine Learning in the Cloud with Spark R212 Data Centric Systems and Networking Open Source Project Study by Haikal Pribadi

Thank you!Questions are very welcome

Haikal [email protected]