[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

21
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. All Spark images and code are from http://spark.apache.org/ Leveraging For Cluster Computing Robin M. E. Swezey Rakuten Institute of Technology, Tokyo Intelligence Domain Group [email protected]

description

Rakuten Technology Conference 2014 "Leveraging Spark for Cluster Computing" Robin M.E. Swezey (Rakuten)

Transcript of [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Page 1: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/

Leveraging For Cluster Computing

Robin M. E. SwezeyRakuten Institute of Technology, Tokyo

Intelligence Domain Group

[email protected]

Page 2: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/2

What is Spark?

Page 3: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/3

In short, Spark is the future of

open-source MapReduce

Page 4: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/4

Current Hadoop stack is heterogeneous

Spark = Fully integrated analytics suite and cluster

computing framework

Berkeley AMP lab + Apache Software Foundation

Why Spark?

Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.

Page 5: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/5

On the surface, very similar to Hadoop

• Relies on HDFS

• Runs on Yarn, Mesos, or standalone

• MapReduce + General cluster computing

Platform

Page 6: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/6

1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)

Platform

RDD

RDDRDD

Key differences with usual stack

Page 7: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/7

1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)

Platform

Key differences with usual stack

RDD<String>RDD<Tuple>RDD<Tuple>

Page 8: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/8

Platform

2. Better resource utilizationDisk is slow. Memory is fast. Several levels of persistence.

Key differences with usual stack

Read blocks

from disk

Cache aggregates

in memory

Page 9: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/9

Platform

Key differences with usual stack

2. Better resource utilizationMore cores > more machines. Resource locality.

Each node x each core

/ each local block

Page 10: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/10

Platform

3. Easier development & operationsScala, Java, Python API

(Logistic Regression)

Key differences with usual stack

8 Lines

Page 11: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/11

Platform

3. Easier AnalyticsInteractive Shells in Scala, Python

Easy to connect with SparkContext (e.g. iPython Notebook)

Key differences with usual stack

Page 12: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/12

4. Integrated Solution

Easy MapReduce

DBMS-like Functionality

Streaming

Machine Learning

Platform

Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.

Key differences with usual stack

Page 13: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/13

Applications

Thanks to RDD Distributed Operators

map()

reduce()

reduceByKey()

groupBy()

sample()

pipe()

foreach()

fold()

histogram()

Easy MapReduce

Page 14: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/14

Applications

Cf.

Integrated Unified Data Access

Hive Compatible Standard Connectivity

Sped-up Analytics with DBMS-like SQL Functionality

Page 15: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/15

Applications

Cf.

Streaming

Page 16: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/16

Applications

Cf.

Statistics

Classification / Regression

Collaborative Filtering

Clustering

Dimensionality Reduction

Feature Extraction

Image: http://en.wikipedia.org/wiki/Machine_learning#mediaviewer/File:Linear-svm-scatterplot.svg

Machine Learning

Page 17: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/17

Applications

Flexible

FastPageRank

Connected components

Label propagation

SVD++

Triangle count

Graph Processing

Page 18: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/18

More

There are deployed clusters

of 1,000+ nodes

How does it scale?

Page 19: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/19

More

Spark 1.1.0

had 171 contributors!

There’s open-source, and there’s highly supported open-source

Page 20: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/20

In Conclusion

Hadoop Cluster Computing Hype Cycle

Image: http://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Gartner_Hype_Cycle.svg/2000px-Gartner_Hype_Cycle.svg.png

Page 21: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/21

Thank you!

http://spark.apache.org