Apache Spark

39
Resilient Distributed Dataset: A Fault-Tolerant Abstraction For In-Memory Cluster Computing Mahdi Esmail oghli Dr. Bagheri AmirKabir University of technology SPARK http://BigData.ceit.aut.ac.ir

Transcript of Apache Spark

Page 1: Apache Spark

Resilient Distributed Dataset: A Fault-Tolerant Abstraction For In-Memory Cluster Computing

Mahdi Esmail oghli

Dr. Bagheri

AmirKabir University of technology

SPARK

http://BigData.ceit.aut.ac.ir

Page 2: Apache Spark

A BIGDATA Processing Framework.

2

Page 3: Apache Spark

“What is BigData?”

3

Page 4: Apache Spark

Dealing With BigData

4

Sampling

Hashing

Approximation methods

Map-Reduce Model

Page 5: Apache Spark

Map-Reduce Model

5

A BigData Programming Model

Potential of parallelism

Can be executed on a cluster

Page 6: Apache Spark

Map-Reduce Model

6

Map

Map

Map

Reduce

Reduce

OutputInput

Page 7: Apache Spark

Problems with current computing frameworks Specially Map-reduce

Provides abstraction for accessing the cluster’s computational resources

Lack of abstraction for distributed memory

7

Page 8: Apache Spark

Problems with current computing frameworks Specially Map-reduce

Makes them inefficient for those that reuse intermediate results across multiple computations

8

Page 9: Apache Spark

SPARK Motivation

Problems with current computing frameworks (ex. Map-Reduce):

Iterative algorithms

Interactive data mining tools

9

Page 10: Apache Spark

Data reuse examples

Iterative machine learning and graph algorithms:

Page Rank

K-means clustering

Logistic regression

10

Page 11: Apache Spark

Data reuse examplesInteractive Data Mining (runs multiple queries on the same subset of data) :

Statistical queries

Fraud detection

Stream queries

11

Page 12: Apache Spark

Current Solution

The only way to reuse data between computations with current frameworks:

Write it to an external stable storage

system. X

12

Page 13: Apache Spark

Map-Reduce Model

13

Map

Map

Map

Reduce

Reduce

OutputInput

Stable Storage

Page 14: Apache Spark

Developed System For Reusing Intermediate Data

Pregel : Iterative graph computation

HALOOP: Iterative map-reduce interface

14

Page 15: Apache Spark

Developed System For Reusing Intermediate Data

Just for specific computation patterns.

We need abstraction for more general reuse.

15

Page 16: Apache Spark

RDDResilient Distributed Dataset

16

Page 17: Apache Spark

RDD

Read-Only partitioned collection of records.

Can be created on either stable storage or other RDDs (using transformations).

User can control Persistence and Partitioning

17

Page 18: Apache Spark

RDD

Efficient data reuse

Parallel data structure

Allow explicit persist results

In-memory computation

Large clusters

fault-tolerant manner

18

Page 19: Apache Spark

Current fault tolerant approaches

Data replication across machines

Log update across machines

19

Page 20: Apache Spark

2 Main interface for RDD

20

RDD

Actions Transformations

Page 21: Apache Spark

-e.g., map, filter and join

Transformations

21

Interface used for fault tolerance in RDD

Page 22: Apache Spark

Actions

SPARK computes RDDs Lazily (Helps pipelining)

Actions return value.

ex. Count - Collect - Save

22

Page 23: Apache Spark

RDD can express other models

Map - Reduce

SQL

Pregel

HALOOP

23

Page 24: Apache Spark

24

rdd1.join(rdd2) .groupby(…) .filter(…)

join

groupby

filter

Task Scheduler

Execute task by worker

Page 25: Apache Spark

Results

SPARK Runtime

25

Driver

Input Data

Input Data

Input DataRAM

RAM

RAM

Worker

Worker

Worker

Tasks

Tasks

TasksResults

Results

Page 26: Apache Spark

An Example

1. lines = spark.textFile(“hdfs://…”)

2. errors = lines.filter( _.startsWith(“ERROR”) )

3. errors.persist()

4. errors.filter( _.contains(“HDFS”) ).map( _.split(‘\t’)[3]).count

26

Page 27: Apache Spark

An Example

27

lines

Errors

HDFS errors

Part3

filter(_.startsWith(“ERROR”))

filter(_.contains(“HDFS”))

map( _.split(‘\t’)[3])

Page 28: Apache Spark

Persistent FunctionIndicates which RDDs we want to reuse in the future actions.

Other persistence strategies like:

Store the RDD only in disk

Replicating across machines

Set persistence priorities to RDDs.

28

Page 29: Apache Spark

SPARKRDD is been

implemented in a system called SPARK

In SCALA Language

29

Page 30: Apache Spark

What benchmarks show about SPARK

20X faster than HADOOP for iterative applications

It can scan 1TB dataset with 5-7s latency

30

100 GB Data 100 node

Page 31: Apache Spark

Evaluation(Logistic Regression)

31

0

35

70

105

140

HADOOP HADOOPBM SPARK

3

6276

46

139

80

First Iteration Later Iterations

Page 32: Apache Spark

Evaluation(K-means)

32

0

50

100

150

200

HADOOP HADOOPBM SPARK

33

87106

82

182

115

First Iteration Later Iterations

Page 33: Apache Spark

SPARK STACK

33

Apache Spark

Distributed File System. e.x. HDFS, GlusterFS

Spark SQLSpark

Streaming MLlib GraphX

Page 34: Apache Spark

SPARK Won Daytona GraySort Contest 2014

Spark officially sets a new record in large-scale sorting

34

Page 35: Apache Spark

Spark the fastest open source engine for sorting

35

HADOOP MR SPARK SPARK 1PT

Data Size 102.5 TB 100 TB 1000 TB

Elapsed Time 72 mins 23 mins 234 mins

# Nodes 2100 206 190

# Cores 50400 physical 6592 virtualized 6080 virtualized

Cluster disk throughput

3150 GB/s 618 GB/s 570 GB/s

EnvironmentDedicated datacenter

EC2 (i2.8xlarge)

EC2 (i2.8xlarge)

Sort rate 1.42 PT/min 4.27 TB/min 4.27 TB/min

Without using Spark’s in-memory cache

Page 36: Apache Spark

Current committers

36

Page 37: Apache Spark

SPARK > HADOOP MR

37

Page 38: Apache Spark

References

38

Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.

http://Spark.apache.org

https://databricks.com

Page 39: Apache Spark

Thank you