Apache Spark

Resilient Distributed Dataset: A Fault-Tolerant Abstraction For In-Memory Cluster Computing

Mahdi Esmail oghli

Dr. Bagheri

AmirKabir University of technology

SPARK

http://BigData.ceit.aut.ac.ir

http://bigdata.ceit.aut.ac.ir

A BIGDATA Processing Framework.

2

“What is BigData?”

3

Dealing With BigData

4

Sampling

Hashing

Approximation methods

Map-Reduce Model

…

Map-Reduce Model

5

A BigData Programming Model

Potential of parallelism

Can be executed on a cluster

Map-Reduce Model

6

Map

Map

Map

Reduce

Reduce

OutputInput

Problems with current computing frameworks Specially Map-reduce

Provides abstraction for accessing the cluster’s computational resources

Lack of abstraction for distributed memory

7

Problems with current computing frameworks Specially Map-reduce

Makes them inefficient for those that reuse intermediate results across multiple computations

8

SPARK Motivation

Problems with current computing frameworks (ex. Map-Reduce):

Iterative algorithms

Interactive data mining tools

9

Data reuse examples

Iterative machine learning and graph algorithms:

Page Rank

K-means clustering

Logistic regression

10

Data reuse examplesInteractive Data Mining (runs multiple queries on the same subset of data) :

Statistical queries

Fraud detection

Stream queries

11

Current Solution

The only way to reuse data between computations with current frameworks:

Write it to an external stable storage

system. X

12

Map-Reduce Model

13

Map

Map

Map

Reduce

Reduce

OutputInput

Stable Storage

Developed System For Reusing Intermediate Data

Pregel : Iterative graph computation

HALOOP: Iterative map-reduce interface

14

Developed System For Reusing Intermediate Data

Just for specific computation patterns.

We need abstraction for more general reuse.

15

RDDResilient Distributed Dataset

16

RDD

Read-Only partitioned collection of records.

Can be created on either stable storage or other RDDs (using transformations).

User can control Persistence and Partitioning

17

RDD

Efficient data reuse

Parallel data structure

Allow explicit persist results

In-memory computation

Large clusters

fault-tolerant manner

18

Current fault tolerant approaches

Data replication across machines

Log update across machines

19

2 Main interface for RDD

20

RDD

Actions Transformations

-e.g., map, filter and join

Transformations

21

Interface used for fault tolerance in RDD

Actions

SPARK computes RDDs Lazily (Helps pipelining)

Actions return value.

ex. Count - Collect - Save

22

RDD can express other models

Map - Reduce

SQL

Pregel

HALOOP

…

23

24

rdd1.join(rdd2) .groupby(…) .filter(…)

join

groupby

filter

Task Scheduler

Execute task by worker

Results

SPARK Runtime

25

Driver

Input Data

Input Data

Input DataRAM

RAM

RAM

Worker

Worker

Worker

Tasks

Tasks

TasksResults

Results

An Example

1. lines = spark.textFile(“hdfs://…”)

2. errors = lines.filter( _.startsWith(“ERROR”) )

3. errors.persist()

4. errors.filter( _.contains(“HDFS”) ).map( _.split(‘\t’)[3]).count

26

An Example

27

lines

Errors

HDFS errors

Part3

filter(_.startsWith(“ERROR”))

filter(_.contains(“HDFS”))

map( _.split(‘\t’)[3])

Persistent FunctionIndicates which RDDs we want to reuse in the future actions.

Other persistence strategies like:

Store the RDD only in disk

Replicating across machines

Set persistence priorities to RDDs.

28

SPARKRDD is been

implemented in a system called SPARK

In SCALA Language

29

What benchmarks show about SPARK

20X faster than HADOOP for iterative applications

It can scan 1TB dataset with 5-7s latency

30

100 GB Data 100 node

Evaluation(Logistic Regression)

31

0

35

70

105

140

HADOOP HADOOPBM SPARK

3

6276

46

139

80

First Iteration Later Iterations

Evaluation(K-means)

32

0

50

100

150

200

HADOOP HADOOPBM SPARK

33

87106

82

182

115

First Iteration Later Iterations

SPARK STACK

33

Apache Spark

Distributed File System. e.x. HDFS, GlusterFS

Spark SQLSpark

Streaming MLlib GraphX

SPARK Won Daytona GraySort Contest 2014

Spark officially sets a new record in large-scale sorting

34

Spark the fastest open source engine for sorting

35

HADOOP MR SPARK SPARK 1PT

Data Size 102.5 TB 100 TB 1000 TB

Elapsed Time 72 mins 23 mins 234 mins

# Nodes 2100 206 190

# Cores 50400 physical 6592 virtualized 6080 virtualized

Cluster disk throughput

3150 GB/s 618 GB/s 570 GB/s

EnvironmentDedicated datacenter

EC2 (i2.8xlarge)

EC2 (i2.8xlarge)

Sort rate 1.42 PT/min 4.27 TB/min 4.27 TB/min

Without using Spark’s in-memory cache

Current committers

36

SPARK > HADOOP MR

37

References

38

Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012.

http://Spark.apache.org

https://databricks.com

http://spark.apache.org

https://databricks.com

Thank you

Apache Spark

Technology

Transcript of Apache Spark