BDAS RDD study report v1.2

Resilient Distributed DatasetsA Fault -Tolerant Abstraction for In-Memory Cluster Computing

Motivation•RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: 1. Iterative algorithms: -iterative machine learning -graph algorithms

2. Interative data mining -ad-hoc query

•In MapReduce, the only way to share data across jobs is stable storage slow!

Examples

Slow due to replication and disk I/O, but necessary for fault tolerance

Goal:In-Memory Data Sharing

Solution: ResilientDistributed Datasets (RDDs)

•Restriced form of distributed shared memory -- Immutable,partitioned collections of records -- Can only be built through coarse-grained derterminstic transformations(map,filter,join,…)

•Efficient fault recovery using lineage --log one operation to apply to many elenments --Recompute lost partitions on failure --No cost if nonthing fails

Solution: ResilientDistributed Datasets (RDDs)

• Allow apps to keep working sets in memory for efficient reuse

• Retain the attractive properties of MapReduce– Fault tolerance, data locality, scalability

• Support a wide range of applications• Control of each RDD’s partitioning (layout

across nodes) and persistence (storage in RAM,on disk,etc)

RDD Operations

Transformations(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result to driver program)

collectreducecountsave

lookupKey

Example: Log Mining

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Load error messages from a log into memory, then interactively search for various patterns

9

Fault Recovery

• RDD track the grapth of transformations that built them (their lineage) to rebuild lost data

10

Example:PageRank

Optimizing Placement

links & ranks repeatedly joined

Can co partition them (e.g.hash both on URL) to avoid shuffles

Can also use app knowledge,e.g.,hash on DNS name

links = links.partitionBy(new URLPartitioner())

PageRank Performance

Representing RDDs• a set of partitions, which are atomic pieces of the

dataset• a set of dependencies on parent RDDs• a function for computing the dataset based on its

parents• metadata about its partitioning scheme• data placement

04/11/23

Representing RDDs

04/11/23

Operation Meanning

partitions() Return a list of Partition objects

preferredLocations(p) List nodes where partition p can be accessed faster due to data locality

dependencies() Return a list of dependencies

iterator(p, parentIters) Compute the elements of partition p given iterators for its parent partitions

partitioner() Return metadata specifying whether the RDD is hash/range partitioned

Interface used to represent RDDs in Spark

Dependencies

• narrow dependencies ---where each partition of the parent RDD is used by at most one partition of the child RDD

• wide dependencies ---where multiple child partitions may depend on it.

• For example ---map leads to a narrow dependency, ---while join leads to wide dependencies (unless the parents are

hash-partitioned)

04/11/23

Dependencies

04/11/23

Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles

Narrow VS Wide dependencies

• Narrow dependencies ---allow for pipelined execution on one cluster node, which can compute all the

parent partitions. ---recovery after a node failure is more efficient, as only the lost parent partitions

need to be recomputed, can be recomputed in parallel on different nodes

• Wide dependencies --- require data from all parent partitions to be available and to be shuffled across

the nodes using a MapReduce-like operation --- in a lineage graph, a single failed node might cause the loss of some partition

from all the ancestors of an RDD, requiring a complete re-execution

04/11/23

Job Scheduler• Similar to Dryad’s, but takes into account which partitions of persistent

RDDS available in memory• When runs an action (e.g., count or save) on an RDD, the scheduler

examines that RDD’s lineage graph to build a DAG of stages to execute• Each stage contains as many pipelined transformations with narrow dependencies as possible Boundary of the stages ---shuffle operations required for wide dependencies ---any already computed partitions(shortcircuit the computation of a parent RDD)• The scheduler then launches tasks to compute missing partitions from

each stage until it has computed the target RDD

04/11/23

Job Scheduler

04/11/23

Dryad-like DAGs

Pipelines functionswithin a stage

Locality & datareuse aware

Partitioning-awareto avoid shuffles

Task Assignment

• scheduler assigns tasks to machines based on data locality using delay scheduling

---if a task needs to process a partition that is available in memory on a node, then send it to that node ---otherwise, a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), then send it to those

04/11/23

Memory Management• in-memory storage as deserialized Java objects ---The first option provides the fastest performance, because the Java VM can access each RDD element natively• in-memory storage as serialized data ---The second option lets users choose a more memory-efficient representation than Java object graphs when space is limited, at the cost of lower performance• on-disk storage ---The third option is useful for RDDs that are too large to keep in RAM but costly to recompute on each use.

04/11/23

Not Suitable for RDDs• RDDs are best suited for batch applications that apply the same

operation to all elements of a dataset• RDDs would be less suitable for applications that make asynchronous

fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler

04/11/23

04/11/23

Programming Models Implemented on Spark

RDDs can express many existing parallel models

04/11/23

Open Source Community

15contributors,5+companies using Spark,3+applications projects at BerkeleyUser applications: » Data mining 40x faster than Hadoop(Conviva) » Exploratory log analysis (Foursquare) » Traffic prediction via EM(Mobile Millennium) » Twitter spam classification (Monarch) » DNA sequence analysis(SNAP)

04/11/23

ConclusionRDDs offer a simple and efficient programming model for a broad range ofApplications(immutable nature and coarse-grained transformations, suitable

for a wide class of applications)Leverage the coarse-grained nature of many parallel algorithms for low-overhead recoveryLet user controls each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)

BDAS RDD study report v1.2

Technology

Transcript of BDAS RDD study report v1.2