BDAS RDD study report v1.2
-
Upload
stefanie-zhao -
Category
Technology
-
view
574 -
download
1
description
Transcript of BDAS RDD study report v1.2
Resilient Distributed DatasetsA Fault -Tolerant Abstraction for In-Memory Cluster Computing
Motivation•RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: 1. Iterative algorithms: -iterative machine learning -graph algorithms
2. Interative data mining -ad-hoc query
•In MapReduce, the only way to share data across jobs is stable storage slow!
Examples
Slow due to replication and disk I/O, but necessary for fault tolerance
Goal:In-Memory Data Sharing
Solution: ResilientDistributed Datasets (RDDs)
•Restriced form of distributed shared memory -- Immutable,partitioned collections of records -- Can only be built through coarse-grained derterminstic transformations(map,filter,join,…)
•Efficient fault recovery using lineage --log one operation to apply to many elenments --Recompute lost partitions on failure --No cost if nonthing fails
Solution: ResilientDistributed Datasets (RDDs)
• Allow apps to keep working sets in memory for efficient reuse
• Retain the attractive properties of MapReduce– Fault tolerance, data locality, scalability
• Support a wide range of applications• Control of each RDD’s partitioning (layout
across nodes) and persistence (storage in RAM,on disk,etc)
RDD Operations
Transformations(define a new RDD)
mapfilter
samplegroupByKeyreduceByKey
sortByKey
flatMapunionjoin
cogroupcross
mapValues
Actions(return a result to driver program)
collectreducecountsave
lookupKey
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)
Load error messages from a log into memory, then interactively search for various patterns
9
Fault Recovery
• RDD track the grapth of transformations that built them (their lineage) to rebuild lost data
10
Example:PageRank
Optimizing Placement
links & ranks repeatedly joined
Can co partition them (e.g.hash both on URL) to avoid shuffles
Can also use app knowledge,e.g.,hash on DNS name
links = links.partitionBy(new URLPartitioner())
PageRank Performance
Representing RDDs• a set of partitions, which are atomic pieces of the
dataset• a set of dependencies on parent RDDs• a function for computing the dataset based on its
parents• metadata about its partitioning scheme• data placement
04/11/23
Representing RDDs
04/11/23
Operation Meanning
partitions() Return a list of Partition objects
preferredLocations(p) List nodes where partition p can be accessed faster due to data locality
dependencies() Return a list of dependencies
iterator(p, parentIters) Compute the elements of partition p given iterators for its parent partitions
partitioner() Return metadata specifying whether the RDD is hash/range partitioned
Interface used to represent RDDs in Spark
Dependencies
• narrow dependencies ---where each partition of the parent RDD is used by at most one partition of the child RDD
• wide dependencies ---where multiple child partitions may depend on it.
• For example ---map leads to a narrow dependency, ---while join leads to wide dependencies (unless the parents are
hash-partitioned)
04/11/23
Dependencies
04/11/23
Examples of narrow and wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles
Narrow VS Wide dependencies
• Narrow dependencies ---allow for pipelined execution on one cluster node, which can compute all the
parent partitions. ---recovery after a node failure is more efficient, as only the lost parent partitions
need to be recomputed, can be recomputed in parallel on different nodes
• Wide dependencies --- require data from all parent partitions to be available and to be shuffled across
the nodes using a MapReduce-like operation --- in a lineage graph, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution
04/11/23
Job Scheduler• Similar to Dryad’s, but takes into account which partitions of persistent
RDDS available in memory• When runs an action (e.g., count or save) on an RDD, the scheduler
examines that RDD’s lineage graph to build a DAG of stages to execute• Each stage contains as many pipelined transformations with narrow dependencies as possible Boundary of the stages ---shuffle operations required for wide dependencies ---any already computed partitions(shortcircuit the computation of a parent RDD)• The scheduler then launches tasks to compute missing partitions from
each stage until it has computed the target RDD
04/11/23
Job Scheduler
04/11/23
Dryad-like DAGs
Pipelines functionswithin a stage
Locality & datareuse aware
Partitioning-awareto avoid shuffles
Task Assignment
• scheduler assigns tasks to machines based on data locality using delay scheduling
---if a task needs to process a partition that is available in memory on a node, then send it to that node ---otherwise, a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), then send it to those
04/11/23
Memory Management• in-memory storage as deserialized Java objects ---The first option provides the fastest performance, because the Java VM can access each RDD element natively• in-memory storage as serialized data ---The second option lets users choose a more memory-efficient representation than Java object graphs when space is limited, at the cost of lower performance• on-disk storage ---The third option is useful for RDDs that are too large to keep in RAM but costly to recompute on each use.
04/11/23
Not Suitable for RDDs• RDDs are best suited for batch applications that apply the same
operation to all elements of a dataset• RDDs would be less suitable for applications that make asynchronous
fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler
04/11/23
04/11/23
Programming Models Implemented on Spark
RDDs can express many existing parallel models
04/11/23
Open Source Community
15contributors,5+companies using Spark,3+applications projects at BerkeleyUser applications: » Data mining 40x faster than Hadoop(Conviva) » Exploratory log analysis (Foursquare) » Traffic prediction via EM(Mobile Millennium) » Twitter spam classification (Monarch) » DNA sequence analysis(SNAP)
04/11/23
ConclusionRDDs offer a simple and efficient programming model for a broad range ofApplications(immutable nature and coarse-grained transformations, suitable
for a wide class of applications)Leverage the coarse-grained nature of many parallel algorithms for low-overhead recoveryLet user controls each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)