Apache Spark overview

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Apache Spark overview

Apache Spark

Apache Spark

AgendaHadoop vs Spark: Big Big Data questionSpark EcosystemWhat is RDDOperations on RDD: Actions vs TransformationsRunning in clusterTask schedulersSpark StreamingDataframes API

Lets remember: MapReduce

Apache Hadoop MapReduce

Hadoop VS/AND Spark

Hadoop: DFSSpark: Speed (RAM)

Spark ecosystem


Simple Example

RDD: Resilient Distributed DatasetRepresents an immutable, partitioned collection of elements that can be operated in parallel with failure recovery possibilities.

ExampleHadoop RDDgetPartitions = HDFS blocksgetDependencies = Nonecompute = load block in memorygetPrefferedLocations = HDFS block locationspartitioner = NoneMapPartitions RDDgetPartitions = same as parentgetDependencies = parent RDDcompute = compute parent and apply map()getPrefferedLocations = same as parentpartitioner = None

RDD: Resilient Distributed Dataset

RDD Example

RDD Example

RDD OperationsTransformationsApply user function to every element in a partitionApply aggregation function to a whole dataset (groupBy, sortBy)Provide functionality for repartitioning (repartition, partitionBy)ActionsMaterialize computation results (collect, count, take)Store RDDs in memory or on disk (cache, persist)

RDD Dependencies

DAG: Directed Acyclic Graph

All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.

DAG Example

DAG Scheduler

The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. Pipelines operators together.

DAG Scheduler example

RDD Persistence: persist() & cache()When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).


Removing data: least-recently-used (LRU) fashion or RDD.unpersist() method.

Sparks cache is fault-tolerant if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.: , , , . . : .

Job execution

Task Schedulers

StandaloneDefaultFIFO strategyControls number of CPU cores and executor memory

YARNHadoop orientedTakes all available resourcesWas designed for stateless batch jobs that can be restarted easily if they fail.

MesosResource orientedDynamic sharing or CPU coresLess predictive latency

Spark Driver (application)

Running in cluster

Memory usage

Execution memoryStorage for data needed during tasks executionShuffle-related dataStorage memoryCached RDDsPossible to borrow from execution memoryUser memoryUser data structures and internal metadataSafeguarding against OOMReserved memoryMemory needed for running executor itself

Spark Streaming

Spark Streaming: Basic Concept

Spark Streaming: ArchitectureSpark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Discretized Streams (DStreams)

Windowed computations

window length - The duration of the window (3 in the figure).sliding interval - The interval at which the window operation is performed (2 in the figure).

Spark Streaming checkpointsCreate heavy objects in foreachRDDDefault persistence level of DStreams keeps the data serialized in memory.Checkpointing (metadata and received data)Automatic restart (task manager)Max receiving rateLevel of ParallelismKryo serialization

Spark Streaming Example

Spark Dataframes (SQL)

Apache HiveHadoop productStores metadata in the relational database, but data only in HDFSIs not suited for real time data processingBest used for batch jobs over large datasets of immutable data (web logs)

Is a good choice if you:Want to query the dataWhen youre familiar with SQL

About Spark SQLPart of Spark core since April 2014Works with structured dataMixes SQL queries with Spark programsConnect to any datasource (files, Hive tables, external databases, RDDs)

Spark Dataframes

Spark Dataframes

Spark SQL

Spark SQL with schema

Dataframes benchmark