Apache Spark overview

41
Apache Spark

Transcript of Apache Spark overview

Page 1: Apache Spark overview

Apache Spark

Page 2: Apache Spark overview

Agenda

Hadoop vs Spark: Big ‘Big Data’ question

Spark Ecosystem

What is RDD

Operations on RDD: Actions vs Transformations

Running in cluster

Task schedulers

Spark Streaming

Dataframes API

Page 3: Apache Spark overview

Let’s remember: MapReduce

Page 4: Apache Spark overview

Apache Hadoop MapReduce

Page 5: Apache Spark overview

Hadoop VS/AND Spark

Hadoop: DFS

Spark: Speed (RAM)

Page 6: Apache Spark overview

Spark ecosystem

Page 7: Apache Spark overview

GlossaryJobRDDStagesTasksDAGExecutorDriver

Page 8: Apache Spark overview

Simple Example

Page 9: Apache Spark overview

RDD: Resilient Distributed DatasetRepresents an immutable, partitioned collection of elements that can be operated in parallel with failure recovery possibilities.

Page 10: Apache Spark overview

ExampleHadoop RDD

getPartitions = HDFS blocksgetDependencies = Nonecompute = load block in memorygetPrefferedLocations = HDFS block locationspartitioner = None

MapPartitions RDDgetPartitions = same as parentgetDependencies = parent RDDcompute = compute parent and apply map()getPrefferedLocations = same as parentpartitioner = None

Page 11: Apache Spark overview

RDD: Resilient Distributed Dataset

Page 12: Apache Spark overview

RDD Example

Page 13: Apache Spark overview

RDD Example

Page 14: Apache Spark overview

RDD Operations● Transformations

○ Apply user function to every element in a partition

○ Apply aggregation function to a whole dataset (groupBy, sortBy)

○ Provide functionality for repartitioning (repartition, partitionBy)

● Actions

○ Materialize computation results (collect, count, take)

○ Store RDDs in memory or on disk (cache, persist)

Page 15: Apache Spark overview

RDD Dependencies

Page 16: Apache Spark overview

DAG: Directed Acyclic Graph

All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.

Page 17: Apache Spark overview

DAG Example

Page 18: Apache Spark overview

DAG Scheduler

The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. Pipelines operators together.

Page 19: Apache Spark overview

DAG Scheduler example

Page 20: Apache Spark overview

RDD Persistence: persist() & cache()When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).

Storage levels: MEMORY_ONLY (default), MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Removing data: least-recently-used (LRU) fashion or RDD.unpersist() method.

Page 21: Apache Spark overview

Job execution

Page 22: Apache Spark overview

Task Schedulers

Standalone

Default

FIFO strategy

Controls number of CPU cores and executor memory

YARN

Hadoop oriented

Takes all available resources

Was designed for stateless batch jobs that can be restarted easily if they fail.

Mesos

Resource oriented

Dynamic sharing or CPU cores

Less predictive latency

Page 23: Apache Spark overview

Spark Driver (application)

Page 24: Apache Spark overview

Running in cluster

Page 25: Apache Spark overview

Memory usage• Execution memory

• Storage for data needed during tasks execution• Shuffle-related data

• Storage memory• Cached RDDs• Possible to borrow from execution memory

• User memory• User data structures and internal metadata• Safeguarding against OOM

• Reserved memory• Memory needed for running executor itself

Page 26: Apache Spark overview

Spark Streaming

Page 27: Apache Spark overview

Spark Streaming: Basic Concept

Page 28: Apache Spark overview

Spark Streaming: ArchitectureSpark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Page 29: Apache Spark overview

Discretized Streams (DStreams)

Page 30: Apache Spark overview

Windowed computations

Page 31: Apache Spark overview

Spark Streaming checkpoints• Create heavy objects in foreachRDD• Default persistence level of DStreams keeps the data serialized in

memory.• Checkpointing (metadata and received data)• Automatic restart (task manager)• Max receiving rate• Level of Parallelism• Kryo serialization

Page 32: Apache Spark overview

Spark Streaming Example

Page 33: Apache Spark overview

Spark Dataframes (SQL)

Page 34: Apache Spark overview

Apache Hive• Hadoop product• Stores metadata in the relational database, but data only in HDFS• Is not suited for real time data processing• Best used for batch jobs over large datasets of immutable data

(web logs)

Is a good choice if you:• Want to query the data• When you’re familiar with SQL

Page 35: Apache Spark overview

About Spark SQLPart of Spark core since April 2014

Works with structured data

Mixes SQL queries with Spark programs

Connect to any datasource (files, Hive tables, external databases, RDDs)

Page 36: Apache Spark overview

Spark Dataframes

Page 37: Apache Spark overview

Spark Dataframes

Page 38: Apache Spark overview

Spark SQL

Page 39: Apache Spark overview

Spark SQL with schema

Page 40: Apache Spark overview

Dataframes benchmark

Page 41: Apache Spark overview

Q&A