Apache Spark overview
Embed Size (px)
Transcript of Apache Spark overview
AgendaHadoop vs Spark: Big Big Data questionSpark EcosystemWhat is RDDOperations on RDD: Actions vs TransformationsRunning in clusterTask schedulersSpark StreamingDataframes API
Lets remember: MapReduce
Apache Hadoop MapReduce
Hadoop VS/AND Spark
Hadoop: DFSSpark: Speed (RAM)
RDD: Resilient Distributed DatasetRepresents an immutable, partitioned collection of elements that can be operated in parallel with failure recovery possibilities.
ExampleHadoop RDDgetPartitions = HDFS blocksgetDependencies = Nonecompute = load block in memorygetPrefferedLocations = HDFS block locationspartitioner = NoneMapPartitions RDDgetPartitions = same as parentgetDependencies = parent RDDcompute = compute parent and apply map()getPrefferedLocations = same as parentpartitioner = None
RDD: Resilient Distributed Dataset
RDD OperationsTransformationsApply user function to every element in a partitionApply aggregation function to a whole dataset (groupBy, sortBy)Provide functionality for repartitioning (repartition, partitionBy)ActionsMaterialize computation results (collect, count, take)Store RDDs in memory or on disk (cache, persist)
DAG: Directed Acyclic Graph
All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. Pipelines operators together.
DAG Scheduler example
RDD Persistence: persist() & cache()When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
Storage levels: MEMORY_ONLY (default), MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Removing data: least-recently-used (LRU) fashion or RDD.unpersist() method.
Sparks cache is fault-tolerant if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.: , , , . . : .
StandaloneDefaultFIFO strategyControls number of CPU cores and executor memory
YARNHadoop orientedTakes all available resourcesWas designed for stateless batch jobs that can be restarted easily if they fail.
MesosResource orientedDynamic sharing or CPU coresLess predictive latency
Spark Driver (application)
Running in cluster
Execution memoryStorage for data needed during tasks executionShuffle-related dataStorage memoryCached RDDsPossible to borrow from execution memoryUser memoryUser data structures and internal metadataSafeguarding against OOMReserved memoryMemory needed for running executor itself
Spark Streaming: Basic Concept
Spark Streaming: ArchitectureSpark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
Discretized Streams (DStreams)
window length - The duration of the window (3 in the figure).sliding interval - The interval at which the window operation is performed (2 in the figure).
Spark Streaming checkpointsCreate heavy objects in foreachRDDDefault persistence level of DStreams keeps the data serialized in memory.Checkpointing (metadata and received data)Automatic restart (task manager)Max receiving rateLevel of ParallelismKryo serialization
Spark Streaming Example
Spark Dataframes (SQL)
Apache HiveHadoop productStores metadata in the relational database, but data only in HDFSIs not suited for real time data processingBest used for batch jobs over large datasets of immutable data (web logs)
Is a good choice if you:Want to query the dataWhen youre familiar with SQL
About Spark SQLPart of Spark core since April 2014Works with structured dataMixes SQL queries with Spark programsConnect to any datasource (files, Hive tables, external databases, RDDs)
Spark SQL with schema