Apache Spark

download Apache Spark

of 78

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Apache Spark

PowerPoint Presentation

SparkMajid HajibabaOutline An Overview on SparkSpark Programming GuideAn Example on SparkRunning Applications on SparkSpark StreamingSpark Streaming Programing GuideAn Example on Spark StreamingSpark and Storm: A ComparisonSpark SQL

15 January 2015Majid Hajibaba - Spark2An Overview15 January 2015Majid Hajibaba - Spark3Cluster Mode OverviewSpark applications run as independent sets of processes on a clusterExecutor processes run tasks in multiple threadsDriver should be close to the workersFor remotely operating, use RPC instead of remote driver



http://spark.apache.org/docs/1.0.1/cluster-overview.html15 January 20154Majid Hajibaba - SparkSpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Sparks own standalone cluster manager or Mesos/YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.

Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If youd like to send requests to the cluster remotely, its better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.4Core is a computational engine that is responsible for scheduling, distributing, and monitoring applications in a clusterhigher-level components (Shark; GraphX; Streaming; ) are Like libraries in a software projecttight integration has several benefitsSimple Improvements, Minimized Costs, Combine Processing Models


Spark - A Unified Stack15 January 20155Majid Hajibaba - SparkThe Spark project contains multiple closely-integrated components. At its core, Spark is a computational engine that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. These components are designed to interoperate closely, letting you combine them like libraries in a software project.

A philosophy of tight integration has several benefits. First, all libraries and higher level components in the stack benefit from improvements at the lower layers. Second, the costs (deployment, maintenance, testing, support) associated with running the stack are minimized, because instead of running 5-10 independent software systems, an organization only needs to run one. also each time a new component is added to the Spark stack, every organization that uses Spark will immediately be able to try this new component. Finally, is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously analysts can query the resulting data, also in real-time, via SQL, e.g. to join the data with unstructured log files.

Spark StreamingSpark Streaming is a Spark component that enables processing live streams of data. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service. Spark Streaming provides an API for manipulating data streams that closely matches the Spark Cores RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real-time. Underneath its API, Spark Streaming was designed to provide the same degree of fault tolerance, throughput, and scalability that the Spark Core provides.

Spark SQLSpark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL). Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations. Beyond providing the SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java and Scala, all within a single application. This tight integration with the rich and sophisticated computing environment provided by the rest of the Spark stack makes Spark SQL unlike any other open source data warehouse tool. 5Spark Processing Model

15 January 20156Majid Hajibaba - SparkIn memory iterative MapReduceMapReduce Processing ModelSpark GoalProvide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce:Fault toleranceData locality Scalability

Solution: augment data flow model with resilient distributed datasets (RDDs)

15 January 20157Majid Hajibaba - SparkProvide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: Fault tolerance (for crashes & stragglers) Data locality ScalabilitySolution: augment data flow model with resilient distributed datasets (RDDs)7Resilient Distributed Datasets (RDDs)

Immutable collection of elements that can be operated on in parallelCreated by transforming data using data flow operators (e.g. map)Parallel operations on RDDsBenefitsConsistency is easy due to immutabilityInexpensive fault tolerance log lineage no replicating/checkpointingLocality-aware scheduling of tasks on partitionsApplicable to a broad variety of applications

15 January 20158Majid Hajibaba - Spark

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

8RDDs15 January 2015Majid Hajibaba - Spark9

Immutable Collection of ObjectsPartitioned and DistributedSpark Programming GuideLinking with SparkSpark 1.2.0 works with Java 6 and higherTo write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at:

Importing Spark classes into the program:

groupId = org.apache.sparkartifactId = spark-core_2.10version = 1.2.0import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.SparkConf;15 January 201511Majid Hajibaba - SparkSpark 1.2.0 works with Java 6 and higher. If you are using Java 8, Spark supports lambda expressions for concisely writing functions, otherwise you can use the classes in the org.apache.spark.api.java.function package.

To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at:

groupId = org.apache.sparkartifactId = spark-core_2.10version = 1.2.0

In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS. Some common HDFS version tags are listed on the third party distributions page.

groupId = org.apache.hadoopartifactId = hadoop-clientversion =

Finally, you need to import some Spark classes into your program. Add the following lines:11Initializing Spark - Creating a SparkContextTells Spark how to access a clusterThe entry point / The first thing a Spark programThis is done through the following constructor:


Or through SparkConf for advanced configuration

new SparkContext(master, appName, [sparkHome], [jars])15 January 201512Majid Hajibaba - Spark

import org.apache.spark.api.java.JavaSparkContext;

JavaSparkContext ctx = new JavaSparkContext("master_url","application name", ["path_to_spark_home", "path_to_jars"]);A SparkContext class represents the connection to a Spark cluster and provides the entry point for interacting with Spark. We need to create a SparkContext instance so that we can interact with Spark and distribute our jobs.Master: is a string specifying a Spark or Mesos cluster URL to connect to, or a special local string to run in local mode, as described below. appName: is a name for your application, which will be shown in the cluster web UI. sparkHome: The path at which Spark is installed on your worker machines (it should be the same on all of them).jars: A list of JAR files on the local machine containing your applications code and any dependencies, which Spark will deploy to all the worker nodes. Youll need to package your application into a set of JARs using your build system.

or through new SparkContext(conf), which takes a SparkConf object for more advanced configuration.12SparkConfConfiguration for a Spark applicationSets various Spark parameters as key-value pairsSparkConf object contains information about the applicationThe constructor will load values from any spark.* Java system properties set and the classpath in the application


import org.apache.spark.SparkConf;

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaSparkContext;

SparkConf sparkConf = new Spark