Apache spark - Installation

of 20 /20
Installation Martin Zapletal Cake Solutions Apache Spark

Transcript of Apache spark - Installation


Martin Zapletal Cake Solutions

Apache Spark

Apache Spark and Big Data

1) History and market overview2) Installation3) MLlib and machine learning on Spark4) Porting R code to Scala and Spark5) Concepts - Core, SQL, GraphX, Streaming6) Spark’s distributed programming model7) Deployment

Table of Contents

● Spark architecture● download, versions, install, startup● Cluster managers

○ Local○ Standalone○ Mesos○ YARN

● Spark shell● Job deployment● Streaming job deployment● Integration with other tools

● after this session you should be able to install Spark, run Spark cluster and deploy basic jobs


● prebuilt packages for different versions of Hadoop, CDH (Cloudera’s distribution of Hadoop), MapR (MapR’s distribution of Hadoop)○ currently only support for Scala 2.10

● build from source○ uses mvn, but has a sbt wrapper○ need to specify Hadoop version build against○ can be built with Scala 2.11 support

Spark architecture● cluster persistent, user submits Jobs● SparkContext (driver) contacts Cluster Manager which assigns cluster resources● then it sends application code to assigned Executors (distributing computation, not data!)● finally sends tasks to Executors to run● each master and worker run a webUI that displays task progress and results● each application (SparkContext) has its own executors (not shared) living for the whole duration of the program

running in separate JVM using multiple threads● Cluster Manager agnostic. Spark only needs to acquire executors and have them communicate with each other

Spark streaming

● mostly similar● Receiver components - consuming from data source● Receiver sends information to driver program which then schedules tasks (discretized streams,

small batches) to run in the cluster○ number of assigned cores must be higher than number of Receivers

● different job lifecycle ○ potentially unbounded○ needs to be stopped by calling sc.stop()


● passing configuration● accessing cluster● SparkContext then used to create RDD from input data

○ various sources

val conf = new SparkConf().setAppName(appName).setMaster(master)

val sc = new SparkContext(conf)

val conf = new SparkConf().setAppName(appName).setMaster(master)

val ssc = new StreamingContext(conf, Seconds(1))

Spark architecture

● 5 modes:1. local2. standalone3. Yarn4. Mesos5. Amazon EC2

Local mode

● for application development purposes, no cluster required● local

○ Run Spark locally with one worker thread (i.e. no parallelism at all).● local[K]

○ Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).

● local[*]○ Run Spark locally with as many worker threads as logical cores on your machine.

● example local

Standalone mode● place compiled version of spark at each node● deployment scripts

○ sbin/start-master.sh○ sbin/start-slaves.sh○ sbin/stop-all.sh

● various settings, e.g. port, webUI port, memory, cores, java opts● drivers use spark://HOST:PORT as master● only supports a simple FIFO scheduler

○ application or global config decides how many cores and memory will be assigned to it.● resilient to Worker failures, Master single point of failure● supports Zookeeper for multiple Masters, leader election and state recovery. Running applications unaffected● or local filesystem recovery mode just restarts Master if it goes down. Single node. Better with external monitor

● example 2 start cluster

YARN mode● yet another resource negotiator● decouples resource management and scheduler from data processing framework● exclusive to Hadoop ecosystem● binary distribution of spark built with YARN support● uses hadoop configuration HADOOP_CONF_DIR or YARN_CONF_DIR● master is set to either yarn-client or yarn-cluster

Mesos mode● Mesos is a cluster operating system● abstracts CPU, memory, storage and other resources enabling fault tolerant and elastic distriuted system● can run Spark along with other applications (Hadoop, Kafka, ElasticSearch, Jenkins, ...) and manage resources

and scheduling across the whole cluster and all the applications● Mesos master replaces Spark Master as Cluster Manager● Spark binary accessible by Mesos (config)● mesos://HOST:PORT for single master mesos or mesos://zk://HOST:PORT for multi master mesos using

Zookeeper for failover● In “fine-grained” mode (default), each Spark task runs as a separate Mesos task. This allows multiple instances

of Spark (and other frameworks) to share machines at a very fine granularity● The “coarse-grained” mode will instead launch only one long-running Spark task on each Mesos machine, and

dynamically schedule its own “mini-tasks” within it.● project Myriad

● utility to connect to a cluster/local Spark● no need to write program● constructs and provides SparkContext ● similar to Scala console

● example 3 shell

Spark shell

Job deployment

● client or cluster mode ● spark-submit script● spark driver program

● allows to write same programs, differ in deployment to cluster

Spark submit script

● need to build and submit a jar with all dependencies (the dependencies need to be available at worker nodes)

● all other jars need to be specified using --jars● spark and hadoop dependencies can be provided

● ./bin/spark-submit○ --class <main class>○ --master <master>○ --deploy-mode <deploy mode>○ --conf <key>=<value>○ <application jar>○ <application arguments>

Spark submit script

● non trivial automation● need to build application jar, have it available at driver, submit job with

arguments and collect result● deployment pipeline necessary

Spark driver program

● can be part of scala/akka application and execute Spark jobs● needs dependencies, can not be provided● jars need to be specified using .setJars() method● running Spark applications, passing parameters and retrieving results

same as just running any other code

● dependency management, versions, compatibility, jar size● one SparkContext per JVM

● example 3 submit script


● streaming○ Kafka, Flume, Kinesis, Twitter, ZeroMQ, MQTT

● batch○ HDFS, Cassandra, HBase, Amazon S3, …○ text files, SequenceFiles, any Hadoop InputFormat○ when loading local file then the file must be present on worker nodes

on given path. You need to either copy it or use dfs


● getting started with Spark is relatively simple● tools simplifying development (console, local mode)● cluster deployment fragile and difficult to troubleshoot● networking using akka remoting