Apache spark - Installation

download Apache spark - Installation

of 20

  • date post

    15-Jul-2015
  • Category

    Software

  • view

    565
  • download

    3

Embed Size (px)

Transcript of Apache spark - Installation

  • Installation

    Martin Zapletal Cake Solutions

    Apache Spark

  • Apache Spark and Big Data

    1) History and market overview2) Installation3) MLlib and machine learning on Spark4) Porting R code to Scala and Spark5) Concepts - Core, SQL, GraphX, Streaming6) Sparks distributed programming model7) Deployment

  • Table of Contents

    Spark architecture download, versions, install, startup Cluster managers

    Local Standalone Mesos YARN

    Spark shell Job deployment Streaming job deployment Integration with other tools

    after this session you should be able to install Spark, run Spark cluster and deploy basic jobs

  • Installation

    prebuilt packages for different versions of Hadoop, CDH (Clouderas distribution of Hadoop), MapR (MapRs distribution of Hadoop) currently only support for Scala 2.10

    build from source uses mvn, but has a sbt wrapper need to specify Hadoop version build against can be built with Scala 2.11 support

  • Spark architecture cluster persistent, user submits Jobs SparkContext (driver) contacts Cluster Manager which assigns cluster resources then it sends application code to assigned Executors (distributing computation, not data!) finally sends tasks to Executors to run each master and worker run a webUI that displays task progress and results each application (SparkContext) has its own executors (not shared) living for the whole duration of the program

    running in separate JVM using multiple threads Cluster Manager agnostic. Spark only needs to acquire executors and have them communicate with each other

  • Spark streaming

    mostly similar Receiver components - consuming from data source Receiver sends information to driver program which then schedules tasks (discretized streams,

    small batches) to run in the cluster number of assigned cores must be higher than number of Receivers

    different job lifecycle potentially unbounded needs to be stopped by calling sc.stop()

  • SparkContext

    passing configuration accessing cluster SparkContext then used to create RDD from input data

    various sources

    val conf = new SparkConf().setAppName(appName).setMaster(master)val sc = new SparkContext(conf)

    val conf = new SparkConf().setAppName(appName).setMaster(master)val ssc = new StreamingContext(conf, Seconds(1))

  • Spark architecture

    5 modes:1. local2. standalone3. Yarn4. Mesos5. Amazon EC2

  • Local mode

    for application development purposes, no cluster required local

    Run Spark locally with one worker thread (i.e. no parallelism at all). local[K]

    Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).

    local[*] Run Spark locally with as many worker threads as logical cores on your machine.

    example local

  • Standalone mode place compiled version of spark at each node deployment scripts

    sbin/start-master.sh sbin/start-slaves.sh sbin/stop-all.sh

    various settings, e.g. port, webUI port, memory, cores, java opts drivers use spark://HOST:PORT as master only supports a simple FIFO scheduler

    application or global config decides how many cores and memory will be assigned to it. resilient to Worker failures, Master single point of failure supports Zookeeper for multiple Masters, leader election and state recovery. Running applications unaffected or local filesystem recovery mode just restarts Master if it goes down. Single node. Better with external monitor

    example 2 start cluster

  • YARN mode yet another resource negotiator decouples resource management and scheduler from data processing framework exclusive to Hadoop ecosystem binary distribution of spark built with YARN support uses hadoop configuration HADOOP_CONF_DIR or YARN_CONF_DIR master is set to either yarn-client or yarn-cluster

  • Mesos mode Mesos is a cluster operating system abstracts CPU, memory, storage and other resources enabling fault tolerant and elastic distriuted system can run Spark along with other applications (Hadoop, Kafka, ElasticSearch, Jenkins, ...) and manage resources

    and scheduling across the whole cluster and all the applications Mesos master replaces Spark Master as Cluster Manager Spark binary accessible by Mesos (config) mesos://HOST:PORT for single master mesos or mesos://zk://HOST:PORT for multi master mesos using

    Zookeeper for failover In fine-grained mode (default), each Spark task runs as a separate Mesos task. This allows multiple instances

    of Spark (and other frameworks) to share machines at a very fine granularity The coarse-grained mode will instead launch only one long-running Spark task on each Mesos machine, and

    dynamically schedule its own mini-tasks within it. project Myriad

  • utility to connect to a cluster/local Spark no need to write program constructs and provides SparkContext similar to Scala console

    example 3 shell

    Spark shell

  • Job deployment

    client or cluster mode spark-submit script spark driver program

    allows to write same programs, differ in deployment to cluster

  • Spark submit script

    need to build and submit a jar with all dependencies (the dependencies need to be available at worker nodes)

    all other jars need to be specified using --jars spark and hadoop dependencies can be provided

    ./bin/spark-submit --class --master --deploy-mode --conf =

  • Spark submit script

    non trivial automation need to build application jar, have it available at driver, submit job with

    arguments and collect result deployment pipeline necessary

  • Spark driver program

    can be part of scala/akka application and execute Spark jobs needs dependencies, can not be provided jars need to be specified using .setJars() method running Spark applications, passing parameters and retrieving results

    same as just running any other code

    dependency management, versions, compatibility, jar size one SparkContext per JVM

    example 3 submit script

  • Integration

    streaming Kafka, Flume, Kinesis, Twitter, ZeroMQ, MQTT

    batch HDFS, Cassandra, HBase, Amazon S3, text files, SequenceFiles, any Hadoop InputFormat when loading local file then the file must be present on worker nodes

    on given path. You need to either copy it or use dfs

  • Conclusion

    getting started with Spark is relatively simple tools simplifying development (console, local mode) cluster deployment fragile and difficult to troubleshoot networking using akka remoting

  • Questions