What is SPARK? - UHgabriel/courses/cosc6339_s17/BDA_11_Spark.pdf · What is SPARK? •In-Memory...

COSC 6339

Big Data Analytics

Introduction to Spark

Edgar Gabriel

Spring 2017

What is SPARK?

• In-Memory Cluster Computing for Big Data Applications

• Fixes the weaknesses of MapReduce

– Iterative applications

– Streaming data processing

– Keep data in memory across different functions

• Sparks works across many environments

– Standalone,

– Hadoop,

– Mesos,

• Spark support accessing data from diverse sources ( HDFS,

HBase, Cassandra, …)

What is SPARK (II)• Three modes of execution

– Spark shell

– Spark scripts

– Spark code

• API defined for multiple languages

– Scala

– Python

– Java

A couple of words on Scala

• Object-oriented language: everything is an object and

every operation is a method-call.

• Scala is also a functional language

– Functions are first class values

– Can be passed as arguments to functions

– Functions have to be free of side effects

– Can defined functions inside of functions

• Scala runs on the JVM

– Java and Scala classes can be freely mixed

Spark Essentials

• Spark program has to create a SparkContext object,

which tells Spark how to access a cluster

• Automatically done in the shell for Scala or Python: accessible through the sc variable

• Programs must use a constructor to instantiate a new SparkContext

gabriel@whale:> pyspark

Using Python version 2.7.6 (default, Nov 21 2013 15:55:38)

SparkSession available as 'spark'.

>>> sc

<pyspark.context.SparkContext object at 0x2609ed0>

Spark Essentials

• The master parameter for a SparkContext

determines which resources to use, e.g. whale>pyspark --master yarn

Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

SPARK cluster utilization

1. master connects to a cluster manager to allocate

resources across applications

2. acquires executors on cluster nodes – processes run

compute tasks, cache data

3. sends app code to the executors

4. sends tasks for the executors to run

SPARK master parameterMaster URL Meaning

local Run Spark locally with one worker thread (i.e.

no parallelism at all).

local[K] Run Spark locally with K worker threads (ideally,

set this to the number of cores on your

machine).

spark://HOST:PORT Connect to the given Spark standalone

cluster master. The port must be whichever one

your master is configured to use, which is 7077

by default.

mesos://HOST:PORT Connect to the given Mesos cluster. The port

must be whichever one your is configured to

use, which is 5050 by default. Or, for a Mesos

cluster using ZooKeeper, use mesos://zk://....

yarn Connect to a YARN cluster in cluster mode. The

cluster location will be found based on

HADOOP_CONF_DIR.

Programming Model

• Resilient distributed datasets (RDDs)

– Immutable collections partitioned across cluster nodes that can be rebuilt if a partition is lost

– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

• Two types of RDDs defined today:

– parallelized collections – take an existing collection and

run functions on it in parallel

– Hadoop datasets – run functions on each record of a file

in Hadoop distributed file system or any other storage

system supported by HadoopSlide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf

Programming Model (II)

• Two types of operations on RDDs: transformations and

actions

– transformations are lazy (not computed immediately)

– the transformed RDD gets recomputed when an action is

run on it (default)

– instead they remember the transformations applied to

some base dataset

• optimize the required calculations

• recover from lost data partitions

Programming Model (III)

• Spark can create RDDs from any file stored in HDFS or other

storage systems supported by Hadoop, PySpark supports only

text files,

• Scala Spark also supports SequenceFiles, and other Hadoop

InputFormat, e.g., local file system, Amazon S3, Hypertable,

HBase, etc.

Creating a simple RDD

>>> numbers = sc.parallelize([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> numbers.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> numbers.count()

>>> numbers2 = sc.parallelize(range(0, 1000000))

>>> numbers2.getNumPartitions()

>>> quit()

Returns a list that contains all elements in this RDD

Note: should only be used if the resulting array is expected to be small

Return the number of elements in this RDD.

Returns the number ‘splits’ in this RDD.

Creating an RDD from an Input File

gabriel@whale:~> hdfs dfs -cat /gabriel/simple-input.txt

line1 value1

line2 value2

line3 value3

line4 value4

gabriel@whale:~>pyspark

>>> text=sc.textFile("/gabriel/simple-input.txt")

>>> text.collect()

[u'line1\tvalue1', u'line2\tvalue2', u'line3\tvalue3',

u'line4\tvalue4']

>>> text.count()

>>> lines = text.map(lambda w : (w, 1) )

>>> lines.collect()

[(u'line1\tvalue1', 1), (u'line2\tvalue2', 1),

(u'line3\tvalue3', 1), (u'line4\tvalue4', 1)]

>>> words = text.flatMap(lambda l: l.split())

>>> words.collect()

[u'line1', u'value1', u'line2', u'value2', u'line3',

u'value3', u'line4', u'value4']

>>>>>> wcount=words.map( lambda w: (w,1))

>>> wcount.collect()

[(u'line1', 1), (u'value1', 1), (u'line2', 1), (u'value2', 1),

(u'line3', 1), (u'value3', 1), (u'line4', 1), (u'value4', 1)]

Shorter version

wcounts=text.flatMap(lambda l: l.split()).map(lambda w: (w,1))

Map every instance w to a pair of (w,1)

flatMap: map every instance l to a sequence of objects returned by the split() operation

lambda: anonymous functions in python

Transformations

Actions

Slide based on a talk found at http://cdn.liber118.com/workshop/itas_workshop.pdf

Actions

Pyspark wordcount example

>>> from operator import add

>>> words = text.flatMap(lambda l: l.split()).map

(lambda w: (w, 1) )

>>> counts = words.reduceByKey(add)

>>> counts.saveAsTextFile(“/gabriel/wordcount.out")

reduceByKey(func, numPartitions=None, partitionFunc=<f>)

Merge the values for each key using an associative and commutative reduce

function.

• This will also perform the merging locally on each mapper before sending

results to a reducer, similarly to a “combiner” in MapReduce.

• Output will be partitioned with numPartitions partitions, or the default parallelism level if numPartitions is not specified.

gabriel@whale:~> hdfs dfs -ls /gabriel/wordcount.out

Found 5 items

-rw-r--r-- 3 gabriel hadoop 0 /gabriel/wordcount.out/_SUCCESS

-rw-r--r-- 3 gabriel hadoop 29 /gabriel/wordcount.out/part-00000

gabriel@whale:~> hdfs dfs -cat /gabriel/wordcount.out/part-00000

(u'value3', 1)

(u'line3', 1)

>>> from operator import add

>>> words = text.flatMap(lambda line:line.split()).

map(lambda w: (w, 1) )

>>> counts = words.reduceByKey(add, numPartitions=1)

>>> counts.saveAsTextFile(“/gabriel/wordcount2.out")

gabriel@whale:~> hdfs dfs -ls /gabriel/wordcount2.out

Found 2 items

-rw-r--r- 3 gabriel hadoop 0 /gabriel/wordcount2.out/_SUCCESS

-rw-r--r- 3 gabriel hadoop 116 /gabriel/wordcount2.out/part-00000

Persistence• Spark can persist (or cache) a dataset in memory across

operations

• Each node stores in memory any slices of it that it

computes and reuses them in other actions on that

dataset – often making future actions more than 10x

faster

• The cache is fault-tolerant: if any partition of an RDD

is lost, it will automatically be recomputed using the

transformations that originally created it

>>> text=sc.textFile(“/Gabriel/simple-input.txt")

>>> lines=text.flatMap(lambda l: l.split()).cache()

Broadcast variables

• Broadcast variables let programmer keep a read-only

variable cached on each machine rather than shipping a

copy of it with tasks

• For example, to give every node a copy of required

accuracy, number of iterations, etc.

• Spark attempts to distribute broadcast variables using

efficient broadcast algorithms to reduce

communication cost

>>>b = sc.broadcast([1, 2, 3])

>>>b.value

[1, 2, 3]

Literature

• http://spark.apache.org/docs/latest/api/python/index.html

• https://www.codementor.io/jadianes/spark-python-rdd-

basics-du107x2ra

• A large number of books available meanwhile for Spark and

friends.

What is SPARK? - UHgabriel/courses/cosc6339_s17/BDA_11_Spark.pdf · What is SPARK? •In-Memory...

Documents

Transcript of What is SPARK? - UHgabriel/courses/cosc6339_s17/BDA_11_Spark.pdf · What is SPARK? •In-Memory...

Budapest Spark Meetup - Apache Spark @enbrite.ly

Real Time Analytics via Spark & Scala | Spark & Scala Fundamentals | Spark & Scala Architecture

Spark Your Legacy (Spark Summit 2016)

An Introduction to Parallel Computing - UHgabriel/old/mpicourse_03_06/Introduction.pdf · Short course on Parallel Computing Edgar Gabriel Recommended Literature • Timothy G. Mattson,

Spark, spark streaming & tachyon

McDonough Spark Tutorial Spark Summit 2013

Spark streaming , Spark SQL

Computer Archicture F07 - UHgabriel/courses/cosc6397_s14/BDA_18_Pig.pdfPig Edgar Gabriel Spring 2014 Overview •Pig is an abstraction on top of Hadoop ... org.apache.pig.backend.hadoop.executionengine.mapReduceLayer

Intro to Spark and Spark SQL

Spark & Spark SQL

Learning spark ch10 - Spark Streaming

CiscoSpark インストール手順 (forWindowsPC)...Cisco Spark Test Spark Spark 15:02 O Spark spark 15:02 Welcome to Cisco Spark! You can easily meet with your team and collaborate

S U M M I T - Amazon Web Services... · Task2/Slide1 Task Dispatcher Spark Driver Spark Worker Spark Worker Spark Worker - Spark Driver provisioning - Task parameters - Spark Workers

Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide5xracing.com/...spark-plug-and-spark-plug-wire-installation-guide.pdf · Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Spark summit2014 techtalk - testing spark

Productionizing Spark and the Spark Job Server

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

NGK SPARK pÚü6s RESISTOR TYPE SPARK PLUGS SPARK PLUGS ... · ngk spark pÚü6s resistor type spark plugs spark plugs bougies bujias