Download - Distributed Computing with Apache Spark

Distributed Computing with Apache SparkConvex and distributed optimization (3 ECTS)

Master of Science in Industrial and Applied Mathematics

2016

Original motivation

Google Inc.Jeffrey Dean, Sanjay Ghemawat - MapReduce : simplified data

processing on large clusters OSDI, 2004

• Data and processing colocation.• Process the data where it is.• Avoid networks and I/Os.

This is the birth certificate of Big Data (i.e. MapReduce)

Original context

• Simple processing : indexation, statistiques, queries orfrequent words.

• Huge amounts of data : web pages, logs, documents (texts,images, videos).

Original challenges

• Data distribution• Parallel processing• Fault management• Cost reduction (commodities PCs)

MapReduce ∼ Functional programming

Characteristics• Operations sequenced by the composition

(f ◦ g)(x) = f (g(x))

No order in the declarations.• The result of a function depends only on its inputs (no purefunctional state).

• Data/variables not modifiable: no assignment, no explicitmanagement of the memory.

Functional Inspiration of MapReduce• MapReduce Pipeline : reduce(⊕) ◦ grp ◦map(f )• Can be automatically parallelized on several calculation units.

Map functionmap : (A→ B)→ ([A]→ [B])

map(f )[x0, ..., xn] = [f ((x0), ..., f (xn)]

map(∗2)[2, 3, 6] = [4, 6, 12]

Map Prototype into MapReduce• Map documentation : (K1,V 1)→ [(K2,V 2)] Map is aparticular prototype of the f of map(f ).

• Apply f to a collection of key/value pairs. For each pair (k , v)then computing f (k , v).

Pseudocode examplefunction map(uri, document)

foreach distinct term in documentoutput (term, count(term, document))

Map function

Algebric properties of Map• map(id) = id with id(x) = x• map(f ◦ g) = map(f ) ◦map(g)• map(f )[x ] = [f (x)]• map(f )(xs ++ys) = map(f )(xs) + +map(f )(ys)

Application• Simplification and automatic program rewriting.• Proof (algebric) of equivalence.• Automatic parallelization of calculations.

Sort/Group/Shuffle function

grp :[(A× B)]→ [(A× [B])]

grp[...(w , a0), ..., (w , an)...] = [..., (w , [a0, ..., an]), ...]

grp[(′a′, 2), (′z ′, 2), (′ab′, 3), (′a′, 4)] = [(′a′, [2, 4]), (′z ′, [2]), (′ab′, [3])]

Sort/Group/Shuffle Prototype into MapReduce• Documentation grp : [(K2,V 2)]→ [(K2, [V 2])]• Recalls instruction GROUP BY/ORDER BY in SQL.• Grp is called transparently between Map and Reduce phase.

Reduce function

reduce : (A× A→ B)→ ([A]→ B)

reduce(⊕)[x0, ..., xn] = x0 ⊕ x2 ⊕ ...⊕ xn−1 ⊕ xn

reduce(+)[2, 1, 3] = 2+ 1+ 3 = 6

Reduce Prototype into MapReduce• Documentation reduce : [(K2,V 2)]→ [(K3, [V 3])]• Reduce is a particular prototype for reduce(⊕), we apply ⊕ ona collection of values associated with the key.

Pseudocode examplefunction reduce(term, counts)

output (term, sum(counts))

Example : Matrix-Vector Multiplication

Let A be an m × n matrix and v be a vector of length n

A =

a11 a12 · · · a1na21 a22 · · · a2n...

.... . .

...am1 am2 · · · amn

, v =

v1v2...vn

The product Av is a vector of length m

Av =

∑n

j=1 a1jvj∑nj=1 a2jvj

...∑nj=1 amjvj

Example : Matrix-Vector Multiplication

MapReduce pseudocode for computes matrix-vector multiplication

map(key, value):for (i, j, a_ij) in value:

emit(i, a_ij * v[j])

reduce(key, values):result = 0for value in values:

result += valueemit(key, result)

Communication costs• Map tasks is O(mn + n)• Reduce tasks is O(mn)

Example : Logistic Regression

We choose the form of hypothesis as hθ(x) = 1/(1+ exp(−θT x))and fitting θ by using Newton-Raphson

θ := θ − H−1∇θl(θ) where l(θ) is the likelihood function

∇θl(θ) is computed in parallel by mappers summing up∑subgroup

(y (i) − hθ(x (i))

)x (i)j each step i

The hessian matrix is computed by mappers with summation

H(j , k) := H(j , k) + hθ(x (i))(hθ(x (i))− 1

)x (i)j x (i)

k

The reducer sum up gradient and hessain values to perform θupdate.

Example : Support Vector Machine

Linear SVM’s goal is to optimize the primal problem

argminω,b

‖ ω ‖2 +C∑

i :ζi≥0ζpi s.t. y (i)(ωT x (i) + b) ≥ 1− ζi

where p is either 1 (hinge loss) or 2 (quadratic loss).

The primal problem for quadratic loss can be solved by batchgradient descent (sc are support vectors)

∇ = 2ω + 2C∑i∈sv

(ωxi − yi )xi and H = I + C∑i∈sv

xixTi

The mappers calculate the partial gradient and the reducer sum upthe partial result to update ω.

Apache HadoopDistributed Data Storage + MapReduce Processing

Traditional network programmingMessage-passing between nodes (e.g. MPI)

Very difficult to do at scale:• How to split problem across nodes?• Must consider network & data locality• How to deal with failures? (inevitable at scale)• Even worse: stragglers (node not failed, but slow)• Ethernet networking not fast• Have to write programs for each machine

Rarely used in commodity datacenters.

MapReduce limitations

Difficultly of programming directly in MapReduce

Constrained modelA Map phase then a Reduce phase.

For complex and iterative algorithms we need to link severalMapReduce phases.

Data transfer between these phases : disk storage.

Most of optimization algorithms are iteratives!

Result & Verdict

While MapReduce is simple, it can require asymptotically morecommunication or I/O.

MapReduce algorithms research doesn’t go to waste, it justgets sped up and easier to use.

Still useful to study as an algorithmic framework, silly to usedirectly.

Therefore, people builtspecialized systems...

Why Apache Spark?

Spark’s goal was to generalize MapReduce to support newapps within same engine.Benefit for Users - Same engine performs data extraction, modeltraining and interactive queries.

Two small additions are enough to express the previous models:• Fast data sharing.• General directed acyclic execution graphs (DAGs).

This allows for an approach which is more efficient for the engine,and much simpler for the end users.

Disk vs Memory

L1 cache reference: 0.5 nsL2 cache reference: 7 nsMutex lock/unlock: 100 nsMain memory reference: 100 nsDisk seek: 10,000,000 ns

In-Memory Computing

Hadoop MapReduce : Share data on disk

Apache Spark : Speed up processing using the memory

Historical

Lightning-fast cluster computing

http://spark.apache.org

Originally developed by UC Berkeley (AMPLab)Open sourced in 2009 and implemented in Scala

Adoption and use cases

eBay: Use Spark for logs processing (aggregation) and analytical,. . .

Kelkoo: Use Spark et Spark Streaming for the recommendation ofproducts, BI, real time filtering of malicious activity, data mining.

Moody’s Analytics: Use Spark for its credit risk calculationplatform, (C)VaR calculation, ...

Amazon, Yahoo!, TripAdvisor, Hitachi, NASA, Ooyala, Shopify,Samsug, Socialmetrix, ...

http://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Spark is Hadoop compatibleIntegration with Hadoop and its ecosystem,

HBase, Cassandra, MongoDB ...

Spark is FastIn-Memory Computing

Suitable for iterative algorithms

Record to sort 100 TB on disk.

Spark is Simple

Development facility - APIs simples & intuitives

APIs in Java, Scala, Python (+ SQL, Clojure, R)

Spark is InteractiveInteractive mode (Spark Shell, PySpark), standalone mode

Spark UIApplication Monitoring

Spark is StreamingReal-time processing (Micro-Batching)

Spark Streaming is easier to use than Apache Storm

Spark is (very) ActiveMost active open source community in big data

+500 contributors

... is well DocumentedOne can find many examples, presentations, videos, MOOCs,

events, meetup, ...

https://sparkhub.databricks.com

... with a large open-sourcecommunity

cf. http://spark-packages.org

cf. Github

Spark Ecosystem

SparkContext

The first thing a Spark program should do is create an objectSparkContext, that says how Spark accesses a cluster.

In the shell (Scala or Python), a variable sc is automaticallycreated.

Other programs must use a constructor to instantiate a newSparkContext.

SparkContext can be used to create other variables.

Master URLsThe master parameter determines which cluster to use.

Master Descriptionlocal Run Spark locally with one worker thread

(i.e. no parallelism at all)

local[K ] Run Spark locally with K worker threads

(ideally set to # cores on your machine)

spark://HOST:PORT Connect to a Spark standalone cluster;

PORT depends on config (7077 by default)

mesos://HOST:PORT Connect to a Mesos cluster;

PORT depends on config (5050 by default)

yarn Connect to a YARN cluster in client or cluster mode

Shell Python (PySpark) locally with 4 cores

$pyspark --master local [4]

Shell Python (PySpark) to a standalone cluster, i.e cluster1

$pyspark --master spark :// cluster1 :7077

Submit a Job (script Python example.py) locally with 4 cores

$spark -submit --master local [4] example.py

Submit a Job to a standalone cluster, i.e. cluster1

$spark -submit --master spark :// cluster1 :7077 example.py

RDDResilient Distributed Datasets (RDD)Collections of objects across a cluster• User controlled partitioning.• Stored in memory or on disk.• Built via parallel transformations (map, filter, ...).• Automatically rebuilt on failure.

There are two types of RDD:• Parallelized collections - take an existing collection and runfunctions on it in parallel.

• Hadoop Datasets - run functions on each record of a file inHadoop distributed file system or any other storage systemsupported by Hadoop (Cassandra, MongoDB, Amazon S3,Hypertable, HBase, ...).

http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-

datasets-rdds

Create a RDD from a local file

alice = sc.textFile("Alice_in_Wonderland.txt")

Create a RDD from a Hadoop/HDFS distributed file

alice = sc.textFile("hdfs :// books/les_miserables.txt")

Create a RDD from a Cassandra table (DataFrame)

# add Cassandra connexion host into sqlContextsqlContext.setConf("spark.cassandra.connection.host",

"172.16.0.161")

metrics = sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace="ssc", table="metrics").load()

Operations on RDDs (twotypes)

TransformationsManipulates an RDD, returnsanother RDD.

Transformations are lazy, notcomputed immediately.

Parallel execution.

Optimize the requiredcalculations.

Recover from lost data partitions.

Examples :map(), filter(), join(),groupByKey(), ...

ActionsThese are the final actions(compute, persistence, ...).

Do not return a RDD.

Persistence in memory or on disk.

Examples:reduce(), count(), foreach(),saveAsHadoopFile(), ...

A transformed RDD is calculated when an action is executed.

Counting words in "Alice in Wonderland"

alice = sc.textFile("Alice_in_Wonderland.txt")

counts = alice.flatMap(lambda line: line.split(" ")).map(lambda word: (word , 1)).reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs :// counts.txt")

Create RDD alice from a local file.

Transformations : RDD alice is transformed (i.e. two map() anda reduceByKey()) in a new RDD counts.

Actions : writing RDD counts into HDFS.

Transformations

Transformation Descriptionmap(func) return a new distributed dataset formed by passing

each element of the source through a function func

filter(func) return a new dataset formed by selecting

those elements of the source on which func returns true

flatMap(func) similar to map, but each input item can be mapped

to 0 or more output items (so func should return a

Seq rather than a single item)

sample(withReplacement, sample a fraction fraction of the data, with or without

fraction, seed) replacement, using a given random generator seed

union(otherDataset) return a new dataset that contains the union of the

elements in the source dataset and the argument

distinct([numTasks]) return a new dataset that contains the distinct elements

of the source dataset

... ...

Actions

Actions Descriptionreduce(func) aggregate the elements of the dataset using a function func.

collect() return all the elements of the dataset as an array.

count() return the number of elements in the dataset.

first() return the first element of the dataset - similar to take(1).

take(n) return an array with the first n elements of the dataset.

saveAsTextFile(path) write the elements of the dataset as a text file,

HDFS or any other Hadoop-supported file system.

countByKey() only available on RDDs of type (K, V).

Returns a Map of (K, Int) pairs with the count of each key

foreach(func) run a function func on each element of the dataset

... ...

RDD partitions

An RDD is divided in partition.We can control the partitioning of a RDD.

Number of partitions of a RDD:

alice = sc.textFile("Alice_in_Wonderland.txt")alice.getNumPartitions ()2

RDD lineageTwo types of dependencies based on transformations.

Narrow Dependencies Wide Dependencies

Impact on performance in case of failure.

RDD lineageAn RDD records all the transformations necessary to build it.

Show the lineage of an RDD:

print counts.toDebugString ()(2) PythonRDD [7] at RDD at PythonRDD.scala :43 []| MapPartitionsRDD [6] at mapPartitions at PythonRDD.scala :374 []| ShuffledRDD [5] at partitionBy at NativeMethodAccessorImpl.java:-2 []+-(2) PairwiseRDD [4] at reduceByKey at <stdin >:1 []

| PythonRDD [3] at reduceByKey at <stdin >:1 []| Alice_in_Wonderland.txt MapPartitionsRDD [1] at textFile at NativeMethodAccessorImpl.java:-2 []| Alice_in_Wonderland.txt HadoopRDD [0] at textFile at NativeMethodAccessorImpl.java:-2 []

Effective recovery in case of failure.

DAG (Directed Acyclic Graph)Transformations define an acyclic oriented graph (DAG) of RDDs

that will be used later when an action is called.

Operations on RDDs create new RDDs that refer back to theirparents, thereby creating a graph.

How Spark worksRDD → DAGSpark translates the transformations on RDDs into DAG (orientedacyclic graph).

DAG schedulerA DAG schedulerw divides operations into Stages of Tasks. Astage consists of tasks based on the partitions of the RDD(distributed data).

Task SchedulerThe stages are passed to the Task Scheduler. The Task Schedulerlaunches jobs through the Cluster Manager.

Cluster Overview (master andworkers)

We submit an application - Execution driven by a driver.1. master connects to a cluster manager to allocate resources

across applications. acquires executors on cluste rnodes -processes run compute tasks, cache data.

3. sends app code to the executors.4. sends tasks for the executors to run.

http://spark.apache.org/docs/latest/cluster-overview.html

RDD persistence

Spark can persist (or cache) a dataset in memory across operations.

Each node stores in memory any slices of it that it computes andreuses them in other actions on that dataset. (often making futureactions more than 10x faster).

The cache is fault-tolerant: if any partition of an RDD is lost, itwill automatically be recomputed using the transformations thatoriginally created it.

RDD Persistence Control

Each persisted RDD can be stored using a different storage level.

RDD.persist ()

Type DescriptionMEMORY_ONLY Store RDD as deserialized Java objects in the JVM.

This is the default level.

MEMORY_AND_DISK Persistance en mémoire et sur disque.

DISK_ONLY Store the RDD partitions only on disk.

... ...

Shared VariablesBroadcast VariablesKeep a read-only variable cached on each machinerather thanshipping a copy of it with tasks. To give every node a copy of alarge input dataset efficiently. Using efficient broadcast algorithmsto reduce communication cost.

AccumulatorsUsed to implement counters and sums, efficiently in parallel. Sparknatively supports accumulators of numeric value types and standardmutable collections. Only the driver program can read anaccumulator’s value, not the tasks.

Spark Built-in Libraries

http://spark.apache.org/sql/

SQL execution engine - Using a DataFrame in SQL

Structured data processingInterrogates structured data using SQL.Standard connectivity via JDBC or ODBC. APIs Java, Scala,Python et R.

DataFramesDataFrames = RDD + named columnsDSL: select(), where(), groupBy(), ...Have tabular data.Describe the scheme → DataFrame.

http://spark.apache.org/mllib/

Machine LearningMLlib is the Machine Learning library (based on Breeze, netlib-java,JBlas, BLAS/LAPACK).

PerformancesHigh quality algorithms, 100x faster than MapReduce.

MLlib Algorithms 1/2ClassificationSVMs, logistic regression, decision trees, naive Bayes, randomforests and gradient-boosted trees.

ClusteringK-means, bisecting K-means, Gaussian mixtures (GMM) and poweriteration clustering.

Dimensionality reduction & DecompositionSingular value decomposition (SVD), QR and principal componentanalysis (ACP).

Collaborative filtering - Recommandationalternating least squares (ALS), non-negative matrix factorization(NMF).

MLlib Algorithms 2/2

Basic statisticsSummary statistics, correlations, stratified sampling, hypothesistesting, and random data generation.

Extraction of characteristics and transformationTopic modeling via latent Dirichlet allocation (LDA), TF-IDF,Word2Vec, StandardScaler, and Normalizer

OptimizationStochastic Gradient Descent (SGD), Limited-memory BFGS.

RegressionGeneralized linear models (GLMs), regression tree with L1, L2, andelastic-net regularization...

http://spark.apache.org/graphx/

GraphX is Spark library for graphs and graph-parallel computation.

FlexibilityGraphX unifies ETL, exploratory analysis, and iterative graphcomputation within a single system.

Performances - SpeedSimilar to specialized (faster) graphics processing systems.

GraphX supported algorithms

Page Rank

Connected components

Label propagation

SVD++

Strongly connected components

Triangle count

http://spark.apache.org/streaming/

Spark Streaming makes it easy to build scalable fault-tolerantstreaming applications.

Ease of UseBuild applications through high-level operators.

Batch + StreamingCombine streaming with batch and interactive queries.

Micro-batches

Découpe un flux continu en batchesAPI similar to SparkSpark Streaming 6= Apache StormSpark Streaming ∼ Apache Storm + Trident