Distributed Computing with Apache Spark

60
Distributed Computing with Apache Spark Convex and distributed optimization (3 ECTS) Master of Science in Industrial and Applied Mathematics 2016

Transcript of Distributed Computing with Apache Spark

Page 1: Distributed Computing with Apache Spark

Distributed Computing with Apache SparkConvex and distributed optimization (3 ECTS)

Master of Science in Industrial and Applied Mathematics

2016

Page 2: Distributed Computing with Apache Spark

Original motivation

Google Inc.Jeffrey Dean, Sanjay Ghemawat - MapReduce : simplified data

processing on large clusters OSDI, 2004

• Data and processing colocation.• Process the data where it is.• Avoid networks and I/Os.

This is the birth certificate of Big Data (i.e. MapReduce)

Page 3: Distributed Computing with Apache Spark

Original context

• Simple processing : indexation, statistiques, queries orfrequent words.

• Huge amounts of data : web pages, logs, documents (texts,images, videos).

Original challenges

• Data distribution• Parallel processing• Fault management• Cost reduction (commodities PCs)

Page 4: Distributed Computing with Apache Spark

MapReduce ∼ Functional programming

Characteristics• Operations sequenced by the composition

(f ◦ g)(x) = f (g(x))

No order in the declarations.• The result of a function depends only on its inputs (no purefunctional state).

• Data/variables not modifiable: no assignment, no explicitmanagement of the memory.

Functional Inspiration of MapReduce• MapReduce Pipeline : reduce(⊕) ◦ grp ◦map(f )• Can be automatically parallelized on several calculation units.

Page 5: Distributed Computing with Apache Spark

Map functionmap : (A→ B)→ ([A]→ [B])

map(f )[x0, ..., xn] = [f ((x0), ..., f (xn)]

map(∗2)[2, 3, 6] = [4, 6, 12]

Map Prototype into MapReduce• Map documentation : (K1,V 1)→ [(K2,V 2)] Map is aparticular prototype of the f of map(f ).

• Apply f to a collection of key/value pairs. For each pair (k , v)then computing f (k , v).

Pseudocode examplefunction map(uri, document)

foreach distinct term in documentoutput (term, count(term, document))

Page 6: Distributed Computing with Apache Spark

Map function

Algebric properties of Map• map(id) = id with id(x) = x• map(f ◦ g) = map(f ) ◦map(g)• map(f )[x ] = [f (x)]• map(f )(xs ++ys) = map(f )(xs) + +map(f )(ys)

Application• Simplification and automatic program rewriting.• Proof (algebric) of equivalence.• Automatic parallelization of calculations.

Page 7: Distributed Computing with Apache Spark

Sort/Group/Shuffle function

grp :[(A× B)]→ [(A× [B])]

grp[...(w , a0), ..., (w , an)...] = [..., (w , [a0, ..., an]), ...]

grp[(′a′, 2), (′z ′, 2), (′ab′, 3), (′a′, 4)] = [(′a′, [2, 4]), (′z ′, [2]), (′ab′, [3])]

Sort/Group/Shuffle Prototype into MapReduce• Documentation grp : [(K2,V 2)]→ [(K2, [V 2])]• Recalls instruction GROUP BY/ORDER BY in SQL.• Grp is called transparently between Map and Reduce phase.

Page 8: Distributed Computing with Apache Spark

Reduce function

reduce : (A× A→ B)→ ([A]→ B)

reduce(⊕)[x0, ..., xn] = x0 ⊕ x2 ⊕ ...⊕ xn−1 ⊕ xn

reduce(+)[2, 1, 3] = 2+ 1+ 3 = 6

Reduce Prototype into MapReduce• Documentation reduce : [(K2,V 2)]→ [(K3, [V 3])]• Reduce is a particular prototype for reduce(⊕), we apply ⊕ ona collection of values associated with the key.

Pseudocode examplefunction reduce(term, counts)

output (term, sum(counts))

Page 9: Distributed Computing with Apache Spark

Example : Matrix-Vector Multiplication

Let A be an m × n matrix and v be a vector of length n

A =

a11 a12 · · · a1na21 a22 · · · a2n...

.... . .

...am1 am2 · · · amn

, v =

v1v2...vn

The product Av is a vector of length m

Av =

∑n

j=1 a1jvj∑nj=1 a2jvj

...∑nj=1 amjvj

Page 10: Distributed Computing with Apache Spark

Example : Matrix-Vector Multiplication

MapReduce pseudocode for computes matrix-vector multiplication

map(key, value):for (i, j, a_ij) in value:

emit(i, a_ij * v[j])

reduce(key, values):result = 0for value in values:

result += valueemit(key, result)

Communication costs• Map tasks is O(mn + n)• Reduce tasks is O(mn)

Page 11: Distributed Computing with Apache Spark

Example : Logistic Regression

We choose the form of hypothesis as hθ(x) = 1/(1+ exp(−θT x))and fitting θ by using Newton-Raphson

θ := θ − H−1∇θl(θ) where l(θ) is the likelihood function

∇θl(θ) is computed in parallel by mappers summing up∑subgroup

(y (i) − hθ(x (i))

)x (i)j each step i

The hessian matrix is computed by mappers with summation

H(j , k) := H(j , k) + hθ(x (i))(hθ(x (i))− 1

)x (i)j x (i)

k

The reducer sum up gradient and hessain values to perform θupdate.

Page 12: Distributed Computing with Apache Spark

Example : Support Vector Machine

Linear SVM’s goal is to optimize the primal problem

argminω,b

‖ ω ‖2 +C∑

i :ζi≥0ζpi s.t. y (i)(ωT x (i) + b) ≥ 1− ζi

where p is either 1 (hinge loss) or 2 (quadratic loss).

The primal problem for quadratic loss can be solved by batchgradient descent (sc are support vectors)

∇ = 2ω + 2C∑i∈sv

(ωxi − yi )xi and H = I + C∑i∈sv

xixTi

The mappers calculate the partial gradient and the reducer sum upthe partial result to update ω.

Page 13: Distributed Computing with Apache Spark

Apache HadoopDistributed Data Storage + MapReduce Processing

Page 14: Distributed Computing with Apache Spark

Traditional network programmingMessage-passing between nodes (e.g. MPI)

Very difficult to do at scale:• How to split problem across nodes?• Must consider network & data locality• How to deal with failures? (inevitable at scale)• Even worse: stragglers (node not failed, but slow)• Ethernet networking not fast• Have to write programs for each machine

Rarely used in commodity datacenters.

Page 15: Distributed Computing with Apache Spark

MapReduce limitations

Difficultly of programming directly in MapReduce

Constrained modelA Map phase then a Reduce phase.

For complex and iterative algorithms we need to link severalMapReduce phases.

Data transfer between these phases : disk storage.

Most of optimization algorithms are iteratives!

Page 16: Distributed Computing with Apache Spark

Result & Verdict

While MapReduce is simple, it can require asymptotically morecommunication or I/O.

MapReduce algorithms research doesn’t go to waste, it justgets sped up and easier to use.

Still useful to study as an algorithmic framework, silly to usedirectly.

Page 17: Distributed Computing with Apache Spark

Therefore, people builtspecialized systems...

Page 18: Distributed Computing with Apache Spark

Why Apache Spark?

Spark’s goal was to generalize MapReduce to support newapps within same engine.Benefit for Users - Same engine performs data extraction, modeltraining and interactive queries.

Two small additions are enough to express the previous models:• Fast data sharing.• General directed acyclic execution graphs (DAGs).

This allows for an approach which is more efficient for the engine,and much simpler for the end users.

Page 19: Distributed Computing with Apache Spark

Disk vs Memory

L1 cache reference: 0.5 nsL2 cache reference: 7 nsMutex lock/unlock: 100 nsMain memory reference: 100 nsDisk seek: 10,000,000 ns

Page 20: Distributed Computing with Apache Spark

In-Memory Computing

Hadoop MapReduce : Share data on disk

Apache Spark : Speed up processing using the memory

Page 21: Distributed Computing with Apache Spark

Historical

Page 22: Distributed Computing with Apache Spark

Lightning-fast cluster computing

http://spark.apache.org

Originally developed by UC Berkeley (AMPLab)Open sourced in 2009 and implemented in Scala

Page 23: Distributed Computing with Apache Spark

Adoption and use cases

eBay: Use Spark for logs processing (aggregation) and analytical,. . .

Kelkoo: Use Spark et Spark Streaming for the recommendation ofproducts, BI, real time filtering of malicious activity, data mining.

Moody’s Analytics: Use Spark for its credit risk calculationplatform, (C)VaR calculation, ...

Amazon, Yahoo!, TripAdvisor, Hitachi, NASA, Ooyala, Shopify,Samsug, Socialmetrix, ...

http://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Page 24: Distributed Computing with Apache Spark

Spark is Hadoop compatibleIntegration with Hadoop and its ecosystem,

HBase, Cassandra, MongoDB ...

Page 25: Distributed Computing with Apache Spark

Spark is FastIn-Memory Computing

Suitable for iterative algorithms

Record to sort 100 TB on disk.

Page 26: Distributed Computing with Apache Spark

Spark is Simple

Development facility - APIs simples & intuitives

APIs in Java, Scala, Python (+ SQL, Clojure, R)

Page 27: Distributed Computing with Apache Spark

Spark is InteractiveInteractive mode (Spark Shell, PySpark), standalone mode

Page 28: Distributed Computing with Apache Spark

Spark UIApplication Monitoring

Page 29: Distributed Computing with Apache Spark

Spark is StreamingReal-time processing (Micro-Batching)

Spark Streaming is easier to use than Apache Storm

Page 30: Distributed Computing with Apache Spark

Spark is (very) ActiveMost active open source community in big data

+500 contributors

Page 31: Distributed Computing with Apache Spark

... is well DocumentedOne can find many examples, presentations, videos, MOOCs,

events, meetup, ...

https://sparkhub.databricks.com

Page 32: Distributed Computing with Apache Spark

... with a large open-sourcecommunity

cf. http://spark-packages.org

cf. Github

Page 33: Distributed Computing with Apache Spark

Spark Ecosystem

Page 34: Distributed Computing with Apache Spark

SparkContext

The first thing a Spark program should do is create an objectSparkContext, that says how Spark accesses a cluster.

In the shell (Scala or Python), a variable sc is automaticallycreated.

Other programs must use a constructor to instantiate a newSparkContext.

SparkContext can be used to create other variables.

Page 35: Distributed Computing with Apache Spark

Master URLsThe master parameter determines which cluster to use.

Master Descriptionlocal Run Spark locally with one worker thread

(i.e. no parallelism at all)

local[K ] Run Spark locally with K worker threads

(ideally set to # cores on your machine)

spark://HOST:PORT Connect to a Spark standalone cluster;

PORT depends on config (7077 by default)

mesos://HOST:PORT Connect to a Mesos cluster;

PORT depends on config (5050 by default)

yarn Connect to a YARN cluster in client or cluster mode

Page 36: Distributed Computing with Apache Spark

Shell Python (PySpark) locally with 4 cores

$pyspark --master local [4]

Shell Python (PySpark) to a standalone cluster, i.e cluster1

$pyspark --master spark :// cluster1 :7077

Submit a Job (script Python example.py) locally with 4 cores

$spark -submit --master local [4] example.py

Submit a Job to a standalone cluster, i.e. cluster1

$spark -submit --master spark :// cluster1 :7077 example.py

Page 37: Distributed Computing with Apache Spark

RDDResilient Distributed Datasets (RDD)Collections of objects across a cluster• User controlled partitioning.• Stored in memory or on disk.• Built via parallel transformations (map, filter, ...).• Automatically rebuilt on failure.

There are two types of RDD:• Parallelized collections - take an existing collection and runfunctions on it in parallel.

• Hadoop Datasets - run functions on each record of a file inHadoop distributed file system or any other storage systemsupported by Hadoop (Cassandra, MongoDB, Amazon S3,Hypertable, HBase, ...).

http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-

datasets-rdds

Page 38: Distributed Computing with Apache Spark

Create a RDD from a local file

alice = sc.textFile("Alice_in_Wonderland.txt")

Create a RDD from a Hadoop/HDFS distributed file

alice = sc.textFile("hdfs :// books/les_miserables.txt")

Create a RDD from a Cassandra table (DataFrame)

# add Cassandra connexion host into sqlContextsqlContext.setConf("spark.cassandra.connection.host",

"172.16.0.161")

metrics = sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace="ssc", table="metrics").load()

Page 39: Distributed Computing with Apache Spark

Operations on RDDs (twotypes)

TransformationsManipulates an RDD, returnsanother RDD.

Transformations are lazy, notcomputed immediately.

Parallel execution.

Optimize the requiredcalculations.

Recover from lost data partitions.

Examples :map(), filter(), join(),groupByKey(), ...

ActionsThese are the final actions(compute, persistence, ...).

Do not return a RDD.

Persistence in memory or on disk.

Examples:reduce(), count(), foreach(),saveAsHadoopFile(), ...

A transformed RDD is calculated when an action is executed.

Page 40: Distributed Computing with Apache Spark

Counting words in "Alice in Wonderland"

alice = sc.textFile("Alice_in_Wonderland.txt")

counts = alice.flatMap(lambda line: line.split(" ")).map(lambda word: (word , 1)).reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs :// counts.txt")

Create RDD alice from a local file.

Transformations : RDD alice is transformed (i.e. two map() anda reduceByKey()) in a new RDD counts.

Actions : writing RDD counts into HDFS.

Page 41: Distributed Computing with Apache Spark

Transformations

Transformation Descriptionmap(func) return a new distributed dataset formed by passing

each element of the source through a function func

filter(func) return a new dataset formed by selecting

those elements of the source on which func returns true

flatMap(func) similar to map, but each input item can be mapped

to 0 or more output items (so func should return a

Seq rather than a single item)

sample(withReplacement, sample a fraction fraction of the data, with or without

fraction, seed) replacement, using a given random generator seed

union(otherDataset) return a new dataset that contains the union of the

elements in the source dataset and the argument

distinct([numTasks]) return a new dataset that contains the distinct elements

of the source dataset

... ...

Page 42: Distributed Computing with Apache Spark

Actions

Actions Descriptionreduce(func) aggregate the elements of the dataset using a function func.

collect() return all the elements of the dataset as an array.

count() return the number of elements in the dataset.

first() return the first element of the dataset - similar to take(1).

take(n) return an array with the first n elements of the dataset.

saveAsTextFile(path) write the elements of the dataset as a text file,

HDFS or any other Hadoop-supported file system.

countByKey() only available on RDDs of type (K, V).

Returns a Map of (K, Int) pairs with the count of each key

foreach(func) run a function func on each element of the dataset

... ...

Page 43: Distributed Computing with Apache Spark

RDD partitions

An RDD is divided in partition.We can control the partitioning of a RDD.

Number of partitions of a RDD:

alice = sc.textFile("Alice_in_Wonderland.txt")alice.getNumPartitions ()2

Page 44: Distributed Computing with Apache Spark

RDD lineageTwo types of dependencies based on transformations.

Narrow Dependencies Wide Dependencies

Impact on performance in case of failure.

Page 45: Distributed Computing with Apache Spark

RDD lineageAn RDD records all the transformations necessary to build it.

Show the lineage of an RDD:

print counts.toDebugString ()(2) PythonRDD [7] at RDD at PythonRDD.scala :43 []| MapPartitionsRDD [6] at mapPartitions at PythonRDD.scala :374 []| ShuffledRDD [5] at partitionBy at NativeMethodAccessorImpl.java:-2 []+-(2) PairwiseRDD [4] at reduceByKey at <stdin >:1 []

| PythonRDD [3] at reduceByKey at <stdin >:1 []| Alice_in_Wonderland.txt MapPartitionsRDD [1] at textFile at NativeMethodAccessorImpl.java:-2 []| Alice_in_Wonderland.txt HadoopRDD [0] at textFile at NativeMethodAccessorImpl.java:-2 []

Effective recovery in case of failure.

Page 46: Distributed Computing with Apache Spark

DAG (Directed Acyclic Graph)Transformations define an acyclic oriented graph (DAG) of RDDs

that will be used later when an action is called.

Operations on RDDs create new RDDs that refer back to theirparents, thereby creating a graph.

Page 47: Distributed Computing with Apache Spark

How Spark worksRDD → DAGSpark translates the transformations on RDDs into DAG (orientedacyclic graph).

DAG schedulerA DAG schedulerw divides operations into Stages of Tasks. Astage consists of tasks based on the partitions of the RDD(distributed data).

Task SchedulerThe stages are passed to the Task Scheduler. The Task Schedulerlaunches jobs through the Cluster Manager.

Page 48: Distributed Computing with Apache Spark

Cluster Overview (master andworkers)

We submit an application - Execution driven by a driver.1. master connects to a cluster manager to allocate resources

across applications. acquires executors on cluste rnodes -processes run compute tasks, cache data.

3. sends app code to the executors.4. sends tasks for the executors to run.

http://spark.apache.org/docs/latest/cluster-overview.html

Page 49: Distributed Computing with Apache Spark

RDD persistence

Spark can persist (or cache) a dataset in memory across operations.

Each node stores in memory any slices of it that it computes andreuses them in other actions on that dataset. (often making futureactions more than 10x faster).

The cache is fault-tolerant: if any partition of an RDD is lost, itwill automatically be recomputed using the transformations thatoriginally created it.

Page 50: Distributed Computing with Apache Spark

RDD Persistence Control

Each persisted RDD can be stored using a different storage level.

RDD.persist ()

Type DescriptionMEMORY_ONLY Store RDD as deserialized Java objects in the JVM.

This is the default level.

MEMORY_AND_DISK Persistance en mémoire et sur disque.

DISK_ONLY Store the RDD partitions only on disk.

... ...

Page 51: Distributed Computing with Apache Spark

Shared VariablesBroadcast VariablesKeep a read-only variable cached on each machinerather thanshipping a copy of it with tasks. To give every node a copy of alarge input dataset efficiently. Using efficient broadcast algorithmsto reduce communication cost.

AccumulatorsUsed to implement counters and sums, efficiently in parallel. Sparknatively supports accumulators of numeric value types and standardmutable collections. Only the driver program can read anaccumulator’s value, not the tasks.

Page 52: Distributed Computing with Apache Spark

Spark Built-in Libraries

Page 53: Distributed Computing with Apache Spark

http://spark.apache.org/sql/

SQL execution engine - Using a DataFrame in SQL

Structured data processingInterrogates structured data using SQL.Standard connectivity via JDBC or ODBC. APIs Java, Scala,Python et R.

DataFramesDataFrames = RDD + named columnsDSL: select(), where(), groupBy(), ...Have tabular data.Describe the scheme → DataFrame.

Page 54: Distributed Computing with Apache Spark

http://spark.apache.org/mllib/

Machine LearningMLlib is the Machine Learning library (based on Breeze, netlib-java,JBlas, BLAS/LAPACK).

PerformancesHigh quality algorithms, 100x faster than MapReduce.

Page 55: Distributed Computing with Apache Spark

MLlib Algorithms 1/2ClassificationSVMs, logistic regression, decision trees, naive Bayes, randomforests and gradient-boosted trees.

ClusteringK-means, bisecting K-means, Gaussian mixtures (GMM) and poweriteration clustering.

Dimensionality reduction & DecompositionSingular value decomposition (SVD), QR and principal componentanalysis (ACP).

Collaborative filtering - Recommandationalternating least squares (ALS), non-negative matrix factorization(NMF).

Page 56: Distributed Computing with Apache Spark

MLlib Algorithms 2/2

Basic statisticsSummary statistics, correlations, stratified sampling, hypothesistesting, and random data generation.

Extraction of characteristics and transformationTopic modeling via latent Dirichlet allocation (LDA), TF-IDF,Word2Vec, StandardScaler, and Normalizer

OptimizationStochastic Gradient Descent (SGD), Limited-memory BFGS.

RegressionGeneralized linear models (GLMs), regression tree with L1, L2, andelastic-net regularization...

Page 57: Distributed Computing with Apache Spark

http://spark.apache.org/graphx/

GraphX is Spark library for graphs and graph-parallel computation.

FlexibilityGraphX unifies ETL, exploratory analysis, and iterative graphcomputation within a single system.

Performances - SpeedSimilar to specialized (faster) graphics processing systems.

Page 58: Distributed Computing with Apache Spark

GraphX supported algorithms

Page Rank

Connected components

Label propagation

SVD++

Strongly connected components

Triangle count

Page 59: Distributed Computing with Apache Spark

http://spark.apache.org/streaming/

Spark Streaming makes it easy to build scalable fault-tolerantstreaming applications.

Ease of UseBuild applications through high-level operators.

Batch + StreamingCombine streaming with batch and interactive queries.

Page 60: Distributed Computing with Apache Spark

Micro-batches

Découpe un flux continu en batchesAPI similar to SparkSpark Streaming 6= Apache StormSpark Streaming ∼ Apache Storm + Trident