Distributed Computing with Apache SparkConvex and distributed optimization (3 ECTS)
Master of Science in Industrial and Applied Mathematics
2016
Original motivation
Google Inc.Jeffrey Dean, Sanjay Ghemawat - MapReduce : simplified data
processing on large clusters OSDI, 2004
• Data and processing colocation.• Process the data where it is.• Avoid networks and I/Os.
This is the birth certificate of Big Data (i.e. MapReduce)
Original context
• Simple processing : indexation, statistiques, queries orfrequent words.
• Huge amounts of data : web pages, logs, documents (texts,images, videos).
Original challenges
• Data distribution• Parallel processing• Fault management• Cost reduction (commodities PCs)
MapReduce ∼ Functional programming
Characteristics• Operations sequenced by the composition
(f ◦ g)(x) = f (g(x))
No order in the declarations.• The result of a function depends only on its inputs (no purefunctional state).
• Data/variables not modifiable: no assignment, no explicitmanagement of the memory.
Functional Inspiration of MapReduce• MapReduce Pipeline : reduce(⊕) ◦ grp ◦map(f )• Can be automatically parallelized on several calculation units.
Map functionmap : (A→ B)→ ([A]→ [B])
map(f )[x0, ..., xn] = [f ((x0), ..., f (xn)]
map(∗2)[2, 3, 6] = [4, 6, 12]
Map Prototype into MapReduce• Map documentation : (K1,V 1)→ [(K2,V 2)] Map is aparticular prototype of the f of map(f ).
• Apply f to a collection of key/value pairs. For each pair (k , v)then computing f (k , v).
Pseudocode examplefunction map(uri, document)
foreach distinct term in documentoutput (term, count(term, document))
Map function
Algebric properties of Map• map(id) = id with id(x) = x• map(f ◦ g) = map(f ) ◦map(g)• map(f )[x ] = [f (x)]• map(f )(xs ++ys) = map(f )(xs) + +map(f )(ys)
Application• Simplification and automatic program rewriting.• Proof (algebric) of equivalence.• Automatic parallelization of calculations.
Sort/Group/Shuffle function
grp :[(A× B)]→ [(A× [B])]
grp[...(w , a0), ..., (w , an)...] = [..., (w , [a0, ..., an]), ...]
grp[(′a′, 2), (′z ′, 2), (′ab′, 3), (′a′, 4)] = [(′a′, [2, 4]), (′z ′, [2]), (′ab′, [3])]
Sort/Group/Shuffle Prototype into MapReduce• Documentation grp : [(K2,V 2)]→ [(K2, [V 2])]• Recalls instruction GROUP BY/ORDER BY in SQL.• Grp is called transparently between Map and Reduce phase.
Reduce function
reduce : (A× A→ B)→ ([A]→ B)
reduce(⊕)[x0, ..., xn] = x0 ⊕ x2 ⊕ ...⊕ xn−1 ⊕ xn
reduce(+)[2, 1, 3] = 2+ 1+ 3 = 6
Reduce Prototype into MapReduce• Documentation reduce : [(K2,V 2)]→ [(K3, [V 3])]• Reduce is a particular prototype for reduce(⊕), we apply ⊕ ona collection of values associated with the key.
Pseudocode examplefunction reduce(term, counts)
output (term, sum(counts))
Example : Matrix-Vector Multiplication
Let A be an m × n matrix and v be a vector of length n
A =
a11 a12 · · · a1na21 a22 · · · a2n...
.... . .
...am1 am2 · · · amn
, v =
v1v2...vn
The product Av is a vector of length m
Av =
∑n
j=1 a1jvj∑nj=1 a2jvj
...∑nj=1 amjvj
Example : Matrix-Vector Multiplication
MapReduce pseudocode for computes matrix-vector multiplication
map(key, value):for (i, j, a_ij) in value:
emit(i, a_ij * v[j])
reduce(key, values):result = 0for value in values:
result += valueemit(key, result)
Communication costs• Map tasks is O(mn + n)• Reduce tasks is O(mn)
Example : Logistic Regression
We choose the form of hypothesis as hθ(x) = 1/(1+ exp(−θT x))and fitting θ by using Newton-Raphson
θ := θ − H−1∇θl(θ) where l(θ) is the likelihood function
∇θl(θ) is computed in parallel by mappers summing up∑subgroup
(y (i) − hθ(x (i))
)x (i)j each step i
The hessian matrix is computed by mappers with summation
H(j , k) := H(j , k) + hθ(x (i))(hθ(x (i))− 1
)x (i)j x (i)
k
The reducer sum up gradient and hessain values to perform θupdate.
Example : Support Vector Machine
Linear SVM’s goal is to optimize the primal problem
argminω,b
‖ ω ‖2 +C∑
i :ζi≥0ζpi s.t. y (i)(ωT x (i) + b) ≥ 1− ζi
where p is either 1 (hinge loss) or 2 (quadratic loss).
The primal problem for quadratic loss can be solved by batchgradient descent (sc are support vectors)
∇ = 2ω + 2C∑i∈sv
(ωxi − yi )xi and H = I + C∑i∈sv
xixTi
The mappers calculate the partial gradient and the reducer sum upthe partial result to update ω.
Apache HadoopDistributed Data Storage + MapReduce Processing
Traditional network programmingMessage-passing between nodes (e.g. MPI)
Very difficult to do at scale:• How to split problem across nodes?• Must consider network & data locality• How to deal with failures? (inevitable at scale)• Even worse: stragglers (node not failed, but slow)• Ethernet networking not fast• Have to write programs for each machine
Rarely used in commodity datacenters.
MapReduce limitations
Difficultly of programming directly in MapReduce
Constrained modelA Map phase then a Reduce phase.
For complex and iterative algorithms we need to link severalMapReduce phases.
Data transfer between these phases : disk storage.
Most of optimization algorithms are iteratives!
Result & Verdict
While MapReduce is simple, it can require asymptotically morecommunication or I/O.
MapReduce algorithms research doesn’t go to waste, it justgets sped up and easier to use.
Still useful to study as an algorithmic framework, silly to usedirectly.
Therefore, people builtspecialized systems...
Why Apache Spark?
Spark’s goal was to generalize MapReduce to support newapps within same engine.Benefit for Users - Same engine performs data extraction, modeltraining and interactive queries.
Two small additions are enough to express the previous models:• Fast data sharing.• General directed acyclic execution graphs (DAGs).
This allows for an approach which is more efficient for the engine,and much simpler for the end users.
Disk vs Memory
L1 cache reference: 0.5 nsL2 cache reference: 7 nsMutex lock/unlock: 100 nsMain memory reference: 100 nsDisk seek: 10,000,000 ns
In-Memory Computing
Hadoop MapReduce : Share data on disk
Apache Spark : Speed up processing using the memory
Historical
Lightning-fast cluster computing
http://spark.apache.org
Originally developed by UC Berkeley (AMPLab)Open sourced in 2009 and implemented in Scala
Adoption and use cases
eBay: Use Spark for logs processing (aggregation) and analytical,. . .
Kelkoo: Use Spark et Spark Streaming for the recommendation ofproducts, BI, real time filtering of malicious activity, data mining.
Moody’s Analytics: Use Spark for its credit risk calculationplatform, (C)VaR calculation, ...
Amazon, Yahoo!, TripAdvisor, Hitachi, NASA, Ooyala, Shopify,Samsug, Socialmetrix, ...
http://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Spark is Hadoop compatibleIntegration with Hadoop and its ecosystem,
HBase, Cassandra, MongoDB ...
Spark is FastIn-Memory Computing
Suitable for iterative algorithms
Record to sort 100 TB on disk.
Spark is Simple
Development facility - APIs simples & intuitives
APIs in Java, Scala, Python (+ SQL, Clojure, R)
Spark is InteractiveInteractive mode (Spark Shell, PySpark), standalone mode
Spark UIApplication Monitoring
Spark is StreamingReal-time processing (Micro-Batching)
Spark Streaming is easier to use than Apache Storm
Spark is (very) ActiveMost active open source community in big data
+500 contributors
... is well DocumentedOne can find many examples, presentations, videos, MOOCs,
events, meetup, ...
https://sparkhub.databricks.com
... with a large open-sourcecommunity
cf. http://spark-packages.org
cf. Github
Spark Ecosystem
SparkContext
The first thing a Spark program should do is create an objectSparkContext, that says how Spark accesses a cluster.
In the shell (Scala or Python), a variable sc is automaticallycreated.
Other programs must use a constructor to instantiate a newSparkContext.
SparkContext can be used to create other variables.
Master URLsThe master parameter determines which cluster to use.
Master Descriptionlocal Run Spark locally with one worker thread
(i.e. no parallelism at all)
local[K ] Run Spark locally with K worker threads
(ideally set to # cores on your machine)
spark://HOST:PORT Connect to a Spark standalone cluster;
PORT depends on config (7077 by default)
mesos://HOST:PORT Connect to a Mesos cluster;
PORT depends on config (5050 by default)
yarn Connect to a YARN cluster in client or cluster mode
Shell Python (PySpark) locally with 4 cores
$pyspark --master local [4]
Shell Python (PySpark) to a standalone cluster, i.e cluster1
$pyspark --master spark :// cluster1 :7077
Submit a Job (script Python example.py) locally with 4 cores
$spark -submit --master local [4] example.py
Submit a Job to a standalone cluster, i.e. cluster1
$spark -submit --master spark :// cluster1 :7077 example.py
RDDResilient Distributed Datasets (RDD)Collections of objects across a cluster• User controlled partitioning.• Stored in memory or on disk.• Built via parallel transformations (map, filter, ...).• Automatically rebuilt on failure.
There are two types of RDD:• Parallelized collections - take an existing collection and runfunctions on it in parallel.
• Hadoop Datasets - run functions on each record of a file inHadoop distributed file system or any other storage systemsupported by Hadoop (Cassandra, MongoDB, Amazon S3,Hypertable, HBase, ...).
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-
datasets-rdds
Create a RDD from a local file
alice = sc.textFile("Alice_in_Wonderland.txt")
Create a RDD from a Hadoop/HDFS distributed file
alice = sc.textFile("hdfs :// books/les_miserables.txt")
Create a RDD from a Cassandra table (DataFrame)
# add Cassandra connexion host into sqlContextsqlContext.setConf("spark.cassandra.connection.host",
"172.16.0.161")
metrics = sqlContext.read.format("org.apache.spark.sql.cassandra").options(keyspace="ssc", table="metrics").load()
Operations on RDDs (twotypes)
TransformationsManipulates an RDD, returnsanother RDD.
Transformations are lazy, notcomputed immediately.
Parallel execution.
Optimize the requiredcalculations.
Recover from lost data partitions.
Examples :map(), filter(), join(),groupByKey(), ...
ActionsThese are the final actions(compute, persistence, ...).
Do not return a RDD.
Persistence in memory or on disk.
Examples:reduce(), count(), foreach(),saveAsHadoopFile(), ...
A transformed RDD is calculated when an action is executed.
Counting words in "Alice in Wonderland"
alice = sc.textFile("Alice_in_Wonderland.txt")
counts = alice.flatMap(lambda line: line.split(" ")).map(lambda word: (word , 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs :// counts.txt")
Create RDD alice from a local file.
Transformations : RDD alice is transformed (i.e. two map() anda reduceByKey()) in a new RDD counts.
Actions : writing RDD counts into HDFS.
Transformations
Transformation Descriptionmap(func) return a new distributed dataset formed by passing
each element of the source through a function func
filter(func) return a new dataset formed by selecting
those elements of the source on which func returns true
flatMap(func) similar to map, but each input item can be mapped
to 0 or more output items (so func should return a
Seq rather than a single item)
sample(withReplacement, sample a fraction fraction of the data, with or without
fraction, seed) replacement, using a given random generator seed
union(otherDataset) return a new dataset that contains the union of the
elements in the source dataset and the argument
distinct([numTasks]) return a new dataset that contains the distinct elements
of the source dataset
... ...
Actions
Actions Descriptionreduce(func) aggregate the elements of the dataset using a function func.
collect() return all the elements of the dataset as an array.
count() return the number of elements in the dataset.
first() return the first element of the dataset - similar to take(1).
take(n) return an array with the first n elements of the dataset.
saveAsTextFile(path) write the elements of the dataset as a text file,
HDFS or any other Hadoop-supported file system.
countByKey() only available on RDDs of type (K, V).
Returns a Map of (K, Int) pairs with the count of each key
foreach(func) run a function func on each element of the dataset
... ...
RDD partitions
An RDD is divided in partition.We can control the partitioning of a RDD.
Number of partitions of a RDD:
alice = sc.textFile("Alice_in_Wonderland.txt")alice.getNumPartitions ()2
RDD lineageTwo types of dependencies based on transformations.
Narrow Dependencies Wide Dependencies
Impact on performance in case of failure.
RDD lineageAn RDD records all the transformations necessary to build it.
Show the lineage of an RDD:
print counts.toDebugString ()(2) PythonRDD [7] at RDD at PythonRDD.scala :43 []| MapPartitionsRDD [6] at mapPartitions at PythonRDD.scala :374 []| ShuffledRDD [5] at partitionBy at NativeMethodAccessorImpl.java:-2 []+-(2) PairwiseRDD [4] at reduceByKey at <stdin >:1 []
| PythonRDD [3] at reduceByKey at <stdin >:1 []| Alice_in_Wonderland.txt MapPartitionsRDD [1] at textFile at NativeMethodAccessorImpl.java:-2 []| Alice_in_Wonderland.txt HadoopRDD [0] at textFile at NativeMethodAccessorImpl.java:-2 []
Effective recovery in case of failure.
DAG (Directed Acyclic Graph)Transformations define an acyclic oriented graph (DAG) of RDDs
that will be used later when an action is called.
Operations on RDDs create new RDDs that refer back to theirparents, thereby creating a graph.
How Spark worksRDD → DAGSpark translates the transformations on RDDs into DAG (orientedacyclic graph).
DAG schedulerA DAG schedulerw divides operations into Stages of Tasks. Astage consists of tasks based on the partitions of the RDD(distributed data).
Task SchedulerThe stages are passed to the Task Scheduler. The Task Schedulerlaunches jobs through the Cluster Manager.
Cluster Overview (master andworkers)
We submit an application - Execution driven by a driver.1. master connects to a cluster manager to allocate resources
across applications. acquires executors on cluste rnodes -processes run compute tasks, cache data.
3. sends app code to the executors.4. sends tasks for the executors to run.
http://spark.apache.org/docs/latest/cluster-overview.html
RDD persistence
Spark can persist (or cache) a dataset in memory across operations.
Each node stores in memory any slices of it that it computes andreuses them in other actions on that dataset. (often making futureactions more than 10x faster).
The cache is fault-tolerant: if any partition of an RDD is lost, itwill automatically be recomputed using the transformations thatoriginally created it.
RDD Persistence Control
Each persisted RDD can be stored using a different storage level.
RDD.persist ()
Type DescriptionMEMORY_ONLY Store RDD as deserialized Java objects in the JVM.
This is the default level.
MEMORY_AND_DISK Persistance en mémoire et sur disque.
DISK_ONLY Store the RDD partitions only on disk.
... ...
Shared VariablesBroadcast VariablesKeep a read-only variable cached on each machinerather thanshipping a copy of it with tasks. To give every node a copy of alarge input dataset efficiently. Using efficient broadcast algorithmsto reduce communication cost.
AccumulatorsUsed to implement counters and sums, efficiently in parallel. Sparknatively supports accumulators of numeric value types and standardmutable collections. Only the driver program can read anaccumulator’s value, not the tasks.
Spark Built-in Libraries
http://spark.apache.org/sql/
SQL execution engine - Using a DataFrame in SQL
Structured data processingInterrogates structured data using SQL.Standard connectivity via JDBC or ODBC. APIs Java, Scala,Python et R.
DataFramesDataFrames = RDD + named columnsDSL: select(), where(), groupBy(), ...Have tabular data.Describe the scheme → DataFrame.
http://spark.apache.org/mllib/
Machine LearningMLlib is the Machine Learning library (based on Breeze, netlib-java,JBlas, BLAS/LAPACK).
PerformancesHigh quality algorithms, 100x faster than MapReduce.
MLlib Algorithms 1/2ClassificationSVMs, logistic regression, decision trees, naive Bayes, randomforests and gradient-boosted trees.
ClusteringK-means, bisecting K-means, Gaussian mixtures (GMM) and poweriteration clustering.
Dimensionality reduction & DecompositionSingular value decomposition (SVD), QR and principal componentanalysis (ACP).
Collaborative filtering - Recommandationalternating least squares (ALS), non-negative matrix factorization(NMF).
MLlib Algorithms 2/2
Basic statisticsSummary statistics, correlations, stratified sampling, hypothesistesting, and random data generation.
Extraction of characteristics and transformationTopic modeling via latent Dirichlet allocation (LDA), TF-IDF,Word2Vec, StandardScaler, and Normalizer
OptimizationStochastic Gradient Descent (SGD), Limited-memory BFGS.
RegressionGeneralized linear models (GLMs), regression tree with L1, L2, andelastic-net regularization...
http://spark.apache.org/graphx/
GraphX is Spark library for graphs and graph-parallel computation.
FlexibilityGraphX unifies ETL, exploratory analysis, and iterative graphcomputation within a single system.
Performances - SpeedSimilar to specialized (faster) graphics processing systems.
GraphX supported algorithms
Page Rank
Connected components
Label propagation
SVD++
Strongly connected components
Triangle count
http://spark.apache.org/streaming/
Spark Streaming makes it easy to build scalable fault-tolerantstreaming applications.
Ease of UseBuild applications through high-level operators.
Batch + StreamingCombine streaming with batch and interactive queries.
Micro-batches
Découpe un flux continu en batchesAPI similar to SparkSpark Streaming 6= Apache StormSpark Streaming ∼ Apache Storm + Trident
Top Related