Distributed Computing with Apache Spark
Embed Size (px)
Transcript of Distributed Computing with Apache Spark
Distributed Computing with Apache SparkConvex and distributed optimization (3 ECTS)
Master of Science in Industrial and Applied Mathematics
Google Inc.Jeffrey Dean, Sanjay Ghemawat - MapReduce : simplified data
processing on large clusters OSDI, 2004
Data and processing colocation. Process the data where it is. Avoid networks and I/Os.
This is the birth certificate of Big Data (i.e. MapReduce)
Simple processing : indexation, statistiques, queries orfrequent words.
Huge amounts of data : web pages, logs, documents (texts,images, videos).
Data distribution Parallel processing Fault management Cost reduction (commodities PCs)
MapReduce Functional programming
Characteristics Operations sequenced by the composition
(f g)(x) = f (g(x))
No order in the declarations. The result of a function depends only on its inputs (no purefunctional state).
Data/variables not modifiable: no assignment, no explicitmanagement of the memory.
Functional Inspiration of MapReduce MapReduce Pipeline : reduce() grp map(f ) Can be automatically parallelized on several calculation units.
Map functionmap : (A B) ([A] [B])
map(f )[x0, ..., xn] = [f ((x0), ..., f (xn)]
map(2)[2, 3, 6] = [4, 6, 12]
Map Prototype into MapReduce Map documentation : (K1,V 1) [(K2,V 2)] Map is aparticular prototype of the f of map(f ).
Apply f to a collection of key/value pairs. For each pair (k , v)then computing f (k , v).
Pseudocode examplefunction map(uri, document)
foreach distinct term in documentoutput (term, count(term, document))
Algebric properties of Map map(id) = id with id(x) = x map(f g) = map(f ) map(g) map(f )[x ] = [f (x)] map(f )(xs ++ys) = map(f )(xs) + +map(f )(ys)
Application Simplification and automatic program rewriting. Proof (algebric) of equivalence. Automatic parallelization of calculations.
grp :[(A B)] [(A [B])]
grp[...(w , a0), ..., (w , an)...] = [..., (w , [a0, ..., an]), ...]
grp[(a, 2), (z , 2), (ab, 3), (a, 4)] = [(a, [2, 4]), (z , ), (ab, )]
Sort/Group/Shuffle Prototype into MapReduce Documentation grp : [(K2,V 2)] [(K2, [V 2])] Recalls instruction GROUP BY/ORDER BY in SQL. Grp is called transparently between Map and Reduce phase.
reduce : (A A B) ([A] B)
reduce()[x0, ..., xn] = x0 x2 ... xn1 xnreduce(+)[2, 1, 3] = 2+ 1+ 3 = 6
Reduce Prototype into MapReduce Documentation reduce : [(K2,V 2)] [(K3, [V 3])] Reduce is a particular prototype for reduce(), we apply ona collection of values associated with the key.
Pseudocode examplefunction reduce(term, counts)
output (term, sum(counts))
Example : Matrix-Vector Multiplication
Let A be an m n matrix and v be a vector of length n
a11 a12 a1na21 a22 a2n...
.... . .
...am1 am2 amn
, v =
The product Av is a vector of length m
j=1 a1jvjnj=1 a2jvj
Example : Matrix-Vector Multiplication
MapReduce pseudocode for computes matrix-vector multiplication
map(key, value):for (i, j, a_ij) in value:
emit(i, a_ij * v[j])
reduce(key, values):result = 0for value in values:
result += valueemit(key, result)
Communication costs Map tasks is O(mn + n) Reduce tasks is O(mn)
Example : Logistic Regression
We choose the form of hypothesis as h(x) = 1/(1+ exp(T x))and fitting by using Newton-Raphson
:= H1l() where l() is the likelihood function
l() is computed in parallel by mappers summing upsubgroup
(y (i) h(x (i))
)x (i)j each step i
The hessian matrix is computed by mappers with summation
H(j , k) := H(j , k) + h(x (i))(h(x (i)) 1
)x (i)j x
The reducer sum up gradient and hessain values to perform update.
Example : Support Vector Machine
Linear SVMs goal is to optimize the primal problem
i :i0pi s.t. y
(i)(T x (i) + b) 1 i
where p is either 1 (hinge loss) or 2 (quadratic loss).
The primal problem for quadratic loss can be solved by batchgradient descent (sc are support vectors)
= 2 + 2Cisv
(xi yi )xi and H = I + Cisv
The mappers calculate the partial gradient and the reducer sum upthe partial result to update .
Apache HadoopDistributed Data Storage + MapReduce Processing
Traditional network programmingMessage-passing between nodes (e.g. MPI)
Very difficult to do at scale: How to split problem across nodes? Must consider network & data locality How to deal with failures? (inevitable at scale) Even worse: stragglers (node not failed, but slow) Ethernet networking not fast Have to write programs for each machine
Rarely used in commodity datacenters.
Difficultly of programming directly in MapReduce
Constrained modelA Map phase then a Reduce phase.
For complex and iterative algorithms we need to link severalMapReduce phases.
Data transfer between these phases : disk storage.
Most of optimization algorithms are iteratives!
Result & Verdict
While MapReduce is simple, it can require asymptotically morecommunication or I/O.
MapReduce algorithms research doesnt go to waste, it justgets sped up and easier to use.
Still useful to study as an algorithmic framework, silly to usedirectly.
Therefore, people builtspecialized systems...
Why Apache Spark?
Sparks goal was to generalize MapReduce to support newapps within same engine.Benefit for Users - Same engine performs data extraction, modeltraining and interactive queries.
Two small additions are enough to express the previous models: Fast data sharing. General directed acyclic execution graphs (DAGs).
This allows for an approach which is more efficient for the engine,and much simpler for the end users.
Disk vs Memory
L1 cache reference: 0.5 nsL2 cache reference: 7 nsMutex lock/unlock: 100 nsMain memory reference: 100 nsDisk seek: 10,000,000 ns
Hadoop MapReduce : Share data on disk
Apache Spark : Speed up processing using the memory
Lightning-fast cluster computing
Originally developed by UC Berkeley (AMPLab)Open sourced in 2009 and implemented in Scala
Adoption and use cases
eBay: Use Spark for logs processing (aggregation) and analytical,. . .
Kelkoo: Use Spark et Spark Streaming for the recommendation ofproducts, BI, real time filtering of malicious activity, data mining.
Moodys Analytics: Use Spark for its credit risk calculationplatform, (C)VaR calculation, ...
Amazon, Yahoo!, TripAdvisor, Hitachi, NASA, Ooyala, Shopify,Samsug, Socialmetrix, ...
Spark is Hadoop compatibleIntegration with Hadoop and its ecosystem,
HBase, Cassandra, MongoDB ...
Spark is FastIn-Memory Computing
Suitable for iterative algorithms
Record to sort 100 TB on disk.
Spark is Simple
Development facility - APIs simples & intuitives
APIs in Java, Scala, Python (+ SQL, Clojure, R)
Spark is InteractiveInteractive mode (Spark Shell, PySpark), standalone mode
Spark UIApplication Monitoring
Spark is StreamingReal-time processing (Micro-Batching)
Spark Streaming is easier to use than Apache Storm
Spark is (very) ActiveMost active open source community in big data
... is well DocumentedOne can find many examples, presentations, videos, MOOCs,
events, meetup, ...
... with a large open-sourcecommunity
The first thing a Spark program should do is create an objectSparkContext, that says how Spark accesses a cluster.
In the shell (Scala or Python), a variable sc is automaticallycreated.
Other programs must use a constructor to instantiate a newSparkContext.
SparkContext can be used to create other variables.
Master URLsThe master parameter determines which cluster to use.
Master Descriptionlocal Run Spark locally with one worker thread
(i.e. no parallelism at all)
local[K ] Run Spark locally with K worker threads(ideally set to # cores on your machine)
spark://HOST:PORT Connect to a Spark standalone cluster;PORT depends on config (7077 by default)
mesos://HOST:PORT Connect to a Mesos cluster;PORT depends on config (5050 by default)
yarn Connect to a YARN cluster in client or cluster mode
Shell Python (PySpark) locally with 4 cores
$pyspark --master local 
Shell Python (PySpark) to a standalone cluster, i.e cluster1
$pyspark --master spark :// cluster1 :7077
Submit a Job (script Python example.py) locally with 4 cores
$spark -submit --master local  example.py
Submit a Job to a standalone cluster, i.e. cluster1
$spark -submit --master spark :// cluster1 :7077 example.py
RDDResilient Distributed Datasets (RDD)Collections of objects across a cluster User controlled partitioning. Stored in memory or on disk. Built via parallel transformations (map, filter, ...). Aut