Distributed Computing with Apache Spark

download Distributed Computing with Apache Spark

of 60

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Distributed Computing with Apache Spark

  • Distributed Computing with Apache SparkConvex and distributed optimization (3 ECTS)

    Master of Science in Industrial and Applied Mathematics


  • Original motivation

    Google Inc.Jeffrey Dean, Sanjay Ghemawat - MapReduce : simplified data

    processing on large clusters OSDI, 2004

    Data and processing colocation. Process the data where it is. Avoid networks and I/Os.

    This is the birth certificate of Big Data (i.e. MapReduce)

  • Original context

    Simple processing : indexation, statistiques, queries orfrequent words.

    Huge amounts of data : web pages, logs, documents (texts,images, videos).

    Original challenges

    Data distribution Parallel processing Fault management Cost reduction (commodities PCs)

  • MapReduce Functional programming

    Characteristics Operations sequenced by the composition

    (f g)(x) = f (g(x))

    No order in the declarations. The result of a function depends only on its inputs (no purefunctional state).

    Data/variables not modifiable: no assignment, no explicitmanagement of the memory.

    Functional Inspiration of MapReduce MapReduce Pipeline : reduce() grp map(f ) Can be automatically parallelized on several calculation units.

  • Map functionmap : (A B) ([A] [B])

    map(f )[x0, ..., xn] = [f ((x0), ..., f (xn)]

    map(2)[2, 3, 6] = [4, 6, 12]

    Map Prototype into MapReduce Map documentation : (K1,V 1) [(K2,V 2)] Map is aparticular prototype of the f of map(f ).

    Apply f to a collection of key/value pairs. For each pair (k , v)then computing f (k , v).

    Pseudocode examplefunction map(uri, document)

    foreach distinct term in documentoutput (term, count(term, document))

  • Map function

    Algebric properties of Map map(id) = id with id(x) = x map(f g) = map(f ) map(g) map(f )[x ] = [f (x)] map(f )(xs ++ys) = map(f )(xs) + +map(f )(ys)

    Application Simplification and automatic program rewriting. Proof (algebric) of equivalence. Automatic parallelization of calculations.

  • Sort/Group/Shuffle function

    grp :[(A B)] [(A [B])]

    grp[...(w , a0), ..., (w , an)...] = [..., (w , [a0, ..., an]), ...]

    grp[(a, 2), (z , 2), (ab, 3), (a, 4)] = [(a, [2, 4]), (z , [2]), (ab, [3])]

    Sort/Group/Shuffle Prototype into MapReduce Documentation grp : [(K2,V 2)] [(K2, [V 2])] Recalls instruction GROUP BY/ORDER BY in SQL. Grp is called transparently between Map and Reduce phase.

  • Reduce function

    reduce : (A A B) ([A] B)

    reduce()[x0, ..., xn] = x0 x2 ... xn1 xnreduce(+)[2, 1, 3] = 2+ 1+ 3 = 6

    Reduce Prototype into MapReduce Documentation reduce : [(K2,V 2)] [(K3, [V 3])] Reduce is a particular prototype for reduce(), we apply ona collection of values associated with the key.

    Pseudocode examplefunction reduce(term, counts)

    output (term, sum(counts))

  • Example : Matrix-Vector Multiplication

    Let A be an m n matrix and v be a vector of length n

    A =

    a11 a12 a1na21 a22 a2n...

    .... . .

    ...am1 am2 amn

    , v =


    The product Av is a vector of length m

    Av =


    j=1 a1jvjnj=1 a2jvj

    ...nj=1 amjvj

  • Example : Matrix-Vector Multiplication

    MapReduce pseudocode for computes matrix-vector multiplication

    map(key, value):for (i, j, a_ij) in value:

    emit(i, a_ij * v[j])

    reduce(key, values):result = 0for value in values:

    result += valueemit(key, result)

    Communication costs Map tasks is O(mn + n) Reduce tasks is O(mn)

  • Example : Logistic Regression

    We choose the form of hypothesis as h(x) = 1/(1+ exp(T x))and fitting by using Newton-Raphson

    := H1l() where l() is the likelihood function

    l() is computed in parallel by mappers summing upsubgroup

    (y (i) h(x (i))

    )x (i)j each step i

    The hessian matrix is computed by mappers with summation

    H(j , k) := H(j , k) + h(x (i))(h(x (i)) 1

    )x (i)j x


    The reducer sum up gradient and hessain values to perform update.

  • Example : Support Vector Machine

    Linear SVMs goal is to optimize the primal problem


    2 +C

    i :i0pi s.t. y

    (i)(T x (i) + b) 1 i

    where p is either 1 (hinge loss) or 2 (quadratic loss).

    The primal problem for quadratic loss can be solved by batchgradient descent (sc are support vectors)

    = 2 + 2Cisv

    (xi yi )xi and H = I + Cisv


    The mappers calculate the partial gradient and the reducer sum upthe partial result to update .

  • Apache HadoopDistributed Data Storage + MapReduce Processing

  • Traditional network programmingMessage-passing between nodes (e.g. MPI)

    Very difficult to do at scale: How to split problem across nodes? Must consider network & data locality How to deal with failures? (inevitable at scale) Even worse: stragglers (node not failed, but slow) Ethernet networking not fast Have to write programs for each machine

    Rarely used in commodity datacenters.

  • MapReduce limitations

    Difficultly of programming directly in MapReduce

    Constrained modelA Map phase then a Reduce phase.

    For complex and iterative algorithms we need to link severalMapReduce phases.

    Data transfer between these phases : disk storage.

    Most of optimization algorithms are iteratives!

  • Result & Verdict

    While MapReduce is simple, it can require asymptotically morecommunication or I/O.

    MapReduce algorithms research doesnt go to waste, it justgets sped up and easier to use.

    Still useful to study as an algorithmic framework, silly to usedirectly.

  • Therefore, people builtspecialized systems...

  • Why Apache Spark?

    Sparks goal was to generalize MapReduce to support newapps within same engine.Benefit for Users - Same engine performs data extraction, modeltraining and interactive queries.

    Two small additions are enough to express the previous models: Fast data sharing. General directed acyclic execution graphs (DAGs).

    This allows for an approach which is more efficient for the engine,and much simpler for the end users.

  • Disk vs Memory

    L1 cache reference: 0.5 nsL2 cache reference: 7 nsMutex lock/unlock: 100 nsMain memory reference: 100 nsDisk seek: 10,000,000 ns

  • In-Memory Computing

    Hadoop MapReduce : Share data on disk

    Apache Spark : Speed up processing using the memory

  • Historical

  • Lightning-fast cluster computing


    Originally developed by UC Berkeley (AMPLab)Open sourced in 2009 and implemented in Scala

  • Adoption and use cases

    eBay: Use Spark for logs processing (aggregation) and analytical,. . .

    Kelkoo: Use Spark et Spark Streaming for the recommendation ofproducts, BI, real time filtering of malicious activity, data mining.

    Moodys Analytics: Use Spark for its credit risk calculationplatform, (C)VaR calculation, ...

    Amazon, Yahoo!, TripAdvisor, Hitachi, NASA, Ooyala, Shopify,Samsug, Socialmetrix, ...


  • Spark is Hadoop compatibleIntegration with Hadoop and its ecosystem,

    HBase, Cassandra, MongoDB ...

  • Spark is FastIn-Memory Computing

    Suitable for iterative algorithms

    Record to sort 100 TB on disk.

  • Spark is Simple

    Development facility - APIs simples & intuitives

    APIs in Java, Scala, Python (+ SQL, Clojure, R)

  • Spark is InteractiveInteractive mode (Spark Shell, PySpark), standalone mode

  • Spark UIApplication Monitoring

  • Spark is StreamingReal-time processing (Micro-Batching)

    Spark Streaming is easier to use than Apache Storm

  • Spark is (very) ActiveMost active open source community in big data

    +500 contributors

  • ... is well DocumentedOne can find many examples, presentations, videos, MOOCs,

    events, meetup, ...


  • ... with a large open-sourcecommunity

    cf. http://spark-packages.org

    cf. Github

  • Spark Ecosystem

  • SparkContext

    The first thing a Spark program should do is create an objectSparkContext, that says how Spark accesses a cluster.

    In the shell (Scala or Python), a variable sc is automaticallycreated.

    Other programs must use a constructor to instantiate a newSparkContext.

    SparkContext can be used to create other variables.

  • Master URLsThe master parameter determines which cluster to use.

    Master Descriptionlocal Run Spark locally with one worker thread

    (i.e. no parallelism at all)

    local[K ] Run Spark locally with K worker threads(ideally set to # cores on your machine)

    spark://HOST:PORT Connect to a Spark standalone cluster;PORT depends on config (7077 by default)

    mesos://HOST:PORT Connect to a Mesos cluster;PORT depends on config (5050 by default)

    yarn Connect to a YARN cluster in client or cluster mode

  • Shell Python (PySpark) locally with 4 cores

    $pyspark --master local [4]

    Shell Python (PySpark) to a standalone cluster, i.e cluster1

    $pyspark --master spark :// cluster1 :7077

    Submit a Job (script Python example.py) locally with 4 cores

    $spark -submit --master local [4] example.py

    Submit a Job to a standalone cluster, i.e. cluster1

    $spark -submit --master spark :// cluster1 :7077 example.py

  • RDDResilient Distributed Datasets (RDD)Collections of objects across a cluster User controlled partitioning. Stored in memory or on disk. Built via parallel transformations (map, filter, ...). Aut