     • date post

03-Jan-2017
• Category

## Documents

• view

221

6

Embed Size (px)

### Transcript of Distributed Computing with Apache Spark

• Distributed Computing with Apache SparkConvex and distributed optimization (3 ECTS)

Master of Science in Industrial and Applied Mathematics

2016

• Original motivation

Google Inc.Jeffrey Dean, Sanjay Ghemawat - MapReduce : simplified data

processing on large clusters OSDI, 2004

Data and processing colocation. Process the data where it is. Avoid networks and I/Os.

This is the birth certificate of Big Data (i.e. MapReduce)

• Original context

Simple processing : indexation, statistiques, queries orfrequent words.

Huge amounts of data : web pages, logs, documents (texts,images, videos).

Original challenges

Data distribution Parallel processing Fault management Cost reduction (commodities PCs)

• MapReduce Functional programming

Characteristics Operations sequenced by the composition

(f g)(x) = f (g(x))

No order in the declarations. The result of a function depends only on its inputs (no purefunctional state).

Data/variables not modifiable: no assignment, no explicitmanagement of the memory.

Functional Inspiration of MapReduce MapReduce Pipeline : reduce() grp map(f ) Can be automatically parallelized on several calculation units.

• Map functionmap : (A B) ([A] [B])

map(f )[x0, ..., xn] = [f ((x0), ..., f (xn)]

map(2)[2, 3, 6] = [4, 6, 12]

Map Prototype into MapReduce Map documentation : (K1,V 1) [(K2,V 2)] Map is aparticular prototype of the f of map(f ).

Apply f to a collection of key/value pairs. For each pair (k , v)then computing f (k , v).

Pseudocode examplefunction map(uri, document)

foreach distinct term in documentoutput (term, count(term, document))

• Map function

Algebric properties of Map map(id) = id with id(x) = x map(f g) = map(f ) map(g) map(f )[x ] = [f (x)] map(f )(xs ++ys) = map(f )(xs) + +map(f )(ys)

Application Simplification and automatic program rewriting. Proof (algebric) of equivalence. Automatic parallelization of calculations.

• Sort/Group/Shuffle function

grp :[(A B)] [(A [B])]

grp[...(w , a0), ..., (w , an)...] = [..., (w , [a0, ..., an]), ...]

grp[(a, 2), (z , 2), (ab, 3), (a, 4)] = [(a, [2, 4]), (z , ), (ab, )]

Sort/Group/Shuffle Prototype into MapReduce Documentation grp : [(K2,V 2)] [(K2, [V 2])] Recalls instruction GROUP BY/ORDER BY in SQL. Grp is called transparently between Map and Reduce phase.

• Reduce function

reduce : (A A B) ([A] B)

reduce()[x0, ..., xn] = x0 x2 ... xn1 xnreduce(+)[2, 1, 3] = 2+ 1+ 3 = 6

Reduce Prototype into MapReduce Documentation reduce : [(K2,V 2)] [(K3, [V 3])] Reduce is a particular prototype for reduce(), we apply ona collection of values associated with the key.

Pseudocode examplefunction reduce(term, counts)

output (term, sum(counts))

• Example : Matrix-Vector Multiplication

Let A be an m n matrix and v be a vector of length n

A =

a11 a12 a1na21 a22 a2n...

.... . .

...am1 am2 amn

, v =

v1v2...vn

The product Av is a vector of length m

Av =

n

j=1 a1jvjnj=1 a2jvj

...nj=1 amjvj

• Example : Matrix-Vector Multiplication

MapReduce pseudocode for computes matrix-vector multiplication

map(key, value):for (i, j, a_ij) in value:

emit(i, a_ij * v[j])

reduce(key, values):result = 0for value in values:

result += valueemit(key, result)

Communication costs Map tasks is O(mn + n) Reduce tasks is O(mn)

• Example : Logistic Regression

We choose the form of hypothesis as h(x) = 1/(1+ exp(T x))and fitting by using Newton-Raphson

:= H1l() where l() is the likelihood function

l() is computed in parallel by mappers summing upsubgroup

(y (i) h(x (i))

)x (i)j each step i

The hessian matrix is computed by mappers with summation

H(j , k) := H(j , k) + h(x (i))(h(x (i)) 1

)x (i)j x

(i)k

The reducer sum up gradient and hessain values to perform update.

• Example : Support Vector Machine

Linear SVMs goal is to optimize the primal problem

argmin,b

2 +C

i :i0pi s.t. y

(i)(T x (i) + b) 1 i

where p is either 1 (hinge loss) or 2 (quadratic loss).

The primal problem for quadratic loss can be solved by batchgradient descent (sc are support vectors)

= 2 + 2Cisv

(xi yi )xi and H = I + Cisv

xixTi

The mappers calculate the partial gradient and the reducer sum upthe partial result to update .

• Apache HadoopDistributed Data Storage + MapReduce Processing

• Traditional network programmingMessage-passing between nodes (e.g. MPI)

Very difficult to do at scale: How to split problem across nodes? Must consider network & data locality How to deal with failures? (inevitable at scale) Even worse: stragglers (node not failed, but slow) Ethernet networking not fast Have to write programs for each machine

Rarely used in commodity datacenters.

• MapReduce limitations

Difficultly of programming directly in MapReduce

Constrained modelA Map phase then a Reduce phase.

For complex and iterative algorithms we need to link severalMapReduce phases.

Data transfer between these phases : disk storage.

Most of optimization algorithms are iteratives!

• Result & Verdict

While MapReduce is simple, it can require asymptotically morecommunication or I/O.

MapReduce algorithms research doesnt go to waste, it justgets sped up and easier to use.

Still useful to study as an algorithmic framework, silly to usedirectly.

• Therefore, people builtspecialized systems...

• Why Apache Spark?

Sparks goal was to generalize MapReduce to support newapps within same engine.Benefit for Users - Same engine performs data extraction, modeltraining and interactive queries.

Two small additions are enough to express the previous models: Fast data sharing. General directed acyclic execution graphs (DAGs).

This allows for an approach which is more efficient for the engine,and much simpler for the end users.

• Disk vs Memory

L1 cache reference: 0.5 nsL2 cache reference: 7 nsMutex lock/unlock: 100 nsMain memory reference: 100 nsDisk seek: 10,000,000 ns

• In-Memory Computing

Hadoop MapReduce : Share data on disk

Apache Spark : Speed up processing using the memory

• Historical

• Lightning-fast cluster computing

http://spark.apache.org

Originally developed by UC Berkeley (AMPLab)Open sourced in 2009 and implemented in Scala

eBay: Use Spark for logs processing (aggregation) and analytical,. . .

Kelkoo: Use Spark et Spark Streaming for the recommendation ofproducts, BI, real time filtering of malicious activity, data mining.

Moodys Analytics: Use Spark for its credit risk calculationplatform, (C)VaR calculation, ...

Amazon, Yahoo!, TripAdvisor, Hitachi, NASA, Ooyala, Shopify,Samsug, Socialmetrix, ...

http://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

HBase, Cassandra, MongoDB ...

• Spark is FastIn-Memory Computing

Suitable for iterative algorithms

Record to sort 100 TB on disk.

• Spark is Simple

Development facility - APIs simples & intuitives

APIs in Java, Scala, Python (+ SQL, Clojure, R)

• Spark is InteractiveInteractive mode (Spark Shell, PySpark), standalone mode

• Spark UIApplication Monitoring

• Spark is StreamingReal-time processing (Micro-Batching)

Spark Streaming is easier to use than Apache Storm

• Spark is (very) ActiveMost active open source community in big data

+500 contributors

• ... is well DocumentedOne can find many examples, presentations, videos, MOOCs,

events, meetup, ...

https://sparkhub.databricks.com

• ... with a large open-sourcecommunity

cf. http://spark-packages.org

cf. Github

• Spark Ecosystem

• SparkContext

The first thing a Spark program should do is create an objectSparkContext, that says how Spark accesses a cluster.

In the shell (Scala or Python), a variable sc is automaticallycreated.

Other programs must use a constructor to instantiate a newSparkContext.

SparkContext can be used to create other variables.

• Master URLsThe master parameter determines which cluster to use.

Master Descriptionlocal Run Spark locally with one worker thread

(i.e. no parallelism at all)

local[K ] Run Spark locally with K worker threads(ideally set to # cores on your machine)

spark://HOST:PORT Connect to a Spark standalone cluster;PORT depends on config (7077 by default)

mesos://HOST:PORT Connect to a Mesos cluster;PORT depends on config (5050 by default)

yarn Connect to a YARN cluster in client or cluster mode

• Shell Python (PySpark) locally with 4 cores

\$pyspark --master local 

Shell Python (PySpark) to a standalone cluster, i.e cluster1

\$pyspark --master spark :// cluster1 :7077

Submit a Job (script Python example.py) locally with 4 cores

\$spark -submit --master local  example.py

Submit a Job to a standalone cluster, i.e. cluster1

\$spark -submit --master spark :// cluster1 :7077 example.py

• RDDResilient Distributed Datasets (RDD)Collections of objects across a cluster User controlled partitioning. Stored in memory or on disk. Built via parallel transformations (map, filter, ...). Aut