What is Distributed Computing, Why we use Apache Spark

download What is Distributed Computing, Why we use Apache Spark

If you can't read please download the document

  • date post

    15-Jul-2015
  • Category

    Technology

  • view

    343
  • download

    0

Embed Size (px)

Transcript of What is Distributed Computing, Why we use Apache Spark

  • BigData, newborn technologies evolving fast. Why Apache Spark

    outruns Apache Hadoop

    Andy Petrella, NextlabXavier Tordoir, SilicoCloud

  • Andy@Noootsab, I am@NextLab_be owner@SparkNotebook creator@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool

    Who are we?

    Xavier@xtordoirSilicoCloud-> Physics

    -> Data analysis -> genomics

    -> scalable systems-> ...

  • So what...Part I

    What distributed resources data managers

    Why: fastest smartest biggest

    How: Map Reduce Limitations Extensions

    PART II Spark

    Model Caching and lineage Master and Workers Core example

    Beyond Processing Streaming SQL GraphX MLlib Example

    Use cases Parallel batch processing of

    timeseries ADAM

  • Part I: The Distributed Age

  • What is a distributed environmentComputations needs three kind of resources: CPU MEM Data storage

    However, its hard to extent each of them at will on a single machine

  • What is a distributed environmentLacking of one of these will result in higher response time or reduced accuracy.Unfortunately, it doesnt matter how parallelized is the algorithm or optimized are the computations

    If the solution cant be inside, it must be outside.

  • What is a distributed environment

  • Distributed File SystemYou have 100 nodes in your cluster, but only 1 dataset.Will you replicate it on all nodes?

    Extended case: your dataset is 1 Zettabyte (10Tb)?

    Lonesome solution: split the file on nodes axing the algorithm to access local data subsets

  • HDFS towards TachyonHadoop Distributed File SystemImplements GoogleFSStore and read files splitted and replicated on nodes1Zb file = 8E12 x 128Mb files

    IOPs are expensive and require more CPU clocks than DRAM accessHence... Tachyon: memory-centric distributed file system

  • Nodes will fail, jobs cannotWe need resilience

    Management

    Resources are generally fewer than required by algorithm.We need scheduling

    The requirements are fluctuatingWe need elasticity

  • Mesos and MarathonMesos: High available cluster managerNodes: attach or remove them on the flyNodes are offering resources -- Applications accept themNode crash: the application restarts the assigned tasks

    Marathon: Meta application on MesosApplication crash: automatically restarted on different node

  • Why: for everybody and now ?

    Fastest:1. Time to result2. Near real time processing

  • Runtime is smaller, Dev lifecyle is shorter no synchronization-hell

    It can even be really interactive consoles or notebooks tools.

    Why for everybody and now

  • Why for everybody and nowNo bottlenecks new-coming data are readily available for processing

    Opens the doors for online models!

  • Why for everybody and nowSmartest: train more and more models, ensembling lots of them is no more a problem

    More complex modelling can be tackled if required

  • Why for everybody and nowAccessing an higher level of accuracy is tricky and might require lots and lots of models.

    Running a model takes quite some time, specially if the data has to be read every single time.

    Example: Netflix contest winner (AT&T labs) ensembled 500 models to gain 10% accuracy.Although in 2009 it wasnt possible to use it in production, today this could change.

  • Why for everybody and nowBiggest: no need for sampling big datasets

    Thats it!

  • How!?Google papers stimulated the open software community, hence competitive tools now exist.

    In the area of computation in distributed environment, there are two disruptive papers: Googles Mapreduce Berkeleys Spark

  • How!?MapReduce (Google white paper 2004):

    Programming model for distributed data intensive computations

    Helps dealing with parallelization, fault-tolerance, data distribution, load balancing

  • Functions:Map transform data to key value pairs

    Reduce aggregate key value pairs per key (e.g. sum, max, count)

    Mappers and Reducers are sent to data location (nodes)

    How!?

  • Map

    Reduce: apply a binary associative operator on all elements

    Image from RxJava: https://github.com/ReactiveX/RxJava/wiki/Transforming-Observables

    How!?

  • Hadoop implementation has some limitations

    Mappers and Reducers ship functions to data while java is not a functional language

    Composability is difficult and more IO/network operations are required

    Iterative algorithms (e.g. stochastic gradient) have to read data at each step (while data has not changed, only parameters)

    How!?

  • How!?MapReduce on steroids

    I) Functional paradigm:- process built lazily based on simple concepts- Map and Reduce are two of them

    II) Cache data in memory. No more IO.

  • So what...Part I

    What distributed resources data managers

    Why: fastest smartest biggest

    How: Map Reduce Limitations Extensions

    PART II Spark

    Model Caching and lineage Master and Workers Core example

    Beyond Processing Streaming SQL GraphX MLlib Example (notebook)

    Use cases Parallel batch processing of

    timeseries ADAM

  • Part II: Spark to the Rescue

  • RDDsThink of an RDD[T] as an immutable, distributed collection of objects of type T

    Resilient => Can be reconstructed in case of failure Distributed => Transformations are parallelizable

    operations Dataset => Data loaded and partitioned across cluster

    nodes (executors)

  • RDD[T]Data distribution hierarchy:- RDD[T]- Elements

    [ x1, x2 ]

    [ x10 ]

    [ x8,x5,x6 ]

    [ x11 ]

    [ x14,x13 ]

    [ x9,x16 ]

    [ x3 ]

    [ x7,x12 ]

    [ x15 ]

    [ x17,x4 ]

    Executor 1

    - Executors- Partitions

    Executor 2 Executor 3 Executor 4

  • Execution

    Execution is split in fundamental units: Tasks

    Tasks running in parallel are grouped in Stages

  • Execution

    Core1Task0(read/process/write)

    Task0(read/process/write)

    Task0(read/process/write)

    Core2Task1(read/process/write)

    Task1(read/process/write)

    Task1(read/process/write)

    Core3Task2(read/process/write)

    Task2(read/process/write)

    Task2(read/process/write)

    Stage2 Stage1 Stage0

  • Master and Workers

  • Spark StreamingWhen you have big fat streams behaving as one single collection

    t

    DStream[T]

    RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

    DStreams: Discretized Streams (= Sequence of RDDs)

  • Spark SQL

    Mapping: RDD -> table, Element Field -> column

  • MLLib: Distributed ML

    Classification linear SVM, logistic regression, classification trees, naive Bayes Models

    Regression SVM, regression trees, linear regression (regularized)

    Clustering & dimensionality reduction singular value decomposition, PCA, k-means clustering

    The library to teach them all

  • GraphX

    Connecting the dots

    Graph processing at scale. > Take edges > Link nodes > Combine/Send messages

  • Use cases examples

    - Parallel batch processing of time series- Bayesian Network in financial market- IoT platform (Lambda architecture)- OpenStreetMap cities topologies classification- Markov Chain in Land Use/Land Cover prediction- Genomics: ADAM

  • Genomics

    Biological systems are very complexOne human sequence is 60Gb

  • ADAMCredits: AmpLab (UC Berkeley)

  • Stratification using 1000Genomes

    http://www.1000genomes.org/

    ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg

  • Machine Learning model

    Clustering: KMeans

    ref: http://en.wikipedia.org/wiki/K-means_clustering

  • Machine Learning modelMLLib, KMeans

    MLLib: Machine Learning Algorithms Data structures (e.g. Vector)

  • Mashupprediction

    Sample [NA20332] is in cluster #0 for population Some( ASW)

    Sample [NA20334] is in cluster # 2 for population Some( ASW)

    Sample [HG00120] is in cluster # 2 for population Some( GBR)

    Sample [NA18560] is in cluster # 1 for population Some( CHB)

  • Mashup

    #0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0

  • Cluster40 m3.xlarge160 cores + 600G

  • Eggo project (public genomics data in ADAM format on s3)

    We1000genomes in ADAM format on S3. Open Source GA4GH Interop services implementationMachine learning on 1000genomes

    Genomic data and distributed computing

  • The end (of the slides)

    Thanks for your attention!

    Xavier Tordoirxavier@silicocloud.eu

    Andy Petrellaandy.petrella@nextlab.be