Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

Post on 17-Feb-2018

229 views 0 download

Transcript of Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    1/67

    Flink vs. Spark

    Slim Baltagi @SlimBaltagiDirector of Big Data Engineering, Fellow

    Capital One

    https://twitter.com/SlimBaltagihttps://twitter.com/SlimBaltagi
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    2/67

    2

    Agenda

    I. otivation for t!is talkII. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital One"

    I%. &!at are some ke' takeawa's"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    3/67

    3

    I. otivation for t!is talk

    (. arketing fl$ff

    ). Conf$sing statements

    *. B$rning +$estions incorrect or

    o$tdated answers

    -. #elping ot!ers eval$ating Flink vs.Spark

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    4/67

    4

    (. arketing fl$ff

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    5/67

    5

    (. arketing fl$ff

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    6/67

    6

    ). Conf$sing statements

    Sparkis alread' an e/cellent piece of software and is

    advancing ver' +$ickl'.0o vendor 1 no new pro2ect 1is likel' to catc! $p. C!asing Spark wo$ld 3e a waste of

    time, and wo$ld dela' availa3ilit' of real4time anal'tic

    and processing services for no good reason.5 So$rce6

    ap7ed$ce and Spark, ike Olson. C!ief Strateg'

    Officer, Clo$dera. Decem3er, *8t!)8(*!ttp699vision.clo$dera.com9mapred$ce4spark9

    :oal6 one engine forall data so$rces, workloadsand

    environments. So$rce6 Slide (; of a!aria. C?O, Data3ricks.Fe3r$ar' )8t!, )8(;. !ttp699www.slides!are.net9data3ricks9new4directions4for4apac!e4spark4in4)8(;

    http://vision.cloudera.com/mapreduce-spark/http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://vision.cloudera.com/mapreduce-spark/http://vision.cloudera.com/mapreduce-spark/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    7/67

    7

    *. B$rning +$estions incorrect or

    o$tdated answersro2ects t!at depend on smart optimiersrarel' work

    well in real life.5 C$rt onas!, onas! 7esearc!.an$ar' (, )8(;!ttp699www.comp$terworld.com9article9)(893ig4data4digest4!ow4man'4!adoop

    s4do4we4reall'4need.!tml

    Flink is 3asicall' a Spark alternative o$t of :erman',w!ic! I=ve 3een dismissing as $nneeded5. C$rt

    onas!, onas! 7esearc!, arc! ;, )8(;. !ttp699www.d3ms).com9)8(;98*98;9cask4and4cdap9

    Of co$rse, t!is is all a 3$llis! arg$ment for Spark Gor

    Flink, if I=m wrong to dismiss its c!ances as a SparkcompetitorH.5 C$rt onas!, onas! 7esearc!,Septem3er ), )8(;. !ttp699www.d3ms).com9)8(;989)9t!e4potential4significance4of4clo$dera4k$d$9

    http://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    8/67

    8

    *. B$rning +$estions incorrect or

    o$tdated answers

    ?!e 3enefit of SparkJs micro43atc! model is t!at 'o$get f$ll fa$lt4tolerance and e/actl'4once processing

    for t!e entire comp$tation, meaning it can recover all

    state and res$lts even if a node cras!es. Flink and

    Storm donJt provide t!isK5 atei >a!aria. C?O,

    Data3ricks. a' )8(;!ttp699www.kdn$ggets.com9)8(;98;9interview4matei4a!aria4creator4apac!e4

    spark.!tml

    I $nderstand Spark Streaming $ses micro43atc!ing.

    Does t!is increase latenc'" &!ile Spark does $se a

    micro43atc! e/ec$tion model, t!is does not !avem$c! impact on applicationsK5 !ttp699spark.apac!e.org9fa+.!tml

    http://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    9/67

    9

    -. #elp ot!ers eval$ating Flink vs. SparkBesides t!e marketing fl$ff, t!e conf$sing statements,

    t!e incorrect or o$tdated answers to 3$rning

    +$estions, t!e little information on t!e s$32ect of Flinkvs. Spark is availa3le piecemealL

    &!ile eval$ating different stream processing tools at

    Capital One, we 3$ilt a framework listing categories

    and over (88 criteria to assess t!ese streamprocessing tools.

    In t!e ne/t section, I=ll 3e s!aring t!is framework and

    $se it to compare Spark and Flink on a few ke'

    criteria.

    &e !ope t!is will 3e 3eneficial to 'o$ as well w!en

    selecting Flink and9or Spark for stream processing.

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    10/67

    10

    Agenda

    I. otivation for t!is talkII. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital One"

    I%. &!at are some ke' takeawa's"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    11/67

    11

    II. Apac!e Flink vs. Apac!e Spark"

    (. &!at is Apac!e Flink"

    ). &!at is Apac!e Spark"

    *. Framework to eval$ate Flink and Spark

    -. Flink vs. Spark on a few ke' criteria

    ;. F$t$re work

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    12/67

    12

    (.&!at is Apac!e Flink"

    S+$irrel6 Animal. In !armon' wit! ot!er animals in t!e

    #adoop ecos'stem G>ooH6 elep!ant, pig, p't!on,

    camel,...S+$irrel6 reflects t!e meaning of t!e word

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    13/67

    13

    (.&!at is Apac!e Flink"

    Apac!e Flink is an open so$rce platform for

    distri3$ted stream and 3atc! data processing.5!ttps699flink.apac!e.org9

    See also t!e definition in &ikipedia6!ttps

    699en.wikipedia.org9wiki9Apac!eMFlink

    https://flink.apache.org/https://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://flink.apache.org/https://flink.apache.org/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    14/67

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    15/67

    15

    ). &!at is Apac!e Spark"

    Apac!e SparkR is a fast and general engine for

    large4scale data processing.5 !ttp699spark.apac!e.org9See also definition in &ikipedia6 !ttps

    699en.wikipedia.org9wiki9Apac!eMSpark

    Nogo was picked to reflect Nig!tning4fast cl$stercomp$ting

    http://spark.apache.org/http://spark.apache.org/https://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttp://spark.apache.org/http://spark.apache.org/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    16/67

    16

    ). &!at is Apac!e Spark"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    17/67

    17

    *. Framework to eval$ate Flink and Spark

    (. Backgro$nd). Fit4for4p$rpose Categories

    *. Organiational4fit Categories

    -. iscellaneo$s9Ot!er Categories

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    18/67

    18

    (. Backgro$nd

    (.( Definition

    (.) Origin

    (.* at$rit'

    ).- %ersion

    (.; :overnance model

    (. Nicense model

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    19/67

    19

    ). Fit4for4p$rpose Categories

    ).( Sec$rit'

    ).) rovisioning onitoring Capa3ilities).* Natenc' rocessing Arc!itect$re

    ).- State anagement

    ).; rocessing Deliver' Ass$rance

    ). Data3ase Integrations, 0ative vs. ?!ird

    part' connector

    ). #ig! Availa3ilit' 7esilienc'

    ). Ease of Development

    ). Scala3ilit'

    ).(8 ni+$e Capa3ilities9Qe' Differentiators

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    20/67

    20

    ). Fit4for4p$rpose Categories

    ).( Sec$rit'

    ).(.( A$t!entication, A$t!oriation

    ).(.) Data at rest encr'ption Gdata persisted in t!e

    frameworkH

    ).(.* Data in motion encr'ption Gprod$cer

    4Tframework 4T cons$merH

    ).(.- Data in motion encr'ption Ginter4node

    comm$nicationH

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    21/67

    21

    ). Fit4for4p$rpose Categories

    ).) rovisioning onitoring Capa3ilities

    2.2.1 Robustness of Administration

    2.2.2 Ease of maintenance Does tec!nolo"#

    provide con$"uration% deplo#ment% scalin"%

    monitorin"% performance tunin" and auditin"

    capabilities&2.2.' onitorin" Alertin"

    2.2.* +o""in"

    2.2., Audit

    2.2.- ransparent p"rade 0ersion up"rade

    it! minimum dontime

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    22/67

    22

    ). Fit4for4p$rpose Categories).* Natenc' rocessing Arc!itect$re

    2.'.1 Supports tuple at a time% microbatc!%

    transactional updates and batc! processin"

    2.'.2 3omputational model

    2.'.' Abilit# to reprocess !istorical data from

    source

    2.'.* Abilit# to reprocess !istorical data from

    native en"ine

    2.'., 3all e4ternal source (A56/database calls)

    2.'.- 6nte"ration it! 7atc! (static) source2.'.8 Data #pes (ima"es% sound etc.)

    2.'.9 Supports comple4 event processin" and

    pattern detection vs. continuous operator

    model (lo latenc#% :o control)

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    23/67

    23

    ). Fit4for4p$rpose Categories

    ).* Natenc' rocessing Arc!itect$re

    2.'.;

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    24/67

    24

    ). Fit4for4p$rpose Categories

    ).- State anagement

    ).-.( Statef$l vs. Stateless

    ).-.) Is statef$l data ersisted locall' vs.

    e/ternal data3ase vs. Ep!emeral).-.* 0ative rolling, t$m3ling and !opping

    window s$pport

    ).-.- 0ative s$pport for integrated data store

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    25/67

    25

    ). Fit4for4p$rpose Categories

    ).; rocessing Deliver' Ass$rance

    ).;.( :$arantee GAt least onceH).;.) :$arantee GAt most onceH

    ).;.* :$arantee GE/actl' onceH

    ).;.- :lo3al Event order g$aranteed).;.; >uarantee predictable and

    repeatable outcomes( deterministic or

    not)

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    26/67

    26

    ). Fit4for4p$rpose Categories

    ). Data3ase Integrations, 0ative vs. ?!ird

    part' connector

    )..( 0oSPN data3ase integration

    )..) File Format GAvro, ar+$et and ot!er

    format s$pportH

    )..* 7DBS integration

    )..- In4memor' data3ase integration9 Cac!ing

    integration

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    27/67

    27

    ). Fit4for4p$rpose Categories

    ). #ig! Availa3ilit' 7esilienc')..( Can t!e s'stem avoid slowdown d$e to straggler node

    )..) Fa$lt4?olerance Gdoes t!e tool !andlenode9operator9messaging fail$res wit!o$t catastrop!icall' failingH

    )..* State recover' from in4memor'

    )..- State recover' from relia3le storage

    )..; Over!ead of fa$lt tolerance mec!anism GDoes failure!andlin" introduce additional latenc# or ne"ativel#

    impact t!rou"!put&)

    ).. $lti4site s$pport Gm$lti4regionH

    ).. Flow control6 3ackpress$re tolerance from slow operators or

    cons$mers

    ).. Fast parallel recover' vs. replication or serial recover' on

    one node at a time

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    28/67

    28

    ). Fit4for4p$rpose Categories

    ). Ease of Development)..( SPN Interface

    )..) 7eal4?ime de3$gging option)..* B$ilt4in stream oriented a3straction Gstreams, windows, operators , iterators

    4 e/pressive AIs t!at ena3le programmers to +$ickl' develop streaming data

    applicationsH

    )..- Separation of application logic from fa$lt tolerance

    )..; ?esting tools and framework

    2.9.- 3!an"e mana"ement multiple model deplo#ment ( E.".

    separate cluster or can one create multiple independent redundant

    streams internall#)

    2.9.8 D#namic model sappin" (Support d#namic updatin" of

    operators/topolo"#/DA> it!out restart or service interruption)

    2.9.9 Re?uired @noled"e of s#stem internals to develop anapplication

    ).. ?ime to market for applications

    2.9.1= Supports plu"in of e4ternal libraries

    2.9.11 A56

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    29/67

    29

    ). Fit4for4p$rpose Categories

    ). Scala3ilit'

    )..( S$pports m$lti4t!read across m$ltiple

    processors9cores

    )..) Distri3$ted across m$ltiple mac!ines9servers

    )..* artition Algorit!m)..- D'namic elasticit' 4 Scaling wit! minim$m impact9

    performance penalt'

    )..; #oriontal scaling wit! linear

    performance9t!ro$g!p$t

    ).. %ertical scaling G:H

    ).. Scaling wit!o$t downtime

    O f C

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    30/67

    30

    *. Organiational4fit Categories

    *.( at$rit' Comm$nit' S$pport

    *.) S$pport Nang$ages for Development

    *.* Clo$d orta3ilit'

    *.- Compati3ilit' wit! 0ative #adoop

    Arc!itect$re

    *.; Adoption of Comm$nit' vs. Enterprise

    Edition

    *. Integration wit! essage Brokers

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    31/67

    31

    *. Organiational4fit Categories

    *.( at$rit' Comm$nit' S$pport

    *.(.( Open So$rce S$pport *.(.) at$rit' G'earsH

    *.(.* Sta3le

    *.(.- Centralied doc$mentation wit! versioning

    s$pport

    *.(.; Documentation of pro"rammin" A56 it!

    "ood code e4amples

    '.1.- 3entralied visible roadmap

    '.1.8 3ommunit# acceptance vs. 0endor

    driven

    *.(. Contri3$tors

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    32/67

    32

    *. Organiational4fit Categories

    *.) S$pport Nang$ages for Development

    *.).( Nang$age tec!nolog' was 3$ilt on

    *.).) Nang$age s$pported to access

    tec!nolog'

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    33/67

    33

    *. Organiational4fit Categories

    *.* Clo$d orta3ilit'

    *.*.( Ease of migration 3etween clo$d vendors

    *.*.) Ease of migration 3etween on premise to clo$d

    *.*.* Ease of migration from on premise to complete

    clo$d services

    *.*.- Clo$d compati3ilit' GA&S, :oogle, A$reH

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    34/67

    34

    *. Organiational4fit Categories

    *.- Compati3ilit' wit! 0ative #adoop

    Arc!itect$re

    *.-.( Implement on top of #adoop A70 vs.

    Standalone

    *.-.) esos

    *.-.* Coordination wit! Apac!e >ookeeper

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    35/67

    35

    *. Organiational4fit Categories

    *.; Adoption of Comm$nit' vs. Enterprise

    Edition

    *.;.( Open So$rce

    *.;.) Enterprise S$pport

    - i ll 9Ot! C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    36/67

    36

    -. iscellaneo$s9Ot!er Categories

    -.( Best S$ited for

    -.) Qe' $se case scenarios

    -.* Companies $sing tec!nolog'

    - Fli k S k f k it i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    37/67

    37

    -. Flink vs. Spark on a few ke' criteria

    (. Streaming Engine

    ). Iterative rocessing*. emor' anagement

    -. Optimiation

    ;. Config$ration. ?$ning

    . erformance

    - ( St i E i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    38/67

    38

    -.(. Streaming Engine

    an' time4critical applications need to process large

    streams of live dataand provide res$lts in real4time.For e/ample6Financial Fra$d detectionFinancial Stock monitoringAnomal' detection

    ?raffic management applicationsatient monitoringOnline recommenders

    Some claim t!at ;U of streaming $se cases can

    3e !andled wit! micro43atc!esL" 7eall'LLL

    - ( St i E i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    39/67

    39

    -.(. Streaming Engine

    Spark=s micro43atc!ing isn=t good eno$g!L?ed D$nning, C!ief Applications Arc!itect at ap7,

    talk at t!e Ba' Area Apac!e Flink eet$p on A$g$st), )8(;

    !ttp699www.meet$p.com9Ba'4Area4Apac!e4Flink4eet$p9events9))-(;)-

    9

    ?ed descri3ed several $se cases w!ere 3atc! and micro

    3atc! processing is not appropriate and descri3ed w!'.#e also descri3ed w!at a tr$e streaming sol$tion needs

    to provide for solving t!ese pro3lems.?!ese $se cases were taken fromreal ind$strial

    sit$ations, 3$t t!e descriptions drove down to tec!nical

    details as well.

    - ( Streaming Engine

    http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    40/67

    40

    -.(. Streaming Engine

    I wo$ld consider stream data anal'sis to 3e a ma2or

    $ni+$e selling propositionfor Flink. D$e to its

    pipelined arc!itect$re, Flink is a perfect matc! for 3ig

    data stream processing in t!e Apac!e stack. %olker

    arkl

    7ef.6 On Apac!e Flink. Interview wit! %olker arkl, $ne )-t!)8(;

    !ttp699www.od3ms.org93log9)8(;989on4apac!e4flink4interview4wit!4volker4markl9

    Apac!e Flink $ses streams for all workloads6

    streaming, SPN, micro43atc! and 3atc!. Batc! is 2$st

    treated as a finite set of streamed data. ?!is makes

    Flinkt!e most sop!isticated distri3$ted open so$rceBig Data processing engine Gnott!e most mat$reone

    'etLH.

    - ) Iterative rocessing

    http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    41/67

    41

    -.). Iterative rocessing

    &!' Iterations" an' ac!ine Nearning and :rap!

    processing algorit!ms need iterationsL For e/ample6

    ac!ine Nearning Algorit!msCl$stering GQ4eans, Canop', KH:radient descent GNogistic 7egression, atri/

    FactoriationH

    :rap! rocessing Algorit!msage47ank, Nine47ankat! algorit!ms on grap!s Gs!ortest pat!s,

    centralities, KH:rap! comm$nities 9 dense s$34componentsInference GBelief propagationH

    - ) Iterative rocessing

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    42/67

    42

    -.). Iterative rocessing

    FlinkJs AI offers two dedicated iteration operations6

    Iterateand Delta Iterate.

    Flink e/ec$tes programs wit! iterationsas c'clicdata flows6 a data flow program Gand all its operatorsH

    is sc!ed$led2$st once.In eac! iteration, t!e step f$nction cons$mest!e

    entire inp$t Gt!e res$lt of t!e previo$s iteration, or t!einitial data setH, and comp$test!e ne/t version of t!e

    partial sol$tion

    - ) Iterative rocessing

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    43/67

    43

    -.). Iterative rocessing

    Delta iterations r$n onl' onparts of t!e datat!at is

    c!anging and can significantl' speed $p man'

    mac!ine learning andgrap! algorit!ms 3eca$se t!ework in eac! iteration decreases as t!e n$m3er of

    iterations goes on.

    Doc$mentation on iterations wit! Apac!e Flink!ttp699ci.apac!e.org9pro2ects9flink9flink4docs4master9apis9iterations.!tml

    - ) Iterative rocessing

    http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    44/67

    44

    -.). Iterative rocessing

    StepStep

    Step Step Step

    3lient

    for (int i = 0; i < m axIterations; i+ + ) {// Execute M apReduce job

    0on4native iterations in #adoop and Sparkare

    implemented as reg$lar for4loops o$tside t!e s'stem.

    - ) Iterative rocessing

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    45/67

    45

    -.). Iterative rocessing

    Alt!o$g!Spark cac!es data across iterations, it still

    needs to sc!ed$leand e/ec$tea new set of tasks foreac! iteration.

    Spinning Fast Iterative Data Flows 4 Ewen et al. )8() 6

    !ttp699vld3.org9pvld39vol;9p()Mstep!anewenMvld3)8().pdf ?!e

    Apac!e Flink model for incremental iterative dataflow

    processing. Academic paper.7ecap of t!e paper, $ne (, )8(;!ttp

    6993log.acol'er.org9)8(;989(9spinning4fast4iterative4dataflows9

    Doc$mentationon iterationswit! Apac!e Flink!ttp699ci.apac!e.org9pro2ects9flink9flink4docs4master9apis9iterations.!tml

    - * emor' anagement

    http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdfhttp://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    46/67

    46

    -.*. emor' anagement

    P$estion6 Spark vs. Flink low memor' availa3le"

    P$estion answered on stackoverflow.com!ttp699stackoverflow.com9+$estions9*(*;)9spark4vs4flink4low4memor'4

    availa3le

    ?!e same +$estion still $nanswered on t!e Apac!e

    Spark ailing NistLL !ttp699apac!e4flink4$ser4mailing4list4arc!ive.)**8;8.n-.na33le.com9spark4

    vs4flink4low4memor'4availa3le4td)*-.!tml

    - * emor' anagement

    http://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-available
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    47/67

    47

    -.*. emor' anagement

    Feat$res6CVV st'le memor' management inside t!e %

    ser data stored in serialied 3'te arra's in %emor' is allocated, de4allocated, and $sed strictl'

    $sing an internal3$ffer pool implementation.

    Advantages6

    (. Flink will not t!row an OO e/ception on 'o$.). 7ed$ctionof :ar3age Collection G:CH

    *. %er' efficient disk spilling and network transfers

    -. 0o 0eed for r$ntime t$ning

    ;. ore relia3le and sta3le performance

    - * emor' anagement

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    48/67

    48

    -.*. emor' anagement

    pub!ic c!ass " # {pub!ic $trin%& ord;pub!ic intcount;empt

    #pa"e

    ool of emor' ages

    Sorting,

    !as!ing,

    cac!ing

    S!$ffles9

    3roadcasts

    ser code

    o32ects

    anagednman

    aged

    Flink contains its own memor' management stack.

    ?o do t!at, Flink contains its own t'pe e/traction

    and serialiationcomponents.

    % #eap

    0

    etwork

    B$ffers

    - * emor' anagement

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    49/67

    49

    -.*. emor' anagement

    eeking into Apac!e FlinkJs Engine 7oom 4 3' Fa3ian

    #Wske, arc! (*, )8(; !ttp

    699flink.apac!e.org9news9)8(;98*9(*9peeking4into4Apac!e4Flinks4Engine47oom.!tml$ggling wit! Bits and B'tes 4 3' Fa3ian #Wske, a'

    ((,)8(;!ttps699flink.apac!e.org9news9)8(;98;9((9$ggling4wit!4Bits4and4B'tes.!tml

    emor' anagement GBatc! AIH 3' Step!an Ewen4

    a' (, )8(;!ttps699cwiki.apac!e.org9confl$ence9pages9viewpage.action"pageIdX

    ;*-(;);

    Flink added anOff4#eap optionfor its memor'

    management component in Flink 8.(86!ttps699iss$es.apac!e.org92ira93rowse9FNI0Q4(*)8

    - * emor' anagement

    http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://issues.apache.org/jira/browse/FLINK-1320https://issues.apache.org/jira/browse/FLINK-1320https://issues.apache.org/jira/browse/FLINK-1320https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    50/67

    50

    -.*. emor' anagement

    Compared to Flink, Spark is still 3e!ind in c$stom

    memor' management3$t is catc!ing $p wit! itspro2ect ?$ngstenfor emor' anagement and Binar'

    rocessing:manage memor' e/plicitl' and eliminate

    t!e over!ead of % o32ect model and gar3age

    collection. April ), )8(-!ttps699data3ricks.com93log9)8(;98-9)9pro2ect4t$ngsten43ringing4spark4closer4to43are4metal.!tml

    It seems t!at Spark is adopting somet!ing similar to

    Flinkand t!e initial ?$ngsten anno$ncement read

    almost like Flink doc$mentationLL

    - - Optimiation

    https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    51/67

    51

    -.- Optimiation

    Apac!e Flink comes wit! an optimiert!at is

    independent of t!e act$al programming interface.

    It c!ooses a fitting e/ec$tion strateg' depending ont!e inp$ts and operations.

    E/ample6 t!e oin operator will c!oose 3etween

    partitioning and 3roadcasting t!e data, as well as

    3etween r$nning a sort4merge42oin or a !'3rid !as!2oin algorit!m.

    ?!is !elps 'o$ foc$s on 'o$r application logic

    rat!er t!an parallel e/ec$tion. P$ick introd$ction to t!e Optimier6 section of t!e

    paper6

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    52/67

    52

    -.- Optimiation

    7$n locall' on a data

    sample

    on t!e laptop7$n a mont! later

    after t!e data evolved

    #as! vs. Sort

    artition vs. Broadcast

    Cac!ing

    7e$sing partition9sortE/ec$tion

    lan A

    E/ec$tion

    lan B

    7$n on large files

    on t!e cl$ster

    E/ec$tion

    lan C

    &!at is A$tomatic Optimiation" ?!e s'stemJs 3$ilt4in

    optimier takes care of finding t!e 3est wa' to

    e/ec$te t!e program in an' environment.

    - - Optimiation

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    53/67

    53

    -.- Optimiation

    In contrast to Flink=s 3$ilt4in a$tomatic optimiation,

    Spark 2o3s !ave to 3e man$all' optimied and

    adapted to specific datasets 3eca$se 'o$ need toman$all' control partitioning and cac!ing if 'o$

    want to get it rig!t.Spark SPN $ses t!e Catal'st optimier t!at

    s$pports 3ot! r$le43ased and cost43asedoptimiation. 7eferences6Spark SPN6 7elational Data rocessing in Spark

    !ttp699people.csail.mit.ed$9matei9papers9)8(;9sigmodMsparkMs+l.pdf

    Deep Dive into Spark SPN=s Catal'st Optimier!ttps699data3ricks.com93log9)8(;98-9(*9deep4dive4into4spark4s+ls4catal'st4

    optimier.!tml

    - ; Config$ration

    http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdfhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    54/67

    54

    -.;. Config$ration

    Flinkre+$ires no memor' t!res!olds to

    config$reFlink manages its own memor'Flinkre+$ires no complicated network

    config$rations

    ipelining engine re+$ires m$c! lessmemor' for data e/c!angeFlinkre+$ires no serialiers to 3e config$redFlink !andles its own t'pe e/traction and

    data representation

    - ?$ning

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    55/67

    55

    -.. ?$ning

    According to ike Olsen, C!ief Strateg' Officer of

    Clo$deraInc. Spark is too kno33' 1 it !as too man't$ning parameters, and t!e' need constant

    ad2$stment as workloads, data vol$mes, $ser co$nts

    c!ange. 7eference6

    !ttp699vision.clo$dera.com9one4platform9?$ning Spark Streaming for ?!ro$g!p$t B' :erard

    aas from %irdata. Decem3er )), )8(- !ttp

    699www.virdata.com9t$ning4spark9Spark ?$ning6

    !ttp699spark.apac!e.org9docs9latest9t$ning.!tml

    - ?$ning

    http://vision.cloudera.com/one-platform/http://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://spark.apache.org/docs/latest/tuning.htmlhttp://spark.apache.org/docs/latest/tuning.htmlhttp://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://vision.cloudera.com/one-platform/http://vision.cloudera.com/one-platform/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    56/67

    56

    -.. ?$ning

    7$n locall' on a data

    sample

    on t!e laptop7$n a mont! later

    after t!e data evolved

    #as! vs. Sort

    artition vs. Broadcast

    Cac!ing

    7e$sing partition9sortE/ec$tion

    lan A

    E/ec$tion

    lan B

    7$n on large files

    on t!e cl$ster

    E/ec$tion

    lan C

    &!at is A$tomatic Optimiation" ?!e s'stemJs 3$ilt4in

    optimier takes care of finding t!e 3est wa' to

    e/ec$te t!e program in an' environment.

    erformance

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    57/67

    57

    . erformance

    &!' Flink provides a 3etter performance"C$stom memor' manager0ativeclosed4loop iteration operators makegrap!

    and mac!ine learning applications r$n m$c! faster.7ole of t!e 3$ilt4in a$tomatic optimier. For

    e/ample6 more efficient 2oin processing.

    ipeliningdata to t!e ne/t operator in Flink is moreefficient t!an in Spark.

    See 3enc!marking res$lts against Flink !ere6!ttp699www.slides!are.net9s3altagi9w!'4apac!e4flink4is4t!e4-g4of43ig4da

    ta4anal'tics4frameworks9

    http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    58/67

    Agenda

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    59/67

    59

    Agenda

    I. otivation for t!is talk

    II. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital

    One"

    I%. &!at are some ke' takeawa's"

    III. #ow Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    60/67

    60

    III. #ow Flink is $sed at Capital One"

    &e started o$r 2o$rne' wit! Apac!e Flink at Capital

    One w!ile researc!ing and contrasting stream

    processing tools in t!e #adoop ecos'stem wit! apartic$lar interest in t!e ones providing real4time

    stream processing capa3ilitiesand not 2$st micro4

    3atc!ing as in Apac!e Spark.

    &!ile learning more a3o$t Apac!e Flink, wediscovered some $ni+$e capa3ilities of Flink w!ic!

    differentiate it from ot!er Big Data anal'tics tools not

    onl' for 7eal4?ime streaming3$t also for Batc!

    processing.&e eval$ated Apac!e Flink 7eal4?ime stream

    processing capa3ilities in a OC.

    III. #ow Apac!e Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    61/67

    61

    p p

    &!ere are we in o$r Flink 2o$rne'"S$ccessf$l installation of Apac!e Flink 8. in o$r

    re4rod$ction cl$ster r$nning on CD# ;.- wit!sec$rit' and #ig! Availa3ilit' ena3led.

    S$ccessf$l installation of Apac!e Flink 8. in a (8

    nodes 7D cl$ster r$nning #D.

    S$ccessf$l completion of Flink OCfor real4timestream processing. ?!e OC proved t!at propriet'

    s'stem can 3e replaced 3' a com3ination of tools6

    Apac!e Qafka, Apac!e Flink,Elasticsearc! and

    Qi3ana in addition to advanced real4time streaming

    anal'tics.

    III. #ow Apac!e Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    62/67

    62

    p p

    &!at are t!e opport$nities for $sing Apac!e

    Flinkat Capital One"(. 7eal4?ime streaming anal'tics

    ). Cascadingon Flink

    *. Flink=s ap7ed$ce Compati3ilit' Na'er

    -. Flink=s Storm Compati3ilit' Na'er

    ;. Ot!er Flink li3raries Gac!ine Nearning

    and :rap! processingH once t!e' come

    o$t of 3eta.

    III. #ow Apac!e Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    63/67

    63

    p p

    Cascading on Flink6 First release of Cascading on Flink was anno$nced

    recentl' 3' Data Artisans and Conc$rrent. It will 3es$pported in $pcoming Cascading *.(. Capital One is t!e firstcompan' verif'ingt!is release

    on real4world Cascading data flows wit! a simple

    config$ration switc! andno code re4work neededL

    ?!is is a good e/ample ofdoing anal'tics on3o$ndeddata sets GCascadingH $singa stream processor GFlinkH

    E/pected advantagesof performance 3oost and less

    reso$rce cons$mption. F$t$re work is to s$pport

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    64/67

    64

    p p

    Flink=s compati3ilit' la'er for Storm6&e can e/ec$te e/isting Storm topologies

    $sing Flink as t!e $nderl'ing engine.&e can re$seo$r application code G3oltsand

    spo$tsH inside Flink programs.

    Flink=s li3raries GFlinkNfor ac!ineNearning and :ell'for Narge scale grap!

    processingH can 3e $sed along Flink=s

    DataStream AI and DataSet AI for o$r end to

    end 3ig data anal'tics needs.

    Agenda

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    65/67

    65

    Agenda

    I. otivation for t!is talk

    II. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital One"

    I%. &!at are some ke' takeawa's"

    III. &!at are some ke' takeawa's"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    66/67

    66

    ' '0eit!er Flink nor Spark will 3e t!e singleanal'tics

    framework t!at will solve ever' Big Data pro3lemL

    B' design, Spark is not for real4time stream processingw!ile Flinkprovides a tr$e low latenc' streaming

    engine and advanced DataStream AI for real4time

    streaming anal'tics.Alt!o$g! Spark is a!ead in pop$larit' and adoption,

    Flinkis a!ead intec!nolog' innovation and is growingfast.

    It is not alwa's t!e most innovative tool t!at gets t!e

    largest market s!are, t!e Flink comm$nit' needs to

    take into acco$nt t!e market d'namicsLBot! Sparkand Flinkwill !ave t!eir sweet spots

    despite t!eir e too s'ndrome5.

    ?! k L

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    67/67

    ?!anksL ?o all of 'o$ for attendingL ?o Capital One for giving me t!e

    opport$nit' to meet wit! t!e growing

    Apac!e Flink famil'. ?o t!e Apac!e Flink comm$nit' for t!e

    great spirit of colla3oration and !elp.)8( will 3e t!e 'ear of Apac!e FlinkLSee 'o$ at FlinkForward )8(L