Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

download Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

of 67

Transcript of Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    1/67

    Flink vs. Spark

    Slim Baltagi @SlimBaltagiDirector of Big Data Engineering, Fellow

    Capital One

    https://twitter.com/SlimBaltagihttps://twitter.com/SlimBaltagi
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    2/67

    2

    Agenda

    I. otivation for t!is talkII. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital One"

    I%. &!at are some ke' takeawa's"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    3/67

    3

    I. otivation for t!is talk

    (. arketing fl$ff

    ). Conf$sing statements

    *. B$rning +$estions incorrect or

    o$tdated answers

    -. #elping ot!ers eval$ating Flink vs.Spark

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    4/67

    4

    (. arketing fl$ff

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    5/67

    5

    (. arketing fl$ff

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    6/67

    6

    ). Conf$sing statements

    Sparkis alread' an e/cellent piece of software and is

    advancing ver' +$ickl'.0o vendor 1 no new pro2ect 1is likel' to catc! $p. C!asing Spark wo$ld 3e a waste of

    time, and wo$ld dela' availa3ilit' of real4time anal'tic

    and processing services for no good reason.5 So$rce6

    ap7ed$ce and Spark, ike Olson. C!ief Strateg'

    Officer, Clo$dera. Decem3er, *8t!)8(*!ttp699vision.clo$dera.com9mapred$ce4spark9

    :oal6 one engine forall data so$rces, workloadsand

    environments. So$rce6 Slide (; of a!aria. C?O, Data3ricks.Fe3r$ar' )8t!, )8(;. !ttp699www.slides!are.net9data3ricks9new4directions4for4apac!e4spark4in4)8(;

    http://vision.cloudera.com/mapreduce-spark/http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015http://vision.cloudera.com/mapreduce-spark/http://vision.cloudera.com/mapreduce-spark/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    7/67

    7

    *. B$rning +$estions incorrect or

    o$tdated answersro2ects t!at depend on smart optimiersrarel' work

    well in real life.5 C$rt onas!, onas! 7esearc!.an$ar' (, )8(;!ttp699www.comp$terworld.com9article9)(893ig4data4digest4!ow4man'4!adoop

    s4do4we4reall'4need.!tml

    Flink is 3asicall' a Spark alternative o$t of :erman',w!ic! I=ve 3een dismissing as $nneeded5. C$rt

    onas!, onas! 7esearc!, arc! ;, )8(;. !ttp699www.d3ms).com9)8(;98*98;9cask4and4cdap9

    Of co$rse, t!is is all a 3$llis! arg$ment for Spark Gor

    Flink, if I=m wrong to dismiss its c!ances as a SparkcompetitorH.5 C$rt onas!, onas! 7esearc!,Septem3er ), )8(;. !ttp699www.d3ms).com9)8(;989)9t!e4potential4significance4of4clo$dera4k$d$9

    http://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/09/28/the-potential-significance-of-cloudera-kudu/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.dbms2.com/2015/03/05/cask-and-cdap/http://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.htmlhttp://www.computerworld.com/article/2871760/big-data-digest-how-many-hadoops-do-we-really-need.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    8/67

    8

    *. B$rning +$estions incorrect or

    o$tdated answers

    ?!e 3enefit of SparkJs micro43atc! model is t!at 'o$get f$ll fa$lt4tolerance and e/actl'4once processing

    for t!e entire comp$tation, meaning it can recover all

    state and res$lts even if a node cras!es. Flink and

    Storm donJt provide t!isK5 atei >a!aria. C?O,

    Data3ricks. a' )8(;!ttp699www.kdn$ggets.com9)8(;98;9interview4matei4a!aria4creator4apac!e4

    spark.!tml

    I $nderstand Spark Streaming $ses micro43atc!ing.

    Does t!is increase latenc'" &!ile Spark does $se a

    micro43atc! e/ec$tion model, t!is does not !avem$c! impact on applicationsK5 !ttp699spark.apac!e.org9fa+.!tml

    http://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://spark.apache.org/faq.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.htmlhttp://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    9/67

    9

    -. #elp ot!ers eval$ating Flink vs. SparkBesides t!e marketing fl$ff, t!e conf$sing statements,

    t!e incorrect or o$tdated answers to 3$rning

    +$estions, t!e little information on t!e s$32ect of Flinkvs. Spark is availa3le piecemealL

    &!ile eval$ating different stream processing tools at

    Capital One, we 3$ilt a framework listing categories

    and over (88 criteria to assess t!ese streamprocessing tools.

    In t!e ne/t section, I=ll 3e s!aring t!is framework and

    $se it to compare Spark and Flink on a few ke'

    criteria.

    &e !ope t!is will 3e 3eneficial to 'o$ as well w!en

    selecting Flink and9or Spark for stream processing.

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    10/67

    10

    Agenda

    I. otivation for t!is talkII. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital One"

    I%. &!at are some ke' takeawa's"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    11/67

    11

    II. Apac!e Flink vs. Apac!e Spark"

    (. &!at is Apac!e Flink"

    ). &!at is Apac!e Spark"

    *. Framework to eval$ate Flink and Spark

    -. Flink vs. Spark on a few ke' criteria

    ;. F$t$re work

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    12/67

    12

    (.&!at is Apac!e Flink"

    S+$irrel6 Animal. In !armon' wit! ot!er animals in t!e

    #adoop ecos'stem G>ooH6 elep!ant, pig, p't!on,

    camel,...S+$irrel6 reflects t!e meaning of t!e word

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    13/67

    13

    (.&!at is Apac!e Flink"

    Apac!e Flink is an open so$rce platform for

    distri3$ted stream and 3atc! data processing.5!ttps699flink.apac!e.org9

    See also t!e definition in &ikipedia6!ttps

    699en.wikipedia.org9wiki9Apac!eMFlink

    https://flink.apache.org/https://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://en.wikipedia.org/wiki/Apache_Flinkhttps://flink.apache.org/https://flink.apache.org/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    14/67

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    15/67

    15

    ). &!at is Apac!e Spark"

    Apac!e SparkR is a fast and general engine for

    large4scale data processing.5 !ttp699spark.apac!e.org9See also definition in &ikipedia6 !ttps

    699en.wikipedia.org9wiki9Apac!eMSpark

    Nogo was picked to reflect Nig!tning4fast cl$stercomp$ting

    http://spark.apache.org/http://spark.apache.org/https://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttps://en.wikipedia.org/wiki/Apache_Sparkhttp://spark.apache.org/http://spark.apache.org/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    16/67

    16

    ). &!at is Apac!e Spark"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    17/67

    17

    *. Framework to eval$ate Flink and Spark

    (. Backgro$nd). Fit4for4p$rpose Categories

    *. Organiational4fit Categories

    -. iscellaneo$s9Ot!er Categories

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    18/67

    18

    (. Backgro$nd

    (.( Definition

    (.) Origin

    (.* at$rit'

    ).- %ersion

    (.; :overnance model

    (. Nicense model

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    19/67

    19

    ). Fit4for4p$rpose Categories

    ).( Sec$rit'

    ).) rovisioning onitoring Capa3ilities).* Natenc' rocessing Arc!itect$re

    ).- State anagement

    ).; rocessing Deliver' Ass$rance

    ). Data3ase Integrations, 0ative vs. ?!ird

    part' connector

    ). #ig! Availa3ilit' 7esilienc'

    ). Ease of Development

    ). Scala3ilit'

    ).(8 ni+$e Capa3ilities9Qe' Differentiators

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    20/67

    20

    ). Fit4for4p$rpose Categories

    ).( Sec$rit'

    ).(.( A$t!entication, A$t!oriation

    ).(.) Data at rest encr'ption Gdata persisted in t!e

    frameworkH

    ).(.* Data in motion encr'ption Gprod$cer

    4Tframework 4T cons$merH

    ).(.- Data in motion encr'ption Ginter4node

    comm$nicationH

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    21/67

    21

    ). Fit4for4p$rpose Categories

    ).) rovisioning onitoring Capa3ilities

    2.2.1 Robustness of Administration

    2.2.2 Ease of maintenance Does tec!nolo"#

    provide con$"uration% deplo#ment% scalin"%

    monitorin"% performance tunin" and auditin"

    capabilities&2.2.' onitorin" Alertin"

    2.2.* +o""in"

    2.2., Audit

    2.2.- ransparent p"rade 0ersion up"rade

    it! minimum dontime

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    22/67

    22

    ). Fit4for4p$rpose Categories).* Natenc' rocessing Arc!itect$re

    2.'.1 Supports tuple at a time% microbatc!%

    transactional updates and batc! processin"

    2.'.2 3omputational model

    2.'.' Abilit# to reprocess !istorical data from

    source

    2.'.* Abilit# to reprocess !istorical data from

    native en"ine

    2.'., 3all e4ternal source (A56/database calls)

    2.'.- 6nte"ration it! 7atc! (static) source2.'.8 Data #pes (ima"es% sound etc.)

    2.'.9 Supports comple4 event processin" and

    pattern detection vs. continuous operator

    model (lo latenc#% :o control)

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    23/67

    23

    ). Fit4for4p$rpose Categories

    ).* Natenc' rocessing Arc!itect$re

    2.'.;

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    24/67

    24

    ). Fit4for4p$rpose Categories

    ).- State anagement

    ).-.( Statef$l vs. Stateless

    ).-.) Is statef$l data ersisted locall' vs.

    e/ternal data3ase vs. Ep!emeral).-.* 0ative rolling, t$m3ling and !opping

    window s$pport

    ).-.- 0ative s$pport for integrated data store

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    25/67

    25

    ). Fit4for4p$rpose Categories

    ).; rocessing Deliver' Ass$rance

    ).;.( :$arantee GAt least onceH).;.) :$arantee GAt most onceH

    ).;.* :$arantee GE/actl' onceH

    ).;.- :lo3al Event order g$aranteed).;.; >uarantee predictable and

    repeatable outcomes( deterministic or

    not)

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    26/67

    26

    ). Fit4for4p$rpose Categories

    ). Data3ase Integrations, 0ative vs. ?!ird

    part' connector

    )..( 0oSPN data3ase integration

    )..) File Format GAvro, ar+$et and ot!er

    format s$pportH

    )..* 7DBS integration

    )..- In4memor' data3ase integration9 Cac!ing

    integration

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    27/67

    27

    ). Fit4for4p$rpose Categories

    ). #ig! Availa3ilit' 7esilienc')..( Can t!e s'stem avoid slowdown d$e to straggler node

    )..) Fa$lt4?olerance Gdoes t!e tool !andlenode9operator9messaging fail$res wit!o$t catastrop!icall' failingH

    )..* State recover' from in4memor'

    )..- State recover' from relia3le storage

    )..; Over!ead of fa$lt tolerance mec!anism GDoes failure!andlin" introduce additional latenc# or ne"ativel#

    impact t!rou"!put&)

    ).. $lti4site s$pport Gm$lti4regionH

    ).. Flow control6 3ackpress$re tolerance from slow operators or

    cons$mers

    ).. Fast parallel recover' vs. replication or serial recover' on

    one node at a time

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    28/67

    28

    ). Fit4for4p$rpose Categories

    ). Ease of Development)..( SPN Interface

    )..) 7eal4?ime de3$gging option)..* B$ilt4in stream oriented a3straction Gstreams, windows, operators , iterators

    4 e/pressive AIs t!at ena3le programmers to +$ickl' develop streaming data

    applicationsH

    )..- Separation of application logic from fa$lt tolerance

    )..; ?esting tools and framework

    2.9.- 3!an"e mana"ement multiple model deplo#ment ( E.".

    separate cluster or can one create multiple independent redundant

    streams internall#)

    2.9.8 D#namic model sappin" (Support d#namic updatin" of

    operators/topolo"#/DA> it!out restart or service interruption)

    2.9.9 Re?uired @noled"e of s#stem internals to develop anapplication

    ).. ?ime to market for applications

    2.9.1= Supports plu"in of e4ternal libraries

    2.9.11 A56

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    29/67

    29

    ). Fit4for4p$rpose Categories

    ). Scala3ilit'

    )..( S$pports m$lti4t!read across m$ltiple

    processors9cores

    )..) Distri3$ted across m$ltiple mac!ines9servers

    )..* artition Algorit!m)..- D'namic elasticit' 4 Scaling wit! minim$m impact9

    performance penalt'

    )..; #oriontal scaling wit! linear

    performance9t!ro$g!p$t

    ).. %ertical scaling G:H

    ).. Scaling wit!o$t downtime

    O f C

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    30/67

    30

    *. Organiational4fit Categories

    *.( at$rit' Comm$nit' S$pport

    *.) S$pport Nang$ages for Development

    *.* Clo$d orta3ilit'

    *.- Compati3ilit' wit! 0ative #adoop

    Arc!itect$re

    *.; Adoption of Comm$nit' vs. Enterprise

    Edition

    *. Integration wit! essage Brokers

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    31/67

    31

    *. Organiational4fit Categories

    *.( at$rit' Comm$nit' S$pport

    *.(.( Open So$rce S$pport *.(.) at$rit' G'earsH

    *.(.* Sta3le

    *.(.- Centralied doc$mentation wit! versioning

    s$pport

    *.(.; Documentation of pro"rammin" A56 it!

    "ood code e4amples

    '.1.- 3entralied visible roadmap

    '.1.8 3ommunit# acceptance vs. 0endor

    driven

    *.(. Contri3$tors

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    32/67

    32

    *. Organiational4fit Categories

    *.) S$pport Nang$ages for Development

    *.).( Nang$age tec!nolog' was 3$ilt on

    *.).) Nang$age s$pported to access

    tec!nolog'

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    33/67

    33

    *. Organiational4fit Categories

    *.* Clo$d orta3ilit'

    *.*.( Ease of migration 3etween clo$d vendors

    *.*.) Ease of migration 3etween on premise to clo$d

    *.*.* Ease of migration from on premise to complete

    clo$d services

    *.*.- Clo$d compati3ilit' GA&S, :oogle, A$reH

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    34/67

    34

    *. Organiational4fit Categories

    *.- Compati3ilit' wit! 0ative #adoop

    Arc!itect$re

    *.-.( Implement on top of #adoop A70 vs.

    Standalone

    *.-.) esos

    *.-.* Coordination wit! Apac!e >ookeeper

    * O i ti l fit C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    35/67

    35

    *. Organiational4fit Categories

    *.; Adoption of Comm$nit' vs. Enterprise

    Edition

    *.;.( Open So$rce

    *.;.) Enterprise S$pport

    - i ll 9Ot! C t i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    36/67

    36

    -. iscellaneo$s9Ot!er Categories

    -.( Best S$ited for

    -.) Qe' $se case scenarios

    -.* Companies $sing tec!nolog'

    - Fli k S k f k it i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    37/67

    37

    -. Flink vs. Spark on a few ke' criteria

    (. Streaming Engine

    ). Iterative rocessing*. emor' anagement

    -. Optimiation

    ;. Config$ration. ?$ning

    . erformance

    - ( St i E i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    38/67

    38

    -.(. Streaming Engine

    an' time4critical applications need to process large

    streams of live dataand provide res$lts in real4time.For e/ample6Financial Fra$d detectionFinancial Stock monitoringAnomal' detection

    ?raffic management applicationsatient monitoringOnline recommenders

    Some claim t!at ;U of streaming $se cases can

    3e !andled wit! micro43atc!esL" 7eall'LLL

    - ( St i E i

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    39/67

    39

    -.(. Streaming Engine

    Spark=s micro43atc!ing isn=t good eno$g!L?ed D$nning, C!ief Applications Arc!itect at ap7,

    talk at t!e Ba' Area Apac!e Flink eet$p on A$g$st), )8(;

    !ttp699www.meet$p.com9Ba'4Area4Apac!e4Flink4eet$p9events9))-(;)-

    9

    ?ed descri3ed several $se cases w!ere 3atc! and micro

    3atc! processing is not appropriate and descri3ed w!'.#e also descri3ed w!at a tr$e streaming sol$tion needs

    to provide for solving t!ese pro3lems.?!ese $se cases were taken fromreal ind$strial

    sit$ations, 3$t t!e descriptions drove down to tec!nical

    details as well.

    - ( Streaming Engine

    http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/http://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/224189524/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    40/67

    40

    -.(. Streaming Engine

    I wo$ld consider stream data anal'sis to 3e a ma2or

    $ni+$e selling propositionfor Flink. D$e to its

    pipelined arc!itect$re, Flink is a perfect matc! for 3ig

    data stream processing in t!e Apac!e stack. %olker

    arkl

    7ef.6 On Apac!e Flink. Interview wit! %olker arkl, $ne )-t!)8(;

    !ttp699www.od3ms.org93log9)8(;989on4apac!e4flink4interview4wit!4volker4markl9

    Apac!e Flink $ses streams for all workloads6

    streaming, SPN, micro43atc! and 3atc!. Batc! is 2$st

    treated as a finite set of streamed data. ?!is makes

    Flinkt!e most sop!isticated distri3$ted open so$rceBig Data processing engine Gnott!e most mat$reone

    'etLH.

    - ) Iterative rocessing

    http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    41/67

    41

    -.). Iterative rocessing

    &!' Iterations" an' ac!ine Nearning and :rap!

    processing algorit!ms need iterationsL For e/ample6

    ac!ine Nearning Algorit!msCl$stering GQ4eans, Canop', KH:radient descent GNogistic 7egression, atri/

    FactoriationH

    :rap! rocessing Algorit!msage47ank, Nine47ankat! algorit!ms on grap!s Gs!ortest pat!s,

    centralities, KH:rap! comm$nities 9 dense s$34componentsInference GBelief propagationH

    - ) Iterative rocessing

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    42/67

    42

    -.). Iterative rocessing

    FlinkJs AI offers two dedicated iteration operations6

    Iterateand Delta Iterate.

    Flink e/ec$tes programs wit! iterationsas c'clicdata flows6 a data flow program Gand all its operatorsH

    is sc!ed$led2$st once.In eac! iteration, t!e step f$nction cons$mest!e

    entire inp$t Gt!e res$lt of t!e previo$s iteration, or t!einitial data setH, and comp$test!e ne/t version of t!e

    partial sol$tion

    - ) Iterative rocessing

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    43/67

    43

    -.). Iterative rocessing

    Delta iterations r$n onl' onparts of t!e datat!at is

    c!anging and can significantl' speed $p man'

    mac!ine learning andgrap! algorit!ms 3eca$se t!ework in eac! iteration decreases as t!e n$m3er of

    iterations goes on.

    Doc$mentation on iterations wit! Apac!e Flink!ttp699ci.apac!e.org9pro2ects9flink9flink4docs4master9apis9iterations.!tml

    - ) Iterative rocessing

    http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    44/67

    44

    -.). Iterative rocessing

    StepStep

    Step Step Step

    3lient

    for (int i = 0; i < m axIterations; i+ + ) {// Execute M apReduce job

    0on4native iterations in #adoop and Sparkare

    implemented as reg$lar for4loops o$tside t!e s'stem.

    - ) Iterative rocessing

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    45/67

    45

    -.). Iterative rocessing

    Alt!o$g!Spark cac!es data across iterations, it still

    needs to sc!ed$leand e/ec$tea new set of tasks foreac! iteration.

    Spinning Fast Iterative Data Flows 4 Ewen et al. )8() 6

    !ttp699vld3.org9pvld39vol;9p()Mstep!anewenMvld3)8().pdf ?!e

    Apac!e Flink model for incremental iterative dataflow

    processing. Academic paper.7ecap of t!e paper, $ne (, )8(;!ttp

    6993log.acol'er.org9)8(;989(9spinning4fast4iterative4dataflows9

    Doc$mentationon iterationswit! Apac!e Flink!ttp699ci.apac!e.org9pro2ects9flink9flink4docs4master9apis9iterations.!tml

    - * emor' anagement

    http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdfhttp://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.htmlhttp://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    46/67

    46

    -.*. emor' anagement

    P$estion6 Spark vs. Flink low memor' availa3le"

    P$estion answered on stackoverflow.com!ttp699stackoverflow.com9+$estions9*(*;)9spark4vs4flink4low4memor'4

    availa3le

    ?!e same +$estion still $nanswered on t!e Apac!e

    Spark ailing NistLL !ttp699apac!e4flink4$ser4mailing4list4arc!ive.)**8;8.n-.na33le.com9spark4

    vs4flink4low4memor'4availa3le4td)*-.!tml

    - * emor' anagement

    http://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-td2364.htmlhttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-availablehttp://stackoverflow.com/questions/31935299/spark-vs-flink-low-memory-available
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    47/67

    47

    -.*. emor' anagement

    Feat$res6CVV st'le memor' management inside t!e %

    ser data stored in serialied 3'te arra's in %emor' is allocated, de4allocated, and $sed strictl'

    $sing an internal3$ffer pool implementation.

    Advantages6

    (. Flink will not t!row an OO e/ception on 'o$.). 7ed$ctionof :ar3age Collection G:CH

    *. %er' efficient disk spilling and network transfers

    -. 0o 0eed for r$ntime t$ning

    ;. ore relia3le and sta3le performance

    - * emor' anagement

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    48/67

    48

    -.*. emor' anagement

    pub!ic c!ass " # {pub!ic $trin%& ord;pub!ic intcount;empt

    #pa"e

    ool of emor' ages

    Sorting,

    !as!ing,

    cac!ing

    S!$ffles9

    3roadcasts

    ser code

    o32ects

    anagednman

    aged

    Flink contains its own memor' management stack.

    ?o do t!at, Flink contains its own t'pe e/traction

    and serialiationcomponents.

    % #eap

    0

    etwork

    B$ffers

    - * emor' anagement

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    49/67

    49

    -.*. emor' anagement

    eeking into Apac!e FlinkJs Engine 7oom 4 3' Fa3ian

    #Wske, arc! (*, )8(; !ttp

    699flink.apac!e.org9news9)8(;98*9(*9peeking4into4Apac!e4Flinks4Engine47oom.!tml$ggling wit! Bits and B'tes 4 3' Fa3ian #Wske, a'

    ((,)8(;!ttps699flink.apac!e.org9news9)8(;98;9((9$ggling4wit!4Bits4and4B'tes.!tml

    emor' anagement GBatc! AIH 3' Step!an Ewen4

    a' (, )8(;!ttps699cwiki.apac!e.org9confl$ence9pages9viewpage.action"pageIdX

    ;*-(;);

    Flink added anOff4#eap optionfor its memor'

    management component in Flink 8.(86!ttps699iss$es.apac!e.org92ira93rowse9FNI0Q4(*)8

    - * emor' anagement

    http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://issues.apache.org/jira/browse/FLINK-1320https://issues.apache.org/jira/browse/FLINK-1320https://issues.apache.org/jira/browse/FLINK-1320https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttps://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.htmlhttp://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    50/67

    50

    -.*. emor' anagement

    Compared to Flink, Spark is still 3e!ind in c$stom

    memor' management3$t is catc!ing $p wit! itspro2ect ?$ngstenfor emor' anagement and Binar'

    rocessing:manage memor' e/plicitl' and eliminate

    t!e over!ead of % o32ect model and gar3age

    collection. April ), )8(-!ttps699data3ricks.com93log9)8(;98-9)9pro2ect4t$ngsten43ringing4spark4closer4to43are4metal.!tml

    It seems t!at Spark is adopting somet!ing similar to

    Flinkand t!e initial ?$ngsten anno$ncement read

    almost like Flink doc$mentationLL

    - - Optimiation

    https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlhttps://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    51/67

    51

    -.- Optimiation

    Apac!e Flink comes wit! an optimiert!at is

    independent of t!e act$al programming interface.

    It c!ooses a fitting e/ec$tion strateg' depending ont!e inp$ts and operations.

    E/ample6 t!e oin operator will c!oose 3etween

    partitioning and 3roadcasting t!e data, as well as

    3etween r$nning a sort4merge42oin or a !'3rid !as!2oin algorit!m.

    ?!is !elps 'o$ foc$s on 'o$r application logic

    rat!er t!an parallel e/ec$tion. P$ick introd$ction to t!e Optimier6 section of t!e

    paper6

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    52/67

    52

    -.- Optimiation

    7$n locall' on a data

    sample

    on t!e laptop7$n a mont! later

    after t!e data evolved

    #as! vs. Sort

    artition vs. Broadcast

    Cac!ing

    7e$sing partition9sortE/ec$tion

    lan A

    E/ec$tion

    lan B

    7$n on large files

    on t!e cl$ster

    E/ec$tion

    lan C

    &!at is A$tomatic Optimiation" ?!e s'stemJs 3$ilt4in

    optimier takes care of finding t!e 3est wa' to

    e/ec$te t!e program in an' environment.

    - - Optimiation

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    53/67

    53

    -.- Optimiation

    In contrast to Flink=s 3$ilt4in a$tomatic optimiation,

    Spark 2o3s !ave to 3e man$all' optimied and

    adapted to specific datasets 3eca$se 'o$ need toman$all' control partitioning and cac!ing if 'o$

    want to get it rig!t.Spark SPN $ses t!e Catal'st optimier t!at

    s$pports 3ot! r$le43ased and cost43asedoptimiation. 7eferences6Spark SPN6 7elational Data rocessing in Spark

    !ttp699people.csail.mit.ed$9matei9papers9)8(;9sigmodMsparkMs+l.pdf

    Deep Dive into Spark SPN=s Catal'st Optimier!ttps699data3ricks.com93log9)8(;98-9(*9deep4dive4into4spark4s+ls4catal'st4

    optimier.!tml

    - ; Config$ration

    http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdfhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    54/67

    54

    -.;. Config$ration

    Flinkre+$ires no memor' t!res!olds to

    config$reFlink manages its own memor'Flinkre+$ires no complicated network

    config$rations

    ipelining engine re+$ires m$c! lessmemor' for data e/c!angeFlinkre+$ires no serialiers to 3e config$redFlink !andles its own t'pe e/traction and

    data representation

    - ?$ning

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    55/67

    55

    -.. ?$ning

    According to ike Olsen, C!ief Strateg' Officer of

    Clo$deraInc. Spark is too kno33' 1 it !as too man't$ning parameters, and t!e' need constant

    ad2$stment as workloads, data vol$mes, $ser co$nts

    c!ange. 7eference6

    !ttp699vision.clo$dera.com9one4platform9?$ning Spark Streaming for ?!ro$g!p$t B' :erard

    aas from %irdata. Decem3er )), )8(- !ttp

    699www.virdata.com9t$ning4spark9Spark ?$ning6

    !ttp699spark.apac!e.org9docs9latest9t$ning.!tml

    - ?$ning

    http://vision.cloudera.com/one-platform/http://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://spark.apache.org/docs/latest/tuning.htmlhttp://spark.apache.org/docs/latest/tuning.htmlhttp://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://www.virdata.com/tuning-spark/http://vision.cloudera.com/one-platform/http://vision.cloudera.com/one-platform/
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    56/67

    56

    -.. ?$ning

    7$n locall' on a data

    sample

    on t!e laptop7$n a mont! later

    after t!e data evolved

    #as! vs. Sort

    artition vs. Broadcast

    Cac!ing

    7e$sing partition9sortE/ec$tion

    lan A

    E/ec$tion

    lan B

    7$n on large files

    on t!e cl$ster

    E/ec$tion

    lan C

    &!at is A$tomatic Optimiation" ?!e s'stemJs 3$ilt4in

    optimier takes care of finding t!e 3est wa' to

    e/ec$te t!e program in an' environment.

    erformance

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    57/67

    57

    . erformance

    &!' Flink provides a 3etter performance"C$stom memor' manager0ativeclosed4loop iteration operators makegrap!

    and mac!ine learning applications r$n m$c! faster.7ole of t!e 3$ilt4in a$tomatic optimier. For

    e/ample6 more efficient 2oin processing.

    ipeliningdata to t!e ne/t operator in Flink is moreefficient t!an in Spark.

    See 3enc!marking res$lts against Flink !ere6!ttp699www.slides!are.net9s3altagi9w!'4apac!e4flink4is4t!e4-g4of43ig4da

    ta4anal'tics4frameworks9

    http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-data-analytics-frameworks/87
  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    58/67

    Agenda

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    59/67

    59

    Agenda

    I. otivation for t!is talk

    II. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital

    One"

    I%. &!at are some ke' takeawa's"

    III. #ow Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    60/67

    60

    III. #ow Flink is $sed at Capital One"

    &e started o$r 2o$rne' wit! Apac!e Flink at Capital

    One w!ile researc!ing and contrasting stream

    processing tools in t!e #adoop ecos'stem wit! apartic$lar interest in t!e ones providing real4time

    stream processing capa3ilitiesand not 2$st micro4

    3atc!ing as in Apac!e Spark.

    &!ile learning more a3o$t Apac!e Flink, wediscovered some $ni+$e capa3ilities of Flink w!ic!

    differentiate it from ot!er Big Data anal'tics tools not

    onl' for 7eal4?ime streaming3$t also for Batc!

    processing.&e eval$ated Apac!e Flink 7eal4?ime stream

    processing capa3ilities in a OC.

    III. #ow Apac!e Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    61/67

    61

    p p

    &!ere are we in o$r Flink 2o$rne'"S$ccessf$l installation of Apac!e Flink 8. in o$r

    re4rod$ction cl$ster r$nning on CD# ;.- wit!sec$rit' and #ig! Availa3ilit' ena3led.

    S$ccessf$l installation of Apac!e Flink 8. in a (8

    nodes 7D cl$ster r$nning #D.

    S$ccessf$l completion of Flink OCfor real4timestream processing. ?!e OC proved t!at propriet'

    s'stem can 3e replaced 3' a com3ination of tools6

    Apac!e Qafka, Apac!e Flink,Elasticsearc! and

    Qi3ana in addition to advanced real4time streaming

    anal'tics.

    III. #ow Apac!e Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    62/67

    62

    p p

    &!at are t!e opport$nities for $sing Apac!e

    Flinkat Capital One"(. 7eal4?ime streaming anal'tics

    ). Cascadingon Flink

    *. Flink=s ap7ed$ce Compati3ilit' Na'er

    -. Flink=s Storm Compati3ilit' Na'er

    ;. Ot!er Flink li3raries Gac!ine Nearning

    and :rap! processingH once t!e' come

    o$t of 3eta.

    III. #ow Apac!e Flink is $sed at Capital One"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    63/67

    63

    p p

    Cascading on Flink6 First release of Cascading on Flink was anno$nced

    recentl' 3' Data Artisans and Conc$rrent. It will 3es$pported in $pcoming Cascading *.(. Capital One is t!e firstcompan' verif'ingt!is release

    on real4world Cascading data flows wit! a simple

    config$ration switc! andno code re4work neededL

    ?!is is a good e/ample ofdoing anal'tics on3o$ndeddata sets GCascadingH $singa stream processor GFlinkH

    E/pected advantagesof performance 3oost and less

    reso$rce cons$mption. F$t$re work is to s$pport

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    64/67

    64

    p p

    Flink=s compati3ilit' la'er for Storm6&e can e/ec$te e/isting Storm topologies

    $sing Flink as t!e $nderl'ing engine.&e can re$seo$r application code G3oltsand

    spo$tsH inside Flink programs.

    Flink=s li3raries GFlinkNfor ac!ineNearning and :ell'for Narge scale grap!

    processingH can 3e $sed along Flink=s

    DataStream AI and DataSet AI for o$r end to

    end 3ig data anal'tics needs.

    Agenda

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    65/67

    65

    Agenda

    I. otivation for t!is talk

    II. Apac!e Flink vs. Apac!e Spark"

    III. #ow Flink is $sed at Capital One"

    I%. &!at are some ke' takeawa's"

    III. &!at are some ke' takeawa's"

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    66/67

    66

    ' '0eit!er Flink nor Spark will 3e t!e singleanal'tics

    framework t!at will solve ever' Big Data pro3lemL

    B' design, Spark is not for real4time stream processingw!ile Flinkprovides a tr$e low latenc' streaming

    engine and advanced DataStream AI for real4time

    streaming anal'tics.Alt!o$g! Spark is a!ead in pop$larit' and adoption,

    Flinkis a!ead intec!nolog' innovation and is growingfast.

    It is not alwa's t!e most innovative tool t!at gets t!e

    largest market s!are, t!e Flink comm$nit' needs to

    take into acco$nt t!e market d'namicsLBot! Sparkand Flinkwill !ave t!eir sweet spots

    despite t!eir e too s'ndrome5.

    ?! k L

  • 7/23/2019 Flink vs Spark by Slim Baltagi 151016065205 Lva1 App6891

    67/67

    ?!anksL ?o all of 'o$ for attendingL ?o Capital One for giving me t!e

    opport$nit' to meet wit! t!e growing

    Apac!e Flink famil'. ?o t!e Apac!e Flink comm$nit' for t!e

    great spirit of colla3oration and !elp.)8( will 3e t!e 'ear of Apac!e FlinkLSee 'o$ at FlinkForward )8(L