21.04.2016 Meetup: Spark vs. Flink

Click here to load reader

Embed Size (px)

Transcript of 21.04.2016 Meetup: Spark vs. Flink

  • Spark vs FlinkRumble in the (Big Data) Jungle

    , Mnchen, 2016-04-20

    Konstantin Knauf Michael Pisula

    mailto:[email protected]:[email protected]://www.tngtech.com/

  • Background

  • The Big Data EcosystemApache Top-Level Projects over Time

    2008 2010 2013 2014 2015

  • The New Guard

  • Berkeley University Origin TU Berlin

    2013 ApacheIncubator

    04/2014

    02/2014 Apache Top-Level

    01/2015

    databricks Company data Artisans

    Scala, Java, Python, R Supportedlanguages

    Java, Scala, Python

    Scala Implementedin

    Java

    Stand-Alone, Mesos,EC2, YARN

    Cluster Stand-Alone, Mesos, EC2, YARN

    Lightning-fast clustercomputing

    Teaser Scalable Batch and StreamData Processing

  • The Challenge

  • Real-Time Analysis of a Superhero Fight Club

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Segment

    id: Intname: Stringsegment: String

    Detail

    name: Stringgender: IntbirthYear: IntnoOfAppearances: Int

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Fight

    hitter: Inthittee: Inthitpoints: Int

    Hero

    id: Intname: Stringsegment: Stringgender: IntbirthYear: IntnoOfAppearances: Int

    {Stream

    {Batch

  • The Setup

    AWS Cluster

    KafkaCluster

    Stream ProcessingBatch Processing

    Heroes

    Combining Stream and Batch

    Segment Detail Data Generator

    Avro

    Avro

  • Round 1Setting up

  • Dependencies

    compile "org.apache.flink:flink-java:1.0.0"compile "org.apache.flink:flink-streaming-java_2.11:1.0.0"//For Local Execution from IDEcompile "org.apache.flink:flink-clients_2.11:1.0.0"

    Skeleton

    //Batch (DataSetAPI)ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();//Stream (DataStream API)StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    //Processing Logic

    //For Streamingenv.execute()

  • Dependencies

    compile 'org.apache.spark:spark-core_2.10:1.5.0'compile 'org.apache.spark:spark-streaming_2.10:1.5.0'

    Skeleton Batch

    SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);// BatchJavaSparkContext sparkContext = new JavaSparkContext(conf);// StreamJavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));// Processing Logic

    jssc.start(); // For Streaming

  • First ImpressionsPractically no boiler plateEasy to get started and play aroundRuns in the IDEHadoop MapReduce is much harder to get into

  • Round 2Static Data Analysis

    Combine both static data parts

  • Read the csv file and transform it

    JavaRDD segmentFile = sparkContext.textFile("s3://...");JavaPairRDD segmentTable = segmentFile .map(line -> line.split(",")) .filter(array -> array.length == 3) .mapToPair((String[] parts) -> { int id = Integer.parseInt(parts[0]); String name = parts[1], segment = parts[2]; return new Tuple2(name, new SegmentTableRecord(id, name, segment)); });

    Join with detail data, filter out humans and write output

    segmentTable.join(detailTable) .mapValues(tuple -> { SegmentTableRecord s = tuple._1(); DetailTableRecord d = tuple._2(); return new Hero(s.getId(), s.getName(), s.getSegment(), d.getGender(), d.getBirthYear(), d.getNoOfAppearances()); }) .map(tuple -> tuple._2()) .filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT)) .saveAsTextFile("s3://...");

  • Loading Files from S3 into POJO

    DataSource segmentTable = env.readCsvFile("s3://...") .ignoreInvalidLines() .pojoType(SegmentTableRecord.class, "id", "name", "segment");

    Join and Filter

    DataSet humanHeros = segmentTable.join(detailTable) .where("name") .equalTo("name") .with((s, d) -> new Hero(s.id, s.name, s.segment, d.gender, d.birthYear, d.noOfAppearances)) .filter(hero -> hero.segment.equals("Human"));

    Write back to S3

    humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE, h -> h.toCsv());

  • PerformanceTerasort1: Flink ca 66% of runtimeTerasort2: Flink ca. 68% of runtimeHashJoin: Flink ca. 32% of runtime(Iterative Processes: Flink ca. 50% of runtime, ca. 7% withDelta-Iterations)

  • 2nd Round PointsGenerally similar abstraction and feature setFlink has a nicer syntax, more sugarSpark is pretty bare-metalFlink is faster

  • Round 3Simple Real Time Analysis

    Total Hitpoints over Last Minute

  • Configuring Environment for EventTime

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

    ExecutionConfig config = env.getConfig();config.setAutoWatermarkInterval(500);

    Creating Stream from Kafka

    Properties properties = new Properties();properties.put("bootstrap.servers", KAFKA_BROKERS);properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION);properties.put("group.id", KAFKA_GROUP_ID);

    DataStreamSource hitStream = env.addSource(new FlinkKafkaConsumer08("FightEventTopic", new FightEventDeserializer(), properties));

  • Processing Logic

    hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .timeWindowAll(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumAllWindowFunction() { @Override public long getSummand(FightEvent fightEvent) { return fightEvent.getHitPoints(); } }) .writeAsCsv("s3://...");

    Example Output3> (1448130670000,1448130730000,290789)4> (1448130680000,1448130740000,289395)5> (1448130690000,1448130750000,291768)6> (1448130700000,1448130760000,292634)7> (1448130710000,1448130770000,293869)8> (1448130720000,1448130780000,293356)1> (1448130730000,1448130790000,293054)2> (1448130740000,1448130800000,294209)

  • Create Context and get Avro Stream from Kafka

    JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

    HashSet topicsSet = Sets.newHashSet("FightEventTopic");

    HashMap kafkaParams = new HashMap();kafkaParams.put("metadata.broker.list", "xxx:11211");kafkaParams.put("group.id", "spark");

    JavaPairInputDStream kafkaStream = KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class, StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet);

    Analyze number of hit points over a sliding window

    kafkaStream.map(tuple -> tuple._2().getHitPoints()) .reduceByWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10)) .foreachRDD((rdd, time) -> { rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds()); LOGGER.info("Hitpoints in the last minute {}", rdd.take(5)); return null; });

  • Output20:19:32 Hitpoints in the last minute [80802]20:19:42 Hitpoints in the last minute [101019]20:19:52 Hitpoints in the last minute [141012]20:20:02 Hitpoints in the last minute [184759]20:20:12 Hitpoints in the last minute [215802]

  • 3rd Round PointsFlink supports event time windowsKafka and Avro worked seamlessly in bothSpark uses micro-batches, no real streamBoth have at-least-once delivery guaranteesExactly-once depends a lot on sink/source

  • Round 4Connecting Static Data with Real

    Time DataTotal Hitpoints over Last Minute Per Gender

  • Read static data using object File and map genders

    JavaRDD staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD genderLookup = staticRdd.mapToPair(user -> { int genderIndicator = user.getGender(); String gender; switch (genderIndicator) { case 1: gender = "MALE"; break; case 2: gender = "FEMALE"; break; default: gender = "OTHER"; break; } return new Tuple2(user.getId(), gender);});

    Analyze number of hit points per hitter over a sliding window

    JavaPairDStream hitpointWindowedStream = kafkaStream .mapToPair(tuple -> { FightEvent fight = tuple._2(); return new Tuple2(fight.getHitterId(), fight.getHitPoints()); }) .reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10));

  • Join with static data to find gender for each hitter

    hitpointWindowedStream.foreachRDD((rdd, time) -> { JavaPairRDD hpg = rdd.leftOuterJoin(genderLookup) .mapToPair(joinedTuple -> { Optional maybeGender = joinedTuple._2()._2(); Long hitpoints = joinedTuple._2()._1(); return new Tuple2(maybeGender.or("UNKNOWN"), hitpoints); }) .reduceByKey((hit1, hit2) -> hit1 + hit2); hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds()); LOGGER.info("Hitpoints per gender {}", hpg.take(5)); return null;});

    Output20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)]20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)]20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)]20:31:14 Hitpoints [(FEMALE,65543), (OTHER,813), (MALE,116416)]20:31:24 Hitpoints [(FEMALE,67507), (OTHER,813), (MALE,123750)]

  • Loading Static Data in Every Map

    public FightEventEnricher(String bucket, String keyPrefix) { this.bucket = bucket; this.keyPrefix = keyPrefix;}

    @Overridepublic void open(Configuration parameters) { populateHeroMapFromS3(bucket, keyPrefix);}

    @Overridepublic EnrichedFightEvent map(FightEvent event) throws Exception { return new EnrichedFightEvent(event, idToHero.get(event.getHitterId()), idToHero.get(event.getHitteeId()));}

    private void populateHeroMapFromS3(String bucket, String keyPrefix) { // Omitted}

  • Processing Logic

    hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(enrichedFightEvent -> enrichedFightEvent.getHittingHero().getGender()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent() .getHitPoints(); } })

    Example Output2> (1448191350000,1448191410000,1,28478)3> (1448191350000,1448191410000,2,264650)2> (1448191360000,1448191420000,1,28290)3> (1448191360000,1448191420000,2,263521)2> (1448191370000,1448191430000,1,29327)3> (1448191370000,1448191430000,2,265526)

  • 4th Round PointsSpark makes combining batch and spark easierWindowing by key works well in bothJava API of Spark can be annoying

  • Round 5More Advanced Real Time

    AnalysisBest Hitter over Last Minute Per Gender

  • Processing Logic

    hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(fightEvent -> fightEvent.getHittingHero().getName()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent().getHitPoints(); } }) .assignTimestamps(new AscendingTimestampExtractor() { @Override public long extractAscendingTimestamp(Tuple4

  • Example Output1> (1448200070000,1448200130000,Tengu,546)2> (1448200080000,1448200140000,Louis XIV,621)3> (1448200090000,1448200150000,Louis XIV,561)4> (1448200100000,1448200160000,Louis XIV,552)5> (1448200110000,1448200170000,Phil Dexter,620)6> (1448200120000,1448200180000,Phil Dexter,552)7> (1448200130000,1448200190000,Kalamity,648)8> (1448200140000,1448200200000,Jakita Wagner,656)1> (1448200150000,1448200210000,Jakita Wagner,703)

  • Read static data using object File and Map names

    JavaRDD staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD userNameLookup = staticRdd .mapToPair(user -> new Tuple2(user.getId(), user.getName()));

    Analyze number of hit points per hitter over a sliding window

    JavaPairDStream hitters = kafkaStream .mapToPair(kafkaTuple -> new Tuple2(kafkaTuple._2().getHitterId(), kafkaTuple._2().getHitPoints())) .reduceByKeyAndWindow((accum, current) -> accum + current, (accum, remove) -> accum - remove, Durations.seconds(60), Durations.seconds(10));

  • Join with static data to find username for each hitter

    hitters.foreachRDD((rdd, time) -> { JavaRDD namedHitters = rdd .leftOuterJoin(userNameLookup) .map(joinedTuple -> { String username = joinedTuple._2()._2().or("No name"); Long hitpoints = joinedTuple._2()._1(); return new Tuple2(username, hitpoints); }) .sortBy(Tuple2::_2, false, PARTITIONS); namedHitters.saveAsTextFile(outputPath + "/round3-" + time); LOGGER.info("Five highest hitters (total: {}){}", namedHitters.count(), namedHitters.take(5)); return null;});

    Output

    15/11/25 20:34:23 Five highest hitters (total: 200)[(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539), 15/11/25 20:34:33 Five highest hitters (total: 378)[(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646), 15/11/25 20:34:43 Five highest hitters (total: 378)[(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699), 15/11/25 20:34:53 Five highest hitters (total: 558)

  • 15/11/25 20:34:53 Five highest hitters (total: 558)[(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,

  • PerformanceYahoo Streaming Benchmark

  • 5th Round PointsSpark makes some things easierBut Flink is real streamingIn Spark you often have to specify partitions

  • The Judges' Call

  • DevelopmentCompared to Hadoop, both are awesomeBoth provide unified programming model for diverse scenariosComfort level of abstraction varies with use-caseSpark's Java API is cumbersome compared to the Scala APIWorking with both is funDocs are ok, but spotty

  • TestingTesting distributed systems will always be hardFunctionally both can be tested nicely

  • Monitoring

  • Monitoring

  • Community

  • The Judge's CallIt depends...

  • Use Spark, ifYou have Cloudera, Hortonworks. etc support and depend on itYou want to heavily use Graph and ML librariesYou want to use the more mature project

  • Use Flink, ifReal-Time processing is important for your use caseYou want more complex window operationsYou develop in Java onlyIf you want to support a German project

  • Benchmark References[1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/[2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html[3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html [4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at [5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

  • Thank You!Questions?

    [email protected] [email protected]

    http://www.tngtech.com/