21.04.2016 Meetup: Spark vs. Flink

download 21.04.2016 Meetup: Spark vs. Flink

of 50

Embed Size (px)

Transcript of 21.04.2016 Meetup: Spark vs. Flink

  • Spark vs FlinkRumble in the (Big Data) Jungle

    , Mnchen, 2016-04-20

    Konstantin Knauf Michael Pisula


  • Background

  • The Big Data EcosystemApache Top-Level Projects over Time

    2008 2010 2013 2014 2015

  • The New Guard

  • Berkeley University Origin TU Berlin

    2013 ApacheIncubator


    02/2014 Apache Top-Level


    databricks Company data Artisans

    Scala, Java, Python, R Supportedlanguages

    Java, Scala, Python

    Scala Implementedin


    Stand-Alone, Mesos,EC2, YARN

    Cluster Stand-Alone, Mesos, EC2, YARN

    Lightning-fast clustercomputing

    Teaser Scalable Batch and StreamData Processing

  • The Challenge

  • Real-Time Analysis of a Superhero Fight Club


    hitter: Inthittee: Inthitpoints: Int


    id: Intname: Stringsegment: String


    name: Stringgender: IntbirthYear: IntnoOfAppearances: Int


    hitter: Inthittee: Inthitpoints: Int


    hitter: Inthittee: Inthitpoints: Int


    hitter: Inthittee: Inthitpoints: Int


    hitter: Inthittee: Inthitpoints: Int


    hitter: Inthittee: Inthitpoints: Int


    hitter: Inthittee: Inthitpoints: Int


    hitter: Inthittee: Inthitpoints: Int


    id: Intname: Stringsegment: Stringgender: IntbirthYear: IntnoOfAppearances: Int



  • The Setup

    AWS Cluster


    Stream ProcessingBatch Processing


    Combining Stream and Batch

    Segment Detail Data Generator



  • Round 1Setting up

  • Dependencies

    compile "org.apache.flink:flink-java:1.0.0"compile "org.apache.flink:flink-streaming-java_2.11:1.0.0"//For Local Execution from IDEcompile "org.apache.flink:flink-clients_2.11:1.0.0"


    //Batch (DataSetAPI)ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();//Stream (DataStream API)StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    //Processing Logic

    //For Streamingenv.execute()

  • Dependencies

    compile 'org.apache.spark:spark-core_2.10:1.5.0'compile 'org.apache.spark:spark-streaming_2.10:1.5.0'

    Skeleton Batch

    SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);// BatchJavaSparkContext sparkContext = new JavaSparkContext(conf);// StreamJavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));// Processing Logic

    jssc.start(); // For Streaming

  • First ImpressionsPractically no boiler plateEasy to get started and play aroundRuns in the IDEHadoop MapReduce is much harder to get into

  • Round 2Static Data Analysis

    Combine both static data parts

  • Read the csv file and transform it

    JavaRDD segmentFile = sparkContext.textFile("s3://...");JavaPairRDD segmentTable = segmentFile .map(line -> line.split(",")) .filter(array -> array.length == 3) .mapToPair((String[] parts) -> { int id = Integer.parseInt(parts[0]); String name = parts[1], segment = parts[2]; return new Tuple2(name, new SegmentTableRecord(id, name, segment)); });

    Join with detail data, filter out humans and write output

    segmentTable.join(detailTable) .mapValues(tuple -> { SegmentTableRecord s = tuple._1(); DetailTableRecord d = tuple._2(); return new Hero(s.getId(), s.getName(), s.getSegment(), d.getGender(), d.getBirthYear(), d.getNoOfAppearances()); }) .map(tuple -> tuple._2()) .filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT)) .saveAsTextFile("s3://...");

  • Loading Files from S3 into POJO

    DataSource segmentTable = env.readCsvFile("s3://...") .ignoreInvalidLines() .pojoType(SegmentTableRecord.class, "id", "name", "segment");

    Join and Filter

    DataSet humanHeros = segmentTable.join(detailTable) .where("name") .equalTo("name") .with((s, d) -> new Hero(s.id, s.name, s.segment, d.gender, d.birthYear, d.noOfAppearances)) .filter(hero -> hero.segment.equals("Human"));

    Write back to S3

    humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE, h -> h.toCsv());

  • PerformanceTerasort1: Flink ca 66% of runtimeTerasort2: Flink ca. 68% of runtimeHashJoin: Flink ca. 32% of runtime(Iterative Processes: Flink ca. 50% of runtime, ca. 7% withDelta-Iterations)

  • 2nd Round PointsGenerally similar abstraction and feature setFlink has a nicer syntax, more sugarSpark is pretty bare-metalFlink is faster

  • Round 3Simple Real Time Analysis

    Total Hitpoints over Last Minute

  • Configuring Environment for EventTime

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

    ExecutionConfig config = env.getConfig();config.setAutoWatermarkInterval(500);

    Creating Stream from Kafka

    Properties properties = new Properties();properties.put("bootstrap.servers", KAFKA_BROKERS);properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION);properties.put("group.id", KAFKA_GROUP_ID);

    DataStreamSource hitStream = env.addSource(new FlinkKafkaConsumer08("FightEventTopic", new FightEventDeserializer(), properties));

  • Processing Logic

    hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .timeWindowAll(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumAllWindowFunction() { @Override public long getSummand(FightEvent fightEvent) { return fightEvent.getHitPoints(); } }) .writeAsCsv("s3://...");

    Example Output3> (1448130670000,1448130730000,290789)4> (1448130680000,1448130740000,289395)5> (1448130690000,1448130750000,291768)6> (1448130700000,1448130760000,292634)7> (1448130710000,1448130770000,293869)8> (1448130720000,1448130780000,293356)1> (1448130730000,1448130790000,293054)2> (1448130740000,1448130800000,294209)

  • Create Context and get Avro Stream from Kafka

    JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

    HashSet topicsSet = Sets.newHashSet("FightEventTopic");

    HashMap kafkaParams = new HashMap();kafkaParams.put("metadata.broker.list", "xxx:11211");kafkaParams.put("group.id", "spark");

    JavaPairInputDStream kafkaStream = KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class, StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet);

    Analyze number of hit points over a sliding window

    kafkaStream.map(tuple -> tuple._2().getHitPoints()) .reduceByWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10)) .foreachRDD((rdd, time) -> { rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds()); LOGGER.info("Hitpoints in the last minute {}", rdd.take(5)); return null; });

  • Output20:19:32 Hitpoints in the last minute [80802]20:19:42 Hitpoints in the last minute [101019]20:19:52 Hitpoints in the last minute [141012]20:20:02 Hitpoints in the last minute [184759]20:20:12 Hitpoints in the last minute [215802]

  • 3rd Round PointsFlink supports event time windowsKafka and Avro worked seamlessly in bothSpark uses micro-batches, no real streamBoth have at-least-once delivery guaranteesExactly-once depends a lot on sink/source

  • Round 4Connecting Static Data with Real

    Time DataTotal Hitpoints over Last Minute Per Gender

  • Read static data using object File and map genders

    JavaRDD staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD genderLookup = staticRdd.mapToPair(user -> { int genderIndicator = user.getGender(); String gender; switch (genderIndicator) { case 1: gender = "MALE"; break; case 2: gender = "FEMALE"; break; default: gender = "OTHER"; break; } return new Tuple2(user.getId(), gender);});

    Analyze number of hit points per hitter over a sliding window

    JavaPairDStream hitpointWindowedStream = kafkaStream .mapToPair(tuple -> { FightEvent fight = tuple._2(); return new Tuple2(fight.getHitterId(), fight.getHitPoints()); }) .reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10));

  • Join with static data to find gender for each hitter

    hitpointWindowedStream.foreachRDD((rdd, time) -> { JavaPairRDD hpg = rdd.leftOuterJoin(genderLookup) .mapToPair(joinedTuple -> { Optional maybeGender = joinedTuple._2()._2(); Long hitpoints = joinedTuple._2()._1(); return new Tuple2(maybeGender.or("UNKNOWN"), hitpoints); }) .reduceByKey((hit1, hit2) -> hit1 + hit2); hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds()); LOGGER.info("Hitpoints per gender {}", hpg.take(5)); return null;});

    Output20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)]20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)]20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)]20:31:14 Hitp