21.04.2016 Meetup: Spark vs. Flink

50
Spark vs Flink Rumble in the (Big Data) Jungle , München, 2016-04-20 Konstantin Knauf Michael Pisula

Transcript of 21.04.2016 Meetup: Spark vs. Flink

Page 1: 21.04.2016 Meetup: Spark vs. Flink

Spark vs FlinkRumble in the (Big Data) Jungle

, München, 2016-04-20

Konstantin Knauf Michael Pisula

Page 2: 21.04.2016 Meetup: Spark vs. Flink

Background

Page 3: 21.04.2016 Meetup: Spark vs. Flink

The Big Data EcosystemApache Top-Level Projects over Time

2008 2010 2013 2014 2015

Page 4: 21.04.2016 Meetup: Spark vs. Flink

The New Guard

Page 5: 21.04.2016 Meetup: Spark vs. Flink

Berkeley University Origin TU Berlin

2013 ApacheIncubator

04/2014

02/2014 Apache Top-Level

01/2015

databricks Company data Artisans

Scala, Java, Python, R Supportedlanguages

Java, Scala, Python

Scala Implementedin

Java

Stand-Alone, Mesos,EC2, YARN

Cluster Stand-Alone, Mesos, EC2, YARN

Lightning-fast clustercomputing

Teaser Scalable Batch and StreamData Processing

Page 6: 21.04.2016 Meetup: Spark vs. Flink
Page 7: 21.04.2016 Meetup: Spark vs. Flink
Page 8: 21.04.2016 Meetup: Spark vs. Flink

The Challenge

Page 9: 21.04.2016 Meetup: Spark vs. Flink

Real-Time Analysis of a Superhero Fight Club

Fight

hitter: Inthittee: Inthitpoints: Int

Segment

id: Intname: Stringsegment: String

Detail

name: Stringgender: IntbirthYear: IntnoOfAppearances: Int

Fight

hitter: Inthittee: Inthitpoints: Int

Fight

hitter: Inthittee: Inthitpoints: Int

Fight

hitter: Inthittee: Inthitpoints: Int

Fight

hitter: Inthittee: Inthitpoints: Int

Fight

hitter: Inthittee: Inthitpoints: Int

Fight

hitter: Inthittee: Inthitpoints: Int

Fight

hitter: Inthittee: Inthitpoints: Int

Hero

id: Intname: Stringsegment: Stringgender: IntbirthYear: IntnoOfAppearances: Int

{Stream

{Batch

Page 10: 21.04.2016 Meetup: Spark vs. Flink

The Setup

AWS Cluster

KafkaCluster

Stream ProcessingBatch Processing

Heroes

Combining Stream and Batch

Segment Detail Data Generator

Avro

Avro

Page 11: 21.04.2016 Meetup: Spark vs. Flink

Round 1Setting up

Page 12: 21.04.2016 Meetup: Spark vs. Flink

Dependencies

compile "org.apache.flink:flink-java:1.0.0"compile "org.apache.flink:flink-streaming-java_2.11:1.0.0"//For Local Execution from IDEcompile "org.apache.flink:flink-clients_2.11:1.0.0"

Skeleton

//Batch (DataSetAPI)ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();//Stream (DataStream API)StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

//Processing Logic

//For Streamingenv.execute()

Page 13: 21.04.2016 Meetup: Spark vs. Flink

Dependencies

compile 'org.apache.spark:spark-core_2.10:1.5.0'compile 'org.apache.spark:spark-streaming_2.10:1.5.0'

Skeleton Batch

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);// BatchJavaSparkContext sparkContext = new JavaSparkContext(conf);// StreamJavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));// Processing Logic

jssc.start(); // For Streaming

Page 14: 21.04.2016 Meetup: Spark vs. Flink

First ImpressionsPractically no boiler plateEasy to get started and play aroundRuns in the IDEHadoop MapReduce is much harder to get into

Page 15: 21.04.2016 Meetup: Spark vs. Flink

Round 2Static Data Analysis

Combine both static data parts

Page 16: 21.04.2016 Meetup: Spark vs. Flink

Read the csv file and transform it

JavaRDD<String> segmentFile = sparkContext.textFile("s3://...");JavaPairRDD<Integer, SegmentTableRecord> segmentTable = segmentFile .map(line -> line.split(",")) .filter(array -> array.length == 3) .mapToPair((String[] parts) -> { int id = Integer.parseInt(parts[0]); String name = parts[1], segment = parts[2]; return new Tuple2<>(name, new SegmentTableRecord(id, name, segment)); });

Join with detail data, filter out humans and write output

segmentTable.join(detailTable) .mapValues(tuple -> { SegmentTableRecord s = tuple._1(); DetailTableRecord d = tuple._2(); return new Hero(s.getId(), s.getName(), s.getSegment(), d.getGender(), d.getBirthYear(), d.getNoOfAppearances()); }) .map(tuple -> tuple._2()) .filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT)) .saveAsTextFile("s3://...");

Page 17: 21.04.2016 Meetup: Spark vs. Flink

Loading Files from S3 into POJO

DataSource<SegmentTableRecord> segmentTable = env.readCsvFile("s3://...") .ignoreInvalidLines() .pojoType(SegmentTableRecord.class, "id", "name", "segment");

Join and Filter

DataSet<Hero> humanHeros = segmentTable.join(detailTable) .where("name") .equalTo("name") .with((s, d) -> new Hero(s.id, s.name, s.segment, d.gender, d.birthYear, d.noOfAppearances)) .filter(hero -> hero.segment.equals("Human"));

Write back to S3

humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE, h -> h.toCsv());

Page 18: 21.04.2016 Meetup: Spark vs. Flink

PerformanceTerasort1: Flink ca 66% of runtimeTerasort2: Flink ca. 68% of runtimeHashJoin: Flink ca. 32% of runtime(Iterative Processes: Flink ca. 50% of runtime, ca. 7% withDelta-Iterations)

Page 19: 21.04.2016 Meetup: Spark vs. Flink

2nd Round PointsGenerally similar abstraction and feature setFlink has a nicer syntax, more sugarSpark is pretty bare-metalFlink is faster

Page 20: 21.04.2016 Meetup: Spark vs. Flink

Round 3Simple Real Time Analysis

Total Hitpoints over Last Minute

Page 21: 21.04.2016 Meetup: Spark vs. Flink

Configuring Environment for EventTime

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

ExecutionConfig config = env.getConfig();config.setAutoWatermarkInterval(500);

Creating Stream from Kafka

Properties properties = new Properties();properties.put("bootstrap.servers", KAFKA_BROKERS);properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION);properties.put("group.id", KAFKA_GROUP_ID);

DataStreamSource<FightEvent> hitStream = env.addSource(new FlinkKafkaConsumer08<>("FightEventTopic", new FightEventDeserializer(), properties));

Page 22: 21.04.2016 Meetup: Spark vs. Flink

Processing Logic

hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .timeWindowAll(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumAllWindowFunction<FightEvent>() { @Override public long getSummand(FightEvent fightEvent) { return fightEvent.getHitPoints(); } }) .writeAsCsv("s3://...");

Example Output3> (1448130670000,1448130730000,290789)4> (1448130680000,1448130740000,289395)5> (1448130690000,1448130750000,291768)6> (1448130700000,1448130760000,292634)7> (1448130710000,1448130770000,293869)8> (1448130720000,1448130780000,293356)1> (1448130730000,1448130790000,293054)2> (1448130740000,1448130800000,294209)

Page 23: 21.04.2016 Meetup: Spark vs. Flink

Create Context and get Avro Stream from Kafka

JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

HashSet<String> topicsSet = Sets.newHashSet("FightEventTopic");

HashMap<String, String> kafkaParams = new HashMap<String, String>();kafkaParams.put("metadata.broker.list", "xxx:11211");kafkaParams.put("group.id", "spark");

JavaPairInputDStream<String, FightEvent> kafkaStream = KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class, StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet);

Analyze number of hit points over a sliding window

kafkaStream.map(tuple -> tuple._2().getHitPoints()) .reduceByWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10)) .foreachRDD((rdd, time) -> { rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds()); LOGGER.info("Hitpoints in the last minute {}", rdd.take(5)); return null; });

Page 24: 21.04.2016 Meetup: Spark vs. Flink

Output20:19:32 Hitpoints in the last minute [80802]20:19:42 Hitpoints in the last minute [101019]20:19:52 Hitpoints in the last minute [141012]20:20:02 Hitpoints in the last minute [184759]20:20:12 Hitpoints in the last minute [215802]

Page 25: 21.04.2016 Meetup: Spark vs. Flink

3rd Round PointsFlink supports event time windowsKafka and Avro worked seamlessly in bothSpark uses micro-batches, no real streamBoth have at-least-once delivery guaranteesExactly-once depends a lot on sink/source

Page 26: 21.04.2016 Meetup: Spark vs. Flink

Round 4Connecting Static Data with Real

Time DataTotal Hitpoints over Last Minute Per Gender

Page 27: 21.04.2016 Meetup: Spark vs. Flink

Read static data using object File and map genders

JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD<String, String> genderLookup = staticRdd.mapToPair(user -> { int genderIndicator = user.getGender(); String gender; switch (genderIndicator) { case 1: gender = "MALE"; break; case 2: gender = "FEMALE"; break; default: gender = "OTHER"; break; } return new Tuple2<>(user.getId(), gender);});

Analyze number of hit points per hitter over a sliding window

JavaPairDStream<String, Long> hitpointWindowedStream = kafkaStream .mapToPair(tuple -> { FightEvent fight = tuple._2(); return new Tuple2<>(fight.getHitterId(), fight.getHitPoints()); }) .reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10));

Page 28: 21.04.2016 Meetup: Spark vs. Flink

Join with static data to find gender for each hitter

hitpointWindowedStream.foreachRDD((rdd, time) -> { JavaPairRDD<String, Long> hpg = rdd.leftOuterJoin(genderLookup) .mapToPair(joinedTuple -> { Optional<String> maybeGender = joinedTuple._2()._2(); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(maybeGender.or("UNKNOWN"), hitpoints); }) .reduceByKey((hit1, hit2) -> hit1 + hit2); hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds()); LOGGER.info("Hitpoints per gender {}", hpg.take(5)); return null;});

Output20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)]20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)]20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)]20:31:14 Hitpoints [(FEMALE,65543), (OTHER,813), (MALE,116416)]20:31:24 Hitpoints [(FEMALE,67507), (OTHER,813), (MALE,123750)]

Page 29: 21.04.2016 Meetup: Spark vs. Flink

Loading Static Data in Every Map

public FightEventEnricher(String bucket, String keyPrefix) { this.bucket = bucket; this.keyPrefix = keyPrefix;}

@Overridepublic void open(Configuration parameters) { populateHeroMapFromS3(bucket, keyPrefix);}

@Overridepublic EnrichedFightEvent map(FightEvent event) throws Exception { return new EnrichedFightEvent(event, idToHero.get(event.getHitterId()), idToHero.get(event.getHitteeId()));}

private void populateHeroMapFromS3(String bucket, String keyPrefix) { // Omitted}

Page 30: 21.04.2016 Meetup: Spark vs. Flink

Processing Logic

hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(enrichedFightEvent -> enrichedFightEvent.getHittingHero().getGender()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, Integer>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent() .getHitPoints(); } })

Example Output2> (1448191350000,1448191410000,1,28478)3> (1448191350000,1448191410000,2,264650)2> (1448191360000,1448191420000,1,28290)3> (1448191360000,1448191420000,2,263521)2> (1448191370000,1448191430000,1,29327)3> (1448191370000,1448191430000,2,265526)

Page 31: 21.04.2016 Meetup: Spark vs. Flink

4th Round PointsSpark makes combining batch and spark easierWindowing by key works well in bothJava API of Spark can be annoying

Page 32: 21.04.2016 Meetup: Spark vs. Flink

Round 5More Advanced Real Time

AnalysisBest Hitter over Last Minute Per Gender

Page 33: 21.04.2016 Meetup: Spark vs. Flink

Processing Logic

hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(fightEvent -> fightEvent.getHittingHero().getName()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, String>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent().getHitPoints(); } }) .assignTimestamps(new AscendingTimestampExtractor<...>() { @Override public long extractAscendingTimestamp(Tuple4<...<tuple, long l) { return tuple.f0; } }) .timeWindowAll(Time.of(10, TimeUnit.SECONDS)) .maxBy(3) .print();

Page 34: 21.04.2016 Meetup: Spark vs. Flink

Example Output1> (1448200070000,1448200130000,Tengu,546)2> (1448200080000,1448200140000,Louis XIV,621)3> (1448200090000,1448200150000,Louis XIV,561)4> (1448200100000,1448200160000,Louis XIV,552)5> (1448200110000,1448200170000,Phil Dexter,620)6> (1448200120000,1448200180000,Phil Dexter,552)7> (1448200130000,1448200190000,Kalamity,648)8> (1448200140000,1448200200000,Jakita Wagner,656)1> (1448200150000,1448200210000,Jakita Wagner,703)

Page 35: 21.04.2016 Meetup: Spark vs. Flink

Read static data using object File and Map names

JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD<String, String> userNameLookup = staticRdd .mapToPair(user -> new Tuple2<>(user.getId(), user.getName()));

Analyze number of hit points per hitter over a sliding window

JavaPairDStream<String, Long> hitters = kafkaStream .mapToPair(kafkaTuple -> new Tuple2<>(kafkaTuple._2().getHitterId(), kafkaTuple._2().getHitPoints())) .reduceByKeyAndWindow((accum, current) -> accum + current, (accum, remove) -> accum - remove, Durations.seconds(60), Durations.seconds(10));

Page 36: 21.04.2016 Meetup: Spark vs. Flink

Join with static data to find username for each hitter

hitters.foreachRDD((rdd, time) -> { JavaRDD<Tuple2<String, Long>> namedHitters = rdd .leftOuterJoin(userNameLookup) .map(joinedTuple -> { String username = joinedTuple._2()._2().or("No name"); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(username, hitpoints); }) .sortBy(Tuple2::_2, false, PARTITIONS); namedHitters.saveAsTextFile(outputPath + "/round3-" + time); LOGGER.info("Five highest hitters (total: {}){}", namedHitters.count(), namedHitters.take(5)); return null;});

Output

15/11/25 20:34:23 Five highest hitters (total: 200)[(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539), 15/11/25 20:34:33 Five highest hitters (total: 378)[(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646), 15/11/25 20:34:43 Five highest hitters (total: 378)[(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699), 15/11/25 20:34:53 Five highest hitters (total: 558)

Page 37: 21.04.2016 Meetup: Spark vs. Flink

15/11/25 20:34:53 Five highest hitters (total: 558)[(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,

Page 38: 21.04.2016 Meetup: Spark vs. Flink

PerformanceYahoo Streaming Benchmark

Page 39: 21.04.2016 Meetup: Spark vs. Flink

5th Round PointsSpark makes some things easierBut Flink is real streamingIn Spark you often have to specify partitions

Page 40: 21.04.2016 Meetup: Spark vs. Flink

The Judges' Call

Page 41: 21.04.2016 Meetup: Spark vs. Flink

DevelopmentCompared to Hadoop, both are awesomeBoth provide unified programming model for diverse scenariosComfort level of abstraction varies with use-caseSpark's Java API is cumbersome compared to the Scala APIWorking with both is funDocs are ok, but spotty

Page 42: 21.04.2016 Meetup: Spark vs. Flink

TestingTesting distributed systems will always be hardFunctionally both can be tested nicely

Page 43: 21.04.2016 Meetup: Spark vs. Flink

Monitoring

Page 44: 21.04.2016 Meetup: Spark vs. Flink

Monitoring

Page 45: 21.04.2016 Meetup: Spark vs. Flink

Community

Page 46: 21.04.2016 Meetup: Spark vs. Flink

The Judge's CallIt depends...

Page 47: 21.04.2016 Meetup: Spark vs. Flink

Use Spark, ifYou have Cloudera, Hortonworks. etc support and depend on itYou want to heavily use Graph and ML librariesYou want to use the more mature project

Page 48: 21.04.2016 Meetup: Spark vs. Flink

Use Flink, ifReal-Time processing is important for your use caseYou want more complex window operationsYou develop in Java onlyIf you want to support a German project

Page 49: 21.04.2016 Meetup: Spark vs. Flink

Benchmark References[1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/[2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html[3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html [4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at [5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/