21.04.2016 Meetup: Spark vs. Flink
-
Upload
comsysto-gmbh -
Category
Data & Analytics
-
view
480 -
download
5
Transcript of 21.04.2016 Meetup: Spark vs. Flink
Spark vs FlinkRumble in the (Big Data) Jungle
, München, 2016-04-20
Konstantin Knauf Michael Pisula
Background
The Big Data EcosystemApache Top-Level Projects over Time
2008 2010 2013 2014 2015
The New Guard
Berkeley University Origin TU Berlin
2013 ApacheIncubator
04/2014
02/2014 Apache Top-Level
01/2015
databricks Company data Artisans
Scala, Java, Python, R Supportedlanguages
Java, Scala, Python
Scala Implementedin
Java
Stand-Alone, Mesos,EC2, YARN
Cluster Stand-Alone, Mesos, EC2, YARN
Lightning-fast clustercomputing
Teaser Scalable Batch and StreamData Processing
The Challenge
Real-Time Analysis of a Superhero Fight Club
Fight
hitter: Inthittee: Inthitpoints: Int
Segment
id: Intname: Stringsegment: String
Detail
name: Stringgender: IntbirthYear: IntnoOfAppearances: Int
Fight
hitter: Inthittee: Inthitpoints: Int
Fight
hitter: Inthittee: Inthitpoints: Int
Fight
hitter: Inthittee: Inthitpoints: Int
Fight
hitter: Inthittee: Inthitpoints: Int
Fight
hitter: Inthittee: Inthitpoints: Int
Fight
hitter: Inthittee: Inthitpoints: Int
Fight
hitter: Inthittee: Inthitpoints: Int
Hero
id: Intname: Stringsegment: Stringgender: IntbirthYear: IntnoOfAppearances: Int
{Stream
{Batch
The Setup
AWS Cluster
KafkaCluster
Stream ProcessingBatch Processing
Heroes
Combining Stream and Batch
Segment Detail Data Generator
Avro
Avro
Round 1Setting up
Dependencies
compile "org.apache.flink:flink-java:1.0.0"compile "org.apache.flink:flink-streaming-java_2.11:1.0.0"//For Local Execution from IDEcompile "org.apache.flink:flink-clients_2.11:1.0.0"
Skeleton
//Batch (DataSetAPI)ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();//Stream (DataStream API)StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//Processing Logic
//For Streamingenv.execute()
Dependencies
compile 'org.apache.spark:spark-core_2.10:1.5.0'compile 'org.apache.spark:spark-streaming_2.10:1.5.0'
Skeleton Batch
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);// BatchJavaSparkContext sparkContext = new JavaSparkContext(conf);// StreamJavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));// Processing Logic
jssc.start(); // For Streaming
First ImpressionsPractically no boiler plateEasy to get started and play aroundRuns in the IDEHadoop MapReduce is much harder to get into
Round 2Static Data Analysis
Combine both static data parts
Read the csv file and transform it
JavaRDD<String> segmentFile = sparkContext.textFile("s3://...");JavaPairRDD<Integer, SegmentTableRecord> segmentTable = segmentFile .map(line -> line.split(",")) .filter(array -> array.length == 3) .mapToPair((String[] parts) -> { int id = Integer.parseInt(parts[0]); String name = parts[1], segment = parts[2]; return new Tuple2<>(name, new SegmentTableRecord(id, name, segment)); });
Join with detail data, filter out humans and write output
segmentTable.join(detailTable) .mapValues(tuple -> { SegmentTableRecord s = tuple._1(); DetailTableRecord d = tuple._2(); return new Hero(s.getId(), s.getName(), s.getSegment(), d.getGender(), d.getBirthYear(), d.getNoOfAppearances()); }) .map(tuple -> tuple._2()) .filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT)) .saveAsTextFile("s3://...");
Loading Files from S3 into POJO
DataSource<SegmentTableRecord> segmentTable = env.readCsvFile("s3://...") .ignoreInvalidLines() .pojoType(SegmentTableRecord.class, "id", "name", "segment");
Join and Filter
DataSet<Hero> humanHeros = segmentTable.join(detailTable) .where("name") .equalTo("name") .with((s, d) -> new Hero(s.id, s.name, s.segment, d.gender, d.birthYear, d.noOfAppearances)) .filter(hero -> hero.segment.equals("Human"));
Write back to S3
humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE, h -> h.toCsv());
PerformanceTerasort1: Flink ca 66% of runtimeTerasort2: Flink ca. 68% of runtimeHashJoin: Flink ca. 32% of runtime(Iterative Processes: Flink ca. 50% of runtime, ca. 7% withDelta-Iterations)
2nd Round PointsGenerally similar abstraction and feature setFlink has a nicer syntax, more sugarSpark is pretty bare-metalFlink is faster
Round 3Simple Real Time Analysis
Total Hitpoints over Last Minute
Configuring Environment for EventTime
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
ExecutionConfig config = env.getConfig();config.setAutoWatermarkInterval(500);
Creating Stream from Kafka
Properties properties = new Properties();properties.put("bootstrap.servers", KAFKA_BROKERS);properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION);properties.put("group.id", KAFKA_GROUP_ID);
DataStreamSource<FightEvent> hitStream = env.addSource(new FlinkKafkaConsumer08<>("FightEventTopic", new FightEventDeserializer(), properties));
Processing Logic
hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .timeWindowAll(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumAllWindowFunction<FightEvent>() { @Override public long getSummand(FightEvent fightEvent) { return fightEvent.getHitPoints(); } }) .writeAsCsv("s3://...");
Example Output3> (1448130670000,1448130730000,290789)4> (1448130680000,1448130740000,289395)5> (1448130690000,1448130750000,291768)6> (1448130700000,1448130760000,292634)7> (1448130710000,1448130770000,293869)8> (1448130720000,1448130780000,293356)1> (1448130730000,1448130790000,293054)2> (1448130740000,1448130800000,294209)
Create Context and get Avro Stream from Kafka
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
HashSet<String> topicsSet = Sets.newHashSet("FightEventTopic");
HashMap<String, String> kafkaParams = new HashMap<String, String>();kafkaParams.put("metadata.broker.list", "xxx:11211");kafkaParams.put("group.id", "spark");
JavaPairInputDStream<String, FightEvent> kafkaStream = KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class, StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet);
Analyze number of hit points over a sliding window
kafkaStream.map(tuple -> tuple._2().getHitPoints()) .reduceByWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10)) .foreachRDD((rdd, time) -> { rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds()); LOGGER.info("Hitpoints in the last minute {}", rdd.take(5)); return null; });
Output20:19:32 Hitpoints in the last minute [80802]20:19:42 Hitpoints in the last minute [101019]20:19:52 Hitpoints in the last minute [141012]20:20:02 Hitpoints in the last minute [184759]20:20:12 Hitpoints in the last minute [215802]
3rd Round PointsFlink supports event time windowsKafka and Avro worked seamlessly in bothSpark uses micro-batches, no real streamBoth have at-least-once delivery guaranteesExactly-once depends a lot on sink/source
Round 4Connecting Static Data with Real
Time DataTotal Hitpoints over Last Minute Per Gender
Read static data using object File and map genders
JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD<String, String> genderLookup = staticRdd.mapToPair(user -> { int genderIndicator = user.getGender(); String gender; switch (genderIndicator) { case 1: gender = "MALE"; break; case 2: gender = "FEMALE"; break; default: gender = "OTHER"; break; } return new Tuple2<>(user.getId(), gender);});
Analyze number of hit points per hitter over a sliding window
JavaPairDStream<String, Long> hitpointWindowedStream = kafkaStream .mapToPair(tuple -> { FightEvent fight = tuple._2(); return new Tuple2<>(fight.getHitterId(), fight.getHitPoints()); }) .reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10));
Join with static data to find gender for each hitter
hitpointWindowedStream.foreachRDD((rdd, time) -> { JavaPairRDD<String, Long> hpg = rdd.leftOuterJoin(genderLookup) .mapToPair(joinedTuple -> { Optional<String> maybeGender = joinedTuple._2()._2(); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(maybeGender.or("UNKNOWN"), hitpoints); }) .reduceByKey((hit1, hit2) -> hit1 + hit2); hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds()); LOGGER.info("Hitpoints per gender {}", hpg.take(5)); return null;});
Output20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)]20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)]20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)]20:31:14 Hitpoints [(FEMALE,65543), (OTHER,813), (MALE,116416)]20:31:24 Hitpoints [(FEMALE,67507), (OTHER,813), (MALE,123750)]
Loading Static Data in Every Map
public FightEventEnricher(String bucket, String keyPrefix) { this.bucket = bucket; this.keyPrefix = keyPrefix;}
@Overridepublic void open(Configuration parameters) { populateHeroMapFromS3(bucket, keyPrefix);}
@Overridepublic EnrichedFightEvent map(FightEvent event) throws Exception { return new EnrichedFightEvent(event, idToHero.get(event.getHitterId()), idToHero.get(event.getHitteeId()));}
private void populateHeroMapFromS3(String bucket, String keyPrefix) { // Omitted}
Processing Logic
hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(enrichedFightEvent -> enrichedFightEvent.getHittingHero().getGender()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, Integer>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent() .getHitPoints(); } })
Example Output2> (1448191350000,1448191410000,1,28478)3> (1448191350000,1448191410000,2,264650)2> (1448191360000,1448191420000,1,28290)3> (1448191360000,1448191420000,2,263521)2> (1448191370000,1448191430000,1,29327)3> (1448191370000,1448191430000,2,265526)
4th Round PointsSpark makes combining batch and spark easierWindowing by key works well in bothJava API of Spark can be annoying
Round 5More Advanced Real Time
AnalysisBest Hitter over Last Minute Per Gender
Processing Logic
hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(fightEvent -> fightEvent.getHittingHero().getName()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, String>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent().getHitPoints(); } }) .assignTimestamps(new AscendingTimestampExtractor<...>() { @Override public long extractAscendingTimestamp(Tuple4<...<tuple, long l) { return tuple.f0; } }) .timeWindowAll(Time.of(10, TimeUnit.SECONDS)) .maxBy(3) .print();
Example Output1> (1448200070000,1448200130000,Tengu,546)2> (1448200080000,1448200140000,Louis XIV,621)3> (1448200090000,1448200150000,Louis XIV,561)4> (1448200100000,1448200160000,Louis XIV,552)5> (1448200110000,1448200170000,Phil Dexter,620)6> (1448200120000,1448200180000,Phil Dexter,552)7> (1448200130000,1448200190000,Kalamity,648)8> (1448200140000,1448200200000,Jakita Wagner,656)1> (1448200150000,1448200210000,Jakita Wagner,703)
Read static data using object File and Map names
JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath);JavaPairRDD<String, String> userNameLookup = staticRdd .mapToPair(user -> new Tuple2<>(user.getId(), user.getName()));
Analyze number of hit points per hitter over a sliding window
JavaPairDStream<String, Long> hitters = kafkaStream .mapToPair(kafkaTuple -> new Tuple2<>(kafkaTuple._2().getHitterId(), kafkaTuple._2().getHitPoints())) .reduceByKeyAndWindow((accum, current) -> accum + current, (accum, remove) -> accum - remove, Durations.seconds(60), Durations.seconds(10));
Join with static data to find username for each hitter
hitters.foreachRDD((rdd, time) -> { JavaRDD<Tuple2<String, Long>> namedHitters = rdd .leftOuterJoin(userNameLookup) .map(joinedTuple -> { String username = joinedTuple._2()._2().or("No name"); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(username, hitpoints); }) .sortBy(Tuple2::_2, false, PARTITIONS); namedHitters.saveAsTextFile(outputPath + "/round3-" + time); LOGGER.info("Five highest hitters (total: {}){}", namedHitters.count(), namedHitters.take(5)); return null;});
Output
15/11/25 20:34:23 Five highest hitters (total: 200)[(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539), 15/11/25 20:34:33 Five highest hitters (total: 378)[(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646), 15/11/25 20:34:43 Five highest hitters (total: 378)[(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699), 15/11/25 20:34:53 Five highest hitters (total: 558)
15/11/25 20:34:53 Five highest hitters (total: 558)[(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,
PerformanceYahoo Streaming Benchmark
5th Round PointsSpark makes some things easierBut Flink is real streamingIn Spark you often have to specify partitions
The Judges' Call
DevelopmentCompared to Hadoop, both are awesomeBoth provide unified programming model for diverse scenariosComfort level of abstraction varies with use-caseSpark's Java API is cumbersome compared to the Scala APIWorking with both is funDocs are ok, but spotty
TestingTesting distributed systems will always be hardFunctionally both can be tested nicely
Monitoring
Monitoring
Community
The Judge's CallIt depends...
Use Spark, ifYou have Cloudera, Hortonworks. etc support and depend on itYou want to heavily use Graph and ML librariesYou want to use the more mature project
Use Flink, ifReal-Time processing is important for your use caseYou want more complex window operationsYou develop in Java onlyIf you want to support a German project
Benchmark References[1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/[2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html[3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html [4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at [5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/