Download - Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

Transcript
Page 1: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

Spark 2

Alexey Zinovyev, Java/BigData Trainer in EPAM

Page 2: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

About

With IT since 2007

With Java since 2009

With Hadoop since 2012

With EPAM since 2015

Page 3: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

3Big Data Training

Secret Word from EPAM

itsubbotnik

Page 4: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

4Big Data Training

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

Page 5: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

5Joker’16: Spark 2 from Zinoviev Alexey

Sprk Dvlprs! Let’s start!

Page 6: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

6Joker’16: Spark 2 from Zinoviev Alexey

< SPARK 2.0

Page 7: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

7Big Data Training

Modern Java in 2016Big Data in 2014

Page 8: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

8Big Data Training

Big Data in 2017

Page 9: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

9Big Data Training

Machine Learning EVERYWHERE

Page 10: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

10Big Data Training

Machine Learning vs Traditional Programming

Page 11: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

11Big Data Training

Page 12: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

12Big Data Training

Something wrong with HADOOP

Page 13: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

13Joker’16: Spark 2 from Zinoviev Alexey

Hadoop is not SEXY

Page 14: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

14Joker’16: Spark 2 from Zinoviev Alexey

Whaaaat?

Page 15: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

15Joker’16: Spark 2 from Zinoviev Alexey

Map Reduce Job Writing

Page 16: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

16Joker’16: Spark 2 from Zinoviev Alexey

MR code

Page 17: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

17Joker’16: Spark 2 from Zinoviev Alexey

Hadoop Developers Right Now

Page 18: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

18Joker’16: Spark 2 from Zinoviev Alexey

Iterative Calculations

10x – 100x

Page 19: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

19Joker’16: Spark 2 from Zinoviev Alexey

MapReduce vs Spark

Page 20: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

20Joker’16: Spark 2 from Zinoviev Alexey

MapReduce vs Spark

Page 21: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

21Joker’16: Spark 2 from Zinoviev Alexey

MapReduce vs Spark

Page 22: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

22Joker’16: Spark 2 from Zinoviev Alexey

SPARK 2.0 DISCUSSION

Page 23: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

23Joker’16: Spark 2 from Zinoviev Alexey

Spark

Family

Page 24: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

24Joker’16: Spark 2 from Zinoviev Alexey

Spark

Family

Page 25: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

25Joker’16: Spark 2 from Zinoviev Alexey

Case #0 : How to avoid DStreams with RDD-like API?

Page 26: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

26Joker’16: Spark 2 from Zinoviev Alexey

Continuous Applications

Page 27: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

27Joker’16: Spark 2 from Zinoviev Alexey

Continuous Applications cases

• Updating data that will be served in real time

• Extract, transform and load (ETL)

• Creating a real-time version of an existing batch job

• Online machine learning

Page 28: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

28Joker’16: Spark 2 from Zinoviev Alexey

Write Batches

Page 29: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

29Joker’16: Spark 2 from Zinoviev Alexey

Catch Streaming

Page 30: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

30Joker’16: Spark 2 from Zinoviev Alexey

The main concept of Structured Streaming

You can express your streaming computation the

same way you would express a batch computation

on static data.

Page 31: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

31Joker’16: Spark 2 from Zinoviev Alexey

Batch

// Read JSON once from S3

logsDF = spark.read.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.write.parquet("s3://out")

Page 32: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

32Joker’16: Spark 2 from Zinoviev Alexey

Real

Time

// Read JSON continuously from S3

logsDF = spark.readStream.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.writeStream.parquet("s3://out")

.start()

Page 33: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

33Joker’16: Spark 2 from Zinoviev Alexey

Unlimited Table

Page 34: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

34Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 35: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

35Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 36: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

36Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Page 37: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

37Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Don’t forget

to start

Streaming

Page 38: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

38Joker’16: Spark 2 from Zinoviev Alexey

WordCount with Structured Streaming

Page 39: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

39Joker’16: Spark 2 from Zinoviev Alexey

Structured Streaming provides …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

Page 40: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

40Joker’16: Spark 2 from Zinoviev Alexey

Structured Streaming provides (in dreams) …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

Page 41: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

41Joker’16: Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

Page 42: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

42Joker’16: Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

org.apache.spark.sql.

AnalysisException:

Union between streaming

and batch

DataFrames/Datasets is not

supported;

Page 43: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

43Joker’16: Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

Go to UnsupportedOperationChecker.scala and check your

operation

Page 44: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

44Joker’16: Spark 2 from Zinoviev Alexey

Case #1 : We should think about optimization in RDD terms

Page 45: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

45Joker’16: Spark 2 from Zinoviev Alexey

Single

Thread

collection

Page 46: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

46Joker’16: Spark 2 from Zinoviev Alexey

No perf

issues,

right?

Page 47: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

47Joker’16: Spark 2 from Zinoviev Alexey

The main concept

more partitions = more parallelism

Page 48: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

48Joker’16: Spark 2 from Zinoviev Alexey

Do it

parallel

Page 49: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

49Joker’16: Spark 2 from Zinoviev Alexey

I’d like

NARROW

Page 50: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

50Joker’16: Spark 2 from Zinoviev Alexey

Map, filter, filter

Page 51: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

51Joker’16: Spark 2 from Zinoviev Alexey

GroupByKey, join

Page 52: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

52Joker’16: Spark 2 from Zinoviev Alexey

Case #2 : DataFrames suggest mix SQL and Scala functions

Page 53: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

53Joker’16: Spark 2 from Zinoviev Alexey

History of Spark APIs

Page 54: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

54Joker’16: Spark 2 from Zinoviev Alexey

RDD

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 55: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

55Joker’16: Spark 2 from Zinoviev Alexey

History of Spark APIs

Page 56: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

56Joker’16: Spark 2 from Zinoviev Alexey

SQL

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 57: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

57Joker’16: Spark 2 from Zinoviev Alexey

Expression

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 58: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

58Joker’16: Spark 2 from Zinoviev Alexey

History of Spark APIs

Page 59: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

59Joker’16: Spark 2 from Zinoviev Alexey

DataSet

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

Page 60: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

60Joker’16: Spark 2 from Zinoviev Alexey

Case #2 : DataFrame is referring to data attributes by name

Page 61: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

61Joker’16: Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

Page 62: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

62Joker’16: Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

Page 63: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

63Joker’16: Spark 2 from Zinoviev Alexey

Structured APIs in SPARK

Page 64: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

64Joker’16: Spark 2 from Zinoviev Alexey

Unified API in Spark 2.0

DataFrame = Dataset[Row]

Dataframe is a schemaless (untyped) Dataset now

Page 65: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

65Joker’16: Spark 2 from Zinoviev Alexey

Define

case class

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 66: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

66Joker’16: Spark 2 from Zinoviev Alexey

Read JSON

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 67: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

67Joker’16: Spark 2 from Zinoviev Alexey

Filter by

Field

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 68: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

68Joker’16: Spark 2 from Zinoviev Alexey

Case #3 : Spark has many contexts

Page 69: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

69Joker’16: Spark 2 from Zinoviev Alexey

Spark Session

• New entry point in spark for creating datasets

• Replaces SQLContext, HiveContext and StreamingContext

• Move from SparkContext to SparkSession signifies move

away from RDD

Page 70: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

70Joker’16: Spark 2 from Zinoviev Alexey

Spark

Session

val sparkSession = SparkSession.builder

.master("local")

.appName("spark session example")

.getOrCreate()

val df = sparkSession.read

.option("header","true")

.csv("src/main/resources/names.csv")

df.show()

Page 71: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

71Joker’16: Spark 2 from Zinoviev Alexey

No, I want to create my lovely RDDs

Page 72: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

72Joker’16: Spark 2 from Zinoviev Alexey

Where’s parallelize() method?

Page 73: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

73Joker’16: Spark 2 from Zinoviev Alexey

RDD?

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

Page 74: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

74Joker’16: Spark 2 from Zinoviev Alexey

Case #4 : Spark uses Java serialization A LOT

Page 75: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

75Joker’16: Spark 2 from Zinoviev Alexey

Two choices to distribute data across cluster

• Java serialization

By default with ObjectOutputStream

• Kryo serialization

Should register classes (no support of Serialazible)

Page 76: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

76Joker’16: Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

Page 77: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

77Joker’16: Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

Don’t forget about GC

Page 78: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

78Joker’16: Spark 2 from Zinoviev Alexey

Tungsten Compact Encoding

Page 79: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

79Joker’16: Spark 2 from Zinoviev Alexey

Encoder’s concept

Generate bytecode to interact with off-heap

&

Give access to attributes without ser/deser

Page 80: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

80Joker’16: Spark 2 from Zinoviev Alexey

Encoders

Page 81: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

81Joker’16: Spark 2 from Zinoviev Alexey

No custom encoders

Page 82: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

82Joker’16: Spark 2 from Zinoviev Alexey

Case #5 : Not enough storage levels

Page 83: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

83Joker’16: Spark 2 from Zinoviev Alexey

Caching in Spark

• Frequently used RDD can be stored in memory

• One method, one short-cut: persist(), cache()

• SparkContext keeps track of cached RDD

• Serialized or deserialized Java objects

Page 84: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

84Joker’16: Spark 2 from Zinoviev Alexey

Full list of options

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

Page 85: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

85Joker’16: Spark 2 from Zinoviev Alexey

Spark Core Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

Page 86: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

86Joker’16: Spark 2 from Zinoviev Alexey

Spark Streaming Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER (default for Spark Streaming)

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

Page 87: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

87Joker’16: Spark 2 from Zinoviev Alexey

Developer API to make new Storage Levels

Page 88: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

88Joker’16: Spark 2 from Zinoviev Alexey

What’s the most popular file format in BigData?

Page 89: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

89Joker’16: Spark 2 from Zinoviev Alexey

Case #6 : External libraries to read CSV

Page 90: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

90Joker’16: Spark 2 from Zinoviev Alexey

Easy to

read CSV

data = sqlContext.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/datasets/samples/users.csv")

data.cache()

data.createOrReplaceTempView(“users")

display(data)

Page 91: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

91Joker’16: Spark 2 from Zinoviev Alexey

Case #7 : How to measure Spark performance?

Page 92: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

92Joker’16: Spark 2 from Zinoviev Alexey

You'd measure performance!

Page 93: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

93Joker’16: Spark 2 from Zinoviev Alexey

TPCDS

99 Queries

http://bit.ly/2dObMsH

Page 94: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

94Joker’16: Spark 2 from Zinoviev Alexey

Page 95: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

95Joker’16: Spark 2 from Zinoviev Alexey

How to benchmark Spark

Page 96: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

96Joker’16: Spark 2 from Zinoviev Alexey

Special Tool from Databricks

Benchmark Tool for SparkSQL

https://github.com/databricks/spark-sql-perf

Page 97: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

97Joker’16: Spark 2 from Zinoviev Alexey

Spark 2 vs Spark 1.6

Page 98: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

98Joker’16: Spark 2 from Zinoviev Alexey

Case #8 : What’s faster: SQL or DataSet API?

Page 99: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

99Joker’16: Spark 2 from Zinoviev Alexey

Job Stages in old Spark

Page 100: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

100Joker’16: Spark 2 from Zinoviev Alexey

Scheduler Optimizations

Page 101: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

101Joker’16: Spark 2 from Zinoviev Alexey

Catalyst Optimizer for DataFrames

Page 102: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

102Joker’16: Spark 2 from Zinoviev Alexey

Unified Logical Plan

DataFrame = Dataset[Row]

Page 103: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

103Joker’16: Spark 2 from Zinoviev Alexey

Bytecode

Page 104: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

104Joker’16: Spark 2 from Zinoviev Alexey

DataSet.explain()

== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]

:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as

bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],

functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])

: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0

+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe

+- Scan CsvRelation(----)

Page 105: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

105Joker’16: Spark 2 from Zinoviev Alexey

Case #9 : Why does explain() show so many Tungsten things?

Page 106: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

106Joker’16: Spark 2 from Zinoviev Alexey

How to be effective with CPU

• Runtime code generation

• Exploiting cache locality

• Off-heap memory management

Page 107: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

107Joker’16: Spark 2 from Zinoviev Alexey

Tungsten’s goal

Push performance closer to the limits of modern

hardware

Page 108: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

108Joker’16: Spark 2 from Zinoviev Alexey

Maybe something UNSAFE?

Page 109: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

109Joker’16: Spark 2 from Zinoviev Alexey

UnsafeRowFormat

• Bit set for tracking null values

• Small values are inlined

• For variable-length values are stored relative offset into the

variablelength data section

• Rows are always 8-byte word aligned

• Equality comparison and hashing can be performed on raw

bytes without requiring additional interpretation

Page 110: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

110Joker’16: Spark 2 from Zinoviev Alexey

Case #10 : Can I influence on Memory Management in Spark?

Page 111: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

111Joker’16: Spark 2 from Zinoviev Alexey

Case #11 : Should I tune generation’s stuff?

Page 112: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

112Joker’16: Spark 2 from Zinoviev Alexey

Cached

Data

Page 113: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

113Joker’16: Spark 2 from Zinoviev Alexey

During

operations

Page 114: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

114Joker’16: Spark 2 from Zinoviev Alexey

For your

needs

Page 115: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

115Joker’16: Spark 2 from Zinoviev Alexey

For Dark

Lord

Page 116: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

116Joker’16: Spark 2 from Zinoviev Alexey

IN CONCLUSION

Page 117: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

117Joker’16: Spark 2 from Zinoviev Alexey

We have no ability…

• join structured streaming and other sources to handle it

• one unified ML API

• GraphX rethinking and redesign

• Custom encoders

• Datasets everywhere

• integrate with something important

Page 118: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

118Joker’16: Spark 2 from Zinoviev Alexey

Roadmap

• Support other data sources (not only S3 + HDFS)

• Transactional updates

• Dataset is one DSL for all operations

• GraphFrames + Structured MLLib

• Tungsten: custom encoders

• The RDD-based API is expected to be removed in Spark

3.0

Page 119: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

119Joker’16: Spark 2 from Zinoviev Alexey

And we can DO IT!

Real-Time Data-Marts

Batch Data-Marts

Relations Graph

Ontology Metadata

Search Index

Events & Alarms

Real-time Dashboarding

Events & Alarms

All Raw Data backupis stored here

Real-time DataIngestion

Batch DataIngestion

Real-Time ETL & CEP

Batch ETL & Raw Area

Scheduler

Internal

External

Social

HDFS → CFSas an option

Time-Series Data

Titan & KairosDBstore data in Cassandra

Push Events & Alarms (Email, SNMP etc.)

Page 120: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

120Joker’16: Spark 2 from Zinoviev Alexey

First Part

Page 121: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

121Joker’16: Spark 2 from Zinoviev Alexey

Second

Part

Page 122: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

122Joker’16: Spark 2 from Zinoviev Alexey

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

Page 123: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured

123Joker’16: Spark 2 from Zinoviev Alexey

Any questions?