Spark 2 - Contentful: a developer-friendly, API-first ??16: Spark 2 from Zinoviev Alexey 30 The main...

download Spark 2 - Contentful: a developer-friendly, API-first ??16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured Streaming You can express your streaming computation the

of 123

  • date post

    19-Mar-2018
  • Category

    Documents

  • view

    215
  • download

    3

Embed Size (px)

Transcript of Spark 2 - Contentful: a developer-friendly, API-first ??16: Spark 2 from Zinoviev Alexey 30 The main...

  • Spark 2

    Alexey Zinovyev, Java/BigData Trainer in EPAM

  • About

    With IT since 2007

    With Java since 2009

    With Hadoop since 2012

    With EPAM since 2015

  • 3Big Data Training

    Secret Word from EPAM

    itsubbotnik

  • 4Big Data Training

    Contacts

    E-mail : Alexey_Zinovyev@epam.com

    Twitter : @zaleslaw @BigDataRussia

    Facebook: https://www.facebook.com/zaleslaw

    vk.com/big_data_russia Big Data Russia

    vk.com/java_jvm Java & JVM langs

  • 5Joker16: Spark 2 from Zinoviev Alexey

    Sprk Dvlprs! Lets start!

  • 6Joker16: Spark 2 from Zinoviev Alexey

    < SPARK 2.0

  • 7Big Data Training

    Modern Java in 2016Big Data in 2014

  • 8Big Data Training

    Big Data in 2017

  • 9Big Data Training

    Machine Learning EVERYWHERE

  • 10Big Data Training

    Machine Learning vs Traditional Programming

  • 11Big Data Training

  • 12Big Data Training

    Something wrong with HADOOP

  • 13Joker16: Spark 2 from Zinoviev Alexey

    Hadoop is not SEXY

  • 14Joker16: Spark 2 from Zinoviev Alexey

    Whaaaat?

  • 15Joker16: Spark 2 from Zinoviev Alexey

    Map Reduce Job Writing

  • 16Joker16: Spark 2 from Zinoviev Alexey

    MR code

  • 17Joker16: Spark 2 from Zinoviev Alexey

    Hadoop Developers Right Now

  • 18Joker16: Spark 2 from Zinoviev Alexey

    Iterative Calculations

    10x 100x

  • 19Joker16: Spark 2 from Zinoviev Alexey

    MapReduce vs Spark

  • 20Joker16: Spark 2 from Zinoviev Alexey

    MapReduce vs Spark

  • 21Joker16: Spark 2 from Zinoviev Alexey

    MapReduce vs Spark

  • 22Joker16: Spark 2 from Zinoviev Alexey

    SPARK 2.0 DISCUSSION

  • 23Joker16: Spark 2 from Zinoviev Alexey

    Spark

    Family

  • 24Joker16: Spark 2 from Zinoviev Alexey

    Spark

    Family

  • 25Joker16: Spark 2 from Zinoviev Alexey

    Case #0 : How to avoid DStreams with RDD-like API?

  • 26Joker16: Spark 2 from Zinoviev Alexey

    Continuous Applications

  • 27Joker16: Spark 2 from Zinoviev Alexey

    Continuous Applications cases

    Updating data that will be served in real time

    Extract, transform and load (ETL)

    Creating a real-time version of an existing batch job

    Online machine learning

  • 28Joker16: Spark 2 from Zinoviev Alexey

    Write Batches

  • 29Joker16: Spark 2 from Zinoviev Alexey

    Catch Streaming

  • 30Joker16: Spark 2 from Zinoviev Alexey

    The main concept of Structured Streaming

    You can express your streaming computation the

    same way you would express a batch computation

    on static data.

  • 31Joker16: Spark 2 from Zinoviev Alexey

    Batch

    // Read JSON once from S3

    logsDF = spark.read.json("s3://logs")

    // Transform with DataFrame API and save

    logsDF.select("user", "url", "date")

    .write.parquet("s3://out")

  • 32Joker16: Spark 2 from Zinoviev Alexey

    Real

    Time

    // Read JSON continuously from S3

    logsDF = spark.readStream.json("s3://logs")

    // Transform with DataFrame API and save

    logsDF.select("user", "url", "date")

    .writeStream.parquet("s3://out")

    .start()

  • 33Joker16: Spark 2 from Zinoviev Alexey

    Unlimited Table

  • 34Joker16: Spark 2 from Zinoviev Alexey

    WordCount

    from

    Socket

    val lines = spark.readStream

    .format("socket")

    .option("host", "localhost")

    .option("port", 9999)

    .load()

    val words = lines.as[String].flatMap(_.split(" "))

    val wordCounts = words.groupBy("value").count()

  • 35Joker16: Spark 2 from Zinoviev Alexey

    WordCount

    from

    Socket

    val lines = spark.readStream

    .format("socket")

    .option("host", "localhost")

    .option("port", 9999)

    .load()

    val words = lines.as[String].flatMap(_.split(" "))

    val wordCounts = words.groupBy("value").count()

  • 36Joker16: Spark 2 from Zinoviev Alexey

    WordCount

    from

    Socket

    val lines = spark.readStream

    .format("socket")

    .option("host", "localhost")

    .option("port", 9999)

    .load()

    val words = lines.as[String].flatMap(_.split(" "))

    val wordCounts = words.groupBy("value").count()

  • 37Joker16: Spark 2 from Zinoviev Alexey

    WordCount

    from

    Socket

    val lines = spark.readStream

    .format("socket")

    .option("host", "localhost")

    .option("port", 9999)

    .load()

    val words = lines.as[String].flatMap(_.split(" "))

    val wordCounts = words.groupBy("value").count()

    Dont forget

    to start

    Streaming

  • 38Joker16: Spark 2 from Zinoviev Alexey

    WordCount with Structured Streaming

  • 39Joker16: Spark 2 from Zinoviev Alexey

    Structured Streaming provides

    fast & scalable

    fault-tolerant

    end-to-end with exactly-once semantic

    stream processing

    ability to use DataFrame/DataSet API for streaming

  • 40Joker16: Spark 2 from Zinoviev Alexey

    Structured Streaming provides (in dreams)

    fast & scalable

    fault-tolerant

    end-to-end with exactly-once semantic

    stream processing

    ability to use DataFrame/DataSet API for streaming

  • 41Joker16: Spark 2 from Zinoviev Alexey

    Lets UNION streaming and static DataSets

  • 42Joker16: Spark 2 from Zinoviev Alexey

    Lets UNION streaming and static DataSets

    org.apache.spark.sql.

    AnalysisException:

    Union between streaming

    and batch

    DataFrames/Datasets is not

    supported;

  • 43Joker16: Spark 2 from Zinoviev Alexey

    Lets UNION streaming and static DataSets

    Go to UnsupportedOperationChecker.scala and check your

    operation

  • 44Joker16: Spark 2 from Zinoviev Alexey

    Case #1 : We should think about optimization in RDD terms

  • 45Joker16: Spark 2 from Zinoviev Alexey

    Single

    Thread

    collection

  • 46Joker16: Spark 2 from Zinoviev Alexey

    No perf

    issues,

    right?

  • 47Joker16: Spark 2 from Zinoviev Alexey

    The main concept

    more partitions = more parallelism

  • 48Joker16: Spark 2 from Zinoviev Alexey

    Do it

    parallel

  • 49Joker16: Spark 2 from Zinoviev Alexey

    Id like

    NARROW

  • 50Joker16: Spark 2 from Zinoviev Alexey

    Map, filter, filter

  • 51Joker16: Spark 2 from Zinoviev Alexey

    GroupByKey, join

  • 52Joker16: Spark 2 from Zinoviev Alexey

    Case #2 : DataFrames suggest mix SQL and Scala functions

  • 53Joker16: Spark 2 from Zinoviev Alexey

    History of Spark APIs

  • 54Joker16: Spark 2 from Zinoviev Alexey

    RDD

    rdd.filter(_.age > 21) // RDD

    df.filter("age > 21") // DataFrame SQL-style

    df.filter(df.col("age").gt(21)) // Expression style

    dataset.filter(_.age < 21); // Dataset API

  • 55Joker16: Spark 2 from Zinoviev Alexey

    History of Spark APIs

  • 56Joker16: Spark 2 from Zinoviev Alexey

    SQL

    rdd.filter(_.age > 21) // RDD

    df.filter("age > 21") // DataFrame SQL-style

    df.filter(df.col("age").gt(21)) // Expression style

    dataset.filter(_.age < 21); // Dataset API

  • 57Joker16: Spark 2 from Zinoviev Alexey

    Expression

    rdd.filter(_.age > 21) // RDD

    df.filter("age > 21") // DataFrame SQL-style

    df.filter(df.col("age").gt(21)) // Expression style

    dataset.filter(_.age < 21); // Dataset API

  • 58Joker16: Spark 2 from Zinoviev Alexey

    History of Spark APIs

  • 59Joker16: Spark 2 from Zinoviev Alexey

    DataSet

    rdd.filter(_.age > 21) // RDD

    df.filter("age > 21") // DataFrame SQL-style

    df.filter(df.col("age").gt(21)) // Expression style

    dataset.filter(_.age < 21); // Dataset API

  • 60Joker16: Spark 2 from Zinoviev Alexey

    Case #2 : DataFrame is referring to data attributes by name

  • 61Joker16: Spark 2 from Zinoviev Alexey

    DataSet = RDDs types + DataFrames Catalyst

    RDD API

    compile-time type-safety

    off-heap storage mechanism

    performance benefits of the Catalyst query optimizer

    Tungsten

  • 62Joker16: Spark 2 from Zinoviev Alexey

    DataSet = RDDs types + DataFrames Catalyst

    RDD API

    compile-time type-safety

    off-heap storage mechanism

    performance benefits of the Catalyst query optimizer

    Tungsten

  • 63Joker16: Spark 2 from Zinoviev Alexey

    Structured APIs in SPARK

  • 64Joker16: Spark 2 from Zinoviev Alexey

    Unified API in Spark 2.0

    DataFrame = Dataset[Row]

    Dataframe is a schemaless (untyped) Dataset now

  • 65Joker16: Spark 2 from Zinoviev Alexey

    Define

    case class

    case class User(email: String, footSize: Long, name: String)

    // DataFrame -> DataSet with Users

    val userDS =

    spark.read.json("/home/tmp/datasets/users.json").as[User]

    userDS.map(_.name).collect()

    userDS.filter(_.footSize > 38).collect()

    ds.rdd // IF YOU REALLY WANT

  • 66Joker16: Spark 2 from Zinoviev Alexey

    Read JSON

    case class User(email: String, footSize: Long, name: String)

    // DataFrame -> DataSet with Users

    val userDS =

    spark.read.json("/home/tmp/datasets/users.json").as[User]

    userDS.map(_.na