Kafka pours and Spark resolves

download Kafka pours and Spark resolves

of 113

Embed Size (px)

Transcript of Kafka pours and Spark resolves

  • Kafka pours and Spark

    resolves!

    Alexey Zinovyev, Java/BigData Trainer in EPAM

  • About

    With IT since 2007

    With Java since 2009

    With Hadoop since 2012

    With Spark since 2014

    With EPAM since 2015

  • 3Spark Streaming from Zinoviev Alexey

    Contacts

    E-mail : Alexey_Zinovyev@epam.com

    Twitter : @zaleslaw @BigDataRussia

    Facebook: https://www.facebook.com/zaleslaw

    vk.com/big_data_russia Big Data Russia

    vk.com/java_jvm Java & JVM langs

  • 4Spark Streaming from Zinoviev Alexey

    Spark

    Family

  • 5Spark Streaming from Zinoviev Alexey

    Spark

    Family

  • 6Spark Streaming from Zinoviev Alexey

    Pre-summary

    Before RealTime

    Spark + Cassandra

    Sending messages with Kafka

    DStream Kafka Consumer

    Structured Streaming in Spark 2.1

    Kafka Writer in Spark 2.2

  • 7Spark Streaming from Zinoviev Alexey

    < REAL-TIME

  • 8Spark Streaming from Zinoviev Alexey

    Modern Java in 2016Big Data in 2014

  • 9Spark Streaming from Zinoviev Alexey

    Batch jobs produce reports. More and more..

  • 10Spark Streaming from Zinoviev Alexey

    But customer can wait forever (ok, 1h)

  • 11Spark Streaming from Zinoviev Alexey

    Big Data in 2017

  • 12Spark Streaming from Zinoviev Alexey

    Machine Learning EVERYWHERE

  • 13Spark Streaming from Zinoviev Alexey

    Data Lake in promotional brochure

  • 14Spark Streaming from Zinoviev Alexey

    Data Lake in production

  • 15Spark Streaming from Zinoviev Alexey

    Simple Flow in Reporting/BI systems

  • 16Spark Streaming from Zinoviev Alexey

    Lets use Spark. Its fast!

  • 17Spark Streaming from Zinoviev Alexey

    MapReduce vs Spark

  • 18Spark Streaming from Zinoviev Alexey

    MapReduce vs Spark

  • 19Spark Streaming from Zinoviev Alexey

    Simple Flow in Reporting/BI systems with Spark

  • 20Spark Streaming from Zinoviev Alexey

    Spark handles last year logs with ease

  • 21Spark Streaming from Zinoviev Alexey

    Where can we store events?

  • 22Spark Streaming from Zinoviev Alexey

    Lets use Cassandra to store events!

  • 23Spark Streaming from Zinoviev Alexey

    Lets use Cassandra to read events!

  • 24Spark Streaming from Zinoviev Alexey

    Cassandra

    CREATE KEYSPACE mySpace WITH replication = {'class':

    'SimpleStrategy', 'replication_factor': 1 };

    USE test;

    CREATE TABLE logs

    ( application TEXT,

    time TIMESTAMP,

    message TEXT,

    PRIMARY KEY (application, time));

  • 25Spark Streaming from Zinoviev Alexey

    Cassandra

    to Spark

    val dataSet = sqlContext

    .read

    .format("org.apache.spark.sql.cassandra")

    .options(Map( "table" -> "logs", "keyspace" -> "mySpace"

    ))

    .load()

    dataSet

    .filter("message = 'Log message'")

    .show()

  • 26Spark Streaming from Zinoviev Alexey

    Simple Flow in Pre-Real-Time systems

  • 27Spark Streaming from Zinoviev Alexey

    Spark cluster over Cassandra Cluster

  • 28Spark Streaming from Zinoviev Alexey

    More events every second!

  • 29Spark Streaming from Zinoviev Alexey

    SENDING MESSAGES

  • 30Spark Streaming from Zinoviev Alexey

    Your Fathers Messaging System

  • 31Spark Streaming from Zinoviev Alexey

    Your Fathers Messaging System

  • 32Spark Streaming from Zinoviev Alexey

    Your Fathers Messaging System

    InitialContext ctx = new InitialContext();

    QueueConnectionFactory f =

    (QueueConnectionFactory)ctx.lookup(qCFactory"); QueueConnection con =

    f.createQueueConnection();

    con.start();

  • 33Spark Streaming from Zinoviev Alexey

    KAFKA

  • 34Spark Streaming from Zinoviev Alexey

    Kafka

    messaging system

  • 35Spark Streaming from Zinoviev Alexey

    Kafka

    messaging system

    distributed

  • 36Spark Streaming from Zinoviev Alexey

    Kafka

    messaging system

    distributed

    supports Publish-Subscribe model

  • 37Spark Streaming from Zinoviev Alexey

    Kafka

    messaging system

    distributed

    supports Publish-Subscribe model

    persists messages on disk

  • 38Spark Streaming from Zinoviev Alexey

    Kafka

    messaging system

    distributed

    supports Publish-Subscribe model

    persists messages on disk

    replicates within the cluster (integrated with Zookeeper)

  • 39Spark Streaming from Zinoviev Alexey

    The main benefits of Kafka

    Scalability with zero down time

    Zero data loss due to replication

  • 40Spark Streaming from Zinoviev Alexey

    Kafka Cluster consists of

    brokers (leader or follower)

    topics ( >= 1 partition)

    partitions

    partition offsets

    replicas of partition

    producers/consumers

  • 41Spark Streaming from Zinoviev Alexey

    Kafka Components with topic messages #1

    Producer

    Thread #1

    Producer

    Thread #2

    Producer

    Thread #3

    Topic: MessagesData

    Data

    Data

    Zookeeper

  • 42Spark Streaming from Zinoviev Alexey

    Kafka Components with topic messages #2

    Producer

    Thread #1

    Producer

    Thread #2

    Producer

    Thread #3

    Topic: Messages

    Part #1

    Part #2

    Data

    Data

    Data

    Broker #1

    Zookeeper

    Part #1

    Part #2

    Leader

    Leader

  • 43Spark Streaming from Zinoviev Alexey

    Kafka Components with topic messages #3

    Producer

    Thread #1

    Producer

    Thread #2

    Producer

    Thread #3

    Topic: Messages

    Part #1

    Part #2

    Data

    Data

    Data

    Broker #1

    Broker #2

    Zookeeper

    Part #1

    Part #1

    Part #2

    Part #2

  • 44Spark Streaming from Zinoviev Alexey

    Why do we need Zookeeper?

  • 45Spark Streaming from Zinoviev Alexey

    Kafka

    Demo

  • 46Spark Streaming from Zinoviev Alexey

    REAL TIME WITH

    DSTREAMS

  • 47Spark Streaming from Zinoviev Alexey

    RDD Factory

  • 48Spark Streaming from Zinoviev Alexey

    From socket to console with DStreams

  • 49Spark Streaming from Zinoviev Alexey

    DStream

    val conf = new SparkConf().setMaster("local[2]")

    .setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    val lines = ssc.socketTextStream("localhost", 9999)

    val words = lines.flatMap(_.split(" "))

    val pairs = words.map(word => (word, 1))

    val wordCounts = pairs.reduceByKey(_ + _)

    wordCounts.print()

    ssc.start()

    ssc.awaitTermination()

  • 50Spark Streaming from Zinoviev Alexey

    DStream

    val conf = new SparkConf().setMaster("local[2]")

    .setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    val lines = ssc.socketTextStream("localhost", 9999)

    val words = lines.flatMap(_.split(" "))

    val pairs = words.map(word => (word, 1))

    val wordCounts = pairs.reduceByKey(_ + _)

    wordCounts.print()

    ssc.start()

    ssc.awaitTermination()

  • 51Spark Streaming from Zinoviev Alexey

    DStream

    val conf = new SparkConf().setMaster("local[2]")

    .setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    val lines = ssc.socketTextStream("localhost", 9999)

    val words = lines.flatMap(_.split(" "))

    val pairs = words.map(word => (word, 1))

    val wordCounts = pairs.reduceByKey(_ + _)

    wordCounts.print()

    ssc.start()

    ssc.awaitTermination()

  • 52Spark Streaming from Zinoviev Alexey

    DStream

    val conf = new SparkConf().setMaster("local[2]")

    .setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    val lines = ssc.socketTextStream("localhost", 9999)

    val words = lines.flatMap(_.split(" "))

    val pairs = words.map(word => (word, 1))

    val wordCounts = pairs.reduceByKey(_ + _)

    wordCounts.print()

    ssc.start()

    ssc.awaitTermination()

  • 53Spark Streaming from Zinoviev Alexey

    DStream

    val conf = new SparkConf().setMaster("local[2]")

    .setAppName("NetworkWordCount")

    val ssc = new StreamingContext(conf, Seconds(1))

    val lines = ssc.socketTextStream("localhost", 9999)

    val words = lines.flatMap(_.split(" "))

    val pairs = words.map(word => (word, 1))

    val wordCounts = pairs.reduceByKey(_ + _)

    wordCounts.print()

    ssc.start()

    ssc.awaitTermination()

  • 54Spark Streaming from Zinoviev Alexey

    Kafka as a main entry point for Spark

  • 55Spark Streaming from Zinoviev Alexey

    DStreams

    Demo

  • 56Spark Streaming from Zinoviev Alexey

    How to avoid DStreams with RDD-like API?

  • 57Spark Streaming from Zinoviev Alexey

    SPARK 2.2 DISCUSSION

  • 58Spark Streaming from Zinoviev Alexey

    Continuous Applications

  • 59Spark Streaming from Zinoviev Alexey

    Continuous Applications cases

    Updating data that will be served in real time

    Extract, transform and load (ETL)

    Creating a real-time version of an existing batch job

    Online machine learning

  • 60Spark Streaming from Zinoviev Alexey

    The main concept of Structured Streaming

    You can