Feeding Cassandra with Spark-Streaming and Kafka

download Feeding Cassandra with Spark-Streaming and Kafka

of 21

  • date post

    11-Apr-2017
  • Category

    Technology

  • view

    1.197
  • download

    2

Embed Size (px)

Transcript of Feeding Cassandra with Spark-Streaming and Kafka

  • Feeding Cassandra with Spark Streaming & KafkaCary Bourgeois Solutions Engineer DataStax, Central Region

  • Who Am I

    Datastax < 2 Years Not a developer Legacy BI/Database

    Business Objects SAP

    Demo Development R Java (If I have to) Scala (Someday)

    2

  • 3

    Cassandra Summit 2015 September 22-24, Santa Clara Convention Center

    7,000 Attendees

  • Last Week - Mission Impossible?A Stretch but possible.

    4

  • Sunday Afternoon - Im getting my A$$ kicked

    5

  • Monday Afternoon - Arghhhhh!

    6

  • Monday Night - I got this!

    7

  • 8

    Capture Raw Data

    Analyze & ummarize

  • Why Mess with Success?

    Spark 1.3+ New/Improved Kafka

    Support Dataframes

    Datastax Enterprise 4.8 Spark 1.4 support

    9

    https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

  • Why Mess with Success?

    Spark 1.3+ New/Improved Kafka

    Support Dataframes

    Datastax Enterprise 4.8 Spark 1.4 support

    10https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

  • Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.FastA single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.ScalableKafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumersDurableMessages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.Distributed by DesignKafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. 11

  • Producers Consumers Persistence Topics Partitions Replication

    12

    http://kafka.apache.org/documentation.html

    http://kafka.apache.org/documentation.html

  • Create a Kafka topic bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic stream_ts

    List all topics bin/kafka-topics.sh --zookeeper localhost:2181 --list

    Monitor a topic bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic stream_ts --from-beginning

    13

  • Confidential

    Kafka and the Producer

    14

  • The Producer App

    Lots of Options I chose

    Scala Not steep enough

    Akka

    Producing this message

    15

    Edge 1;1;401843;2015-11-04 06:23:49.001;64.44286233060423;82.79653847181152

  • Destination - Cassandra Tables

    16

    CREATE TABLE demo.data (edge_id text,sensor text,epoch_hr text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor, epoch_hr ), ts)

    )

    CREATE TABLE demo.last (edge_id text,sensor text,ts timestamp,depth double,value double,PRIMARY KEY (( edge_id, sensor ))

    )

    CREATE TABLE demo.count (pk int,ts timestamp,count bigint,count_ma double,PRIMARY KEY (pk, ts)

    )

  • DSE Analytics => Spark

    No ETL Spark 1.4.1 certification Simplified map and reduce Very developer Friendly

    SparkSQL Spark Streaming Machine Learning

    DSE Analytics and Search Integration Cassandra benefits (scaling, availability)

    17

    I want to do processing on data before it hits Cassandra. I need my sums, avgs, group bys ETC. I want to run real-time analytics on my Cassandra data.

  • Processing the Stream

    Simple Scala Job Deal with the raw flow

    Capture the raw data Capture the latest sensor

    reading Summarize and Analyze

    Windowing the Stream Count Records every x

    seconds Calculate a moving average

    of every x seconds over a number of periods. 18

  • Confidential

    Full Demo

    19

  • Next Steps

    SparkR MLLib workflows Notebooks

    Spark Jupyter

    20

  • If you would like the code:

    21

    https://github.com/CaryBourgeois/KafkaSparkCassandraDemo