Apache Cassandra and Spark for simple, distributed, near real-time

download Apache Cassandra and Spark for simple, distributed, near real-time

of 49

  • date post

    13-Feb-2017
  • Category

    Documents

  • view

    215
  • download

    1

Embed Size (px)

Transcript of Apache Cassandra and Spark for simple, distributed, near real-time

  • Data at the Speed of your UsersApache Cassandra and Spark for simple, distributed, near real-time stream processing.

    GOTO Copenhagen 2014

  • Rustam Aliyev

    Solution Architect at . !

    ! @rstml

  • Big Data?

    Photo: Flickr / Watches En Masse

  • " Volume

    # Variety

    $ Velocity

  • Velocity = Near Real Time

  • Near Real Time?

  • 0.5 sec 60 secNear Real Time

  • Use Cases

    Photo: Flickr / Swiss Army / Jim Pennucci

  • Web Analytics Dynamic Pricing

    Recommendation Fraud Detection

  • Architecture

    Photo: Ilkin Kangarli / Baku Haydar Aliyev Center

  • Architecture Goals

    Low Latency

    High Availability

    Horizontal Scalability

    Simplicity

  • Stream Processing

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection Processing Storing Delivery

  • Stream Processing

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection!

    !

    Spark

    !

    CassandraDelivery

  • Cassandra Distributed Database

    Photo: Flickr / Hypostyle Hall / Jorge Lscar

  • Data Model

  • Partition

    Cell 1 Cell 2 Cell 3Partition Key

  • Partition

    os: Android

    storage: 32GB

    version: 4.4

    weight: 130g

    sort order on disk

    Nexus 5

  • Table

    os: Android

    storage: 32GB

    version: 4.4

    weight: 130gNexus 5

    os: iOS

    storage: 64GB

    version: 8.0

    weight: 129giPhone 6

    os: memory: version: weight: Other

  • Distribution

  • 0000

    8000

    4000C000

    2000

    6000

    E000

    A000

    3D97Nexus 5

  • 0000

    8000

    4000C000

    2000

    6000

    E000

    A000

    9C4FiPhone 6

    3D97

  • Replication

  • 0000

    8000

    4000C000

    2000

    6000

    E000

    A000

    3D97

    9C4F

    1 replica

  • 0000

    8000

    4000C000

    2000

    6000

    E000

    A000

    3D97

    9C4F

    9C4F

    3D97

    2 replicas

  • Spark Distributed Data Processing Engine

    Photo: Flickr / Sparklers / Alexandra Compo / CreativeCommons

  • Fast

    In-memory

  • Logistic Regression

    Run

    ning

    Tim

    e (s

    )

    1000

    2000

    3000

    4000

    Number of Iterations

    1 5 10 20 30

    SparkHadoop

  • Easy

  • map !

    reduce

  • map filter groupBy sort union join leftOuterJoin rightOuterJoin

    reduce count fold reduceByKey groupByKey cogroup cross zip

    sample take first partitionBy mapWith pipe save ...

  • RDD Resilient Distributed Datasets

    Node 1

    Node 2

    Node 3Node 1

    Node 2

    Node 3

  • Operator DAG

    groupBy

    join

    filtermap

    Disk RDD Memory RDD

  • Spark Streaming

    Micro-batching

  • RDD

    DStreamData Stream

  • Spark + Cassandra

    DataStax Spark Cassandra Connector

  • https://github.com/datastax/spark-cassandra-connector

  • M

    M

    M

    Cassandra

    Spark Worker

    Spark Master & Worker

  • Demo!

    ! Twitter Analytics

  • Cassandra Data Model

  • ALL: 7139

    2014-09-21: 220

    2014-09-20: 309

    2014-09-19: 129

    sort order

    #hashtag

  • CREATE TABLE hashtags ( hashtag text, interval text, mentions counter, PRIMARY KEY((hashtag), interval) ) WITH CLUSTERING ORDER BY (interval DESC);

  • Processing Data Stream

  • import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-Demo") .setJars("demo-assembly-1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(

  • import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-Demo") .setJars("demo-assembly-1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(

  • import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-Demo") .setJars("demo-assembly-1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(

  • !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()

  • !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsByDay = tagCounts.map{ case (tag, mentions) => (tag, mentions, DateTime.now.toString("yyyyMMdd")) } !tagCountsByDay.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()

  • !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()

  • Questions ?