Real-Time Analytics with Apache Cassandra and Apache Spark

Click here to load reader

  • date post

    09-Jan-2017
  • Category

    Software

  • view

    2.811
  • download

    2

Embed Size (px)

Transcript of Real-Time Analytics with Apache Cassandra and Apache Spark

  • BLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH

    Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz

  • Guido Schmutz

    Working for Trivadis for more than 18 years Oracle ACE Director for Fusion Middleware and SOA Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and

    Big Data / Fast Data Technology Manager @ Trivadis

    More than 25 years of software development experience

    Contact: [email protected] Blog: http://guidoschmutz.wordpress.com Twitter: gschmutz

  • Agenda

    1. Introduction2. Apache Spark3. Apache Cassandra4. Combining Spark & Cassandra5. Summary

  • Big Data Definition (4 Vs)

    +Timetoaction? BigData+Real-Time=StreamProcessing

    CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

  • What is Real-Time Analytics?

    What is it? Why do we need it?

    How does it work? Collect real-time data Process data as it flows in Data in Motion over Data at

    Rest Reports and Dashboard

    access processed data TimeEvents RespondAnalyze

    Shorttimetoanalyze&respond

    Required - fornewbusinessmodels

    Desired - forcompetitiveadvantage

  • Real Time Analytics Use Cases

    Algorithmic Trading

    Online Fraud Detection

    Geo Fencing

    Proximity/Location Tracking

    Intrusion detection systems

    Traffic Management

    Recommendations

    Churn detection

    Internet of Things (IoT) / Intelligence

    Sensors

    Social Media/Data Analytics

    Gaming Data Feed

  • Apache Spark

  • Motivation Why Apache Spark?

    Hadoop MapReduce: Data Sharing on Disk

    Spark: Speed up processing by using Memory instead of Disks

    map reduce . . .Input

    HDFSread

    HDFSwrite

    HDFSread

    HDFSwrite

    op1 op2 . . .Input

    Output

    Output

  • Apache Spark

    Apache Spark is a fast and general engine for large-scale data processing The hot trend in Big Data! Originally developed 2009 in UC Berkleys AMPLab Based on 2007 Microsoft Dryad paper Written in Scala, supports Java, Python, SQL and R Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x

    faster on disk One of the largest OSS communities in big data with over 200 contributors in 50+

    organizations Open Sourced in 2010 since 2014 part of Apache Software foundation

  • Apache Spark

    SparkSQL(BatchProcessing)

    BlinkDB(ApproximateQuerying)

    SparkStreaming(Real-Time)

    MLlib,SparkR(MachineLearning)

    GraphX(GraphProcessing)

    SparkCoreAPIandExecutionModel

    SparkStandalone MESOS YARN HDFS

    ElasticSearch NoSQL

    S3

    Libraries

    CoreRuntime

    ClusterResourceManagers DataStores

  • Resilient Distributed Dataset (RDD)

    Are Immutable Re-computable Fault tolerant Reusable

    Have Transformations Produce new RDD Rich set of transformation available

    filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...

    Have Actions Start cluster computing operations Rich set of action available

    collect(), count(), fold(), reduce(), count(),

  • RDD RDD

    Input Source

    File Database Stream Collection

    .count() ->100

    Data

  • Partitions RDD

    Data

    Partition0

    Partition1

    Partition2

    Partition3

    Partition4

    Partition5

    Partition6

    Partition7

    Partition8

    Partition9

    Server1

    Server2

    Server3

    Server4

    Server5

  • Partitions RDD

    Data

    Partition0

    Partition1

    Partition2

    Partition3

    Partition4

    Partition5

    Partition6

    Partition7

    Partition8

    Partition9

    Server1

    Server2

    Server3

    Server4

    Server5

  • Partitions RDD

    Data

    Partition0

    Partition1

    Partition2

    Partition3

    Partition4

    Partition5

    Partition6

    Partition7

    Partition8

    Partition9

    Server2

    Server3

    Server4

    Server5

  • Stage 1 reduceByKey()

    Stage 1 flatMap() + map()

    Spark Workflow InputHDFSFile

    HadoopRDD

    MappedRDD

    ShuffledRDD

    TextFileOutput

    sc.hapoopFile()

    map()

    reduceByKey()

    sc.saveAsTextFile()

    Transformations(Lazy)

    Action(Execute

    Transformations)

    Master

    MappedRDD

    P0

    P1

    P3

    ShuffledRDD

    P0

    MappedRDD

    flatMap()

    DAGScheduler

  • Spark Workflow HDFSFileInput1

    HadoopRDD

    FilteredRDD

    MappedRDD

    ShuffledRDD

    HDFSFileOutput

    HadoopRDD

    MappedRDD

    HDFSFileInput2SparkContext.hadoopFile()

    SparkContext.hadoopFile()filter()

    map() map()

    join()

    SparkContext.saveAsHadoopFile()

    Transformations(Lazy)

    Action(ExecuteTransformations)

  • Spark Execution Model

    DataStorage

    Worker

    Master

    Executer

    Executer

    Server

    Executer

  • Stage 1 flatMap() + map()

    Spark Execution Model

    DataStorage

    Worker

    Master

    Executer

    DataStorage

    Worker

    Executer

    DataStorage

    Worker

    Executer

    RDD

    P0

    P1

    P3

    NarrowTransformationMaster

    filter()map()sample()flatMap()

    DataStorage

    Worker

    Executer

  • Stage 2 reduceByKey()

    Spark Execution Model

    DataStorage

    Worker

    Executer

    DataStorage

    Worker

    Executer

    RDD

    P0

    WideTransformation

    Master

    join()reduceByKey()union()groupByKey()

    Shuffle!

    DataStorage

    Worker

    Executer

    DataStorage

    Worker

    Executer

  • Batch vs. Real-Time Processing

    PetabytesofData

    Gigaby

    tes

    PerS

    econ

    d

  • Various Input Sources

  • Apache Kafka

    distributed publish-subscribe messaging system

    Designed for processing of real time activity stream data (logs, metrics collections,

    social media streams, )

    Initially developed at LinkedIn, now part of Apache

    Does not use JMS API and standards

    Kafka maintains feeds of messages in topics Kafka Cluster

    Consumer Consumer Consumer

    Producer Producer Producer

  • Apache Kafka

    Kafka Broker

    Temperature Processor

    TemperatureTopic

    RainfallTopic

    1 2 3 4 5 6

    RainfallProcessor1 2 3 4 5 6

    WeatherStation

  • Apache Kafka

    Kafka Broker

    Temperature Processor

    TemperatureTopic

    RainfallTopic

    1 2 3 4 5 6

    RainfallProcessor

    Partition0

    1 2 3 4 5 6Partition0

    1 2 3 4 5 6Partition1 Temperature

    ProcessorWeatherStation

  • ApacheKafka

    Kafka Broker

    Temperature Processor

    WeatherStation

    TemperatureTopic

    RainfallTopic

    RainfallProcessor

    P0

    Temperature Processor

    1 2 3 4 5

    P1 1 2 3 4 5

    Kafka BrokerTemperatureTopic

    RainfallTopic

    P0 1 2 3 4 5

    P1 1 2 3 4 5

    P0 1 2 3 4 5

    P0 1 2 3 4 5

  • Discretized Stream (DStream)

    Kafka

    WeatherStation

    WeatherStation

    WeatherStation

  • Discretized Stream (DStream)

    Kafka

    WeatherStation

    WeatherStation

    WeatherStation

  • Discretized Stream (DStream)

    Kafka

    WeatherStation

    WeatherStation

    WeatherStation

  • Discretized Stream (DStream)

    Kafka

    WeatherStation

    WeatherStation

    WeatherStation Discretebytime

    IndividualEvent

    DStream =RDD

  • Discretized Stream (DStream)

    DStream DStream

    XSeconds

    Transform

    .countByValue()

    .reduceByKey()

    .join

    .map

  • Discretized Stream (DStream)time1 time2 time3

    message

    timen.

    f(message 1)[email protected]

    f(message 2)

    f(message n)

    .

    message [email protected]

    message 2

    message n

    .

    result 1

    result 2

    result n

    .

    message message message

    f(message 1)[email protected]

    f(message 2)

    f(message n)

    .

    message [email protected]

    message 2

    message n

    .

    result 1

    result 2

    result n

    .

    f(message 1)[email protected]

    f(message 2)

    f(message n)

    .

    message [email protected]

    message 2

    message n

    .

    result 1

    result 2

    result n

    .

    f(message 1)[email protected]

    f(message 2)

    f(message n)

    .

    message [email protected]

    message 2

    message n

    .

    result 1

    result 2

    result n

    .

    InputStream

    EventDStream

    MappedDStreammap()

    saveAsHadoopFiles()

    Time Increasing

    DStream

    Transformation

    Lineage

    Actio

    nsTrig

    ger

    SparkJobs

    Adapted fromChrisFregly: http://slidesha.re/11PP7FV

  • Apache Spark Streaming Core concepts

    Discretized Stream (DStream) Core Spark Streaming abstraction

    micro batches of RDDs

    Operations similar to RDD

    Input DStreams Represents the stream of raw data received

    from streaming sources

    Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc.

    Custom Sources can be easily written for custom data sources

    Operations Same as Spark Core + Additional Stateful

    transformations (window, reduceByWindow)

  • Apache Cassandra