Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

download Real-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra

of 39

Embed Size (px)

Transcript of Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

  • Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino BusaData Platform Architect at Ing

  • ING group

    http://www.ing.com/About-us/Purpose-Strategy.htm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • ING group

    Empowering people to stay a step ahead in life and in business.

    http://www.ing.com/About-us/Purpose-Strategy.htm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • ING group

    http://www.ing.com/About-us/Purpose-Strategy.htm

    Clear and Easy

    Anytime, Anywhere

    Empower

    Keep getting better

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • Apply advanced, predictive analytics on live dataEvent-Driven and exposed via APIs

    Lean Architecture, Easy to integrate

    Available, Consistent, Streaming, Real-time Data

    Resilient, Distributed, Scalable, Maintainable

    Clear and Easy

    Anytime, Anywhere

    Empower

    Keep getting better

    Data Principles

    ING group

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Big Data and Fast Datapo

    pula

    tion:

    eve

    nts,

    tran

    sact

    ions

    , se

    ssio

    ns, c

    usto

    mer

    s, et

    c

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Why Fast Data?

    1. Relevant up-to-date information.2. Delivers actionable events.

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Why Big Data?

    1. Analyze and model2. Learn, cluster, categorize, organize facts

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 10

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 11

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 12

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Cassandra+Akka+Spark: Machine Learning

    Fast writes2D Data StructureReplicatedTunable consistencyMulti-Data centers

    C* Akka SparkVery Fast processingDistributed, Scalable computingActor-based PipelinesActor state can be persistedSupervision strategies

    Ad-Hoc QueriesJoins, AggregateUser Defined FunctionsMachine Learning, Advanced Stats and Analytics

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Akka-Cassandra-Spark Stack

    Cassandra-Spark Connector

    Cassandra

    Spark

    Streaming SQL MLlib Graphx

    Extract Data

    Create Models,Enrich, Transform

    Fetch from other Sources: Kafka

    Fetch from other Sources: DBs, Files

    Akka

    Analytics, Statistics, Data Science, Model Training

    AccessModel

    PersistActors State

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Cassandra-Spark Connector

    Cassandra: Store all the dataSpark: Analyze all the data

    DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics!

    Data

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: Anomaly Detection

    An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

    Hawkins, 1980

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: Anomaly Detection

    Distance Based Density Based

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Example: Analyze gowalla check-ins

    year | month | day | time | uid | lat | lon | ts | vid------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870

    Check-ins dataset

    Venues dataset

    vid | name | lat | long ------+-------+-----+------+--------+----------+-----------+--------------------------+---------754108 | My Suit NY | 40.73474 | -73.87434249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering venues

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering venues

    Weekly visitors patterns!

    Madison Square, Apple Store, Radio City Music HallThursdays, Fridays, Saturdays are busy

    Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle)Not popular on midweek

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering with k-means

    Histograms components as dimensions

    Similar histograms would occupy similar places in the feature space

    How do I compare histograms:- EMD- Chi-squared distance- Space transformation (DCT)

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • K-Means: Featurize data + cluster

    val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1))

    val numClusters = 15val numIterations = 100

    val clusters = KMeans.train(weekly_visits, numClusters, numIterations)

    PairRDDs, weekly patterns per venue

    cluster similar weekly patterns

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • How to use it

    1) ClassificationClassify venues to given groups

    2) Anomaly DetectionDetect shift in the clustering assignment for a given venue for a given weekKeep monitoring weekly change in patterns, when it happens trigger a signal

    week 26 week 27

    Action

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering users venues

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering users venues

    Users tend to stick in the same placesPeople have habits

    By clustering the places togetherWe can identify anomalous locations

    Size of the cluster mattersMore points means less anomalous

    Mini-clusters and single anomalies are treated in similar ways ...

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering with DBSCANDBSCAN find clusters based on neighbouring density

    Does not require the number of cluster k beforehand.Clusters are not spherical

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering users venues

    val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) ))

    .reduceByKey(_ + _) .mapValues( dbscan (_) )

    Have a look at: scalanlp/nak

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttps://github.com/scalanlphttps://github.com/scalanlp/nak

  • Data Science:

    Two ways to find anomalies with clustering

    - Cluster big amount of data with k-means and histograms

    - Apply clustering independently to million of users,to each identify the patterns with dbscan algorithm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • MLlib vs PairRDDs

    KMeans.train(FeaturesRDD, numClusters, numIterations)

    UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan(_) )

    RDDs map functionsParallelism easy to exploitThe function runs locally for each KeyPick your fav machine learning algorithms

    Limited nr of pointsRunning in parallel for millions of Keys

    MLlibTruly distributed algorithmClassify venues to given groups

    Millions of datapointsLimited amount of clusters

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa