Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa

download Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa

of 39

Embed Size (px)

Transcript of Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa

  • Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino BusaData Platform Architect at Ing

  • Distributed computing Machine Learning

    Statistics Big/Fast Data Streaming Computing

    @natbusa | linkedin.com/in/natalinobusa

  • @natbusa | linkedin: Natalino Busa

    ING group

    http://www.ing.com/About-us/Purpose-Strategy.htm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • @natbusa | linkedin: Natalino Busa

    ING group

    Empowering people to stay a step ahead in life and in business.

    http://www.ing.com/About-us/Purpose-Strategy.htm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • @natbusa | linkedin: Natalino Busa

    ING group

    http://www.ing.com/About-us/Purpose-Strategy.htm

    Clear and Easy

    Anytime, Anywhere

    Empower

    Keep getting better

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • @natbusa | linkedin: Natalino Busa

    Apply advanced, predictive analytics on live dataEvent-Driven and exposed via APIs

    Lean Architecture, Easy to integrate

    Available, Consistent, Streaming, Real-time Data

    Resilient, Distributed, Scalable, Maintainable

    Clear and Easy

    Anytime, Anywhere

    Empower

    Keep getting better

    Data Principles

    ING group

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Big Data and Fast Data

    10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m

    time

    popu

    latio

    n: e

    vent

    s, tr

    ansa

    ctio

    ns,

    sess

    ions

    , cus

    tom

    ers,

    etc

    event streams

    recent data

    historical

    big data

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Why Fast Data?

    1. Relevant up-to-date information.2. Delivers actionable events.

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Why Big Data?

    1. Analyze and model2. Learn, cluster, categorize, organize facts

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa 10

    DistributedData Store

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    API for mobile and web

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa 11

    DistributedData Store

    Fast Analytics

    Real Time APIsStreaming Data

    Data ModelingData Sources,Files, DB extractsBatched Data

    API for mobile and web

    Training, Scoring and Exposing models

    read the datawrite the model

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa 12

    DistributedData Store

    Fast Analytics

    Event Processing

    Real Time APIsStreaming Data

    Data ModelingData Sources,Files, DB extractsBatched Data

    Alerts and Notifications

    API for mobile and web

    Training, Scoring and Exposing models

    read the model

    read the datawrite the model

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Cassandra+Akka+Spark: Machine Learning

    Fast writes2D Data StructureReplicatedTunable consistencyMulti-Data centers

    C* Akka SparkVery Fast processingDistributed, Scalable computingActor-based PipelinesActor state can be persistedSupervision strategies

    Ad-Hoc QueriesJoins, AggregateUser Defined FunctionsMachine Learning, Advanced Stats and Analytics

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Akka-Cassandra-Spark Stack

    Cassandra-Spark Connector

    Cassandra

    Spark

    Streaming SQL MLlib Graphx

    Extract Data

    Create Models,Enrich, Transform

    Fetch from other Sources: Kafka

    Fetch from other Sources: DBs, Files

    Akka

    Analytics, Statistics, Data Science, Model Training

    AccessModel

    PersistActors State

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Cassandra-Spark Connector

    Cassandra: Store all the dataSpark: Analyze all the data

    DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics!

    Data

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Data Science: Anomaly Detection

    An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

    Hawkins, 1980

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Data Science: Anomaly Detection

    Distance Based Density Based

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Example: Analyze gowalla check-ins

    year | month | day | time | uid | lat | lon | ts | vid------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870

    Check-ins dataset

    Venues dataset

    vid | name | lat | long ------+-------+-----+------+--------+----------+-----------+--------------------------+---------754108 | My Suit NY | 40.73474 | -73.87434249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Data Science: clustering venues

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Data Science: clustering venues

    Weekly visitors patterns!

    Madison Square, Apple Store, Radio City Music HallThursdays, Fridays, Saturdays are busy

    Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle)Not popular on midweek

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Data Science: clustering with k-means

    Histograms components as dimensions

    Similar histograms would occupy similar places in the feature space

    How do I compare histograms:- EMD- Chi-squared distance- Space transformation (DCT)

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    K-Means: Featurize data + cluster

    val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1))

    val numClusters = 15val numIterations = 100

    val clusters = KMeans.train(weekly_visits, numClusters, numIterations)

    PairRDDs, weekly patterns per venue

    cluster similar weekly patterns

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    How to use it

    1) ClassificationClassify venues to given groups

    2) Anomaly DetectionDetect shift in the clustering assignment for a given venue for a given weekKeep monitoring weekly change in patterns, when it happens trigger a signal

    week 26 week 27

    Action

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Data Science: clustering users venues

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • @natbusa | linkedin: Natalino Busa

    Data Science: clustering users venues

    Users tend to stick in the same placesPeople have habits

    By clustering the places togetherWe can identify anomalous locations

    Siz