Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Click here to load reader

Embed Size (px)

Transcript of Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

  • Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino BusaData Platform Architect at Ing

  • ING group

    http://www.ing.com/About-us/Purpose-Strategy.htm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • ING group

    Empowering people to stay a step ahead in life and in business.

    http://www.ing.com/About-us/Purpose-Strategy.htm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • ING group

    http://www.ing.com/About-us/Purpose-Strategy.htm

    Clear and Easy

    Anytime, Anywhere

    Empower

    Keep getting better

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttp://www.ing.com/About-us/Purpose-Strategy.htmhttp://www.ing.com/About-us/Purpose-Strategy.htm

  • Apply advanced, predictive analytics on live dataEvent-Driven and exposed via APIs

    Lean Architecture, Easy to integrate

    Available, Consistent, Streaming, Real-time Data

    Resilient, Distributed, Scalable, Maintainable

    Clear and Easy

    Anytime, Anywhere

    Empower

    Keep getting better

    Data Principles

    ING group

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Big Data and Fast Datapo

    pula

    tion:

    eve

    nts,

    tran

    sact

    ions

    , se

    ssio

    ns, c

    usto

    mer

    s, et

    c

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Why Fast Data?

    1. Relevant up-to-date information.2. Delivers actionable events.

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Why Big Data?

    1. Analyze and model2. Learn, cluster, categorize, organize facts

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 10

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 11

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 12

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Cassandra+Akka+Spark: Machine Learning

    Fast writes2D Data StructureReplicatedTunable consistencyMulti-Data centers

    C* Akka SparkVery Fast processingDistributed, Scalable computingActor-based PipelinesActor state can be persistedSupervision strategies

    Ad-Hoc QueriesJoins, AggregateUser Defined FunctionsMachine Learning, Advanced Stats and Analytics

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Akka-Cassandra-Spark Stack

    Cassandra-Spark Connector

    Cassandra

    Spark

    Streaming SQL MLlib Graphx

    Extract Data

    Create Models,Enrich, Transform

    Fetch from other Sources: Kafka

    Fetch from other Sources: DBs, Files

    Akka

    Analytics, Statistics, Data Science, Model Training

    AccessModel

    PersistActors State

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Cassandra-Spark Connector

    Cassandra: Store all the dataSpark: Analyze all the data

    DC1: replication factor 3 DC2: replication factor 3 DC3: replication factor 3 + Spark Executors Storage! Analytics!

    Data

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: Anomaly Detection

    An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

    Hawkins, 1980

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: Anomaly Detection

    Distance Based Density Based

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Example: Analyze gowalla check-ins

    year | month | day | time | uid | lat | lon | ts | vid------+-------+-----+------+--------+----------+-----------+--------------------------+--------- 2010 | 9 | 14 | 91 | 853 | 40.73474 | -73.87434 | 2010-09-14 00:01:31+0000 | 917955 2010 | 9 | 14 | 328 | 4516 | 40.72585 | -73.99289 | 2010-09-14 00:05:28+0000 | 37160 2010 | 9 | 14 | 344 | 2964 | 40.67621 | -73.98405 | 2010-09-14 00:05:44+0000 | 956870

    Check-ins dataset

    Venues dataset

    vid | name | lat | long ------+-------+-----+------+--------+----------+-----------+--------------------------+---------754108 | My Suit NY | 40.73474 | -73.87434249755 | UA Court Street Stadium 12 | 40.72585 | -73.99289 6919688 | Sky Asian Bistro | 40.67621 | -73.98405

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering venues

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering venues

    Weekly visitors patterns!

    Madison Square, Apple Store, Radio City Music HallThursdays, Fridays, Saturdays are busy

    Statue of Liberty, Jacob K. Javits Convention Center, Whole Foods Market (Columbus Circle)Not popular on midweek

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering with k-means

    Histograms components as dimensions

    Similar histograms would occupy similar places in the feature space

    How do I compare histograms:- EMD- Chi-squared distance- Space transformation (DCT)

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • K-Means: Featurize data + cluster

    val weekly_visits = checkins_venues.select("vid","ts") .map(row => (row.getLong("vid"), vectorize_time(s.getTimestamp("ts")) .reduceByKey(_ + _) .mapValues(_ => featurize_histogram(_._1))

    val numClusters = 15val numIterations = 100

    val clusters = KMeans.train(weekly_visits, numClusters, numIterations)

    PairRDDs, weekly patterns per venue

    cluster similar weekly patterns

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • How to use it

    1) ClassificationClassify venues to given groups

    2) Anomaly DetectionDetect shift in the clustering assignment for a given venue for a given weekKeep monitoring weekly change in patterns, when it happens trigger a signal

    week 26 week 27

    Action

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering users venues

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering users venues

    Users tend to stick in the same placesPeople have habits

    By clustering the places togetherWe can identify anomalous locations

    Size of the cluster mattersMore points means less anomalous

    Mini-clusters and single anomalies are treated in similar ways ...

    Intuition:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering with DBSCANDBSCAN find clusters based on neighbouring density

    Does not require the number of cluster k beforehand.Clusters are not spherical

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Data Science: clustering users venues

    val locs = checkins_venues.select("uid", "lat","lon") .map(s => (s.getLong(0), Seq( (s.getDouble(1), s.getDouble(2)) ))

    .reduceByKey(_ + _) .mapValues( dbscan (_) )

    Have a look at: scalanlp/nak

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttps://github.com/scalanlphttps://github.com/scalanlp/nak

  • Data Science:

    Two ways to find anomalies with clustering

    - Cluster big amount of data with k-means and histograms

    - Apply clustering independently to million of users,to each identify the patterns with dbscan algorithm

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • MLlib vs PairRDDs

    KMeans.train(FeaturesRDD, numClusters, numIterations)

    UserFeaturesPairRDD.GroupbyKey().mapValues( dbscan(_) )

    RDDs map functionsParallelism easy to exploitThe function runs locally for each KeyPick your fav machine learning algorithms

    Limited nr of pointsRunning in parallel for millions of Keys

    MLlibTruly distributed algorithmClassify venues to given groups

    Millions of datapointsLimited amount of clusters

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 30

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Training, Scoring and Exposing models

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Training vs Scoring: Latency budget

    Akka: millisecond response

    Spark: in-memory data modelsTrain: SparkScore: Spark

    Train: SparkScore: Akka

    slow: minutes fast: millisecs

    Model Scoring

    Mod

    el T

    rain

    ing

    slow

    : min

    utes

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Akka

    Mixed Load Cassandra Cluster

    Coral: Web API for dynamic data flows

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Akka

    Web API for dynamic data flows a web api to define/manage/run streaming data-flows open source and community managed event processing as a service

    coral-streaming/coral

    Steven RaemaekersJasper van ZandbeekGer van RossumHoda AlemiKoen Verschuren

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • 34

    Real Time APIsStreaming Data

    Data Sources,Files, DB extractsBatched Data

    Summary:

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Akka

    Feedback to the community:

    More Algorithms for machine learning!- DBSCAN, OPTICS, PAM- More metrics, non-euclidean spaces, etc- Non distributed algorithms: more scalanlp integration?

    Streaming all the way:Unify batch (Spark) and event streaming (Akka) computing

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • Thanks!

    - Vision and strategy on an event-driven bank- ING CIO management team and awesome colleagues

    Spark, Cassandra, Akka communities !

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • webinar + live dem

    o: Dec 9th

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusa

  • ResourcesCoral: event processing webapihttps://github.com/coral-streaming/coral

    Spark + Cassandra: Clustering Eventshttp://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.html

    Spark: Machine Learning, SQL frameshttps://spark.apache.org/docs/latest/mllib-guide.html

    https://spark.apache.org/docs/latest/sql-programming-guide.html

    Datastax: Analytics and Spark connectorhttp://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecases

    http://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.html

    Anomaly DetectionChandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey"(PDF). ACM Computing Surveys 41 (3): 1. doi:10.1145/1541880.1541882.

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttps://github.com/coral-streaming/coralhttps://github.com/coral-streaming/coralhttp://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.htmlhttp://www.natalinobusa.com/2015/07/clustering-check-ins-with-spark-and.htmlhttps://spark.apache.org/docs/latest/mllib-guide.htmlhttps://spark.apache.org/docs/latest/mllib-guide.htmlhttps://spark.apache.org/docs/latest/sql-programming-guide.htmlhttps://spark.apache.org/docs/latest/sql-programming-guide.htmlhttp://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecaseshttp://www.slideshare.net/doanduyhai/spark-cassandra-connector-api-best-practices-and-usecaseshttp://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.htmlhttp://docs.datastax.com/en/datastax_enterprise/4.7/datastax_enterprise/anaHome/anaHome.htmlhttp://www.cs.umn.edu/tech_reports_upload/tr2007/07-017.pdfhttps://en.wikipedia.org/wiki/ACM_Computing_Surveyshttps://en.wikipedia.org/wiki/Digital_object_identifierhttps://dx.doi.org/10.1145%2F1541880.1541882

  • ResourcesDatasetshttps://snap.stanford.edu/data/loc-gowalla.htmlE. Cho, S. A. Myers, J. Leskovec. Friendship and Mobility: Friendship and Mobility: User Movement in Location-Based Social Networks ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2011

    https://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.zipThe project is being developed in the context of the SInteliGIS project financed by the Portuguese Foundation for Science and Technology (FCT) through project grant PTDC/EIA-EIA/109840/2009. .

    Pictures:"DBSCAN-density-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-density-data.svg#/media/File:DBSCAN-density-data.svg

    "DBSCAN-Illustration" by Chire - Own work. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:DBSCAN-Illustration.svg

    "Multimodal" by Visnut - Own work. Licensed under CC BY-SA 4.0 via Commons - https://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.png

    "Standard deviation diagram" by Mwtoews - Own work, based (in concept) on figure by Jeremy Kemp, on 2005-02-09. Licensed under CC BY 2.5 via Commons - https://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svg

    "Michelsonmorley-boxplot" by User:Schutz - Own work. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.svg#/media/File:Michelsonmorley-boxplot.svg

    https://twitter.com/natbusahttps://nl.linkedin.com/in/natalinobusahttps://twitter.com/natbusahttps://snap.stanford.edu/data/loc-gowalla.htmlhttps://snap.stanford.edu/data/loc-gowalla.htmlhttps://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.ziphttps://code.google.com/p/locrec/downloads/detail?name=gowalla-dataset.ziphttps://commons.wikimedia.org/wiki/File:DBSCAN-density-data.svg#/media/File:DBSCAN-density-data.svghttps://commons.wikimedia.org/wiki/File:DBSCAN-density-data.svg#/media/File:DBSCAN-density-data.svghttps://commons.wikimedia.org/wiki/File:DBSCAN-density-data.svg#/media/File:DBSCAN-density-data.svghttps://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:DBSCAN-Illustration.svghttps://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:DBSCAN-Illustration.svghttps://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg#/media/File:DBSCAN-Illustration.svghttps://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.pnghttps://commons.wikimedia.org/wiki/File:Multimodal.png#/media/File:Multimodal.pnghttps://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svghttps://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svghttps://commons.wikimedia.org/wiki/File:Standard_deviation_diagram.svg#/media/File:Standard_deviation_diagram.svghttps://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.svg#/media/File:Michelsonmorley-boxplot.svghttps://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.svg#/media/File:Michelsonmorley-boxplot.svghttps://commons.wikimedia.org/wiki/File:Michelsonmorley-boxplot.svg#/media/File:Michelsonmorley-boxplot.svg