Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Petabyte scale data science using Spark & R

Sridhar Alla, Kiran MuglurmathComcast

Who we are• Sridhar Alla

Director, Solution Architecture, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business

Intelligence initiatives. • Kiran Muglurmath

Executive Director, Data Science, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business

Intelligence initiatives.

Top Initiatives

• Customer Churn Prediction• Clickthru Analytics • Personalization• Customer Journey• Modeling

Spark Stack

• Enables using R packages to process data • Can run Machine Learning and Statistical Analysis

algorithms

SparkR

Spark MLlib

• Implements various Machine Learning Algorithms• Classification, Regression, Collaborative Filtering,

Clustering, Decomposition• Works with Streaming, Spark SQL, GraphX or with

SparkR.

Using PySpark & SparkR

Hidden Markov Model (HMM)

• Supporting points go here.

Dataset Preparation: Training Data

Dataset Preparation: Raw Data

Baum – Welch algorithm for state detection

1. Given the download/upload levels (observations) for a given time interval, the model detects the hidden streaming state for that interval.

2. Given a set of observations (i = 1 .. n), ith hidden variable is independent of (i – 1)th hidden variable. For a discrete random variable Xt with N possible values, assume at P(Xt|X{t-1}) is independent of time t

1. From observations, calculate transition probabilities for N possible states. Then recursively compute maximum likelihoods for all observations, backwards and forwards to identify most probable state for each observation.

Sample Code (R):

• library('RHmm')• indata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• testdata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• downloads <- c(as.numeric(indata$V4))• downloadModel <- HMMFit(downloads, nStates=3)• testdownloads <- c(as.numeric(testdata$V4))• tVitPath <- viterbi(downloadModel, testdownloads)

• #Forward-backward procedure, compute probabilities• tfb <- forwardBackward(downloadModel, testdownloads)

• # Plot implied states• layout(1:3)• plot(testdownloads[1:100],ylab="Down Bandwidth",type="l", main="Download bytes")• plot(tVitPath$states[1:100],ylab="Download States",type="l", main="Download States")

Output for a test dataset

Parallelizing in Hadoop

Steps:• Create sample dataset to build model. This can be a small sample (~2000 – 5000 rows), or a size sufficient to build

generalized model.

• Script model as an R file, except that it should use streamed input instead of reading from CSV files. Separate map.R and reduce.R can be created if a reduction stage is required to create unified output datasets.

• Test that code works from command line with structure below, where dataset.csv is the input dataset with structure as shown before

cat dataset.csv | map.R | reduce.R > output.csv• Ensure that Hive tables are in delimited text format. Deploy and run model using Hadoop streaming with sample command

line below

hadoop jar /usr/hdp/2.2.6.4-1/hadoop-mapreduce/hadoop-streaming.jar \-D mapred.min.split.size=268435456 \-D mapreduce.task.timeout=300000000 \-D mapreduce.map.memory.mb=3584 \-D mapreduce.reduce.memory.mb=8092\-input /user/hive/warehouse/ebidatascience.db/ipdr/local_day_id=$NEXT_DATE -output /user/hive/warehouse/ebidatascience.db/ipdr_flagged/-file ./map.R-file <sample dataset to build model.csv>

-mapper ./map.R

Flagged output

Performance

• 1.7B observations/day• About 30 minutes processing time/day• 380 shared nodes• 92% accuracy in detecting streaming events

Add Pages as Necessary• Supporting points go here.

We are hiring!• Big Data Engineers (Hadoop, Spark,

Kafka…)• Data Analysts (R, SAS…..)• Big Data Analysts (Hive, Pig ….)

sridhar_alla@cable.comcast.com

THANK YOU.

Add Pages as Necessary• Supporting points go here.

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Data & Analytics

Transcript of Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

QoS Guarantees for Real Time Applications in 802.11 WLANs Kiran P Diwakar Guide: Prof. Sridhar Iyer.

(SEC313) Security & Compliance at the Petabyte Scale

Kiran Kiran Sooraj - Wasif Ali Wasif

· 2018. 8. 18. · j tatarao g ananth rao c sruthi b ramya s sridhar d sridhar s divva m rajesh b ravi kiran g uma madhuri ce designation asst prof. asst prof. asst prof. asst prof.

Exceptional Data Insight for Petabyte- Scale Unstructured ...

A Escola na Era do PetaByte

PetaByte Storage Facility at RHIC

Altered Sridhar

Sridhar Project

SRIDHAR NARASIMHAN

Sridhar prenatal diagnosis

Diverting excess food to hungry people Kiran Sridhar.

Petabyte-scale computing for LHC

Kiran kiran suraj

KIRAN NAMASTE NIEUWSFLITS fileMet hartelijke groet, (Feri Vetaula) Iet Schilder-Verboom, kiran@kiran-namaste.nl Facebook: Kiran Namaste ABN AMRO Bank NL73ABNA0468149570 Stichting Kiran

The Personal Petabyte The Enterprise Exabyte

Sridhar Rajagopal

Sridhar Hannenhalli

Petabyte Scale Data at Facebook - Borthakur

Sridhar Service Certificates