Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

22
Petabyte scale data science using Spark & R Sridhar Alla, Kiran Muglurmath Comcast

Transcript of Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Page 1: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Petabyte scale data science using Spark & R

Sridhar Alla, Kiran MuglurmathComcast

Page 2: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Who we are• Sridhar Alla

Director, Solution Architecture, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business

Intelligence initiatives. • Kiran Muglurmath

Executive Director, Data Science, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business

Intelligence initiatives.

Page 3: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Top Initiatives

• Customer Churn Prediction• Clickthru Analytics • Personalization• Customer Journey• Modeling

Page 4: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Spark Stack

Page 5: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

• Enables using R packages to process data • Can run Machine Learning and Statistical Analysis

algorithms

SparkR

Page 6: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Spark MLlib

• Implements various Machine Learning Algorithms• Classification, Regression, Collaborative Filtering,

Clustering, Decomposition• Works with Streaming, Spark SQL, GraphX or with

SparkR.

Page 7: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Using PySpark & SparkR

Page 8: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Hidden Markov Model (HMM)

• Supporting points go here.

Page 9: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Dataset Preparation: Training Data

• Supporting points go here.

Page 10: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Dataset Preparation: Raw Data

• Supporting points go here.

Page 11: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Baum – Welch algorithm for state detection

1. Given the download/upload levels (observations) for a given time interval, the model detects the hidden streaming state for that interval.

2. Given a set of observations (i = 1 .. n), ith hidden variable is independent of (i – 1)th hidden variable. For a discrete random variable Xt with N possible values, assume at P(Xt|X{t-1}) is independent of time t

1. From observations, calculate transition probabilities for N possible states. Then recursively compute maximum likelihoods for all observations, backwards and forwards to identify most probable state for each observation.

Page 12: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Sample Code (R):

• library('RHmm')• indata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• testdata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• downloads <- c(as.numeric(indata$V4))• downloadModel <- HMMFit(downloads, nStates=3)• testdownloads <- c(as.numeric(testdata$V4))• tVitPath <- viterbi(downloadModel, testdownloads)

• #Forward-backward procedure, compute probabilities• tfb <- forwardBackward(downloadModel, testdownloads)

• # Plot implied states• layout(1:3)• plot(testdownloads[1:100],ylab="Down Bandwidth",type="l", main="Download bytes")• plot(tVitPath$states[1:100],ylab="Download States",type="l", main="Download States")

Page 13: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Output for a test dataset

• Supporting points go here.

Page 14: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Parallelizing in Hadoop

Steps:• Create sample dataset to build model. This can be a small sample (~2000 – 5000 rows), or a size sufficient to build

generalized model.

• Script model as an R file, except that it should use streamed input instead of reading from CSV files. Separate map.R and reduce.R can be created if a reduction stage is required to create unified output datasets.

• Test that code works from command line with structure below, where dataset.csv is the input dataset with structure as shown before

cat dataset.csv | map.R | reduce.R > output.csv• Ensure that Hive tables are in delimited text format. Deploy and run model using Hadoop streaming with sample command

line below

hadoop jar /usr/hdp/2.2.6.4-1/hadoop-mapreduce/hadoop-streaming.jar \-D mapred.min.split.size=268435456 \-D mapreduce.task.timeout=300000000 \-D mapreduce.map.memory.mb=3584 \-D mapreduce.reduce.memory.mb=8092\-input /user/hive/warehouse/ebidatascience.db/ipdr/local_day_id=$NEXT_DATE -output /user/hive/warehouse/ebidatascience.db/ipdr_flagged/-file ./map.R-file <sample dataset to build model.csv>

-mapper ./map.R

Page 15: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Flagged output

• Supporting points go here.

Page 16: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Performance

• 1.7B observations/day• About 30 minutes processing time/day• 380 shared nodes• 92% accuracy in detecting streaming events

Page 17: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Output for a test dataset

• Supporting points go here.

Page 18: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Add Pages as Necessary• Supporting points go here.

Page 19: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

We are hiring!• Big Data Engineers (Hadoop, Spark,

Kafka…)• Data Analysts (R, SAS…..)• Big Data Analysts (Hive, Pig ….)

[email protected]

Page 20: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

THANK YOU.

Page 21: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Output for a test dataset

• Supporting points go here.

Page 22: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Add Pages as Necessary• Supporting points go here.