Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Post on 11-Jan-2017

1.472 views 2 download

Transcript of Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath

Petabyte scale data science using Spark & R

Sridhar Alla, Kiran MuglurmathComcast

Who we are• Sridhar Alla

Director, Solution Architecture, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business

Intelligence initiatives. • Kiran Muglurmath

Executive Director, Data Science, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business

Intelligence initiatives.

Top Initiatives

• Customer Churn Prediction• Clickthru Analytics • Personalization• Customer Journey• Modeling

Spark Stack

• Enables using R packages to process data • Can run Machine Learning and Statistical Analysis

algorithms

SparkR

Spark MLlib

• Implements various Machine Learning Algorithms• Classification, Regression, Collaborative Filtering,

Clustering, Decomposition• Works with Streaming, Spark SQL, GraphX or with

SparkR.

Using PySpark & SparkR

Hidden Markov Model (HMM)

• Supporting points go here.

Dataset Preparation: Training Data

• Supporting points go here.

Dataset Preparation: Raw Data

• Supporting points go here.

Baum – Welch algorithm for state detection

1. Given the download/upload levels (observations) for a given time interval, the model detects the hidden streaming state for that interval.

2. Given a set of observations (i = 1 .. n), ith hidden variable is independent of (i – 1)th hidden variable. For a discrete random variable Xt with N possible values, assume at P(Xt|X{t-1}) is independent of time t

1. From observations, calculate transition probabilities for N possible states. Then recursively compute maximum likelihoods for all observations, backwards and forwards to identify most probable state for each observation.

Sample Code (R):

• library('RHmm')• indata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• testdata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• downloads <- c(as.numeric(indata$V4))• downloadModel <- HMMFit(downloads, nStates=3)• testdownloads <- c(as.numeric(testdata$V4))• tVitPath <- viterbi(downloadModel, testdownloads)

• #Forward-backward procedure, compute probabilities• tfb <- forwardBackward(downloadModel, testdownloads)

• # Plot implied states• layout(1:3)• plot(testdownloads[1:100],ylab="Down Bandwidth",type="l", main="Download bytes")• plot(tVitPath$states[1:100],ylab="Download States",type="l", main="Download States")

Output for a test dataset

• Supporting points go here.

Parallelizing in Hadoop

Steps:• Create sample dataset to build model. This can be a small sample (~2000 – 5000 rows), or a size sufficient to build

generalized model.

• Script model as an R file, except that it should use streamed input instead of reading from CSV files. Separate map.R and reduce.R can be created if a reduction stage is required to create unified output datasets.

• Test that code works from command line with structure below, where dataset.csv is the input dataset with structure as shown before

cat dataset.csv | map.R | reduce.R > output.csv• Ensure that Hive tables are in delimited text format. Deploy and run model using Hadoop streaming with sample command

line below

hadoop jar /usr/hdp/2.2.6.4-1/hadoop-mapreduce/hadoop-streaming.jar \-D mapred.min.split.size=268435456 \-D mapreduce.task.timeout=300000000 \-D mapreduce.map.memory.mb=3584 \-D mapreduce.reduce.memory.mb=8092\-input /user/hive/warehouse/ebidatascience.db/ipdr/local_day_id=$NEXT_DATE -output /user/hive/warehouse/ebidatascience.db/ipdr_flagged/-file ./map.R-file <sample dataset to build model.csv>

-mapper ./map.R

Flagged output

• Supporting points go here.

Performance

• 1.7B observations/day• About 30 minutes processing time/day• 380 shared nodes• 92% accuracy in detecting streaming events

Output for a test dataset

• Supporting points go here.

Add Pages as Necessary• Supporting points go here.

We are hiring!• Big Data Engineers (Hadoop, Spark,

Kafka…)• Data Analysts (R, SAS…..)• Big Data Analysts (Hive, Pig ….)

sridhar_alla@cable.comcast.com

THANK YOU.

Output for a test dataset

• Supporting points go here.

Add Pages as Necessary• Supporting points go here.