Big Data Analytics with Storm, Spark and GraphLab

1

Big Data Analytics with Storm, Spark and GraphLab

Dr. Vijay Srinivas Agneeswaran Director and Head, Big-data R&D Impetus Technologies Inc.

2

ContentsBig Data Computations• Introduction to

ML• Characterization

Berkeley data analytics stack• Spark

Real-time Analytics

with Storm

PMML Scoring for Naïve Bayes• PMML Primer• Naïve Bayes Primer

GraphLab

Hadoop 2.0 (Hadoop YARN)

Programming

Abstractions

3

• What is it?

• learn patterns in data

• improve accuracy by learning

• Examples

• Speech recognition systems

• Recommender systems

• Medical decision aids

• Robot navigation systems

Introduction to Machine Learning

4

• Attributes and their values:

• Outlook: Sunny, Overcast, Rain

• Humidity: High, Normal

• Wind: Strong, Weak

• Temperature: Hot, Mild, Cool

• Target prediction - Play Tennis: Yes, No


5


NoStrongHighMildRainD14

YesWeakNormalHotOvercastD13

YesStrongHighMildOvercastD12

YesStrongNormalMildSunnyD11

YesStrongNormalMildRainD10

YesWeakNormalCoolSunnyD9

NoWeakHighMildSunnyD8

YesWeakNormalCoolOvercastD7

NoStrongNormalCoolRainD6

YesWeakNormalCoolRainD5

YesWeakHighMildRain D4

YesWeakHighHotOvercastD3

NoStrongHighHotSunnyD2

NoWeakHighHotSunnyD1

Play TennisWindHumidityTemp.OutlookDay

Tom Mitchell, Machine Learning, Tata McGraw Hill Publications.

6

Introduction to Machine Learning: Decision Trees

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

7

Decision Trees to Random Forests

Can we have an ensemble of trees? – random forests

Final prediction is the mean (regression) or class with max votes

(categorization)

Does not need tree pruning for

generalization

Greater accuracy across domains.

Decision treesPros• Handling of mixed data, Robustness to outliers, Computational scalability

cons• Low prediction accuracy, High variance, Size VS Goodness of fit

K-means Clustering

8

9

Support Vector Machines

10

Introduction to Machine LearningMachine

learning tasks

Learning associations – market basket

analysis

Supervised learning (Classification/regression) –

random forests, support vector machines (SVMs), logistic regression (LR),

Naïve Bayes

Unsupervised learning (clustering)

- k-means, sentiment analysis

Prediction – random forests, SVMs, LR

Data Mining

Application of machine learning to

large data

Knowledge Discovery in

Databases (KDD)

Credit scoring, fraud detection, market basket analysis,

medical diagnosis, manufacturing optimization

11

Big Data ComputationsC

om

puta

tions/

Op

era

tions

Giant 1 (simple stats) is perfect for Hadoop 1.0.

Giants 2 (linear algebra), 3 (N-body), 4 (optimization) Spark from UC Berkeley is

efficient.

Logistic regression, kernel SVMs, conjugate gradient

descent, collaborative filtering, Gibbs sampling, alternating least squares.

Example is social group-first approach for

consumer churn analysis [2]

Interactive/On-the-fly data processing – Storm.

OLAP – data cube operations. Dremel/Drill

Data sets – not embarrassingly parallel?

Deep Learning Artificial Neural Networks

Machine vision from Google [3]

Speech analysis from Microsoft

Giant 5 – Graph processing – GraphLab, Pregel, Giraph

[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240

http://www.informatik.uni-trier.de/~ley/pers/hd/c/Corrado:Gregory_S=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/c/Corrado:Gregory_S=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/m/Monga:Rajat.html



http://www.informatik.uni-trier.de/~ley/pers/hd/c/Chen:Kai.html

http://www.informatik.uni-trier.de/~ley/pers/hd/d/Devin:Matthieu.html

http://www.informatik.uni-trier.de/~ley/pers/hd/d/Devin:Matthieu.html

http://www.informatik.uni-trier.de/~ley/pers/hd/l/Le:Quoc_V=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/l/Le:Quoc_V=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/m/Mao:Mark_Z=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/r/Ranzato:Marc=Aurelio.html



http://www.informatik.uni-trier.de/~ley/pers/hd/s/Senior:Andrew_W=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/t/Tucker:Paul_A=.html

http://www.informatik.uni-trier.de/~ley/pers/hd/y/Yang:Ke.html

http://www.informatik.uni-trier.de/~ley/pers/hd/y/Yang:Ke.html

http://www.informatik.uni-trier.de/~ley/pers/hd/n/Ng:Andrew_Y=.html

http://www.informatik.uni-trier.de/~ley/db/conf/nips/nips2012.html#DeanCMCDLMRSTYN12

Iterative ML Algorithms

[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing, Technical Report, University of California, Computer Science Department, 2009.

What are iterative

algorithms?• Those that

need communication among the computing entities

• Examples – neural networks, PageRank algorithms, network traffic analysis

Conjugate gradient descent

• Commonly used to solve systems of linear equations

• [CB09] tried implementing CG on dense matrices

• DAXPY – Multiplies vector x by constant a and adds y.

• DDOT – Dot product of 2 vectors

• MatVec – Multiply matrix by vector, produce a vector.

Communication Overhead

• 1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices.

Other iterative algorithms

• fast fourier transform, block tridiagonal

13

ML realizations: 3 Generational viewGeneration

First Generation

Second Generation Third Generation

Examples SAS, R, Weka, SPSS in native form

Mahout, Pentaho, Revolution R, SAS In-memory Analytics (Hadoop)

Spark, HaLoop, GraphLab, Pregel, SAS In-memory Analytics (Greenplum/Teradata), Giraph, Golden ORB, Stanford GPS, ML over Storm

Scalability

Vertical Horizontal (over Hadoop)

Horizontal (Beyond Hadoop)

Algorithms Available

Huge collection of algorithms

Small subset – sequential logistic regression, linear SVMs, Stochastic Gradient Descent, k-means clustering, Random Forests etc.

Much wider – including Conjugate Gradient Descent (CGD), Alternating Least Squares (ALS), collaborative filtering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling etc.

Algorithms Not Available

Practically Nothing

Vast no. – Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descent, ALS etc.

Multivariate logistic regression in general form, K-means clustering etc. – work in progress to expand the set of algorithms available.

Fault-Tolerance

Single point of failure

Most tools are FT, as they are built on top of Hadoop

FT – HaLoop, SparkNot FT – Pregel, GraphLab, Giraph

Giants All 7 giants – for small data sets

Giants 1, and 2. Spark – giant 2, 3 and 4.GraphLab – giant 5.

Vijay Srinivas Agneeswaran, Pranay Tonpay and Jayati Tiwari, “Paradigms for Realizing Machine Learning Algorithms”, Big Data Journal (Libertpub), 1(4), 207-214.

14




Real-time Analytics

with Storm



GraphLab

Programming

Abstractions

15

Data Flow in Spark and Hadoop

16

Berkeley Big-data Analytics Stack (BDAS)

BDAS: Use Cases

17

Ooyala

Uses Cassandra for video data

personalization.

Pre-compute aggregates VS

on-the-fly queries.

Moved to Spark for ML and computing

views.

Moved to Shark for on-the-fly queries – C* OLAP aggregate

queries on Cassandra 130 secs, 60 ms in Spark

Conviva Uses Hive for

repeatedly running ad-hoc

queries on video data.

Optimized ad-hoc queries using Spark

RDDs – found Spark is 30 times faster

than Hive

ML for connection

analysis and video

streaming optimization.

Yahoo

Advertisement targeting: 30K

nodes on Hadoop Yarn

Hadoop – batch processingSpark – iterative processing

Storm – on-the-fly processing

Content recommendatio

n – collaborative

filtering

BDAS: Spark

[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

Transformations/Actions

Description

Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD.

Filter(function f2) Select elements of RDD that return true when passed through f2.flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to

multiple outputs.Union(RDD r1) Returns result of union of the RDD r1 with the self.Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD.groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value.

No. of parallel tasks is given as an argument (default is 8).reduceByKey(function f4, noTasks)

Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument.

Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key.groupWith(RDD r3, noTasks)

Joins RDD r3 with self and groups by key.

sortByKey(flag) Sorts the self RDD in ascending or descending based on flag.Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDDCollect() Return all elements of the RDD as an array.Count() Count no. of elements in RDDtake(n) Get first n elements of RDD.First() Equivalent to take(1)saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given

path.saveAsSequenceFile(path)

Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs that implement Hadoop writable interface or equivalent.

foreach(function f6) Run f6 in parallel on elements of self RDD.

19

Representation of an RDDInformation HadoopRDD FilteredRDD JoinedRDD

Set of partitions 1 per HDFS block Same as parent 1 per reduce task

Set of dependencies

None 1-to-1 on parent Shuffle on each parent

Function to compute data set based on parents

Read corresponding block

Compute parent and filter it

Read and join shuffled data

Meta-data on location (preferredLocaations)

HDFS block location from namenode

None (parent) None

Meta-data on partitioning (partitioningScheme)

None None HashPartitioner

Some Spark(ling) examplesScala code (serial)

var count = 0

for (i <- 1 to 100000)

{ val x = Math.random * 2 - 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI.

Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).

Some Spark(ling) examplesSpark code (parallel)

val spark = new SparkContext(<Mesos master>)

var count = spark.accumulator(0)

for (i <- spark.parallelize(1 to 100000, 12))

{ val x = Math.random * 2 – 1

val y = Math.random * 2 - 1

if (x*x + y*y < 1) count += 1 }

println("Pi is roughly " + 4 * count / 100000.0)

Notable points:

1. Spark context created – talks to Mesos1 master.

2. Count becomes shared variable – accumulator.

3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.

4. Parallelize method invokes foreach method of RDD.

1 Mesos is an Apache incubated clustering system – http://mesosproject.org

http://mesosproject.org/

Logistic Regression in Spark: Serial Code// Read data file and convert it into Point objects

val lines = scala.io.Source.fromFile("data.txt").getLines()

val points = lines.map(x => parsePoint(x))

// Run logistic regression

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = Vector.zeros(D)

for (p <- points) {

val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y

gradient += scale * p.x

}

w -= gradient

}

println("Result: " + w)

Logistic Regression in Spark// Read data file and transform it into Point objectsval spark = new SparkContext(<Mesos master>)val lines = spark.hdfsTextFile("hdfs://.../data.txt")val points = lines.map(x => parsePoint(x)).cache()

// Run logistic regressionvar w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value}println("Result: " + w)

24

Logistic Regression: Spark VS Hadoop

http://spark-project.org

http://spark-project.org/

26




Real-time Analytics

with Storm


GraphLab


Programming

Abstractions

27

Real-time Analytics with Storm

Solution to Internet Traffic Analysis Use Case

29




Real-time Analytics

with Storm


GraphLab


Programming

Abstractions

PMML Primer

30

Predictive Model Markup Language

Developed by DMG (Data Mining Group)

XML representation of a model.

PMML offers a standard to define a

model, so that a model generated in

tool-A can be directly used in tool-B.

May contain a myriad of data

transformations (pre- and post-processing)

as well as one or more predictive

models.

Naïve Bayes Primer

31

Normalization Constant

Likelihood Prior

A simple probabilistic

classifier based on Bayes Theorem

Given features X1,X2,…,Xn,

predict a label Y by calculating the probability for all possible Y value

PMML Scoring for Naïve Bayes

32

Wrote a PMML based scoring

engine for Naïve Bayes algorithm.

This can theoretically be

used in any framework for

data processing by invoking the API

Deployed a Naïve Bayes PMML

generated from R into Storm / Spark

and Samza frameworks

Real time predictions with the above APIs

33

Header

• Version and timestamp

• Model development environment information

Data Dictionary

• Variable types, missing valid and invalid values,

Data Munging/Transformati

on• Normalization,

mapping, discretization

Model

• Model specifi attributes

• Mining Schema• Treatment for

missing and outlier values

• Targets• Prior probability

and default • Outputs

• List of computer output fields

• Post-processing• Definition of model

architecture/parameters.

<DataDictionary numberOfFields="4"> <DataField name="Class" optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary>

(ctd on the next slide)


34

<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningField name="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs>

(ctd on the next page)


35


36

<BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> *</BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>


37

Definition Of Elements:-

DataDictionary : Definitions for fields as used in mining

models ( Class, V1, V2, V3 )

NaiveBayesModel : Indicates that this is a NaiveBayes PMML

MiningSchema : lists fields as used in that model.

Class is “predicted” field, V1,V2,V3 are “active” predictor fields

Output: Describes a set of result values that can be

returned from a model


38

Definition Of Elements (ctd .. ) :-

BayesInputs:For each type of inputs, contains the counts

of outputsBayesOutput:

Contains the counts associated with the values of the target field

39

Sample Input

Eg1 - n y y n y y n n n n n n y y y y

Eg2 - n y n y y y n n n n n y y y n y

• 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField )

• Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput)



40

• 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors ) Number of records

( in millions ) Time Taken (seconds)

0.1 4

0.4 7

1.0 12

2.0 21

10 129

25 310


41

• 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space )

Number of records ( in millions )

Time Taken (

0.1 1 min 47 sec

0.2 3 min 35 src

0.4 6 min 40 secs

1.0 35 mins 17 sec

10 More than 3 hrs

42




Real-time Analytics

with Storm


GraphLab


Programming

Abstractions

GraphLab: Ideal Engine for Processing Natural Graphs [YL12]

[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.

Goals – targeted at machine learning.• Model graph dependencies, be

asynchronous, iterative, dynamic.

Data associated with edges (weights, for instance) and vertices (user profile data,

current interests etc.).

Update functions – lives on each vertex• Transforms data in scope of vertex.• Can choose to trigger neighbours

(for example only if Rank changes drastically)

• Run asynchronously till convergence – no global barrier.

Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering).• GraphLab – provides varying level of

consistency. Parallelism VS consistency.

Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc.• Co-EM (Expectation Maximization)

algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time.

GraphLab 2: PowerGraph – Modeling Natural Graphs [1]

[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12).

GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges.• Most graph parallel

abstractions assume small neighbourhoods – low degree vertices

• But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs.

• Hard to partition power law graphs, high degree vertices limit parallelism.

Powergraph provides new way of partitioning power law graphs• Edges are tied to

machines, vertices (esp. high degree ones) span machines

• Execution split into 3 phases:• Gather, apply and

scatter.

Triangle counting on Twitter graph• Hadoop MR took 423

minutes on 1536 machines

• GraphLab 2 took 1.5 minutes on 1024 cores (64 machines)

45




Real-time Analytics

with Storm


GraphLab


Programming

Abstractions

Hadoop YARN Requirements or 1.0 shortcomings

46

R1: Scalability• single cluster

limitation

R2: Multi-tenancy • Addressed by

Hadoop-on-Demand• Security, Quotas

R3: Locality awareness• Shuffle of records

R4: Shared cluster utilization• Hogging by users• Typed slots

R5: Reliability/Availability• Job Tracker bugs

R6: Iterative Machine Learning

Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing, Oct 2013, ACM Press.

47 Hadoop YARN Architecture

YARN Internals

48

Application Master

• Sends ResourceRequests to the YARN RM

• Captures containers, resources per container, locality preferences.

YARN RM

• Generates tokens and containers

• Global view of cluster – monolithic scheduling.

Node Manager

• Node health monitoring, advertise available resources through heartbeats to RM.

49




Real-time Analytics

with Storm


GraphLab


Programming

Abstractions

50

Programming Abstractions

PMML• XML based

representation of the analytical model

Spark• Scala

collection – over a distributed shared memory system

GraphLab• Gather-Apply-

Scatter

Forge• Domain

Specific Language

51

Domain specific language approach from Stanford.

Forge [AKS13] – a meta DSL for high performance DSLs.

40X faster than Spark!OptiML – DSL for machine language

Forge: Approach to build high performance Domain Specific Languages

[Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Proceedings of the 12th international conference on Generative programming: concepts & experiences (GPCE '13). ACM, New York, NY, USA, 145-154.

• Beyond Hadoop Map-Reduce philosophy

• Optimization and other problems.

• Real-time computation

• Processing specialized data structures

• PMML scoring

• Spark for batch computations

• Spark streaming and Storm for real-time.

• Allows traditional analytical tools/algorithms to be re-used.

Conclusions

52

Thank You!

Mail • [email protected]

LinkedIn

• www.linkedin.com/company/impetus

Blogs • http://blogs.impetus.com/

Twitter • @impetustech

mailto:[email protected]

http://www.linkedin.com/company/impetus

http://www.linkedin.com/company/impetus

http://blogs.impetus.com/

Big Data Analytics with Storm, Spark and GraphLab

Technology

Transcript of Big Data Analytics with Storm, Spark and GraphLab