Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

© 2015 IBM Corporation

Introduction to Spark MLLib Toolkit in

IBM Streams

IBM Streams Version 4.1

Ankit Pasricha

Toolkits Team Lead

[email protected]

2 © 2015 IBM Corporation

Important Disclaimer

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONALPURPOSES ONLY.

WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THEINFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTYOF ANY KIND, EXPRESS OR IMPLIED.

IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY,WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.

IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OROTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.

NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF:

• CREATING ANY WARRANTY OR REPRESENTATION FROM IBM (OR ITS AFFILIATES OR ITS ORTHEIR SUPPLIERS AND/OR LICENSORS); OR

• ALTERING THE TERMS AND CONDITIONS OF THE APPLICABLE LICENSE AGREEMENTGOVERNING THE USE OF IBM SOFTWARE.

IBM’s statements regarding its plans, directions, and intent are subject to change orwithdrawal without notice at IBM’s sole discretion. Information regarding potentialfuture products is intended to outline our general product direction and it should notbe relied on in making a purchasing decision. The information mentioned regardingpotential future products is not a commitment, promise, or legal obligation to deliverany material, code or functionality. Information about potential future products maynot be incorporated into any contract. The development, release, and timing of anyfuture features or functionality described for our products remains at our solediscretion.

THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.

IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.


Agenda

Introduction– Spark

– MLLib library

Spark MLLib Toolkit Overview

Using Spark models in Streams

Example

Streams + Spark Demo

Resources

Q&A


What is Spark?

Spark is a next-generation

framework for

large-scale, in-memory

distributed data

processing, following in

the footsteps of Hadoop


Introduction to Spark

Allows processing of static distributed data in memory

Advantages

Speed (In memory computing)

Ease of Use

Program in Java, Scala or Python

Provides interactivity through Scala or Python shells

Tools support

Spark SQL

Mllib

Graphx

Spark Streaming

Spark Streaming

Handle streaming data in micro-batches

Not real-time like Streams


Introduction to MLLib

Spark MLLib library provides various analytics

Basic statistics

Mean, Median

Correlation

Hypothesis testing

Classification and regression

Linear, logistic, isotonic regression

Naive bayes

Decision trees

Collaborative filtering

Clustering

K-means

Dimensionality reduction

Feature extraction and transformation

Optimization


Combines the power of Spark MLLib and real-time streaming capabilities of

Streams

Allows scoring of real-time streaming data using Spark models

Github project

http://ibmstreams.github.io/streamsx.sparkMLLib/

Support for a number of MLLib models• Classification

• Linear SVM

• Naive Bayes

• Clustering

• KMeans

• Collaborative Filtering

• Regression

• Isotonic

• Linear

• Logistic

• Tree

• Decision Tree

• Gradient Boosted Trees

• Random Forest

Spark MLLib Toolkit Overview

http://ibmstreams.github.io/streamsx.sparkMLLib/


Step 1: Create Spark Application Use any supported language: Scala, Java or Python

Train the analytics model using historical data

Save the model (to file or HDFS for example) using Spark’s save/load API

Step 2: Create Streams Application Use the appropriate Spark Analytics Operator in the Spark MLLib toolkit

Specify the location where the trained model was stored

Specify the incoming attributes to use as inputs to the model

Run the application

The score from the analysis will be output as a tuple by the operator

Using Spark models in Streams


Spark Application

Written in Java

Creates a Spark Context

Use CSV file to train a K-means clustering model

Saves the model to disk/HDFS

Example


Streams Application

Written in SPL

Add a SparkClusteringKMeans operator

Specify an attribute on the input port to be used for scoring.

Specify the saved model location as a parameter.

The score is output as analysisResult int32 on the output port.

Example



Incidents

Calls for Service

(911, etc)

311

Code Violations

Permits

Buildings

Historical City Data Sets



Incidents

Calls for Service

(911, etc)

311

Code Violations

Permits

Buildings Apache Spark

MLlib

hdfs




Incidents

Calls for Service

(911, etc)

311

Code Violations

Permits


MLlib

hdfs


Model :

Is this call for

service a

false alarm?



Incidents

Calls for Service

(911, etc)

311

Code Violations

Permits


MLlib

hdfs


Model :

Is this call for

service a

false alarm?

Real-time

Calls for Service

Real-time

Predictions &

Relevant Context

IBM

Streams

Real-time

Dashboard


Demo


Resources

Getting Started Guide:

https://developer.ibm.com/streamsdev/docs/getting-started-with-the-

spark-mllib-toolkit/

Documentation:

http://ibmstreams.github.io/streamsx.sparkMLLib/com.ibm.streamsx.

sparkmllib/doc/spldoc/html/

MLLib Guide: https://spark.apache.org/docs/latest/mllib-guide.html

Samples:

https://github.com/IBMStreams/streamsx.sparkMLLib/tree/master/sa

mples

https://developer.ibm.com/streamsdev/docs/getting-started-with-the-spark-mllib-toolkit/

http://ibmstreams.github.io/streamsx.sparkMLLib/com.ibm.streamsx.sparkmllib/doc/spldoc/html/

https://spark.apache.org/docs/latest/mllib-guide.html

https://github.com/IBMStreams/streamsx.sparkMLLib/tree/master/samples


Questions?

Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

Data & Analytics

Transcript of Introduction to the Spark MLLib Toolkit in IBM Streams V4.1