Introduction to the Spark MLLib Toolkit in IBM Streams V4.1
-
Upload
lisanl -
Category
Data & Analytics
-
view
538 -
download
4
Transcript of Introduction to the Spark MLLib Toolkit in IBM Streams V4.1
© 2015 IBM Corporation
Introduction to Spark MLLib Toolkit in
IBM Streams
IBM Streams Version 4.1
Ankit Pasricha
Toolkits Team Lead
2 © 2015 IBM Corporation
Important Disclaimer
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONALPURPOSES ONLY.
WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THEINFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT WARRANTYOF ANY KIND, EXPRESS OR IMPLIED.
IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY,WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.
IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OROTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.
NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF:
• CREATING ANY WARRANTY OR REPRESENTATION FROM IBM (OR ITS AFFILIATES OR ITS ORTHEIR SUPPLIERS AND/OR LICENSORS); OR
• ALTERING THE TERMS AND CONDITIONS OF THE APPLICABLE LICENSE AGREEMENTGOVERNING THE USE OF IBM SOFTWARE.
IBM’s statements regarding its plans, directions, and intent are subject to change orwithdrawal without notice at IBM’s sole discretion. Information regarding potentialfuture products is intended to outline our general product direction and it should notbe relied on in making a purchasing decision. The information mentioned regardingpotential future products is not a commitment, promise, or legal obligation to deliverany material, code or functionality. Information about potential future products maynot be incorporated into any contract. The development, release, and timing of anyfuture features or functionality described for our products remains at our solediscretion.
THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE.
IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.
3 © 2015 IBM Corporation
Agenda
Introduction– Spark
– MLLib library
Spark MLLib Toolkit Overview
Using Spark models in Streams
Example
Streams + Spark Demo
Resources
Q&A
4 © 2015 IBM Corporation
What is Spark?
Spark is a next-generation
framework for
large-scale, in-memory
distributed data
processing, following in
the footsteps of Hadoop
5 © 2015 IBM Corporation
Introduction to Spark
Allows processing of static distributed data in memory
Advantages
Speed (In memory computing)
Ease of Use
Program in Java, Scala or Python
Provides interactivity through Scala or Python shells
Tools support
Spark SQL
Mllib
Graphx
Spark Streaming
Spark Streaming
Handle streaming data in micro-batches
Not real-time like Streams
6 © 2015 IBM Corporation
Introduction to MLLib
Spark MLLib library provides various analytics
Basic statistics
Mean, Median
Correlation
Hypothesis testing
Classification and regression
Linear, logistic, isotonic regression
Naive bayes
Decision trees
Collaborative filtering
Clustering
K-means
Dimensionality reduction
Feature extraction and transformation
Optimization
7 © 2015 IBM Corporation
Combines the power of Spark MLLib and real-time streaming capabilities of
Streams
Allows scoring of real-time streaming data using Spark models
Github project
http://ibmstreams.github.io/streamsx.sparkMLLib/
Support for a number of MLLib models• Classification
• Linear SVM
• Naive Bayes
• Clustering
• KMeans
• Collaborative Filtering
• Regression
• Isotonic
• Linear
• Logistic
• Tree
• Decision Tree
• Gradient Boosted Trees
• Random Forest
Spark MLLib Toolkit Overview
8 © 2015 IBM Corporation
Step 1: Create Spark Application Use any supported language: Scala, Java or Python
Train the analytics model using historical data
Save the model (to file or HDFS for example) using Spark’s save/load API
Step 2: Create Streams Application Use the appropriate Spark Analytics Operator in the Spark MLLib toolkit
Specify the location where the trained model was stored
Specify the incoming attributes to use as inputs to the model
Run the application
The score from the analysis will be output as a tuple by the operator
Using Spark models in Streams
9 © 2015 IBM Corporation
Spark Application
Written in Java
Creates a Spark Context
Use CSV file to train a K-means clustering model
Saves the model to disk/HDFS
Example
10 © 2015 IBM Corporation
Streams Application
Written in SPL
Add a SparkClusteringKMeans operator
Specify an attribute on the input port to be used for scoring.
Specify the saved model location as a parameter.
The score is output as analysisResult int32 on the output port.
Example
11 © 2015 IBM Corporation
Streams + Spark Demo
12 © 2015 IBM Corporation
Streams + Spark Demo
Incidents
Calls for Service
(911, etc)
311
Code Violations
Permits
Buildings
Historical City Data Sets
13 © 2015 IBM Corporation
Streams + Spark Demo
Incidents
Calls for Service
(911, etc)
311
Code Violations
Permits
Buildings Apache Spark
MLlib
hdfs
Historical City Data Sets
14 © 2015 IBM Corporation
Streams + Spark Demo
Incidents
Calls for Service
(911, etc)
311
Code Violations
Permits
Buildings Apache Spark
MLlib
hdfs
Historical City Data Sets
Model :
Is this call for
service a
false alarm?
15 © 2015 IBM Corporation
Streams + Spark Demo
Incidents
Calls for Service
(911, etc)
311
Code Violations
Permits
Buildings Apache Spark
MLlib
hdfs
Historical City Data Sets
Model :
Is this call for
service a
false alarm?
Real-time
Calls for Service
Real-time
Predictions &
Relevant Context
IBM
Streams
Real-time
Dashboard
16 © 2015 IBM Corporation
Demo
17 © 2015 IBM Corporation
Resources
Getting Started Guide:
https://developer.ibm.com/streamsdev/docs/getting-started-with-the-
spark-mllib-toolkit/
Documentation:
http://ibmstreams.github.io/streamsx.sparkMLLib/com.ibm.streamsx.
sparkmllib/doc/spldoc/html/
MLLib Guide: https://spark.apache.org/docs/latest/mllib-guide.html
Samples:
https://github.com/IBMStreams/streamsx.sparkMLLib/tree/master/sa
mples
18 © 2015 IBM Corporation
Questions?