Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

download Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

of 18

Embed Size (px)

Transcript of Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

  1. 1. 2015 IBM Corporation Introduction to Spark MLLib Toolkit in IBM Streams IBM Streams Version 4.1 Ankit Pasricha Toolkits Team Lead ankitp@ca.ibm.com
  2. 2. 2 2015 IBM Corporation Important Disclaimer THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF: CREATING ANY WARRANTY OR REPRESENTATION FROM IBM (OR ITS AFFILIATES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS); OR ALTERING THE TERMS AND CONDITIONS OF THE APPLICABLE LICENSE AGREEMENT GOVERNING THE USE OF IBM SOFTWARE. IBMs statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBMs sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. THIS INFORMATION IS BASED ON IBMS CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.
  3. 3. 3 2015 IBM Corporation Agenda Introduction Spark MLLib library Spark MLLib Toolkit Overview Using Spark models in Streams Example Streams + Spark Demo Resources Q&A
  4. 4. 4 2015 IBM Corporation What is Spark? Spark is a next-generation framework for large-scale, in-memory distributed data processing, following in the footsteps of Hadoop
  5. 5. 5 2015 IBM Corporation Introduction to Spark Allows processing of static distributed data in memory Advantages Speed (In memory computing) Ease of Use Program in Java, Scala or Python Provides interactivity through Scala or Python shells Tools support Spark SQL Mllib Graphx Spark Streaming Spark Streaming Handle streaming data in micro-batches Not real-time like Streams
  6. 6. 6 2015 IBM Corporation Introduction to MLLib Spark MLLib library provides various analytics Basic statistics Mean, Median Correlation Hypothesis testing Classification and regression Linear, logistic, isotonic regression Naive bayes Decision trees Collaborative filtering Clustering K-means Dimensionality reduction Feature extraction and transformation Optimization
  7. 7. 7 2015 IBM Corporation Combines the power of Spark MLLib and real-time streaming capabilities of Streams Allows scoring of real-time streaming data using Spark models Github project http://ibmstreams.github.io/streamsx.sparkMLLib/ Support for a number of MLLib models Classification Linear SVM Naive Bayes Clustering KMeans Collaborative Filtering Regression Isotonic Linear Logistic Tree Decision Tree Gradient Boosted Trees Random Forest Spark MLLib Toolkit Overview
  8. 8. 8 2015 IBM Corporation Step 1: Create Spark Application Use any supported language: Scala, Java or Python Train the analytics model using historical data Save the model (to file or HDFS for example) using Sparks save/load API Step 2: Create Streams Application Use the appropriate Spark Analytics Operator in the Spark MLLib toolkit Specify the location where the trained model was stored Specify the incoming attributes to use as inputs to the model Run the application The score from the analysis will be output as a tuple by the operator Using Spark models in Streams
  9. 9. 9 2015 IBM Corporation Spark Application Written in Java Creates a Spark Context Use CSV file to train a K-means clustering model Saves the model to disk/HDFS Example
  10. 10. 10 2015 IBM Corporation Streams Application Written in SPL Add a SparkClusteringKMeans operator Specify an attribute on the input port to be used for scoring. Specify the saved model location as a parameter. The score is output as analysisResult int32 on the output port. Example
  11. 11. 11 2015 IBM Corporation Streams + Spark Demo
  12. 12. 12 2015 IBM Corporation Streams + Spark Demo Incidents Calls for Service (911, etc) 311 Code Violations Permits Buildings Historical City Data Sets
  13. 13. 13 2015 IBM Corporation Streams + Spark Demo Incidents Calls for Service (911, etc) 311 Code Violations Permits Buildings Apache Spark MLlib hdfs Historical City Data Sets
  14. 14. 14 2015 IBM Corporation Streams + Spark Demo Incidents Calls for Service (911, etc) 311 Code Violations Permits Buildings Apache Spark MLlib hdfs Historical City Data Sets Model : Is this call for service a false alarm?
  15. 15. 15 2015 IBM Corporation Streams + Spark Demo Incidents Calls for Service (911, etc) 311 Code Violations Permits Buildings Apache Spark MLlib hdfs Historical City Data Sets Model : Is this call for service a false alarm? Real-time Calls for Service Real-time Predictions & Relevant Context IBM Streams Real-time Dashboard
  16. 16. 16 2015 IBM Corporation Demo
  17. 17. 17 2015 IBM Corporation Resources Getting Started Guide: https://developer.ibm.com/streamsdev/docs/getting-started-with-the- spark-mllib-toolkit/ Documentation: http://ibmstreams.github.io/streamsx.sparkMLLib/com.ibm.streamsx. sparkmllib/doc/spldoc/html/ MLLib Guide: https://spark.apache.org/docs/latest/mllib-guide.html Samples: https://github.com/IBMStreams/streamsx.sparkMLLib/tree/master/sa mples
  18. 18. 18 2015 IBM Corporation Questions?