Using Apache Spark with IBM SPSS Modeler
-
Upload
global-knowledge-training -
Category
Technology
-
view
334 -
download
7
Transcript of Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
Dr. Steve R. Poulin
© Global Knowledge Training LLC. All rights reserved. Page 2
Dr. Steve PoulinPrincipal Data Scientist & Manager of Predictive Analytics
Over 20 years experience as SPSS trainer and consultant
Holds a Ph.D. in Social Policy, Planning, and Policy Analysis from Columbia University
IBM Master Instructor with Global Knowledge
Worked with over 250 organizations that have used SPSS
Currently more heavily involved in consulting
© Global Knowledge Training LLC. All rights reserved. Page 3
Agenda
Intro Concepts
Enabling Apache Spark Applications
Gradient Boosted Trees with Mllib
K-Means with Mllib
Multinomial Naive Bayes with Mllib
Q&A
Follow-Ons & Additional References
Intro Concepts
© Global Knowledge Training LLC. All rights reserved. Page 5
What is Apache Spark?
Apache Spark1 is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to technologies on the market today.
Apache Spark works within Hadoop and is an alternative to MapReduce.
© Global Knowledge Training LLC. All rights reserved. Page 6
Hadoop
Hadoop is a collection of open-source modules that are part of the Apache Project.
o The Apache Project is managed by the volunteer-run Apache Software Foundation.
One of the major components of Hadoop is the Hadoop Distributed File System (HDFS™), which is a distributed file system providing high-throughput access to application data.
© Global Knowledge Training LLC. All rights reserved. Page 7
MapReduce
MapReduce2 is the processing engine for Apache Hadoop:
o A parallel processing system that is composed of a map procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a reduce procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies.)
It is designed for the analysis of large datasets.
© Global Knowledge Training LLC. All rights reserved. Page 8
MapReduce and Apache Spark
Apache Spark performs in-memory processing, whereas MapReduce moves data in and out of a disk.3
As a result, Apache Spark can run programs up to 100x faster than MapReducein memory or 10x faster on disk.
Enabling Apache Spark
Applications
© Global Knowledge Training LLC. All rights reserved. Page 10
IBM SPSS Modeler
Apache Spark is well-suited for running complex machine learning techniques using machine learning libraries (MLlib) with large datasets.
Although Apache Spark applications will run with any data source, they will only achieve these efficiencies when connected to the Analytic Server node, which enables IBM SPSS Modeler to use data from a Hadoop environment.
The following applications that can be accessed from with IBM SPSS Modeler will be demonstrated during this seminar:
o Gradient Boosted Trees with MLlib
o K-Means with MLlib
o Multinomial Naive Bayes with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 11
IBM SPSS Analytic Server
IBM SPSS Analytic Server enables the IBM SPSS Modeler to use data from Hadoop distributions
This feature is found as a node in the Sources palette:
Although Apache Spark applications will run with data accessed from many data sources (e.g. SQL databases and text files), they will not achieve their full potential efficiency unless they are connected to a Hadoop data environment through IBM SPSS Analytic Server.4
© Global Knowledge Training LLC. All rights reserved. Page 12
Enabling IBM SPSS Modeler to Run Apache Spark
Applications
Install a copy of Python 2.7 that includes NumPy, a Python component for scientific computing. o Anaconda is a free package manager that includes Python with the NumPy
component.
o The Python 2.7 Anaconda package can be downloaded from Continuum Analytics©
at: www.continuum.io/downloads
The following line of text must be added to your options.cfg file:o eas_pyspark_python_path, “[location of python.exe file in the Python
program with NumPy]”
o For example: eas_pyspark_python_path, “C:/Program Files/Anaconda2/python.exe”
The options.cfg file is located in the config folder of your IBM SPSS Modeler Program Files.o For example: C:\Program Files\IBM\SPSS\Modeler\18.0\config
© Global Knowledge Training LLC. All rights reserved. Page 13
Adding Spark Applications through IBM SPSS
Modeler Extension Hub
The Extension Hub automatically connects to the IBM SPSS Predictive Analytics Gallery
http://ibmpredictiveanalytics.github.io and presents the extensions in a dialog box.
© Global Knowledge Training LLC. All rights reserved. Page 14
IBM SPSS Modeler Extension Hub Dialog Box
Demos on extensions can be obtained at: https://github.com/IBMPredictiveAnalytic
Gradient Boosted Trees
with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 16
Introduction
Like the Random Trees procedure, this procedure generates ensembles of decision trees but also iteratively trains decision trees in order to minimize a “loss function,” (a penalty for mispredictions.)5
The algorithm uses the current ensemble to predict the label of each training instance and then compares the prediction with the true label.
The dataset is re-labeled to put more emphasis on training instances with poor predictions.
Thus, in the next iteration, the decision tree will help correct for previous mistakes.
© Global Knowledge Training LLC. All rights reserved. Page 17
Loss Functions
Loss Task Description
Log Loss Classification Twice binomial negative log likelihood
Squared Error Regression Also called L2 loss. Default loss for regression tasks
Absolute Error Regression Also called L1 loss. Can be more robust to outliers than Squared Error
© Global Knowledge Training LLC. All rights reserved. Page 18
Gradient Boosted Trees with MLlib Dialog Boxes
© Global Knowledge Training LLC. All rights reserved. Page 19
Gradient Boosted Trees with MLlib Dialog Boxes
One of the three Loss Functions is selected here
© Global Knowledge Training LLC. All rights reserved. Page 20
Gradient Boosted Trees with MLlib Output
Confidence scores
© Global Knowledge Training LLC. All rights reserved. Page 21
Gradient Boosted Trees with MLlib Stream:
LIVE DEMO
K-Means with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 23
Introduction
The K-Means clustering technique has long been part of IBM SPSS Modeler and IBM SPSS Statistics.
The user specifies the number of clusters (the “K” value) to test.
In the traditional method, K individual records are selected based on their distinctive profiles although there is some randomness in which records are selected.
The remaining records are assigned to the K clusters based on which of the initial records they are most similar to as determined by the Squared Euclidian distance measure.
Records can be re-assigned to make the clusters more distinctive.
© Global Knowledge Training LLC. All rights reserved. Page 24
K-Means with MLlib
The K-Means with MLlib procedure uses a machine-learning process to build the clusters.6
The distance measure used to determine which cluster each record is assigned to is labeled Epsilon.
Although the user still provides the K value, the final result may be less than K clusters.
© Global Knowledge Training LLC. All rights reserved. Page 25
K-Means with MLlib Dialog Boxes
© Global Knowledge Training LLC. All rights reserved. Page 26
K-Means with MLlib Dialog Boxes
When creating the
clusters does not
improve the Epsilon
less than this value,
the cluster building
process stops.
Lowering this value
will increase
processing time.
© Global Knowledge Training LLC. All rights reserved. Page 27
K-Means with MLlib Dialog Boxes
This only needs to be
increased if there is an
indication that the
convergence threshold
was not met.
© Global Knowledge Training LLC. All rights reserved. Page 28
K-Means with MLlib Dialog Boxes
This does not to be
changed for more recent
versions of Spark.
© Global Knowledge Training LLC. All rights reserved. Page 29
K-Means with MLlib Dialog Boxes
The Initialization Mode determines how
individual records are selected for the
training process.
The Random option randomly selects
these records.
Without the use of a Random Seed,
varying distributions of random numbers
will be generated that result in the
selection of different records each time
the procedure is run.
If this box is checked, the Random Seed
value will ensure that the same initial
records are selected.
© Global Knowledge Training LLC. All rights reserved. Page 30
K-Means with MLlib Dialog Boxes
The K-Means [] option (also
known as K-Means ++) in the
Initialization Mode section of the
dialog box provides an alternative
way to select the first records for
the cluster-building process.
This option builds clusters more
quickly than the use of randomly
selected records but may not
scale up well for large datasets.
The Initialization Steps only
applies to this option.
© Global Knowledge Training LLC. All rights reserved. Page 31
K-Means with MLlib Output
Cluster membership
values
© Global Knowledge Training LLC. All rights reserved. Page 32
K-Means with MLlib Stream:
LIVE DEMO
Multinomial Naive Bayes
with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 34
Multinomial Naive Bayes with MLlib
Naive Bayes is a classification algorithm with the assumption of independence (hence the term “naïve”) between every pair of predictors (called “features” in this procedure).7
As is the case for all classification procedures, it requires one target field and any number of predictors.
Within a single pass to the training data, it computes the conditional probability distribution of each categorical field value, and then it applies Bayes’ theorem (the probability of an event based on prior knowledge of conditions that might be related to the event) to compute the conditional probability distribution of predictor values given an observation and use it for prediction.
© Global Knowledge Training LLC. All rights reserved. Page 35
Multinomial Naive Bayes (in contrast to other forms of Bayesian methods) uses fields representing the number of times items, such as words, have been found in a document
This procedure is often used for document classification
Multinomial Naive Bayes with MLlib
© Global Knowledge Training LLC. All rights reserved. Page 36
The Smoothing
parameter addresses
conditions have a
conditional probability
of zero and should
probably be left at its
default value of 1.
Multinomial Naive Bayes with MLlib Dialog Box
© Global Knowledge Training LLC. All rights reserved. Page 37
Predicted outcomes
Multinomial Naive Bayes with MLlib Output
© Global Knowledge Training LLC. All rights reserved. Page 38
Multinomial Naive Bayes with MLlib Stream:
LIVE DEMO
© Global Knowledge Training LLC. All rights reserved. Page 39
Questions?
Steve Poulin
Still have questions? [email protected]
© Global Knowledge Training LLC. All rights reserved. Page 40
References: Further Reading
1. www.spark.apache.org
2. https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
3. https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
4. http://www-03.ibm.com/software/products/en/spss-analytic-server
5. http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts
6. http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
7. http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html
© Global Knowledge Training LLC. All rights reserved. Page 41
Next Steps
For a deeper dive into the concepts and tactics presented here, take a look at our available training:
Introduction to IBM SPSS Modeler and Data Mining (v18)
Predictive Modeling for Categorical Targets Using IBM SPSS Modeler (v18)
Advanced Predictive Modeling Using IBM SPSS Modeler (v18)
For more information contact us at:www.globalknowledge.com | 1-800-COURSES