Using Apache Spark with IBM SPSS Modeler

Using Apache Spark with IBM SPSS Modeler

Dr. Steve R. Poulin

© Global Knowledge Training LLC. All rights reserved.

Dr. Steve PoulinPrincipal Data Scientist & Manager of Predictive Analytics

Over 20 years experience as SPSS trainer and consultant

Holds a Ph.D. in Social Policy, Planning, and Policy Analysis from Columbia University

IBM Master Instructor with Global Knowledge

Worked with over 250 organizations that have used SPSS

Currently more heavily involved in consulting


Agenda

Intro Concepts

Enabling Apache Spark Applications

Gradient Boosted Trees with Mllib

K-Means with Mllib

Multinomial Naive Bayes with Mllib

Q&A

Follow-Ons & Additional References

Intro Concepts


What is Apache Spark?

Apache Spark1 is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to technologies on the market today.

Apache Spark works within Hadoop and is an alternative to MapReduce.


Hadoop

Hadoop is a collection of open-source modules that are part of the Apache Project.

o The Apache Project is managed by the volunteer-run Apache Software Foundation.

One of the major components of Hadoop is the Hadoop Distributed File System (HDFS™), which is a distributed file system providing high-throughput access to application data.


MapReduce

MapReduce2 is the processing engine for Apache Hadoop:

o A parallel processing system that is composed of a map procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a reduce procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies.)

It is designed for the analysis of large datasets.


MapReduce and Apache Spark

Apache Spark performs in-memory processing, whereas MapReduce moves data in and out of a disk.3

As a result, Apache Spark can run programs up to 100x faster than MapReducein memory or 10x faster on disk.

Enabling Apache Spark

Applications


IBM SPSS Modeler

Apache Spark is well-suited for running complex machine learning techniques using machine learning libraries (MLlib) with large datasets.

Although Apache Spark applications will run with any data source, they will only achieve these efficiencies when connected to the Analytic Server node, which enables IBM SPSS Modeler to use data from a Hadoop environment.

The following applications that can be accessed from with IBM SPSS Modeler will be demonstrated during this seminar:

o Gradient Boosted Trees with MLlib

o K-Means with MLlib

o Multinomial Naive Bayes with MLlib


IBM SPSS Analytic Server

IBM SPSS Analytic Server enables the IBM SPSS Modeler to use data from Hadoop distributions

This feature is found as a node in the Sources palette:

Although Apache Spark applications will run with data accessed from many data sources (e.g. SQL databases and text files), they will not achieve their full potential efficiency unless they are connected to a Hadoop data environment through IBM SPSS Analytic Server.4


Enabling IBM SPSS Modeler to Run Apache Spark

Applications

Install a copy of Python 2.7 that includes NumPy, a Python component for scientific computing. o Anaconda is a free package manager that includes Python with the NumPy

component.

o The Python 2.7 Anaconda package can be downloaded from Continuum Analytics©

at: www.continuum.io/downloads

The following line of text must be added to your options.cfg file:o eas_pyspark_python_path, “[location of python.exe file in the Python

program with NumPy]”

o For example: eas_pyspark_python_path, “C:/Program Files/Anaconda2/python.exe”

The options.cfg file is located in the config folder of your IBM SPSS Modeler Program Files.o For example: C:\Program Files\IBM\SPSS\Modeler\18.0\config

http://www.continuum.io/downloads


Adding Spark Applications through IBM SPSS

Modeler Extension Hub

The Extension Hub automatically connects to the IBM SPSS Predictive Analytics Gallery

http://ibmpredictiveanalytics.github.io and presents the extensions in a dialog box.

http://ibmpredictiveanalytics.github.io/


IBM SPSS Modeler Extension Hub Dialog Box

Demos on extensions can be obtained at: https://github.com/IBMPredictiveAnalytic

https://github.com/IBMPredictiveAnalytic

Gradient Boosted Trees

with MLlib


Introduction

Like the Random Trees procedure, this procedure generates ensembles of decision trees but also iteratively trains decision trees in order to minimize a “loss function,” (a penalty for mispredictions.)5

The algorithm uses the current ensemble to predict the label of each training instance and then compares the prediction with the true label.

The dataset is re-labeled to put more emphasis on training instances with poor predictions.

Thus, in the next iteration, the decision tree will help correct for previous mistakes.


Loss Functions

Loss Task Description

Log Loss Classification Twice binomial negative log likelihood

Squared Error Regression Also called L2 loss. Default loss for regression tasks

Absolute Error Regression Also called L1 loss. Can be more robust to outliers than Squared Error


Gradient Boosted Trees with MLlib Dialog Boxes


Gradient Boosted Trees with MLlib Dialog Boxes

One of the three Loss Functions is selected here


Gradient Boosted Trees with MLlib Output

Confidence scores


Gradient Boosted Trees with MLlib Stream:

LIVE DEMO

K-Means with MLlib


Introduction

The K-Means clustering technique has long been part of IBM SPSS Modeler and IBM SPSS Statistics.

The user specifies the number of clusters (the “K” value) to test.

In the traditional method, K individual records are selected based on their distinctive profiles although there is some randomness in which records are selected.

The remaining records are assigned to the K clusters based on which of the initial records they are most similar to as determined by the Squared Euclidian distance measure.

Records can be re-assigned to make the clusters more distinctive.


K-Means with MLlib

The K-Means with MLlib procedure uses a machine-learning process to build the clusters.6

The distance measure used to determine which cluster each record is assigned to is labeled Epsilon.

Although the user still provides the K value, the final result may be less than K clusters.


K-Means with MLlib Dialog Boxes



When creating the

clusters does not

improve the Epsilon

less than this value,

the cluster building

process stops.

Lowering this value

will increase

processing time.



This only needs to be

increased if there is an

indication that the

convergence threshold

was not met.



This does not to be

changed for more recent

versions of Spark.



The Initialization Mode determines how

individual records are selected for the

training process.

The Random option randomly selects

these records.

Without the use of a Random Seed,

varying distributions of random numbers

will be generated that result in the

selection of different records each time

the procedure is run.

If this box is checked, the Random Seed

value will ensure that the same initial

records are selected.



The K-Means [] option (also

known as K-Means ++) in the

Initialization Mode section of the

dialog box provides an alternative

way to select the first records for

the cluster-building process.

This option builds clusters more

quickly than the use of randomly

selected records but may not

scale up well for large datasets.

The Initialization Steps only

applies to this option.


K-Means with MLlib Output

Cluster membership

values


K-Means with MLlib Stream:

LIVE DEMO

Multinomial Naive Bayes

with MLlib


Multinomial Naive Bayes with MLlib

Naive Bayes is a classification algorithm with the assumption of independence (hence the term “naïve”) between every pair of predictors (called “features” in this procedure).7

As is the case for all classification procedures, it requires one target field and any number of predictors.

Within a single pass to the training data, it computes the conditional probability distribution of each categorical field value, and then it applies Bayes’ theorem (the probability of an event based on prior knowledge of conditions that might be related to the event) to compute the conditional probability distribution of predictor values given an observation and use it for prediction.


Multinomial Naive Bayes (in contrast to other forms of Bayesian methods) uses fields representing the number of times items, such as words, have been found in a document

This procedure is often used for document classification

Multinomial Naive Bayes with MLlib


The Smoothing

parameter addresses

conditions have a

conditional probability

of zero and should

probably be left at its

default value of 1.

Multinomial Naive Bayes with MLlib Dialog Box


Predicted outcomes

Multinomial Naive Bayes with MLlib Output


Multinomial Naive Bayes with MLlib Stream:

LIVE DEMO


Questions?

Steve Poulin

Still have questions? [email protected]

mailto:[email protected]


References: Further Reading

1. www.spark.apache.org

2. https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/

3. https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

4. http://www-03.ibm.com/software/products/en/spss-analytic-server

5. http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts

6. http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

7. http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html

http://www.spark.apache.org/

https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/

https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

http://www-03.ibm.com/software/products/en/spss-analytic-server

http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts

http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

http://spark.apache.org/docs/1.5.2/mllib-naive-bayes.html


Next Steps

For a deeper dive into the concepts and tactics presented here, take a look at our available training:

Introduction to IBM SPSS Modeler and Data Mining (v18)

Predictive Modeling for Categorical Targets Using IBM SPSS Modeler (v18)

Advanced Predictive Modeling Using IBM SPSS Modeler (v18)

https://www.globalknowledge.com/us-en/course/153426/introduction-to-ibm-spss-modeler-and-data-mining-v18/

https://www.globalknowledge.com/us-en/course/162449/predictive-modeling-for-categorical-targets-using-ibm-spss-modeler-v18/

https://www.globalknowledge.com/us-en/course/162451/advanced-predictive-modeling-using-ibm-spss-modeler-v18/

For more information contact us at:www.globalknowledge.com | 1-800-COURSES

[email protected]

Using Apache Spark with IBM SPSS Modeler

Technology

Transcript of Using Apache Spark with IBM SPSS Modeler