Post on 23-Dec-2016
Introduction to R in IBM SPSS Modeler
A guide for SPSS Users
Wannes RosiusBelgiumIBM
Goal of this guide
Although there are several very good articles and blogs related to IBM SPSS Modeler in my roleas technical professional for IBM Analytical solutions we still see lots of people struggling withboth R and the integration between IBM SPSS Modeler and R
The idea of this document is certainly not to replace these very useful links listed below butto enhance these in a way that people knowing IBM SPSS Modeler with only a very limitedknowledge of R can use this integration
Going through sections 2 3 and 4 the reader should be able to understand at a high level theR integration within SPSS and to (re)create some very basic R models within SPSS even if youhave only a basic knowledge of R
In section 5 you will learn more detailed tips tricks and other things This part is for theexperienced user and can be interpreted as a list of loose things which might help you get up tospeed with some more detailed functionalities of the integration and understand some pitfalls
At every point in the document we try to include R examples to the reader that could be easilycopied into the appropriate R node in IBM SPSS Modeler Unless specied otherwise these codesnippets are always based on the telcosav dataset which can be found in the demo folder of yourSPSS Modeler installation After the source node attach a type node and thereafter the appro-priate R node However sometimes there are just abstracts of code to show you the idea It willbe clearly mentioned when the code is incomplete You will nd these codes back into several codeframes throughout this document Furthermore all the SPSS streams and assets are embedded
in the pdf symbolized by You can access them by right clicking within this pdf document
Some useful links
bull Essentials for R - Installation Instructions
bull User Guide IBM SPSS Modeler 18 R Nodes
bull Modeler essentials for R Downloads
bull SPSS Modeler and R integration - Getting started
IBM SPSS Modeler and R
Contents
1 System Setup 311 Installing R 312 Enabling the R nodes 3
2 R basics 3
3 The basics of R nodes in IBM SPSS Modeler 531 The R nodes 532 Simple R code example 5
321 modelerData 6322 modelerDataModel 7323 modelerModel 8
33 Some general remarks 1034 Read data options 11
4 Custom Dialog builder 1141 Tools 1242 Custom dialog 1243 Simple example 12
5 Tips amp tricks Some more detailed 1451 R code 14
511 ibmspsscf70 library 14512 Some useful parts of R code 15
52 Custom Dialog builder 17521 How to save and share a custom dialog 17522 Link to dialog and script 17
53 What about SQL Pushback Hadoop pushback 1854 What about real-time scoring and Solution Publisher 1955 Something more about the metadata in modeler and the consequences on R integration 19
Page 2 of 20
IBM SPSS Modeler and R
1 System Setup
Let us start with the setup of your system For now we assume that you have a valid installationof IBM SPSS Modeler on your machine For more installation topics we refer to the InstallationInstructions
11 Installing R
Depending on the version of your IBM SPSS Modeler you will now have to install dierent versionsof R
SPSS Version R version R download link
160 2152 httpscranr-projectorgbinwindowsbaseold2152170 31 httpscranr-projectorgbinwindowsbaseold310171 31 httpscranr-projectorgbinwindowsbaseold310180 32 httpscranr-projectorgbinwindowsbaseold320
Once you downloaded and installed you will have a working R instance on your machineserverLike SPSS Modeler you can have several versions of R installed on your machine without anyproblem
12 Enabling the R nodes
You will need to install the IBM SPSS Modeler essentials for R You can nd these here on theSPSS Community Downloads page Click 2 Get Essentials for SPSS and then click the buttonGet R Essentials for SPSS Modeler This will take you to github and you will be able to selectand download the Modeler 18 Essentials for R for a variety of platforms If you require Essentialsfor R for earlier Modeler versions there is also a link to legacy versions
Run this execution le The installation will ask you the path of your R installation and thepath to the bin les of your SPSS Modeler installation (Note that in the prelled path it is thedefault path to a ModelerServer and you will need to change this if you want to congure yourclient) This installation will place the R nodes in your SPSS Modeler node palette and it willalso include necessary R libraries in your R installation folder
2 R basics
There is already an over13ow of R courses (publicly) available through several channels so wewould certainly not want to replace these In it also not very important that you are an R expertto follow this document However there are still some basics of R code and R terminology usersneed to understand in order to exploit the integration of R and IBM SPSS Modeler For thissection let us open R in its original GUI Therefore go to the R installation folder and openbinx64RGUIexe A window will be opened looking like this
Page 3 of 20
IBM SPSS Modeler and R
This is the R console ready for commands to run You might often hear the term RStudio whichis nothing more than a development environment on top of this R gui Installation of RStudio isnot required for this introduction but might be handy for further useWe will start the R introduction by stating R is a powerful programming language and environmentfor statistical computing and graphics An important part within that last phrase is that R is aprogramming language unlike IBM SPSS Modeler That means it is built on objects that aredened by the user As an example assume the following R code (feel free to type it within theR console to see the R outputs)
1 x lt- 1+1
2 y lt- 2x
3 xyVector lt- c(xy)
4 z lt- mean(xyVector)
5 print(z)
Here x is an object This statement will ll the object x with the value of the evaluated formula1 + 1 being 2 So whenever the program refers to x it will be interpreted as 2 In the secondline we will dene y as twice the value of x In the third line we create a vector containing thecontent of x and y to calculate the mean of these 2 objects and place it in an object z
The operator lt- could also be replaced by = but for various reasons lots of R users pre-fer this way of writing (actually it is not exactly the same but that could be ignored for thepurpose of this document) If you feel more comfortable in using = please do so
Like we lled x y and z with some numbers any R object can be lled with a variety of typesHere is a list of the most important for our purposes
Vector is a sequence of data elements of the same type (eg numeric or character) This includesvectors of length 1 which can be interpreted as just being numbers You can create a vectorwith the R function c() So in the example code above all the values of x y and z arevectors of length one xyVector is a vector of length 2 containing the values of (the vector)x followed by (the vector) y Trying to link it back to SPSS you can interpret a vector asthe values of a single data column
Data frame is a list of vectors of equal length If you look at a vector as the values of a variablea data frame could be interpreted as a 2-dimensional dataset with columns (the number ofvectors) and lines (the size of each vector)
1 n lt- c(2 3 5 3 9) A first vector of 5 numeric values
2 n2 lt- c(1 3 2 5 4) A second vector of 5 numeric values
3 s lt- c(aa bb cc aa zz) A third vector of 5 string values
4 b lt- c(TRUE FALSE TRUE TRUE TRUE) A fourth vector of 5 flag values
5 Data lt- dataframe(n s b New = n+n2) A data frame containing 4 vectors
Page 4 of 20
IBM SPSS Modeler and R
6 Note n+n2 will be a new vector called New with the sum of the n + n2 c(3 6 7 8 13)
7
8 dim(Data) Will show you it is a 5x4 dataset
9 Data[24] Will give back the value on the 2nd line the 3rd column
10 colnames(Data) Will give the column names as a vector (nsb New)
11 Data$n[1] Will give back the first value of the vector n within the data frame
12
13 iris predefined data frame
There are also several pre-dened data frames installed within R One of them is called irisSometimes this document will refer back to iris
Model class which is actually a specic list containing predened objects dening a statisticalmodel For example a linear model class will be a list containing among others the coecientsof the regression model
List is an ordered collection of objects As an example you can have a list where the rst elementis a vector the second is a data frame and the third is a model Note that a data frame isa special type of a list where all the elements are vectors of equal sizes
3 The basics of R nodes in IBM SPSS Modeler
31 The R nodes
Once the installation for the R essentials are done you will see 3 new nodes in your node palettesThere is also a 4th R node which is the R nugget The dierence between and understanding ofthese 4 objects are essential
Output with this node data will be sent to R but it will never go back to SPSS (as it is aterminal node) The only thing that can go back to SPSS is the outputs generated by Rthatwill be presented within an SPSS output window
Transform data will go from SPSS to R but will also go back to SPSS after which the SPSSprocess can be continued
Model like the output node this is a terminal node so data will not go back to SPSS Howeverthere will be a reusable R object created within a nugget
Nugget similar to Transform node with the dierence that there is a reusable R object that canbe used in the R code
Node
Name R output node R transform node R model node R syntax nodePalette tab Output Record Ops Model NAData back to SPSS No Yes No YesReusable R object No No Create Use
32 Simple R code example
Let us start with saying that all the examples in this section are intentionally kept very simple soas to explain the interaction in a functional and structured way and be simple enough for non Rprogrammers We are certainly aware that most of the R code snippets we show in this chapter
Page 5 of 20
IBM SPSS Modeler and R
could also easily be implemented using standard SPSS Modeler functionality
There are 3 very important and reserved R objects that you should keep in mind when youuse the SPSS Modeler R integration Here is a brief description of these 3 after which we will gointo more detail for each of them
modelerData This is an R data frame that will be lled by the data entering in this R node Thisdata frame can be used and changed within your R code Eventually it will also be the dataframe that will be sent back to SPSS Modeler as a dataset Note that it will only containthe content of the data not (necessarily) the data column names and other metadata items
modelerDataModel This is also an R data frame containing the metadata of the data that is sentto R and back to SPSS Modeler It contains most of the information that you may expectwithin an SPSS Modeler type node This will be the object that will be most strange forexperienced R users
modelerModel this is an R object that can be lled by the user by any type of object you wantIt does not need to have a certain structure It will be calculated in the R model node afterwhich it will be saved within the R nugget where it can be used in the R-syntax
Note that R code is case-sensitive and therefore so are these object names In the following sectionswe will explain the usage of these objects
321 modelerData
modelerData is the R data frame that will be lled by the dataset entering the SPSS Modeler nodeit comes from So you can use this data frame to perform the desired calculations transformationsand outputs in R
Place the following code in an R output node
1 Print the first 6 lines of the data
2 head(modelerData)
3
4 Give a summary of the data
5 summary(modelerData)
6
7 create a histogram of the variable tenure
8 hist(modelerData$tenure xlab = years main = Tenure histogram)
9
10 change the tenure unit from months to years
11 modelerData$tenure lt- modelerData$tenure1212
13 recreate the histogram now in months
14 hist(modelerData$tenure xlab = months main = Tenure histogram)
Execution of this node will result in an SPSS Modeler output window in which all the R outputswill be assembled These will always be divided in 2 tabs Text output and Graph output
Page 6 of 20
IBM SPSS Modeler and R
In this case the text output is linked to the code on line 2 and 5 rst it prints the rst 6 lines(head) of the data next it will give summary statistics for each column
The graph output are two histograms One for the tenure in months the other for the samecolumn but after redening it by dividing the original value by twelve to give the tenure in years(note the X-axis scale)
As shown in the example stream Explain modelerDatastr you can also copy exactly this samecode into a transform node and attach a table node to it After running this table node you willnot see any R output (as none is expected) That means that even though the output code hasrun no outputs will be given However the data frame of modelerData will be send back to SPSSModeler In this case you will see the value of tenure being divided by 12
322 modelerDataModel
Metadata is very important in SPSS Modeler Let us for simplicity say that within modeler meta-data is represented by the type node With metadata we mean the type of each of the variablesin the dataset (numeric 13ag String storage ) At all times modeler will know exactly all themetadata at every step in the stream
R does not handle the metadata in a similar way as SPSS Modeler We already explainedmodelerDataModel taking over the role of the type node This is done by a data frame(=dataset) of the following structure
X1 X2 Xn
fieldName region tenure agefieldLabel Geographic indicator Months with service Age in years
fieldStorage real real realfieldMeasure nominal continuous continuousfieldFormat standard standard standardfieldRole input input input
So this means that this dataset will always have 6 lines with xed names (yes in R also thelines have names) The thing with this dataset is that it is completely the responsibility of theuser to align this metadata with the appropriate data So that means if we would like to add avariable with R the user must also manually add a column in modelerDataModel to make suremodelerData correctly goes back to SPSS Modeler In the earlier example above we did not makeany changes to the modelerDataModel and it was also not needed as the metadata did not change(dividing a number by 12 will not change the metadata) Now let us continue on the previousexample But now rather than changing the value of tenure in the same data variable we willcreate another one As a result we would have to update the metadata
1 Create the vector of tenure in years
2 Rcolumn lt- modelerData$tenure123
4 Paste this vector to the right of the dataset
5 modelerData lt- cbind(modelerDataRcolumn)
6
7 create the metadata for the column to add
8 newVar lt- c(fieldName=tenureYears fieldLabel=fieldStorage=real fieldMeasure=
fieldFormat= fieldRole=)
9
10 paste the new column metadata to the existing metadata
11 modelerDataModel lt- cbind(modelerDataModelnewVar)
Running a table node downstream of this transform node will show you the new variable withthe name tenureYears There are some important things to realize in this
Page 7 of 20
IBM SPSS Modeler and R
bull fieldName and fieldStorage are the only 2 required rows that needs to be lled in for anynew column In the code we left all the other lines empty meaning they will be lled in bythe stream default For a list of available values we refer to the user guide
bull As modelerDataModel is only useful when you go back to SPSS Modeler you will generallyonly usechange this object in non-terminal R-nodes It might still be handy to use it interminal nodes if the value of the modelerDataModel is important for your output (egrun a histogram of all continuous variables)
bull When data will go back to SPSS modeler it will be the content of the data frame excludingthe column- and row names That means that even though the column in the modelerData
will be called Rcolumn the name of the column in SPSS will only be dened by the metadatawithin the row fieldName In this case it is called tenureYears
bull The only link between modelerData and modelerDataModel is the order of the columns Itwill not look by name The rst column in the data will be given the metadata of the rstcolumn of modelerDataModel In case the metadata (modelerDataModel ) does not matchthe modelerData an error is thrown The table below shows schematic how this works
modelerData modelerDataModel
R
RName1 RNamenx11 x1n
x21 x2n
xm11 xm1n
xm1 xmn
X1 Xn
fieldName Name1 NamenfieldLabel
fieldStorage xxx xxxfieldMeasure fieldFormat
fieldRole
SPSS
Name1 Namenx11 x1n
x21 x2n
xm1 xmn
Note that only the names of the modelerDataModel are used
Since this concept is very strange to standard R users We found this part the most dicult toexplain To people who know SPSS you can summarize it as modelerDataModel taking over therole of the type node
You can nd all streams and R scripts explaining modelerDataModel here
323 modelerModel
modelerModel is the R object that is stored within the R nugget This object will be populatedwithin the R model node after which you could use modelerModel within the R nugget for scoringThis very much works the same way as IBM Modeler works You ask a model node to calculatea formula after which that formula will be stored within the nugget together with the way itshould be used to calculate a scoring
You will only use this object within the R model node and nugget Note that within the Rmodel node there are 2 syntax window
Page 8 of 20