SAS to R Migration Richard Pugh Commercial Director rpugh@mango-solutions.com.

Post on 22-Dec-2015

218 views 0 download

Transcript of SAS to R Migration Richard Pugh Commercial Director rpugh@mango-solutions.com.

SAS to R Migration

Richard Pugh

Commercial Director

rpugh@mango-solutions.com

rich@mango-solutions.com

Agenda

• What is SAS?• Why Migrate from SAS to R?• Case Study: Major Financial Company• How to Migrate from SAS to R?• Questions

rich@mango-solutions.com

rich@mango-solutions.com

> fortune("SUV")

When talking about user friendliness of computer software I like the analogy of cars vs. busses: [...]

Using this analogy programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed.

R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the passenger seat, and mountain climbing and spelunking gear in the back. R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS. -- Greg Snow R-help (May 2006)

rich@mango-solutions.com

Why Migrate to R?

rich@mango-solutions.com

rich@mango-solutions.com

rich@mango-solutions.com

Why NOT migrate?

rich@mango-solutions.com

rich@mango-solutions.com

rich@mango-solutions.com

Case StudyMajor Financial Firm

rich@mango-solutions.com

rich@mango-solutions.com

Background

• SAS User Base• Mature SAS Processes• Leverage Oracle Investment

Can Oracle R Enterprise replace SAS?

rich@mango-solutions.com

Initial PoC

• Key SAS process: Credit Risk• 1,625 lines of SAS Code

• 79 “data steps” • 66 “procedure” calls• 29 macros

• Passionate SAS User Community

rich@mango-solutions.com

Initial PoC

rich@mango-solutions.com

Initial PoC

Theme Question

Capabilities

Does R/ORE provide all the SAS capabilities required?

What gaps, if any, exist between R and SAS?

Where, and why, do results differ between R and SAS?

WorkflowHow does the “style” of coding differ between R and SAS?

How easy, or not, was it to implement the existing SAS workflow?

Skills

How much learning is required in order to become proficient in R?

How much learning is required to take on and manage the

R implementation of the modelling macros?

ExtensionsWhat areas of “value add” arise from using R for these tasks?

What areas of “value add” outside of the current scope are enabled by using R?

rich@mango-solutions.com

Oracle R Enterprise

• In-database implementation of R• Very appealing: take R to the database• Features of ORE

• ROracle Implementation• Transparency Layer• Publish functions to database (access via R, SQL)

• Learn more at www.oracle.com/goto/R

rich@mango-solutions.com

How to Migrate from SAS to R?

rich@mango-solutions.com

What to Migrate?

• Doesn’t happen overnight!• Choose a key first step

• Functional Area• Capability (e.g. graphics, time series)

rich@mango-solutions.com

Step 1Analyse

SAS Code

rich@mango-solutions.com

SAS Code Analysis%Macro1;

data a; set b; run;

%mend;

1,500 lines of code

%Macro2;

data c; set a; run;

%mend;

• Can be complex• Scoping rules in

particular can be a challenge

rich@mango-solutions.com

Use R to Analyse SAS Code

rich@mango-solutions.com

Use R to Analyse SAS Code

rich@mango-solutions.com

SAS Dependencies with functionMap

rich@mango-solutions.com

Step 2Tame the SAS Code

rich@mango-solutions.com

• Version Control• Unit Tests• Continuous

Integration

Tame your SAS Code

rich@mango-solutions.com

Step 3Translate the Code

rich@mango-solutions.com

Translate the Code

• Translate the Unit Tests first• Then, translate macros one at a time• Proc translates can be partially-automated, but

care must be taken

rich@mango-solutions.com

 %macro sampler(DS=);    data random; set datalib.&DS.;      xxx=ranuni(54321);       origorder + 1;   run;    proc sort data=random ; by xxx; run;    data datalib.&DS.;      set random nobs=numg;      if _n_ le &DEVPERC.*numg  then Holdout=0;      else Holdout=1;   run;    proc sort data=datalib.&DS.; by origorder; run;                 proc freq data= datalib.&DS.;      tables Holdout  /missing;       weight weight;   run; %mend sampler; 

 sampler <- function(ds,  DEVPERC = .8, hCol = ‘HOLDOUT”)  {    N <- nrow(ds)    holdTest <- runif(N) > DEVPERC   ds[[hCol]] <- as.numeric(holdTest)    outDf <- aggregate( list(Freq = ds[[hCol]]), ds[hCol], length)    print(transform(outDf, Percent = round(100 * Freq / N, 2)))     invisible(ds)}

rich@mango-solutions.com

 %macro sampler(DS=);    data random; set datalib.&DS.;      xxx=ranuni(54321);       origorder + 1;   run;    proc sort data=random ; by xxx; run;    data datalib.&DS.;      set random nobs=numg;      if _n_ le &DEVPERC.*numg  then Holdout=0;      else Holdout=1;   run;    proc sort data=datalib.&DS.; by origorder; run;                 proc freq data= datalib.&DS.;      tables Holdout  /missing;       weight weight;   run; %mend sampler; 

 sampler <- function(ds,  DEVPERC = .8, hCol = ‘HOLDOUT”)  {    N <- nrow(ds)    holdTest <- runif(N) > DEVPERC   ds[[hCol]] <- as.numeric(holdTest)    outDf <- aggregate( list(Freq = ds[[hCol]]), ds[hCol], length)    print(transform(outDf, Percent = round(100 * Freq / N, 2)))     invisible(ds)}

rich@mango-solutions.com

 %macro sampler(DS=);    data random; set datalib.&DS.;      xxx=ranuni(54321);       origorder + 1;   run;    proc sort data=random ; by xxx; run;    data datalib.&DS.;      set random nobs=numg;      if _n_ le &DEVPERC.*numg  then Holdout=0;      else Holdout=1;   run;    proc sort data=datalib.&DS.; by origorder; run;                 proc freq data= datalib.&DS.;      tables Holdout  /missing;       weight weight;   run; %mend sampler; 

 sampler <- function(ds,  DEVPERC = .8, hCol = ‘HOLDOUT”)  {    N <- nrow(ds)    holdTest <- runif(N) > DEVPERC   ds[[hCol]] <- as.numeric(holdTest)    outDf <- aggregate( list(Freq = ds[[hCol]]), ds[hCol], length)    print(transform(outDf, Percent = round(100 * Freq / N, 2)))     invisible(ds)}

rich@mango-solutions.com

 %macro sampler(DS=);    data random; set datalib.&DS.;      xxx=ranuni(54321);       origorder + 1;   run;    proc sort data=random ; by xxx; run;    data datalib.&DS.;      set random nobs=numg;      if _n_ le &DEVPERC.*numg  then Holdout=0;      else Holdout=1;   run;    proc sort data=datalib.&DS.; by origorder; run;                 proc freq data= datalib.&DS.;      tables Holdout  /missing;       weight weight;   run; %mend sampler; 

 sampler <- function(ds,  DEVPERC = .8, hCol = “HOLDOUT”)  {    N <- nrow(ds)    holdTest <- runif(N) > DEVPERC   ds[[hCol]] <- as.numeric(holdTest)    outDf <- aggregate( list(Freq = ds[[hCol]]), ds[hCol], length)    print(transform(outDf, Percent = round(100 * Freq / N, 2)))     invisible(ds)}

rich@mango-solutions.com

 %macro sampler(DS=);    data random; set datalib.&DS.;      xxx=ranuni(54321);       origorder + 1;   run;    proc sort data=random ; by xxx; run;    data datalib.&DS.;      set random nobs=numg;      if _n_ le &DEVPERC.*numg  then Holdout=0;      else Holdout=1;   run;    proc sort data=datalib.&DS.; by origorder; run;                 proc freq data= datalib.&DS.;      tables Holdout  /missing;       weight weight;   run; %mend sampler; 

 sampler <- function(ds,  DEVPERC = .8, hCol = “HOLDOUT”)  {    N <- nrow(ds)    holdTest <- runif(N) > DEVPERC   ds[[hCol]] <- as.numeric(holdTest)    outDf <- aggregate( list(Freq = ds[[hCol]]), ds[hCol], length)    print(transform(outDf, Percent = round(100 * Freq / N, 2)))     invisible(ds)}

17 SAS Lines > 8 R Lines

rich@mango-solutions.com

Step 4Use Oracle R Enterprise

rich@mango-solutions.com

Oracle R Enterprise

• Remove code to import/export from database• Replace with links to the database• Look for other opportunities (e.g. using in-

database GLM vs standard)

rich@mango-solutions.com

Oracle R Enterprise 

library(ORE) # Load the libraryore.connect(…) # Make the connection

… 

ore.create(newData, table = "X") # Create new db tableX[1:5, ] # Simple Command

# Define function to runtheFun <- function(x, F, ...) step(ore.glm(F, data = x,

family = "binomial"), direction = "both") 

# Run the modelstepOut <- ore.tableApply(X, theFun, F = as.formula("DV ~ *"))

… 

ore.disconnect()

rich@mango-solutions.com

Review

AnalysisCode

UnitTests

UnitTests

AnalysisCode

Oracle REnterprise

SQLInterface

rich@mango-solutions.com

Findings

• A formal migration process allows for a clear and accurate transition

• SAS code conversion to R at a rate of ~200 lines per day

• Code base reduces by ~55%

rich@mango-solutions.com

Challenges

• More relaxed formal scoping of SAS• Differences in statistical algorithms• The danger of migrating poor code flows

rich@mango-solutions.com

Code Migration isn’t just technical …

rich@mango-solutions.com

SAS Migration is more about people …

rich@mango-solutions.com

Why are these business users so defensive?  It’s just a computer language!!

Taking away SAS means taking away 

my ability to do analysis!!

rich@mango-solutions.com

Convincing People to move to R

• Concede some ground …• Show quick wins• Teach the basic data structures early

rich@mango-solutions.com

SAS to R Brain Dump …

Questions?