SAS to R Best Practices - Convert to R...
Transcript of SAS to R Best Practices - Convert to R...
1 of 36 Confidential and Proprietary © 2012 Boston Decision, LLC
SAS to R Best Practices in SAS to R Conversion
2 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
About Rconvert.com
Division of Boston Decision, LLC
Founded 2010 - Cambridge, MA Finance, Marketing, Technology
Located in the Cambridge Innovation Center
www.BostonDecision.com
3 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
SAS & R Compared
4 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
SAS – Circa 1966
• 4th Generation Language for Data Analysis
• Mostly written in the C language
• Proprietary
5 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
R – Circa 1993
• Origins in S language, circa 1976
• 4th Generation Language for Data Analysis
• Mostly written in the C language
• Open-source
6 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
SAS Components
• Data Step – Functions
• Procedure Step
• Macro Language
• SAS ODS
• Component Product Languages – SAS/IML
7 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
R Components
• Functions
8 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
A Paradigm Shift
• In R, all work is performed by functions
– Data steps = expressions with functions
– Procedures = expressions with functions
– Macros = expressions with functions
– SAS functions = R functions
9 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
SAS Data Step – Implied Loop
• Data steps leverage an implied loop
– Reads data by row (obs), passes row through code line-by-line, hits run, starts next row.
• R does not make use of an implicit loop
– R applies functions at the “column-level”
10 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
SAS Example
a a
A B
1 2
4 5
7 8
A B C
1 2 3
a a
A B
1 2
4 5
7 8
A B C
1 2 3
4 5 9
11 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
R Example
a
A B
1 2
4 5
7 8
A B C
1 2 3
4 5 9
7 8 15
1 4 7
2 5 8
+
12 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Data Storage
• SAS data sets
– Arrays = special group of columns in a SAS data set
• In R, more variety.
– Data frame (similar to SAS data set)
– Vectors – single “column” of values (all same type)
– Matrices
– Lists – collection of objects
13 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Object Orientation (OOP)
• Both SAS and R can be used in OOP fashion
– In practice, we don’t see this much with SAS
• In R, everything is an object
– Variables are objects
– Functions are objects
14 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Native Memory Usage
• SAS
– Hard drive (I/O) Intensive
• R
– RAM Intensive
• RevoScaleR
– External memory algorithms with data on disk
15 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Extensibility
• SAS
– New procedures added with additional SAS purchases.
• R
– New functions added by loading libraries from CRAN.
16 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Conversion Process
17 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Quick Start
• Developing tools and libraries to expedite conversion.
18 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Hybrid Agile Conversion
Implement in R
Test & Feedback
Document
Conversion Design
SAS Code Review
Master Requirements
Iteration Plan
19 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Impact
• Faster results.
• Surfaces challenges quickly.
• TRANSPARENCY!
20 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Order of Review
SAS Macro Language
Procedures Data Steps
21 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Re-Factorization Strategy
• SAS and R are not just different languages.
– They are different frameworks.
• Planning must ensure appropriate vectorization to conform to the R framework.
22 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
SAS Macro Assessment
• Assess business usage
– Level 1: Wrapper Macros
– Level 2: Macro Variables
– Level 3: Code Generation
23 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Areas of Caution - Macros
• Macros to create or name SAS data sets
• Loops and iterations
• Specific syntax
– Call Symput
– %GOTO
24 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
SAS Procedure Conversion
• SAS procedures map to R functions
– Most common Base SAS, SAS/STAT, and SAS/GRAPH can be found in Base R.
• Most common SAS procedures have approximate 1:1 analogs in R
– E.g. univariate, means, freq, rank, corr, import
25 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Areas of Caution - PROCS
• Some advanced SAS procedures may require literature review of implementation.
• R implements new capabilities faster – consequence of open source.
26 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Data Step Conversion
• Many SAS functions have similar R syntax.
– E.g. String and data manipulation
• Consider each tool’s preferred analysis target
– SAS = data sets = row
– R = vector = column
27 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Areas of Caution – Data Steps
• Missing Data Handling
– SAS uses “.” for numeric and “” for string.
– 27 other missing data values.
– R uses NA. No numeric or string equivalent.
• Date & Time
– SAS date – number of days since 1/1/1960
– R date – number of days since 1/1/1970
28 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Best Practices
• Best practice in SAS may be poor practice in R due to paradigm shift.
• E.g. SAS loop should not convert to R loop – Loops in SAS should generally be reconstructed
as applies in R.
• SAS is procedural, R is functional.
29 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Conversion Samples
30 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Reading a CSV File
• SAS
• R
31 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Merging Data
• SAS
• R
32 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Contingency Tables
• SAS
• R
33 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Logistic Regression
• SAS
• R
34 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Non-Linear Optimization
• SAS
• R
35 of 36 Confidential and Proprietary © 2012 Boston Decision, LLC
Thank you Timothy D‘Auria
36 of 36 www.bostondecision.com [email protected]
1 Broadway, 14th Fl Cambridge, MA 02142 Phone 617 500 0093
© 2012 Boston Decision, LLC
Disclaimers
• SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries.
• R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License. R is hosted at www.r-project.org.
• Boston Decision LLC and Rconvert.com are not affiliated with SAS Institute nor the core R development team.