From SAS to Python: Road to Analytics
-
Upload
datio-big-data -
Category
Data & Analytics
-
view
717 -
download
6
Transcript of From SAS to Python: Road to Analytics
![Page 1: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/1.jpg)
From SAS to PySparkRoad to analytics
![Page 2: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/2.jpg)
Contents
123
SAS vs Spark
SAS Proc SQL vs Spark SQL
Advantage Analytics
![Page 3: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/3.jpg)
1. SAS vs Spark
![Page 4: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/4.jpg)
OVERVIEW
SAS
○ The largest independent vendor in “advanced analytics”
○ 1976 foundation of the SAS Institute, Cary, North Carolina
○ Commercial software product
SPARK
○ A fast and general engine for large-scale data processing
○ Started in 2009 as a research project in the UC Berkeley, AMPLab
○ Open source
![Page 5: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/5.jpg)
CODE
SAS
Basic programming model consists of code blocks:○ SAS Data Step
■ generation of data■ concatenation of data
○ SAS PROCedures■ special functionalities
SPARK
“Line based” programmingNative Language is Scala, but flexible programming model:
○ Scala○ Java○ Python○ R
![Page 6: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/6.jpg)
DATA
SAS: DATASET
○ Computed in memory (RAM)
○ A data set contains:● observations: organized in
rows● variables: organized in
columns
SPARK: DATAFRAME
○ A distributed collection of data organized into named columns.
○ It is conceptually equivalent to:table in a relational database anddataframe in R/Python
○ It is a programming abstraction
![Page 7: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/7.jpg)
IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE
Transformations like: map, filter, union, join, group by… results in an other dataset
![Page 8: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/8.jpg)
SAS:data sasData
set sasData;
Fare2 = Fare + 2;
run;
Python Pandas:pandasDF['Fare2'] = pandasDF['Fare']+2
Spark:sparkDF = sparkDF
.withColumn('Fare2',sparkDF['Fare']+2)
NOTEBOOK
IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE
![Page 9: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/9.jpg)
READ SAS DATASETS
The SAS-FILE (sas7bdat) is a file with special structure created by SAS and binary stored
● PYTHON: SAS7BDAT PACKAGE ● R: HAVEN LIBRARY
○
![Page 10: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/10.jpg)
2. SAS Proc SQL vs Spark SQL
![Page 11: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/11.jpg)
SQL sentences
SAS ProC SQL
SAS Procedure that combines the functionality of DATA and PROC steps. It can sort, summarize, subset, join, concatenate datasets, create new variables...
Spark SQL
○ Spark’s interface for working with structured and semi-structured data, query using SQL
○ Load data from JSON, Hive, Parquet
○ Evaluated “lazily”
![Page 12: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/12.jpg)
SQL sentences
SAS ProC SQL
PROC SQL;
CREATE TABLE newTable ASSELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns;QUIT;
Spark SQL
sqlContext = new org.apache.spark.sql.SQLContext(sc)newTable = sqlContext.sql(“SELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns”)
NOTEBOOK
![Page 13: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/13.jpg)
AGGREGATE FUNCTION IN SPARK SQL
sum, avg, mean, count, max, min, first, last, sttedev, variance, skewness, kurtosis…
After aggregation
Act on each group of data, return a single
value as a result
![Page 14: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/14.jpg)
WINDOW FUNCTION IN SPARK SQL
Ranking: rank, dense_rank, percent_rank, ntile, row_number
Analytics: cume_dist, lag, first_value, last_value, leadAggregate: aggregate funcs
Calculate a return value over a set of rows called
window that are somehow related to the
currentNOTEBOOK
![Page 15: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/15.jpg)
EXTEND SPARK SQL
Standard functions are over 100 functions
(pyspark)
from pyspark.sql.functions import *
![Page 16: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/16.jpg)
BUILT-IN FUNCTIONS, UDFs
“User Defined Function” Define new Column-based functions that extend the vocabulary of Spark
Act on a single row as an input, single return value for
every input row
NOTEBOOK
![Page 17: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/17.jpg)
TIPS○ Not thinking in sorted data. In parallel process we can’t acces per row.
○ Cache tables/DFs when they are used more than once
○ Merge doesn’t need ordered data as SAS
○ Use functions already defined instead of creating your own UDF
○ Save data in columnar format as Parquet
○ Avoid collecting data when you are working with Big Data, take a sample
![Page 18: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/18.jpg)
3. Advantage Analytics
![Page 19: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/19.jpg)
ADVANTAGE ANALYTICS
SAS Stats
Traditional Add-on package to SAS for Statistics
○ Analysis of variance○ Bayesian analysis○ Categorical data analysis○ Distribution analysis○ Mixed models○ Predictive modeling...
Spark MLlib
Scalable machine learning library
○ Basic statistics○ Classification and regression○ Collaborative filtering○ Clustering○ Dimensionality reduction○ Feature extraction and
transformation...
![Page 20: From SAS to Python: Road to Analytics](https://reader034.fdocuments.net/reader034/viewer/2022042605/58e73ae41a28ab8f028b51e1/html5/thumbnails/20.jpg)
BIBLIOGRAPHY
SPARK DOCUMENTATION:https://spark.apache.org/docs/2.0.0/
PYSPARK API:https://spark.apache.org/docs/2.0.0/api/python/index.html
PYSPARK FUNCTIONS: https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html