R Studio Server on Amazon EMR

44

Transcript of R Studio Server on Amazon EMR

Page 1: R Studio Server on Amazon EMR
Page 2: R Studio Server on Amazon EMR

2

R Studio Server on Amazon EMR

Chad DvoracekBrandon Veber

Page 3: R Studio Server on Amazon EMR

3

• Leading IDE for R• Integrated console• Code completion• Syntax highlighting editor• Direct code execution• Plotting, history, debugging• Open source

Page 4: R Studio Server on Amazon EMR

4

EMR (Elastic MapReduce)Managed Hadoop Framework

• Easy to use

• Quick set up

• Low cost (Spot Instance)

• Elastic

• Flexible

Page 5: R Studio Server on Amazon EMR

5

GUI Interactive Environments

Page 6: R Studio Server on Amazon EMR

6

Case StudySpeeding data deliver to data scientists.

Page 7: R Studio Server on Amazon EMR

7

Data Science Team•Temporary projects•Long term data persistence not needed•Data sets received in .zip files•Multiple complex joins•Aggregation•Process predictive algorithms

Page 8: R Studio Server on Amazon EMR

8

Initial Challenges•Joins in RDBMS taking to long•Moving data between systems•Time to delivery on changes•Data scientists only use R

Page 9: R Studio Server on Amazon EMR

9

Page 10: R Studio Server on Amazon EMR

10

Page 11: R Studio Server on Amazon EMR

11

Reduced time for data deliveryMoving to EMR allowed for faster processing and preparing of the data. By incorporating automating data load scripts and utilizing Hive for batch processing and Spark for in memory processing it allowed the data scientist to focus on solutions rather than time constraints.

Confidence utilizing Big Data systemsBy providing assistance in set up and training for data processing it allowed the data engineering team the ability to gain confidence in preparing data on distributed systems.

Case Study: Results

Process changeHaving the option to transform data in distributed systems provided an opportunity to re-think the process and time needed for data and solution delivery.

Future considerationsWith the data scientist working primarily in R, exploring R Studio Server and SparkR may be a logical next step in improving the teams workflow.

Page 12: R Studio Server on Amazon EMR

12

Bootstrap Problems

?

Page 13: R Studio Server on Amazon EMR

13

What about Spark R?

Page 14: R Studio Server on Amazon EMR

14

EMR Set UpQuick & Simple

Page 15: R Studio Server on Amazon EMR

15

Simple Start Up Script

Page 16: R Studio Server on Amazon EMR

16

EMR Configuration: Step 1

Page 17: R Studio Server on Amazon EMR

17

EMR Configuration: Step 2

Page 18: R Studio Server on Amazon EMR

18

EMR Configuration: Step 2

Page 19: R Studio Server on Amazon EMR

19

EMR Configuration: Step 3

Page 20: R Studio Server on Amazon EMR

20

EMR Configuration: Step 4

Page 21: R Studio Server on Amazon EMR

21

EMR Web Connection

Page 22: R Studio Server on Amazon EMR

22

EMR Web Connection

Page 23: R Studio Server on Amazon EMR

23

Final Step

Page 24: R Studio Server on Amazon EMR

24

R Studio ServerR and SparkR

Page 25: R Studio Server on Amazon EMR

25

R Studio Server: Sign In

Page 26: R Studio Server on Amazon EMR

26

R Studio Server

Page 27: R Studio Server on Amazon EMR

27

Environment Set Up

Page 28: R Studio Server on Amazon EMR

28

Data Analysis

Page 29: R Studio Server on Amazon EMR

29

Data Set

Page 30: R Studio Server on Amazon EMR

30

Data Set

Page 31: R Studio Server on Amazon EMR

31

Create RDD Data Frame

Page 32: R Studio Server on Amazon EMR

32

SparkR API

Page 33: R Studio Server on Amazon EMR

33

sqlContext

Page 34: R Studio Server on Amazon EMR

34

Access Hive

Page 35: R Studio Server on Amazon EMR

35

Full Power of R

Page 36: R Studio Server on Amazon EMR

36

Full Power of R

Page 37: R Studio Server on Amazon EMR

37

SparkR: In Memory Processing

Page 38: R Studio Server on Amazon EMR

38

Page 39: R Studio Server on Amazon EMR

39

SparkR

Limitations

• Machine Learning limited to glm

• Distributed processing constrained by existing API.

Future

Databricks

Alteryx

Page 40: R Studio Server on Amazon EMR

40

Machine Learning in R*1. e1071 Functions for latent class analysis, short time Fourier

transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc (142479 downloads) 

2. rpart Recursive Partitioning and Regression Trees. (135390)3. igraph A collection of network analysis tools. (122930)4. nnet Feed-forward Neural Networks and Multinomial Log-Linear

Models. (108298)5. randomForest Breiman and Cutler's random forests for

classification and regression. (105375)6. caret package (short for Classification And Regression Training) is

a set of functions that attempt to streamline the process for creating predictive models. (87151)

7. kernlab Kernel-based Machine Learning Lab. (62064)8. glmnet Lasso and elastic-net regularized generalized linear

models. (56948)9. ROCR Visualizing the performance of scoring classifiers. (51323)10. gbm Generalized Boosted Regression Models. (44760)

11. party A Laboratory for Recursive Partitioning. (43290)12. arules Mining Association Rules and Frequent Itemsets. (39654)13. tree Classification and regression trees. (27882)14. klaR Classification and visualization. (27828)15. RWeka R/Weka interface. (26973)16. ipred Improved Predictors. (22358)17. lars Least Angle Regression, Lasso and Forward Stagewise.

(19691)18. earth Multivariate Adaptive Regression Spline Models. (15901)19. CORElearn Classification, regression, feature evaluation and

ordinal evaluation. (13856)20. mboost Model-Based Boosting. (13078)

* KDnuggets Top 20 R Machine Learning Packages and Data Science Packages. Source: http://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html

Page 41: R Studio Server on Amazon EMR

41

Machine Learning in Spark• Classification and regression

• linear models (SVMs, logistic regression, linear regression)

• naive Bayes• decision trees• ensembles of trees (Random Forests and Gradient-

Boosted Trees)• isotonic regression

• Collaborative filtering• alternating least squares (ALS)

• Clustering• k-means• Gaussian mixture• power iteration clustering (PIC)• latent Dirichlet allocation (LDA)• bisecting k-means• streaming k-means

• Dimensionality reduction• singular value decomposition (SVD)• principal component analysis (PCA)

• Feature extraction and transformation• Frequent pattern mining

• FP-growth• association rules• PrefixSpan

• Evaluation metrics• PMML model export• Optimization (developer)

• stochastic gradient descent• limited-memory BFGS (L-BFGS)

Page 42: R Studio Server on Amazon EMR

42

Resources

Slides: @the_nerdery

Code: https://github.com/thenerdery/SparkRTalk

Page 43: R Studio Server on Amazon EMR

43

Questions?

Page 44: R Studio Server on Amazon EMR

44

Contact

The [email protected](877) 664.6373