R Studio Server on Amazon EMR
-
Upload
the-nerdery -
Category
Technology
-
view
339 -
download
1
Transcript of R Studio Server on Amazon EMR
2
R Studio Server on Amazon EMR
Chad DvoracekBrandon Veber
3
• Leading IDE for R• Integrated console• Code completion• Syntax highlighting editor• Direct code execution• Plotting, history, debugging• Open source
4
EMR (Elastic MapReduce)Managed Hadoop Framework
• Easy to use
• Quick set up
• Low cost (Spot Instance)
• Elastic
• Flexible
5
GUI Interactive Environments
6
Case StudySpeeding data deliver to data scientists.
7
Data Science Team•Temporary projects•Long term data persistence not needed•Data sets received in .zip files•Multiple complex joins•Aggregation•Process predictive algorithms
8
Initial Challenges•Joins in RDBMS taking to long•Moving data between systems•Time to delivery on changes•Data scientists only use R
9
10
11
Reduced time for data deliveryMoving to EMR allowed for faster processing and preparing of the data. By incorporating automating data load scripts and utilizing Hive for batch processing and Spark for in memory processing it allowed the data scientist to focus on solutions rather than time constraints.
Confidence utilizing Big Data systemsBy providing assistance in set up and training for data processing it allowed the data engineering team the ability to gain confidence in preparing data on distributed systems.
Case Study: Results
Process changeHaving the option to transform data in distributed systems provided an opportunity to re-think the process and time needed for data and solution delivery.
Future considerationsWith the data scientist working primarily in R, exploring R Studio Server and SparkR may be a logical next step in improving the teams workflow.
12
Bootstrap Problems
?
13
What about Spark R?
14
EMR Set UpQuick & Simple
15
Simple Start Up Script
16
EMR Configuration: Step 1
17
EMR Configuration: Step 2
18
EMR Configuration: Step 2
19
EMR Configuration: Step 3
20
EMR Configuration: Step 4
21
EMR Web Connection
22
EMR Web Connection
23
Final Step
24
R Studio ServerR and SparkR
25
R Studio Server: Sign In
26
R Studio Server
27
Environment Set Up
28
Data Analysis
29
Data Set
30
Data Set
31
Create RDD Data Frame
32
SparkR API
33
sqlContext
34
Access Hive
35
Full Power of R
36
Full Power of R
37
SparkR: In Memory Processing
38
39
SparkR
Limitations
• Machine Learning limited to glm
• Distributed processing constrained by existing API.
Future
Databricks
Alteryx
40
Machine Learning in R*1. e1071 Functions for latent class analysis, short time Fourier
transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc (142479 downloads)
2. rpart Recursive Partitioning and Regression Trees. (135390)3. igraph A collection of network analysis tools. (122930)4. nnet Feed-forward Neural Networks and Multinomial Log-Linear
Models. (108298)5. randomForest Breiman and Cutler's random forests for
classification and regression. (105375)6. caret package (short for Classification And Regression Training) is
a set of functions that attempt to streamline the process for creating predictive models. (87151)
7. kernlab Kernel-based Machine Learning Lab. (62064)8. glmnet Lasso and elastic-net regularized generalized linear
models. (56948)9. ROCR Visualizing the performance of scoring classifiers. (51323)10. gbm Generalized Boosted Regression Models. (44760)
11. party A Laboratory for Recursive Partitioning. (43290)12. arules Mining Association Rules and Frequent Itemsets. (39654)13. tree Classification and regression trees. (27882)14. klaR Classification and visualization. (27828)15. RWeka R/Weka interface. (26973)16. ipred Improved Predictors. (22358)17. lars Least Angle Regression, Lasso and Forward Stagewise.
(19691)18. earth Multivariate Adaptive Regression Spline Models. (15901)19. CORElearn Classification, regression, feature evaluation and
ordinal evaluation. (13856)20. mboost Model-Based Boosting. (13078)
* KDnuggets Top 20 R Machine Learning Packages and Data Science Packages. Source: http://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html
41
Machine Learning in Spark• Classification and regression
• linear models (SVMs, logistic regression, linear regression)
• naive Bayes• decision trees• ensembles of trees (Random Forests and Gradient-
Boosted Trees)• isotonic regression
• Collaborative filtering• alternating least squares (ALS)
• Clustering• k-means• Gaussian mixture• power iteration clustering (PIC)• latent Dirichlet allocation (LDA)• bisecting k-means• streaming k-means
• Dimensionality reduction• singular value decomposition (SVD)• principal component analysis (PCA)
• Feature extraction and transformation• Frequent pattern mining
• FP-growth• association rules• PrefixSpan
• Evaluation metrics• PMML model export• Optimization (developer)
• stochastic gradient descent• limited-memory BFGS (L-BFGS)
42
Resources
Slides: @the_nerdery
Code: https://github.com/thenerdery/SparkRTalk
43
Questions?