R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark...
-
Upload
spark-summit -
Category
Data & Analytics
-
view
40 -
download
0
Transcript of R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark...
ANALYZE DATA USING RSTUDIO'S SPARKLYR
R AND SPARKhttps://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
Apache Spark• Huge investments in big data and Hadoop• Data scientists wanting to analyze data at scale• Rapid progress and adoption in Spark libraries
R and RStudio• Wide range of tools and packages• Powerful ways to share insights• Interactive notebooks• Great visualizations
What we hear from our customers
Best of both worldsIf you are investing in Spark, then there is nothing stopping you from using it with the full power of R
Using R with Spark
Benefits of Spark for the R user
Apache Spark…• Can integrate with Hadoop• Supports familiar SQL syntax• Has built-in machine learning• Is designed for performance• Great for interactive data analysis
R users can take advantage of all these investments
New! Open-source R package from RStudio• Integrated with the RStudio IDE• Sparklyr is a dplyr back-end for Spark• Extensible foundation for Spark
applications and R
sparklyrhttp://spark.rstudio.com/
Create your own R packages with interfaces to Spark•Interfaces to custom machine learning pipelines•Interfaces to 3rd party Spark packages•Many other R interfaces
sparklyr extensionsExampleCount the number of lines in a file
Extensionlibrary(sparklyr)count_lines <- function(sc, file) { spark_context(sc) %>% invoke("textFile", file, 1L) %>% invoke("count")}
Callsc <- spark_connect(master = "local")count_lines(sc, "hdfs://path/data.csv")
R for data science toolchain
“You’ll learn how to get your data into R [with Spark], get it into the most useful structure, transform it, visualize it and model it [with Spark].”
Import
Create a connectionsc <- spark_connect()
Import data from file/S3/HDFS/Rspark_read_csv(sc, “table”, “hdfs://<path>”)sdf_copy_to(sc, table, “table”)nyct2010_tbl <- tbl(sc, “table")
Write dataspark_write_parquet(table, “hdfs://<path>”)
SparklyrConnect to Spark.
Read and write data in CSV, JSON, and Parquet
formats. Data can be stored in HDFS, S3, or on the
local filesystem.
Wrangle
dplyrmy_tbl %>% filter(Petal_Width < 0.3) %>% select(Petal_Length, Petal_Width)
Spark SQLselect Petal_Length, Petal_Widthfrom mytablewhere Petal_Width < 0.3
Use dplyr to write Spark SQL
A fast, consistent tool for working with data
frame like objects, both in memory and
out of memory.
Visualize
ggplot2collect(mpg_tbl) %>%ggplot() + aes(displ, hwy, color = class) + geom_point()
Use ggplot2 to visualize data
collected from Spark
A plotting system for R that makes it easy to
produce complex multi-layered graphics.
Model
ModelsK-means
Linear regressionLogistic regressionSurvival regression
Generalized linear regressionDecision trees
Random forestsGradient boosted trees
Principal component analysisNaive Bayes
Multilayer perceptronLatent Dirichlet allocation
One vs rest
Industry SpecificChemometricsClinical TrialsEconometricsEnvironmetrics
FinanceGenetics
PharmacokineticsPhylogeneticsPsychometricsSocial Sciences
ModelsGLMNet
Bayesian regressionMultinomial regression
Random ForestGradient boosted machine
Decision treesMulti-Layer Perceptron
Auto-encoderRestricted Boltzmann
K-MeansLSHSVDALS
ARIMAForecasting
Collaborative filteringSolvers and optimization
General TopicsMachine Learning
BayesianCluster
Design of experimentsExtreme ValueMeta AnalsisMultivariate
NLPRobust methods
SpatialSurvival
Time SeriesGraphical models
No data movement required. Native ML algorithms.
Fast growing ecosystem.
Over 10,000 packages. Time tested, industry specific models.
Integrated with other R packages
Scalable, high performance models. Wide variety of algorithms.
Useful diagnostics.
MLlib
CommunicateR MarkdownNotebooks
Make decisionsTake actionsSee results
Weave together text and code to produce
high quality documents, apps, and plots.
Share
DemoAnalyzing 1 billion records with Spark and R
http://colorado.rstudio.com:3939/content/262/
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
rsparkling extension
Spark is extensible…sparklyr is extensible
https://github.com/h2oai/rsparkling/blob/master/R/h2o_context.R#L53
Spark
R H2O
rsparkling
sparklyr
h2o
sparklingwater
Benefits Limitations
No data movement required. Native ML algorithms.
Fast growing ecosystem.
Comparatively fewer algorithms and fewer diagnostics.
Scalable, high performance models. Wide variety of algorithms.
Useful diagnostics.
Data conversion requires 3-4X memory. Added complexity around introducing and
learning another tool.
Access to CRAN packages, visualization, reporting tools, and time tested algorithms.
Data collection is expensive and collection size is limited (< 10 GB).
Where should I model my data?
Others…
MLlib
spark.rstudio.com
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast