Download - SparkR + Zeppelin

Transcript
Page 1: SparkR + Zeppelin

SparkR + ZeppelinSeattle Spark Meetup

Sept 9, 2015Felix Cheung

Page 2: SparkR + Zeppelin

Agenda• R & SparkR• SparkR DataFrame• SparkR in Zeppelin•What’s next

Page 3: SparkR + Zeppelin

R• A programming language for statistical computing and

graphics• S – 1975• S4 - advanced object-oriented features

• R – 1993• S + lexical scoping

• Interpreted•Matrix arithmetic• Comprehensive R Archive Network (CRAN) – 7000+ packages

Page 4: SparkR + Zeppelin

Fast!

Scalable

Flexible

Statistical!

Interactive

Packages

Page 5: SparkR + Zeppelin

SparkR• R language APIs for Spark and Spark SQL• Exposes Spark functionality in an R-friendly DataFrame API• Runs as its own REPL sparkR• or as a standard R package imported in tools like Rstudio library(SparkR)sc <- sparkR.init()sqlContext <- sparkRSQL.init(sc)

5

Page 6: SparkR + Zeppelin

History• Shivaram Venkataraman & Zongheng Yang,

amplab – UC Berkeley• RDD APIs in a standalone package (Jan/2014)• Spark SQL and SchemaRDD -> DataFrame• Spark 1.4 – first Spark release with SparkR APIs• Spark 1.5 (today!)

6

Page 7: SparkR + Zeppelin

Architecture

7Native S4 classes

& methods RBackend

socket

• A set of native S4 classes and methods that live inside a standard R package• A backend that passes data structures and method calls to

Spark Scala/JVM• A collection of “helper” methods written in Scala

Page 8: SparkR + Zeppelin

Advantages• R-like syntax extending DataFrame API• JVM processing with full access to Spark’s DAG capabilities

and Catalyst engine,e.g. execution plan optimization, constant-folding, predicate pushdown, and code generation

8

Page 10: SparkR + Zeppelin

SparkR in Zeppelin

Page 11: SparkR + Zeppelin

Architecture

RR adaptor

Page 12: SparkR + Zeppelin

Demo

Page 14: SparkR + Zeppelin

(extracted from the demo)Native R

Page 15: SparkR + Zeppelin

(extracted from the demo)

Native R and dplyr...

Similarly SparkR DataFrame…

Page 16: SparkR + Zeppelin

(extracted from the demo)

SparkR DataFrame…

Page 17: SparkR + Zeppelin

What’s new• Zeppelin - run with provided Spark (SPARK_HOME)• Spark 1.5.0 release• SparkR new APIs

Page 18: SparkR + Zeppelin

SparkR in Spark 1.5.0Get this today:• R formula •Machine learning like GLMmodel <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")

•More R-likedf[df$age %in% c(19, 30), 1:2]transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)

Page 19: SparkR + Zeppelin

Zeppelin• Stay tuned! More to come with R/SparkR• Lots of updates in the upcoming 0.5.x/0.6.0 release

Page 20: SparkR + Zeppelin

Question?https://github.com/felixcheung

linkedin: http://linkd.in/1OeZDb7 blog: http://bit.ly/1E2z6OI

Page 21: SparkR + Zeppelin
Page 22: SparkR + Zeppelin

subset# Columns can be selected using `[[` and `[`df[[2]] == df[["age"]]df[,2] == df[,"age"]df[,c("name", "age")]# Or to filter rowsdf[df$age > 20,]# DataFrame can be subset on both rows and Columnsdf[df$name == "Smith", c(1,2)]df[df$age %in% c(19, 30), 1:2]subset(df, df$age %in% c(19, 30), 1:2)subset(df, df$age %in% c(19), select = c(1,2))

Page 23: SparkR + Zeppelin

Transform/mutatenewDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2)

newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)