SparkR - Play Spark Using R (20160909 HadoopCon)

47
SparkR - Play Spark Using R Gil Chen @HadoopCon 2016 Demo: http://goo.gl/VF77ad

Transcript of SparkR - Play Spark Using R (20160909 HadoopCon)

Page 1: SparkR - Play Spark Using R (20160909 HadoopCon)

SparkR - Play Spark Using R

Gil Chen @HadoopCon 2016

Demo: http://goo.gl/VF77ad

Page 2: SparkR - Play Spark Using R (20160909 HadoopCon)

about me

• R, Python & Matlab User

• Taiwan R User Group

• Taiwan Spark User Group

• Co-founder

• Data Scientist @

Page 3: SparkR - Play Spark Using R (20160909 HadoopCon)

HadoopCon 2015

Page 4: SparkR - Play Spark Using R (20160909 HadoopCon)
Page 5: SparkR - Play Spark Using R (20160909 HadoopCon)

Outline• Introduction to SparkR

• Demo

• Starting to use SparkR

• DataFrames: dplyr style, SQL style

• RDD v.s. DataFrames

• MLlib: GLM, K-means

• User Case

• Median: approxQuantile()

• ID Match: dplyr style, SQL style, SparkR function

• SparkR + Shiny

• The Future of SparkR

Page 6: SparkR - Play Spark Using R (20160909 HadoopCon)

Introduction to SparkR

Page 7: SparkR - Play Spark Using R (20160909 HadoopCon)

Spark Origin

• Apache Spark is an open source cluster computing framework

• Originally developed at the University of California, Berkeley's AMPLab

• The first 2 contributors of SparkR: Shivaram Venkataraman & Zongheng Yang

https://amplab.cs.berkeley.edu/

Page 8: SparkR - Play Spark Using R (20160909 HadoopCon)

Spark History

https://en.wikipedia.org/wiki/Apache_Spark

SparkR

DataFrames

PySpark

Page 9: SparkR - Play Spark Using R (20160909 HadoopCon)

Key Advantages of Spark & R

+

Fast!

Flexible

Scalable

Statistical!

Interactive

Packages

https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf

Page 10: SparkR - Play Spark Using R (20160909 HadoopCon)

ggplot2

Google Search: ggplot2

ggplot2 is a plotting system for R, based on the grammar of graphics.

Page 11: SparkR - Play Spark Using R (20160909 HadoopCon)

Shiny

http://shiny.rstudio.com/gallery/and more impressive dashboard…

A web application framework for R Turn your analyses into interactive web applications No HTML, CSS, or JavaScript knowledge required

Page 12: SparkR - Play Spark Using R (20160909 HadoopCon)

Performance

https://amplab.cs.berkeley.edu/announcing-sparkr-r-on-spark/

The runtime performance of running group-by aggregation on 10 million integer pairs on a single machine in R, Python and Scala. (using the same dataset as https://goo.gl/iMLXnh)

https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf

Page 13: SparkR - Play Spark Using R (20160909 HadoopCon)

RDD (Resilient Distributed Dataset)

https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD

Internally, each RDD is characterized by five main properties:

1. A list of partitions

2. A function for computing each split

3. A list of dependencies on other RDDs

4. Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

5. Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

https://docs.cloud.databricks.com/docs/latest/courses

Page 14: SparkR - Play Spark Using R (20160909 HadoopCon)

RDD dependencies• Narrow dependency: Each partition of the parent RDD is used by at most

one partition of the child RDD. This means the task can be executed locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.)

• Wide dependency: Multiple child partitions may depend on one partition of the parent RDD. This means we have to shuffle data unless the parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.)

http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

Page 15: SparkR - Play Spark Using R (20160909 HadoopCon)

Job Scheduling

http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

Black: if they are already in memory

Page 16: SparkR - Play Spark Using R (20160909 HadoopCon)

Transformations

RDD Operations

map() flatmap() filter() mapPartitions() sample() union() intersection() distinct() groupByKey() reduceByKey() sortByKey() join() cogroup() …

Actionsreduce() collect() count() first() take(num) takeSample() takeOrdered() saveAsTextFile() saveAsSequenceFile() saveAsObjectFile() countByValue() countByKey() foreach() …

http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

narrowdep.

widedep.

lazy evaluation

Page 17: SparkR - Play Spark Using R (20160909 HadoopCon)

RDD Example

RDDRDDRDDRDD

Transformations

Action Value

rdd <- SparkR:::textFile(sc, "txt")

words <- SparkR:::flatMap(rdd, function())

wordCount <- SparkR:::lapply(words, function(word))

counts <- SparkR:::reduceByKey(wordCount, "+", 1)

op <- SparkR:::collect(counts)

Page 18: SparkR - Play Spark Using R (20160909 HadoopCon)

R shell

RDD

SparkR

RDD & DataFrames

before v1.6

since v2.0

array

data.frame

+ schema

SparkDataFrame

+ schema

General

Action

Transformation

Page 19: SparkR - Play Spark Using R (20160909 HadoopCon)

DataFrames are Faster!

http://scala-phase.org/talks/rdds-dataframes-datasets-2016-06-16/#/

Beyond SQL: Speeding up Spark with DataFrameshttp://www.slideshare.net/databricks/spark-sqlsse2015public

Page 20: SparkR - Play Spark Using R (20160909 HadoopCon)

Spark Stack

https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/ch04.html

Storage

Cluster Manager

Processing Engine

Access & Interfaces

Page 21: SparkR - Play Spark Using R (20160909 HadoopCon)

How does sparkR works?

https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf

Page 22: SparkR - Play Spark Using R (20160909 HadoopCon)

Upgrading From SparkR 1.6 to 2.0

Before 1.6.2 Since 2.0.0

data type naming DataFrame SparkDataFrame

read csv Package from Databricks built-in

function(like approxQuantile) X O

ML function glm more(or use sparklyr)

SQLContext/ HiveContext sparkRSQL.init(sc) merge in

sparkR.session()

Execute Message very detailed simple

Launch on EC2 API X

https://spark.apache.org/docs/latest/sparkr.html

Page 23: SparkR - Play Spark Using R (20160909 HadoopCon)

Demohttp://goo.gl/VF77ad

Page 24: SparkR - Play Spark Using R (20160909 HadoopCon)

Easy Setting

1. Download

2. Decompress and Give a Path

3. Set Path and Launch SparkR in R

Page 25: SparkR - Play Spark Using R (20160909 HadoopCon)

Documents

• If you have to use RDD, refer to AMP-Lab github: http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/and use “:::” e.g. SparkR:::textFile, SparkR:::lapply

• Otherwise, refer to SparkR official documents: https://spark.apache.org/docs/2.0.0/api/R/index.html

Page 26: SparkR - Play Spark Using R (20160909 HadoopCon)

Starting to Use SparkR (v1.6.2)

# Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/")

# Load SparkR library into your R session library(SparkR,

lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

# Initialize SparkContext, sc:sparkContext sc <- sparkR.init(appName = "Demo_SparkR")

# Initialize SQLContext sqlContext <- sparkRSQL.init(sc)

# your sparkR script # ... # ...

sparkR.stop()

Page 27: SparkR - Play Spark Using R (20160909 HadoopCon)

Starting to Use SparkR (v2.0.0)

# Set Spark path Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/")

# Load SparkR library into your R session library(SparkR,

lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

# Initialize SparkContext, sc: sparkContext sc <- sparkR.session(appName = "Demo_SparkR")

# Initialize SQLContext (don’t need anymore since 2.0.0) # sqlContext <- sparkRSQL.init(sc)

# your sparkR script # ... # ...

sparkR.stop()

Page 28: SparkR - Play Spark Using R (20160909 HadoopCon)

DataFrames# Load the flights CSV file using read.df sdf <- read.df(sqlContext,"data_flights.csv", "com.databricks.spark.csv", header = "true")

# Filter flights from JFK jfk_flights <- filter(sdf, sdf$origin == "JFK")

# Group and aggregate flights to each destination dest_flights <- summarize(

groupBy(jfk_flights, jfk_flights$dest), count = n(jfk_flights$dest))

# Running SQL Queries registerTempTable(sdf, "tempTable") training <- sql(sqlContext,

"SELECT dest, count(dest) as cnt FROM tempTable WHERE dest = 'JFK' GROUP BY dest")

Page 29: SparkR - Play Spark Using R (20160909 HadoopCon)

Word Count

# read data into RDD rdd <- SparkR:::textFile(sc, "data_word_count.txt")

# split word words <- SparkR:::flatMap(rdd, function(line) { strsplit(line, " ")[[1]] })

# map: give 1 for each word wordCount <- SparkR:::lapply(words, function(word) { list(word, 1) })

# reduce: count the value by key(word) counts <- SparkR:::reduceByKey(wordCount, "+", 2)

# convert RDD to list op <- SparkR:::collect(counts)

Page 30: SparkR - Play Spark Using R (20160909 HadoopCon)

RDD v.s. DataFrames

flights_SDF <- read.df(sqlContext, "data_flights.csv", source = "com.databricks.spark.csv", header = "true")

SDF_op <- flights_SDF %>% group_by(flights_SDF$hour) %>% summarize(sum(flights_SDF$dep_delay)) %>% collect()

flights_RDD <- SparkR:::textFile(sc, "data_flights.csv")

RDD_op <- flights_RDD %>% SparkR:::filterRDD(function (x) { x >= 1 }) %>% SparkR:::lapply(function(x) { y1 <- as.numeric(unlist(strsplit(x, ","))[2]) y2 <- as.numeric(unlist(strsplit(x, ","))[6]) return(list(y1,y2))}) %>% SparkR:::reduceByKey(function(x,y) x + y, 1) %>% SparkR:::collect()

DataFrames

RDD

Page 31: SparkR - Play Spark Using R (20160909 HadoopCon)

SparkR on MLlib

SparkR supports a subset of the available R formula operators for model fitting, including ~ . : + - ,

e.g. y ~ x1 + x2

Page 32: SparkR - Play Spark Using R (20160909 HadoopCon)

Generalized Linear Model, GLM

# read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv", header = "true", schema) %>% cache()

# drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any")

# split train/test dataset train <- sample(flights_SDF_2, withReplacement = FALSE, fraction = 0.5, seed = 42) test <- except(flights_SDF_2, train)

# building model gaussianGLM <- spark.glm(train, arr_delay ~ dep_delay + dist, family = "gaussian") summary(gaussianGLM)

# prediction preds <- predict(gaussianGLM, test)

Page 33: SparkR - Play Spark Using R (20160909 HadoopCon)

K-means

# read data and cache flights_SDF <- read.df("data_flights.csv", source = "csv", header = "true", schema) %>% cache()

# drop NA flights_SDF_2 <- dropna(flights_SDF, how = "any")

# clustering kmeansModel <- spark.kmeans(flights_SDF_2, ~ arr_delay + dep_delay + dist + flight + dest + cancelled + time + dist, k = 15) summary(kmeansModel)

cluster_op <- fitted(kmeansModel)

# clustering result kmeansPredictions <- predict(kmeansModel, flights_SDF_2)

Page 34: SparkR - Play Spark Using R (20160909 HadoopCon)

User Case

Page 35: SparkR - Play Spark Using R (20160909 HadoopCon)

Median (approxQuantile)

gdf <- seq(1,10,1) %>% data.frame() colnames(gdf) <- "seq"

sdf <- createDataFrame(gdf)

median_val <- approxQuantile(sdf, "seq", 0.5, 0) %>% print()

Calculate Median using SQL query…so complicated…

http://www.1keydata.com/tw/sql/sql-median.html

Page 36: SparkR - Play Spark Using R (20160909 HadoopCon)

ID Match

##### method 1 : like dplyr + pipeline join_id_m1 <- join(sdf_1, sdf_2, sdf_1$id1 == sdf_2$id2, "inner") %>% select("id2") %>% collect()

##### method 2 : sql query createOrReplaceTempView(sdf_1, "table1") createOrReplaceTempView(sdf_2, "table2")

qry_str <- "SELECT table2.id2 FROM table1 JOIN table2 ON table1.id1 = table2.id2"

join_id_m2 <- sql(qry_str)

##### method 3 : SparkR function join_id_m2 <- intersect(sdf_1, sdf_2) %>% collect()

Page 37: SparkR - Play Spark Using R (20160909 HadoopCon)

Play Pokemon Go Data with SparkR !!

Page 38: SparkR - Play Spark Using R (20160909 HadoopCon)

Application on SparkR

Interactive MapsWeb FrameworkCompute Engine

+

Where is the Dragonite nest ?

+

Page 39: SparkR - Play Spark Using R (20160909 HadoopCon)

Port: 8080 - Cluster Monitor

Capacity of each worker

Page 40: SparkR - Play Spark Using R (20160909 HadoopCon)

Port: 4040 Jobs Monitor

cache(SparkDataFrame), long run time for first time

Advanced performance

Status of each worker

Page 41: SparkR - Play Spark Using R (20160909 HadoopCon)
Page 42: SparkR - Play Spark Using R (20160909 HadoopCon)

Some Tricks

• Customize spark config for launch

• cache()

• Some codes can’t run in Rstudio, try to use terminal

• Packages from 3rd party, like package of read csv file from databricks

Page 43: SparkR - Play Spark Using R (20160909 HadoopCon)

The Future of SparkR

• More MLlib API

• Advanced User Define Function

• package(“sparklyr”) from Rstudio

Page 44: SparkR - Play Spark Using R (20160909 HadoopCon)

Reference• SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu,

Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, and Matei Zaharia. SIGMOD 2016. June 2016. https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf

• SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark Summit, June 2014, San Francisco.https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf

• Apache Spark Official Research http://spark.apache.org/research.html - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

• Apache Spark Official Document http://spark.apache.org/docs/latest/api/scala/

• AMPlab UC Berkeley - SparkR Project https://github.com/amplab-extras/SparkR-pkg

• Databricks Official Bloghttps://databricks.com/blog/category/engineering/spark

• R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/

Page 45: SparkR - Play Spark Using R (20160909 HadoopCon)

Rstudio in Amazon EC2

Page 46: SparkR - Play Spark Using R (20160909 HadoopCon)

Join Us• Fansboard

• Web Designer (php & JavaScript) • Editor w/ facebook & instagram

• Vpon - Data Scientist

• Taiwan Spark User Group

• Taiwan R User Group

Page 47: SparkR - Play Spark Using R (20160909 HadoopCon)

Thanks for your attention

& Taiwan Spark User Group& Vpon Data Team