Beneath RDD in Apache Spark by Jacek Laskowski

BENEATH RDDIN APACHE SPARK

USING SPARK-SHELL AND WEBUI / / / JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES

https://blog.jaceklaskowski.pl/

https://twitter.com/jaceklaskowski

https://github.com/jaceklaskowski

http://bit.ly/mastering-apache-spark

Jacek Laskowski is an independent consultantContact me at [email protected] or Delivering Development Services | Consulting | TrainingBuilding and leading development teamsMostly and these daysLeader of and

Blogger at and

@JacekLaskowski

Apache Spark ScalaWarsaw Scala Enthusiasts Warsaw Apache

SparkJava Champion

blog.jaceklaskowski.pl jaceklaskowski.pl


http://spark.apache.org/

http://www.scala-lang.org/

http://warsawscala.pl/

http://www.meetup.com/Warsaw-Spark/

https://java.net/website/java-champions/bios.html#Laskowski

http://blog.jaceklaskowski.pl/

http://jaceklaskowski.pl/

SPARKCONTEXTTHE LIVING SPACE FOR RDDS

SPARKCONTEXT AND RDDSAn RDD belongs to one and only one Spark context.

You cannot share RDDs between contexts.SparkContext tracks how many RDDs were created.

You may see it in toString output.

SPARKCONTEXT AND RDDS (2)

RDDRESILIENT DISTRIBUTED DATASET

CREATING RDD - SC.PARALLELIZEsc.parallelize(col, slices) to distribute a localcollection of any elements.

scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

Alternatively, sc.makeRDD(col, slices)

CREATING RDD - SC.RANGEsc.range(start, end, step, slices) to createRDD of long numbers.

scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:24

CREATING RDD - SC.TEXTFILEsc.textFile(name, partitions) to create a RDD oflines from a file.

scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFile at <console>:24

CREATING RDD - SC.WHOLETEXTFILESsc.wholeTextFiles(name, partitions) to createa RDD of pairs of a file name and its content from adirectory.

scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wholeTextFiles at <console>:24

There are many more more advanced functions inSparkContext to create RDDs.

PARTITIONS (AND SLICES)Did you notice the words slices and partitions asparameters?Partitions (aka slices) are the level of parallelism.

We're going to talk about the level of parallelism later.

CREATING RDD - DATAFRAMESRDDs are so last year :-) Use DataFrames...early and often!A DataFrame is a higher-level abstraction over RDDs andsemi-structured data.DataFrames require a SQLContext.

FROM RDDS TO DATAFRAMESscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int]

scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

...AND VICE VERSAscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70] at rdd at <console>:29

CREATING DATAFRAMES -SQLCONTEXT.CREATEDATAFRAMEsqlContext.createDataFrame(rowRDD, schema)

CREATING DATAFRAMES - SQLCONTEXT.READsqlContext.read is the modern yet experimental way.sqlContext.read.format(f).load(path), where fis:

jdbcjsonorcparquettext

EXECUTION ENVIRONMENT

PARTITIONS AND LEVEL OF PARALLELISMThe number of partitions of a RDD is (roughly) the numberof tasks.Partitions are the hint to size jobs.Tasks are the smallest unit of execution.Tasks belong to TaskSets.TaskSets belong to Stages.Stages belong to Jobs.Jobs, stages, and tasks are displayed in web UI.

We're going to talk about the web UI later.

PARTITIONS AND LEVEL OF PARALLELISM CD.In local[*] mode, the number of partitions equals thenumber of cores (the default in spark-shell)

scala> sc.defaultParallelism res0: Int = 8

scala> sc.masterres1: String = local[*]

Not necessarily true when you use local or local[n] masterURLs.

LEVEL OF PARALLELISM IN SPARK CLUSTERSTaskScheduler controls the level of parallelismDAGScheduler, TaskScheduler, SchedulerBackend workin tandemDAGScheduler manages a "DAG" of RDDs (aka RDDlineage)SchedulerBackends manage TaskSets

DAGSCHEDULER

TASKSCHEDULER AND SCHEDULERBACKEND

RDD LINEAGERDD lineage is a graph of RDD dependencies.Use toDebugString to know the lineage.Be careful with the hops - they introduce shuffle barriers.Why is the RDD lineage important?This is the R in RDD - resiliency.But deep lineage costs processing time, doesn't it?Persist (aka cache) it early and often!

RDD LINEAGE - DEMOWhat does the following do?

val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)

RDD LINEAGE - DEMO CD.How many stages are there?

// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 []

Nothing happens yet - processing time-wise.

SPARK CLUSTERSSpark supports the following clusters:

one-JVM local clusterSpark StandaloneApache MesosHadoop YARN

You use --master to select the clusterspark://hostname:port is for Spark Standalone

And you know the local master URL, ain't you?local, local[n], or local[*]

MANDATORY PROPERTIES OF SPARK APPYour task: Fill in the gaps below.

Any Spark application must specify application name (akaappName ) and master URL.

Demo time! => spark-shell is a Spark app, too!

SPARK STANDALONE CLUSTERThe built-in Spark clusterStart standalone Master with sbin/start-master

Use -h to control the host name to bind to.Start standalone Worker with sbin/start-slave

Run single worker per machine (aka node) = web UI for Standalone cluster

Don't confuse it with the web UI of Spark applicationDemo time! => Run Standalone cluster

http://localhost:8080/


SPARK-SHELLSPARK REPL APPLICATION

SPARK-SHELL AND SPARK STANDALONEYou can connect to Spark Standalone using spark-shellthrough --master command-line option.

Demo time! => we've already started the Standalonecluster.

WEBUIWEB USER INTERFACE FOR SPARK APPLICATION

WEBUIIt is available under You can disable it using spark.ui.enabled flag.All the events are captured by Spark listeners

You can register your own Spark listener.Demo time! => webUI in action with different master URLs



QUESTIONS?- Visit - Follow at twitter - Use - Read notes.

Jacek Laskowski's blog@jaceklaskowski

Jacek's projects at GitHubMastering Apache Spark

https://blog.jaceklaskowski.pl/


https://github.com/jaceklaskowski


Beneath RDD in Apache Spark by Jacek Laskowski

Data & Analytics

Transcript of Beneath RDD in Apache Spark by Jacek Laskowski