Beneath RDD in Apache Spark by Jacek Laskowski

36
BENEATH RDD IN APACHE SPARK USING SPARK-SHELL AND WEBUI / / / JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES

Transcript of Beneath RDD in Apache Spark by Jacek Laskowski

Page 1: Beneath RDD in Apache Spark by Jacek Laskowski

BENEATH RDDIN APACHE SPARK

USING SPARK-SHELL AND WEBUI / / / JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES

Page 2: Beneath RDD in Apache Spark by Jacek Laskowski

Jacek Laskowski is an independent consultantContact me at [email protected] or Delivering Development Services | Consulting | TrainingBuilding and leading development teamsMostly and these daysLeader of and

Blogger at and

@JacekLaskowski

Apache Spark ScalaWarsaw Scala Enthusiasts Warsaw Apache

SparkJava Champion

blog.jaceklaskowski.pl jaceklaskowski.pl

Page 3: Beneath RDD in Apache Spark by Jacek Laskowski

http://bit.ly/mastering-apache-spark

Page 4: Beneath RDD in Apache Spark by Jacek Laskowski

http://bit.ly/mastering-apache-spark

Page 5: Beneath RDD in Apache Spark by Jacek Laskowski

SPARKCONTEXTTHE LIVING SPACE FOR RDDS

Page 6: Beneath RDD in Apache Spark by Jacek Laskowski

SPARKCONTEXT AND RDDSAn RDD belongs to one and only one Spark context.

You cannot share RDDs between contexts.SparkContext tracks how many RDDs were created.

You may see it in toString output.

Page 7: Beneath RDD in Apache Spark by Jacek Laskowski

SPARKCONTEXT AND RDDS (2)

Page 8: Beneath RDD in Apache Spark by Jacek Laskowski

RDDRESILIENT DISTRIBUTED DATASET

Page 9: Beneath RDD in Apache Spark by Jacek Laskowski

CREATING RDD - SC.PARALLELIZEsc.parallelize(col, slices) to distribute a localcollection of any elements.

scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

Alternatively, sc.makeRDD(col, slices)

Page 10: Beneath RDD in Apache Spark by Jacek Laskowski

CREATING RDD - SC.RANGEsc.range(start, end, step, slices) to createRDD of long numbers.

scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:24

Page 11: Beneath RDD in Apache Spark by Jacek Laskowski

CREATING RDD - SC.TEXTFILEsc.textFile(name, partitions) to create a RDD oflines from a file.

scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFile at <console>:24

Page 12: Beneath RDD in Apache Spark by Jacek Laskowski

CREATING RDD - SC.WHOLETEXTFILESsc.wholeTextFiles(name, partitions) to createa RDD of pairs of a file name and its content from adirectory.

scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wholeTextFiles at <console>:24

Page 13: Beneath RDD in Apache Spark by Jacek Laskowski

There are many more more advanced functions inSparkContext to create RDDs.

Page 14: Beneath RDD in Apache Spark by Jacek Laskowski

PARTITIONS (AND SLICES)Did you notice the words slices and partitions asparameters?Partitions (aka slices) are the level of parallelism.

We're going to talk about the level of parallelism later.

Page 15: Beneath RDD in Apache Spark by Jacek Laskowski

CREATING RDD - DATAFRAMESRDDs are so last year :-) Use DataFrames...early and often!A DataFrame is a higher-level abstraction over RDDs andsemi-structured data.DataFrames require a SQLContext.

Page 16: Beneath RDD in Apache Spark by Jacek Laskowski

FROM RDDS TO DATAFRAMESscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int]

scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

Page 17: Beneath RDD in Apache Spark by Jacek Laskowski

...AND VICE VERSAscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70] at rdd at <console>:29

Page 18: Beneath RDD in Apache Spark by Jacek Laskowski

CREATING DATAFRAMES -SQLCONTEXT.CREATEDATAFRAMEsqlContext.createDataFrame(rowRDD, schema)

Page 19: Beneath RDD in Apache Spark by Jacek Laskowski

CREATING DATAFRAMES - SQLCONTEXT.READsqlContext.read is the modern yet experimental way.sqlContext.read.format(f).load(path), where fis:

jdbcjsonorcparquettext

Page 20: Beneath RDD in Apache Spark by Jacek Laskowski

EXECUTION ENVIRONMENT

Page 21: Beneath RDD in Apache Spark by Jacek Laskowski

PARTITIONS AND LEVEL OF PARALLELISMThe number of partitions of a RDD is (roughly) the numberof tasks.Partitions are the hint to size jobs.Tasks are the smallest unit of execution.Tasks belong to TaskSets.TaskSets belong to Stages.Stages belong to Jobs.Jobs, stages, and tasks are displayed in web UI.

We're going to talk about the web UI later.

Page 22: Beneath RDD in Apache Spark by Jacek Laskowski

PARTITIONS AND LEVEL OF PARALLELISM CD.In local[*] mode, the number of partitions equals thenumber of cores (the default in spark-shell)

scala> sc.defaultParallelism res0: Int = 8

scala> sc.masterres1: String = local[*]

Not necessarily true when you use local or local[n] masterURLs.

Page 23: Beneath RDD in Apache Spark by Jacek Laskowski

LEVEL OF PARALLELISM IN SPARK CLUSTERSTaskScheduler controls the level of parallelismDAGScheduler, TaskScheduler, SchedulerBackend workin tandemDAGScheduler manages a "DAG" of RDDs (aka RDDlineage)SchedulerBackends manage TaskSets

Page 24: Beneath RDD in Apache Spark by Jacek Laskowski

DAGSCHEDULER

Page 25: Beneath RDD in Apache Spark by Jacek Laskowski

TASKSCHEDULER AND SCHEDULERBACKEND

Page 26: Beneath RDD in Apache Spark by Jacek Laskowski

RDD LINEAGERDD lineage is a graph of RDD dependencies.Use toDebugString to know the lineage.Be careful with the hops - they introduce shuffle barriers.Why is the RDD lineage important?This is the R in RDD - resiliency.But deep lineage costs processing time, doesn't it?Persist (aka cache) it early and often!

Page 27: Beneath RDD in Apache Spark by Jacek Laskowski

RDD LINEAGE - DEMOWhat does the following do?

val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)

Page 28: Beneath RDD in Apache Spark by Jacek Laskowski

RDD LINEAGE - DEMO CD.How many stages are there?

// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 []

Nothing happens yet - processing time-wise.

Page 29: Beneath RDD in Apache Spark by Jacek Laskowski

SPARK CLUSTERSSpark supports the following clusters:

one-JVM local clusterSpark StandaloneApache MesosHadoop YARN

You use --master to select the clusterspark://hostname:port is for Spark Standalone

And you know the local master URL, ain't you?local, local[n], or local[*]

Page 30: Beneath RDD in Apache Spark by Jacek Laskowski

MANDATORY PROPERTIES OF SPARK APPYour task: Fill in the gaps below.

Any Spark application must specify application name (akaappName ) and master URL.

Demo time! => spark-shell is a Spark app, too!

Page 31: Beneath RDD in Apache Spark by Jacek Laskowski

SPARK STANDALONE CLUSTERThe built-in Spark clusterStart standalone Master with sbin/start-master

Use -h to control the host name to bind to.Start standalone Worker with sbin/start-slave

Run single worker per machine (aka node) = web UI for Standalone cluster

Don't confuse it with the web UI of Spark applicationDemo time! => Run Standalone cluster

http://localhost:8080/

Page 32: Beneath RDD in Apache Spark by Jacek Laskowski

SPARK-SHELLSPARK REPL APPLICATION

Page 33: Beneath RDD in Apache Spark by Jacek Laskowski

SPARK-SHELL AND SPARK STANDALONEYou can connect to Spark Standalone using spark-shellthrough --master command-line option.

Demo time! => we've already started the Standalonecluster.

Page 34: Beneath RDD in Apache Spark by Jacek Laskowski

WEBUIWEB USER INTERFACE FOR SPARK APPLICATION

Page 35: Beneath RDD in Apache Spark by Jacek Laskowski

WEBUIIt is available under You can disable it using spark.ui.enabled flag.All the events are captured by Spark listeners

You can register your own Spark listener.Demo time! => webUI in action with different master URLs

http://localhost:4040/

Page 36: Beneath RDD in Apache Spark by Jacek Laskowski

QUESTIONS?- Visit - Follow at twitter - Use - Read notes.

Jacek Laskowski's blog@jaceklaskowski

Jacek's projects at GitHubMastering Apache Spark