Beneath RDD in Apache Spark by Jacek Laskowski

download Beneath RDD in Apache Spark by Jacek Laskowski

of 36

Embed Size (px)

Transcript of Beneath RDD in Apache Spark by Jacek Laskowski

  • BENEATH RDDIN APACHE SPARK

    USING SPARK-SHELL AND WEBUI / / / JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES

    https://blog.jaceklaskowski.pl/https://twitter.com/jaceklaskowskihttps://github.com/jaceklaskowskihttp://bit.ly/mastering-apache-spark

  • Jacek Laskowski is an independent consultantContact me at jacek@japila.pl or Delivering Development Services | Consulting | TrainingBuilding and leading development teamsMostly and these daysLeader of and

    Blogger at and

    @JacekLaskowski

    Apache Spark ScalaWarsaw Scala Enthusiasts Warsaw Apache

    SparkJava Champion

    blog.jaceklaskowski.pl jaceklaskowski.pl

    https://twitter.com/jaceklaskowskihttp://spark.apache.org/http://www.scala-lang.org/http://warsawscala.pl/http://www.meetup.com/Warsaw-Spark/https://java.net/website/java-champions/bios.html#Laskowskihttp://blog.jaceklaskowski.pl/http://jaceklaskowski.pl/

  • http://bit.ly/mastering-apache-spark

    http://bit.ly/mastering-apache-spark

  • http://bit.ly/mastering-apache-spark

    http://bit.ly/mastering-apache-spark

  • SPARKCONTEXTTHE LIVING SPACE FOR RDDS

  • SPARKCONTEXT AND RDDSAn RDD belongs to one and only one Spark context.

    You cannot share RDDs between contexts.SparkContext tracks how many RDDs were created.

    You may see it in toString output.

  • SPARKCONTEXT AND RDDS (2)

  • RDDRESILIENT DISTRIBUTED DATASET

  • CREATING RDD - SC.PARALLELIZEsc.parallelize(col, slices) to distribute a localcollection of any elements.

    scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at :24

    Alternatively, sc.makeRDD(col, slices)

  • CREATING RDD - SC.RANGEsc.range(start, end, step, slices) to createRDD of long numbers.

    scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at :24

  • CREATING RDD - SC.TEXTFILEsc.textFile(name, partitions) to create a RDD oflines from a file.

    scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFile at :24

  • CREATING RDD - SC.WHOLETEXTFILESsc.wholeTextFiles(name, partitions) to createa RDD of pairs of a file name and its content from adirectory.

    scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wholeTextFiles at :24

  • There are many more more advanced functions inSparkContext to create RDDs.

  • PARTITIONS (AND SLICES)Did you notice the words slices and partitions asparameters?Partitions (aka slices) are the level of parallelism.

    We're going to talk about the level of parallelism later.

  • CREATING RDD - DATAFRAMESRDDs are so last year :-) Use DataFrames...early and often!A DataFrame is a higher-level abstraction over RDDs andsemi-structured data.DataFrames require a SQLContext.

  • FROM RDDS TO DATAFRAMESscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at :24

    scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int]

    scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

  • ...AND VICE VERSAscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at :24

    scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

    scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70] at rdd at :29

  • CREATING DATAFRAMES -SQLCONTEXT.CREATEDATAFRAMEsqlContext.createDataFrame(rowRDD, schema)

  • CREATING DATAFRAMES - SQLCONTEXT.READsqlContext.read is the modern yet experimental way.sqlContext.read.format(f).load(path), where fis:

    jdbcjsonorcparquettext

  • EXECUTION ENVIRONMENT

  • PARTITIONS AND LEVEL OF PARALLELISMThe number of partitions of a RDD is (roughly) the numberof tasks.Partitions are the hint to size jobs.Tasks are the smallest unit of execution.Tasks belong to TaskSets.TaskSets belong to Stages.Stages belong to Jobs.Jobs, stages, and tasks are displayed in web UI.

    We're going to talk about the web UI later.

  • PARTITIONS AND LEVEL OF PARALLELISM CD.In local[*] mode, the number of partitions equals thenumber of cores (the default in spark-shell)

    scala> sc.defaultParallelism res0: Int = 8

    scala> sc.masterres1: String = local[*]

    Not necessarily true when you use local or local[n] masterURLs.

  • LEVEL OF PARALLELISM IN SPARK CLUSTERSTaskScheduler controls the level of parallelismDAGScheduler, TaskScheduler, SchedulerBackend workin tandemDAGScheduler manages a "DAG" of RDDs (aka RDDlineage)SchedulerBackends manage TaskSets

  • DAGSCHEDULER

  • TASKSCHEDULER AND SCHEDULERBACKEND

  • RDD LINEAGERDD lineage is a graph of RDD dependencies.Use toDebugString to know the lineage.Be careful with the hops - they introduce shuffle barriers.Why is the RDD lineage important?This is the R in RDD - resiliency.But deep lineage costs processing time, doesn't it?Persist (aka cache) it early and often!

  • RDD LINEAGE - DEMOWhat does the following do?

    val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)

  • RDD LINEAGE - DEMO CD.How many stages are there?

    // val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at :24 [] +-(2) MapPartitionsRDD[2] at groupBy at :24 [] | MapPartitionsRDD[1] at map at :24 [] | ParallelCollectionRDD[0] at parallelize at :24 []

    Nothing happens yet - processing time-wise.

  • SPARK CLUSTERSSpark supports the following clusters:

    one-JVM local clusterSpark StandaloneApache MesosHadoop YARN

    You use --master to select the clusterspark://hostname:port is for Spark Standalone

    And you know the local master URL, ain't you?local, local[n], or local[*]

  • MANDATORY PROPERTIES OF SPARK APPYour task: Fill in the gaps below.

    Any Spark application must specify application name (akaappName ) and master URL.

    Demo time! => spark-shell is a Spark app, too!

  • SPARK STANDALONE CLUSTERThe built-in Spark clusterStart standalone Master with sbin/start-master

    Use -h to control the host name to bind to.Start standalone Worker with sbin/start-slave

    Run single worker per machine (aka node) = web UI for Standalone cluster

    Don't confuse it with the web UI of Spark applicationDemo time! => Run Standalone cluster

    http://localhost:8080/

    http://localhost:8080/

  • SPARK-SHELLSPARK REPL APPLICATION

  • SPARK-SHELL AND SPARK STANDALONEYou can connect to Spark Standalone using spark-shellthrough --master command-line option.

    Demo time! => we've already started the Standalonecluster.

  • WEBUIWEB USER INTERFACE FOR SPARK APPLICATION

  • WEBUIIt is available under You can disable it using spark.ui.enabled flag.All the events are captured by Spark listeners

    You can register your own Spark listener.Demo time! => webUI in action with different master URLs

    http://localhost:4040/

    http://localhost:4040/

  • QUESTIONS?- Visit - Follow at twitter - Use - Read notes.

    Jacek Laskowski's blog@jaceklaskowski

    Jacek's projects at GitHubMastering Apache Spark

    https://blog.jaceklaskowski.pl/https://twitter.com/jaceklaskowskihttps://github.com/jaceklaskowskihttp://bit.ly/mastering-apache-spark