Spark rdd operations in scala part 2

download Spark rdd operations in scala part   2

of 14

  • date post

    14-Apr-2017
  • Category

    Education

  • view

    48
  • download

    2

Embed Size (px)

Transcript of Spark rdd operations in scala part 2

  • ACADGILDACADGILDIn our previous post, we had discussed about the basic RDD operations in Scala. Now, letsdiscuss about some of the advanced RDD operations in Scala.Here we have taken two datasets, dept and emp, to work on this operations. The datasetslooks like this:

    [DeptNo DeptName] [Emp_no DOB FName Lname gender HireDate DeptNo]Both the datasets are delimited by tab.Union:The Union operation results in an RDD which contains the elements of both the RDD's.You can refer to the below screen shot to see how the Union operation performs.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

    https://acadgild.com/blog/spark-rdds-scala/

  • ACADGILDACADGILD

    Here, we have created two RDDs and loaded the two datasets into them. We haveperformed Union operation on them, and from the result you can see that both thedatasets are combined and have printed the first 10 records of the newly obtained RDD.Here the 10th record is the first record of the second dataset.Intersection:Intersection returns the elements of both the RDD's. Refer the below screen shot to knowhow to perform intersection.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    Here we have split the datasets by using tab delimiter and have extracted the 1st columnfrom the first dataset and the 7th column from the second dataset. We have alsoperformed intersection on the datasets and the result is as displayed.Cartesian:The Cartesian operation will return the RDD containing the Cartesian product of theelements contained in both the RDDs. You can refer to the below screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    Here we have split the datasets by using tab delimiter and have extracted 1st column fromthe first dataset and 7th column from the second dataset. Then, we have performed theCartesian operation on the RDDs and the results are displayed.Subtract:The Subtract operation will remove the common elements present in both the RDDs. Youcan refer to the below screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    Here, we have split the datasets by using tab delimiter and have extracted the 1st columnfrom the first dataset and the 7th column from the second dataset. Then we haveperformed the Subtract operation on the RDDs and the results are displayed.Foreach:The foreach operation is used to iterate every element in the RDD. You can refer to thebelow screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    In the above screen shot, you can see that every element in the RDD emp are printed in aseparate line.Operations on Paired RDD's:

    Creating Pair RDD:

    Here, we will create a RDD pair which consists of key and value pairs. To create a pairRDD, we need to import the RDD package by using the below statement:import org.apache.spark.rdd.RDDYou can refer to the below screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

    https://acadgild.com/blog/spark-quiz-part-1?utm_source=Blog(organic)&utm_medium=Blog%20article&utm_campaign=Big%20Data%20Landing%20Page

  • ACADGILDACADGILD

    Here, we have split the dataset by using the tab as delimiter and made the key value pairsas shown in the above screen shot.Keys:The Keys operation is used to print all the keys in the RDD pair. You can refer to the belowscreen shot for the same.

    Values:The Values operation is used to print all the values in the RDD pair. You can refer to thebelow screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    SortByKey:The SortByKey operation returns the RDD that contains the key value pairs sorted byKeys. SortByKey accepts arguments true/false. False will sort the keys in descendingorder and True will sort the keys in ascending order. You can refer to the below screenshot for the same.

    RDD's holding Objects:Here, by using the case class, we will declare one object and will pass this case class asparameter to the RDD. You can refer to the below screen shot for the same.https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    Join:The Join operation is used to join two RDDs. The default Join will be Inner join. You canrefer to the below screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    Here, we have taken two case classes for the two datasets and have created two RDDswith the two datasets as the common element as key and the rest of the contents as valueand have performed Join operation on the RDDs and the result is as displayed on thescreen.RighOuterJoin:The RightOuterJoin operation returns the joined elements of both the RDDs, where thekey must be present in the first RDD. You can refer to the below screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    Here, we have taken two case classes for the two datasets and have created two RDDswith the two datasets as the common element as key and the rest of the contents as valueshttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILDand we have performed rightOuterJoin operation on the RDDs and the result is asdisplayed on the screen.LeftOuterJoin:The LeftOuterJoin operation returns the joined elements of both the RDDs, where the keymust be present in the second RDD. You can refer to the below screen shot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILDHere, we have taken two case classes for the two datasets and we have created two RDDswith the two datasets as the common element as key and the rest of the contents as valueand we have performed the LeftOuterJoin operation on the RDDs and the result is asdisplayed on the screen.CountByKey:The CountByKEy operation returns the number of elements present for each key. You can refer to the below screenshot for the same.

    Here, we have loaded the dataset and split the records by using tab as delimiter and created the pair as DeptNo and DeptName. Then, we have performed CountByKey operation and the result is as displayed.SaveAsTextFile:The SaveAsTExtFile operation stores the result of the RDD in a text File in the given output path. You can refer to the below screenshot for the same.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

  • ACADGILDACADGILD

    Hope this post has been helpful in understanding the advanced RDD operations in Scala. In case of any queries, feel free to drop us a comment below or email us at support@acadgild.com.Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

    https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

    http://www.acadgild.com/mailto:support@acadguild.comhttp://www.acadgild.com/help/apache-spark/academy5_in?utm_source=Blog(organic)&utm_medium=Blog%20article&utm_campaign=Spark%20Landing%20Page