Introduce to Spark sql 1.3.0

download Introduce to Spark sql 1.3.0

If you can't read please download the document

Embed Size (px)

Transcript of Introduce to Spark sql 1.3.0

1. Introduce to Spark SQL 1.3.0 2. Optimization 3. 4. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sq l.package 5. spark-shell scSQL Context Spark context available as sc. 15/03/22 02:09:11 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. 6. JAR val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ 7. DF from RDD RDD scala> val data = sc.textFile("hdfs://localhost:54310/user/hadoop/ml-100k/u.data") case class case class Rattings(userId: Int, itemID: Int, rating: Int, timestmap:String) Data Frame scala> val ratting = data.map(_.split("t")).map(p => Rattings(p(0).trim.toInt, p(1).trim.toInt, p(2).trim.toInt, p(3))).toDF() ratting: org.apache.spark.sql.DataFrame = [userId: int, itemID: int, rating: int, timestmap: string] 8. DF from json {"movieID":242,"name":"test1"} {"movieID":307,"name":"test2"} scala> val movie = sqlContext.jsonFile("hdfs://localhost:54310/user/hadoop/ml- 100k/movies.json") 9. Dataframe Operations Show() userId itemID rating timestmap 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 253 465 5 891628467 head(5) res11: Array[org.apache.spark.sql.Row] = Array([196,242,3,881250949], [186,302,3,891717742], [22,377,1,878887116], [244,51,2,880606923], [166,346,1,886397596]) 10. printSchema() scala> ratting.printSchema() root |-- userId: integer (nullable = false) |-- itemID: integer (nullable = false) |-- rating: integer (nullable = false) |-- timestmap: string (nullable = true) 11. Select Select Column scala> ratting.select("userId").show() Condition Select scala> ratting.select(ratting("itemID")>100).show() (itemID > 100) true true true 12. filter scala> ratting.filter(ratting("rating")>3).show() userId itemID rating timestmap 298 474 4 884182806 253 465 5 891628467 286 1014 5 879781125 200 222 5 876042340 122 387 5 879270459 291 1042 4 874834944 119 392 4 13. ratting.filter('rating>3).show() scala> ratting.filter(ratting("rating")>3).select("userID","itemID")show() userID itemID 298 474 286 1014 ratting.filter('userID>500).select(avg('rating),max('rating),sum('rating))show() 14. GROUP BY count() scala> ratting.groupBy("userId").count().show() userId count 831 73 631 20 agg() scala> ratting.groupBy("userId").agg("rating"->"avg","userID" -> "count").show() scala> ratting.groupBy("userId").count().sort("count","userID").show() 15. GROUP BY o avg o max o min o mean o sum Function https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ 16. ratting.groupBy('userID).agg(('userID),avg('rating),max('rating),sum('rating),count(' userID AVG('rating) MAX('rating) SUM('rating) COUNT('rating) 831 3.5205479452054793 5 257 73 631 3.1 4 62 20 31 3.9166666666666665 5 141 36 431 3.380952380952381 5 71 21 231 3.6666666666666665 5 77 21 832 2.96 5 74 25 632 3.6610169491525424 5 432 118 17. UnionAll scala> val ratting1_3 = ratting.filter(ratting("rating") ratting1_3.count() //res79: Long = 44625 scala> val ratting4_5 = ratting.filter(ratting("rating")>3) scala> ratting4_5.count() //res80: Long = 55375 ratting1_3.unionAll(ratting4_5).count() //res81: Long = 100000 UNION scala> ratting1_3.unionAll(test).count() java.lang.AssertionError: assertion failed 18. JOIN scala> ratting.join(movie, $"itemID" === $"movieID", "inner").show() userId itemID rating timestmap movieID name 196 242 3 881250949 242 test1 63 242 3 875747190 242 test1 joininner, outer, left_outer, right_outer, semijoin. 19. TABLE scala> ratting.registerTempTable("ratting_table") SQL sqlContext.sql("SELECT us scala> sqlContext.sql("SELECT userID FROM ratting_table").show() 20. DFRDD MAP scala> result.map(t => "user:" + t(0)).collect().foreach(println) Any scala> ratting.map(t => t(2)).take(5) stringint scala> ratting.map(t => Array(t(0),t(2).toString.toInt * 10)).take(5) res130: Array[Array[Any]] = Array(Array(196, 30), Array(186, 30), Array(22, 10), Array(244, 20), Array(166, 10)) 21. SAVE DATA Save() ratting.select("itemID").save("hdfs://localhost:54310/test2.json","json") saveAsParquetFile saveAsTable(Hive Table) 22. DataType Numeric types String type Binary type Boolean type Datetime type o TimestampType: Represents values comprising values of fields year, month, day, hour, minute, and second. o DateType: Represents values comprising values of fields year, month, day. Complex types 23. 24. Reference 1. https://databricks.com/blog/2015/02/17/introducing- dataframes-in-spark-for-large-scale-data-science.html 1. https://www.youtube.com/watch?v=vxeLcoELaP4 1. http://www.slideshare.net/databricks/introducing- dataframes-in-spark-for-large-scale-data-science