SparkSQL and Dataframe

Spark SQL and DataFrame

2015. 8.

이남기 (Namgee L e e )

숭실대학교

2 / 30

Programming Interface

3 / 30

DataFrame

DataFrame = RDD + Schema

Introduced in Spark 1.3

4 / 30

DataFrame

A distributed collection of rows organized into named columns

An abstraction for selecting, filtering, aggregating and plotting structured data

5 / 30

DataFrame

Write Less Code : Input & Output

DataFrame

Input : JSON

Output : Parquet

6 / 30

DataFrame

Spark SQL’s Data Source API can read and write DataFrame using a variety of formats.

7 / 30

DataFrame

Write Less Code

Likely

8 / 30

DataFrame

Write Less Code : Powerful Operation

Common operations can be expressed concisely as calls to the DataFrame API:• Selecting required columns• Joining different data sources• Aggregation (count, sum, average, etc)• Filtering

9 / 30

Creating DataFrames

With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources.

val sc: SparkContext // An existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdoutdf.show()

// age name

// null Michael

// 30 Andy

// 19 Justin

{"name":"Michael"}{"name":"Andy", "age":30}{"name":"Justin", "age":19}

10 / 30

Creating DataFrames

// age name

// null Michael

// 30 Andy

// 19 Justin

DataFrameReader:json

SQLContext:read

Input String path

DataFrame

Name Age

Michael 29

Andy 30

Justin 19

11 / 30

Creating DataFrames

// age name

// null Michael

// 30 Andy

// 19 Justin

Name Age

Michael 29

Andy 30

Justin 19

12 / 30

DataFrame Operations

// Select everybody, but increment the age by 1df.select(df("name"), df("age") + 1).show()// name (age + 1)// Michael null// Andy 31// Justin 20

// Select people older than 21df.filter(df("age") > 21).show()// age name// 30 Andy

// Count people by agedf.groupBy("age").count().show()// age count// null 1// 19 1// 30 1

DataFrame(“Column Name”)

Column Object

DataFrame Operations• Select, filter, groupBy, join, etc…

Name Age

Michael 29

Andy 30

Justin 19

Output

13 / 30

DataFrame Operations

val sqlContext = ... // An existing SQLContextval df = sqlContext.sql("SELECT * FROM table")

The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

SQLContext:sql

DataFrame

Input String Query

14 / 30

DataFrame

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame// sc is an existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)// this is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits._

// Define the schema using a case class.// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,// you can use custom classes that implement the Product interface.case class Person(name: String, age: Int)

Continue

15 / 30

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContextval teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by field index:teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)// Map("name" -> "Justin", "age" -> 19)

DataFrame

Inferring the Schema Using Reflection

16 / 30

// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

DataFrame

Michael, 29Andy, 30Justin, 19

17 / 30

DataFrame

// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

Name Age

Michael 29

Andy 30

Justin 19

Text RDDPerson Class

RDD DataFrame

.map(_.split(","))

.map(p => Person(p(0), p(1).trim.toInt)).toDF()

Array(Michael, 29)Array(Andy, 30)Array(Justin, 19)

Person:name Person:age

Michael 29

Andy 30

Justin 19

18 / 30

// SQL statements can be run by using the sql methods provided by sqlContextval teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

DataFrame

Name Age

Michael 29

Andy 30

Justin 19

DataFrame

19 / 30

DataFrame

Programmatically Specifying the Schema// sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDDval people = sc.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a stringval schemaString = "name age"

// Import Row.import org.apache.spark.sql.Row;

// Import Spark SQL data typesimport org.apache.spark.sql.types.{StructType,StructField,StringType};

// Generate the schema based on the string of schemaval schema = StructType(

schemaString.split(" ").map(fieldName => StructField(fieldName,StringType, true)))

people RDD

20 / 30

DataFrame

(name, age)

schemaString.split(" ")

21 / 30

DataFrame

.map(fieldName => StructField(fieldName, StringType, true))

StructField:”name”StructField:”age”

(name, age)

22 / 30

DataFrame

.map(fieldName => StructField(fieldName, StringType, true))

StructField:”name”StructField:”age”

(name, age)

Seq:(StructField:”name”, StructField:”age”)

StructType( …)

23 / 30

DataFrame

Programmatically Specifying the Schema// Convert records of the RDD (people) to Rows.val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

// Apply the schema to the RDD.val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

// Register the DataFrames as a table.peopleDataFrame.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.val results = sqlContext.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by field index or by field name.results.map(t => "Name: " + t(0)).collect().foreach(println)

24 / 30

DataFrame

people RDD

Row(Michael, 29)Row(Andy, 30)Row(Justin, 19)

row RDD

25 / 30

DataFrame

Row(Michael, 29)Row(Andy, 30)Row(Justin, 19)

row RDD

Dataframe

name age

Michael 29

Andy 30

Justin 19

Seq:(StructField:”name”, StructField:”age”)

schema

26 / 30

DataFrame

results DataFrame

name age

Michael 29

Andy 30

Justin 19

27 / 30

DataFrame

results DataFrame

name age

Michael 29

Andy 30

Justin 19

t(0) t(1)

28 / 30

DataFrame

results DataFrame

name age

Michael 29

Andy 30

Justin 19

t(0) t(1)

t(0) => Array[Row]( Row(“Michael”), Row(“Andy”), Row(“Justin”) )

t(1) => Array[Row]( Row(29), Row(30), Row(19) )

collect()

29 / 30

Plan Optimization & Execution

30 / 30

Plan Optimization & Execution

31 / 30

32 / 30

Reference

Michael Armbrust (2015), “Spark SQL: Relational Data Processing in Spark”, SIGMOD '15 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pages 1383-1394

Spark Site, “Spark SQL and DataFrame Guide”, http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

Youtube, “Spark DataFrames Simple and Fast Analysis of Structured Data - Michael Armbrust(Databricks)”, https://www.youtube.com/watch?v=xWkJCUcD55w

Blog ,”Spark SQL Internals”, http://www.trongkhoanguyen.com/2015/08/sparksql-internals.html

DataBricks, “Deep Dive into Spark SQL’s Catalyst Optimizer”, https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Spark Site, “Spark API Documations : Scala”, http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

SparkSQL and Dataframe

Software

Transcript of SparkSQL and Dataframe

Hadoop architecture and ecosystem - polito.itdbdmg.polito.it/.../05/23_SparkSQL_Datasets_BigData... · SparkSQL").getOrCreate(); // Create a DataFrame from persons.json DataFrameReaderdfr=ss.read().format("json");

Using the oemof cosmos to create feed-in time …...2017/05/10 · sql-query Normalised time series PV/Wind power plants csv files download / read DataFrame DataFrame DataFrame region

SparkSQL et Cassandra - Tool In Action Devoxx 2015

9/16 Tokyo Apache Drill Meetup - drill vs sparksql

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

SPARK ON HIPERGATOR - help.rc.ufl.edu · SPARK SQL AND DATAFRAMES • SparkSQL • Allows SQL-like commands on distributed data sets • Spark DataFrames • Developed in Spark 2.0

Apache Spark Notes · SparkSQL is a library that runs on top of the Apache Spark Core and provides DataFrame API. The Spark DataFrames use a relational optimizer called the Catalyst

Research Article Analysis of Plant Breeding on Hadoop and ...downloads.hindawi.com/journals/aag/2016/7081491.pdf · SparkSQL Spark Streaming MLlib HDFS Cluster node Cluster node Cluster

DITA: Distributed In-Memory Trajectory AnalyticsDITA: Distributed In-memory Trajectory Analytics Support trajectory similarity search and join with SQL and DataFrame API Support most

Machine Learning with Spark - GitHub Pages...Example - Input DataFrame (1/2) I Make a DataFrame of the type Article. importorg.apache.spark.ml.classification.LogisticRegression importorg.apache.spark.ml.linalg

SparkSQL: A Compiler from Queries to RDDs

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

GraphFrames: DataFrame-based graphs for Apache® Spark™

From Java Stream to Java DataFrame

IBM WATSON + PYTHON · Python e Expressões Regulares para captura de informações 30 min PPC do Curso (Apêndice B) Criação do DataFrame. Conversão de Dataframe para JSON Entidades

The$DataFrame,$Type$Conversion,$ Graphing$people.tamu.edu/~alawing/materials/ESSM689/DataFrame.pdf · ESSM689$Quan,tave$Methods$in$ Ecology,$Evolu,on$and$Biogeography$ Ecosystem$Science$and$Management|$

SPARKSQL, DATAFRAME AND HIVECONTEXT · SPARKSQL, DATAFRAME AND HIVECONTEXT By Note: These instructions should be used with the HadoopExam Apache Spar k: Professional Trainings. Where

Property Series DataFrame - WordPress.com

portal.scitech.au.eduportal.scitech.au.edu/thanachai/wp-content/uploads/2019/... · Web viewIntroduction to PandasHow to load and save .csv files, series and dataframe variable types

Lecture 1: Welcome to Data Visualization Using R · A. A R dataframe I a dataframe is the basic building block of data analysis in R I R has other types of data structures, but this