Getting started with SparkSQL - Desert Code Camp 2016

19
1 Page : Getting started with SparkSQL By: Avinash Ramineni

Transcript of Getting started with SparkSQL - Desert Code Camp 2016

1Page:

Getting started with SparkSQL

By: Avinash Ramineni

2Page:

Agenda

• About Clairvoyant• Spark Concepts• Spark SQL

• Project Tungsten• Catalyst Optimizer

• Spark SQL as Distributed Query Engine• Demo• Questions

3Page:

Clairvoyant

4Page:

Clairvoyant Services

5Page:

Avinash Ramineni

• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni

6Page:

• Fast and general purpose cluster computing system • Born in UC Berkeley around 2009• Open Sourced in 2010• Interoperable with Hadoop and included in all the major

distributions • Provides high level APIs in Scala, Java , Python

7Page:

Spark Architecture

Apache Spark, Cluster Mode Overviewhttp://spark.apache.org/docs/latest/img/cluster-overview.png

8Page:

Spark UI (Resource Manager)http://{HOST}:8088/cluster

9Page:

Spark SQL

• Spark module for structured data processing• The most popular Spark Module in the Ecosystem• Use SQLContext to perform operations

• SQL Queries• DataFrame API• Dataset API

• White Paper• http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.

pdf

10

Page:

SQLContext

• Used to Create DataFrames• Implementations

• SQLContext• HiveContext

• An instance of the Spark SQL execution engine that integrates with data stored in Hive

• SQLContext• Spark Shell automatically creates a SparkContext as the sqlContext variable

• Documentation• https://spark.apache.org/docs/latest/api/scala/index.html#org.a

pache.spark.sql.SQLContext

• As of Spark 2.0 use SparkSession

11

Page:

DataFrame API

• One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL

12

Page:

Dataset API

• Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine

• Use the SQLContext• sqlContext.read.{OPERATION}• DataFrame is simply a type alias of Dataset[Row]• The unified Dataset API can be used both in Scala and Java.• Python does not yet have support for the Dataset API

13

Page:

Catalyst Optimizer

• Applied to Spark SQL and DataFrame API• Extensible Optimizer• Automatically finds the most efficient plan to execute data

operations in the users operation

Databricks, Catalyst Optimizer Workflowhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

14

Page:

Project Tungsten• Applied to Spark SQL, DataFrame API, Dataset API and ML• Effort to improve CPU and Memory usage by Spark with 3 Techniques

• Memory Management and Binary Processing• Leveraging application semantics to manage memory

explicitly and eliminate the overhead of JVM object model and garbage collection

• Cache-aware computation• Algorithms and data structures to exploit memory hierarchy

• Code generation• Using code generation to exploit modern compilers and CPUs

• Closer to Bare Metal

15

Page:

Demo

16

Page:

Learn More (Courses and Videos)• MapR Academy

• http://learn.mapr.com/dev-360-apache-spark-essentials• edx

• https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x

• https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x

• https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x

• https://www.edx.org/xseries/data-science-engineering-apache-spark• Coursera

• https://www.coursera.org/learn/big-data-analysys• Apache Spark YouTube

• https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w• Spark Summit

• https://spark-summit.org/2016/schedule/

17

Page:

Certification

• Certification Organizations • O’Reilly and Databricks

• http://www.oreilly.com/data/sparkcert.html • Additional steps to prepare

• Work with Apache Spark• Research Apache Spark Modules

• Spark SQL, Spark Streaming, MLlib, Graphx• Read the RDD and other White Papers• Read O’Reilly’s “Learning Spark” book

18

Page:

References

• https://en.wikipedia.org/wiki/Apache_Spark• http://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-be

nchmark.html

• https://www.datanami.com/2016/06/08/apache-spark-adoption-numbers/

• http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf• http://training.databricks.com/workshop/itas_workshop.pdf• https://spark.apache.org/docs/latest/api/scala/index.html• https://spark.apache.org/docs/latest/programming-guide.html• https://github.com/databricks/learning-spark

19

Page:

Q&A