Getting started with SparkSQL - Desert Code Camp 2016
-
Upload
clairvoyantllc -
Category
Technology
-
view
56 -
download
0
Transcript of Getting started with SparkSQL - Desert Code Camp 2016
2Page:
Agenda
• About Clairvoyant• Spark Concepts• Spark SQL
• Project Tungsten• Catalyst Optimizer
• Spark SQL as Distributed Query Engine• Demo• Questions
5Page:
Avinash Ramineni
• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni
6Page:
• Fast and general purpose cluster computing system • Born in UC Berkeley around 2009• Open Sourced in 2010• Interoperable with Hadoop and included in all the major
distributions • Provides high level APIs in Scala, Java , Python
7Page:
Spark Architecture
Apache Spark, Cluster Mode Overviewhttp://spark.apache.org/docs/latest/img/cluster-overview.png
9Page:
Spark SQL
• Spark module for structured data processing• The most popular Spark Module in the Ecosystem• Use SQLContext to perform operations
• SQL Queries• DataFrame API• Dataset API
• White Paper• http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.
10
Page:
SQLContext
• Used to Create DataFrames• Implementations
• SQLContext• HiveContext
• An instance of the Spark SQL execution engine that integrates with data stored in Hive
• SQLContext• Spark Shell automatically creates a SparkContext as the sqlContext variable
• Documentation• https://spark.apache.org/docs/latest/api/scala/index.html#org.a
pache.spark.sql.SQLContext
• As of Spark 2.0 use SparkSession
11
Page:
DataFrame API
• One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL
12
Page:
Dataset API
• Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine
• Use the SQLContext• sqlContext.read.{OPERATION}• DataFrame is simply a type alias of Dataset[Row]• The unified Dataset API can be used both in Scala and Java.• Python does not yet have support for the Dataset API
13
Page:
Catalyst Optimizer
• Applied to Spark SQL and DataFrame API• Extensible Optimizer• Automatically finds the most efficient plan to execute data
operations in the users operation
Databricks, Catalyst Optimizer Workflowhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
14
Page:
Project Tungsten• Applied to Spark SQL, DataFrame API, Dataset API and ML• Effort to improve CPU and Memory usage by Spark with 3 Techniques
• Memory Management and Binary Processing• Leveraging application semantics to manage memory
explicitly and eliminate the overhead of JVM object model and garbage collection
• Cache-aware computation• Algorithms and data structures to exploit memory hierarchy
• Code generation• Using code generation to exploit modern compilers and CPUs
• Closer to Bare Metal
16
Page:
Learn More (Courses and Videos)• MapR Academy
• http://learn.mapr.com/dev-360-apache-spark-essentials• edx
• https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x
• https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x
• https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x
• https://www.edx.org/xseries/data-science-engineering-apache-spark• Coursera
• https://www.coursera.org/learn/big-data-analysys• Apache Spark YouTube
• https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w• Spark Summit
• https://spark-summit.org/2016/schedule/
17
Page:
Certification
• Certification Organizations • O’Reilly and Databricks
• http://www.oreilly.com/data/sparkcert.html • Additional steps to prepare
• Work with Apache Spark• Research Apache Spark Modules
• Spark SQL, Spark Streaming, MLlib, Graphx• Read the RDD and other White Papers• Read O’Reilly’s “Learning Spark” book
18
Page:
References
• https://en.wikipedia.org/wiki/Apache_Spark• http://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-be
nchmark.html
• https://www.datanami.com/2016/06/08/apache-spark-adoption-numbers/
• http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf• http://training.databricks.com/workshop/itas_workshop.pdf• https://spark.apache.org/docs/latest/api/scala/index.html• https://spark.apache.org/docs/latest/programming-guide.html• https://github.com/databricks/learning-spark