Spark SQL | Apache Spark

of 22 /22
View Apache Spark and Scala course details at www.edureka.co/apache-spark-scala-training Apache Spark | Spark SQL

Transcript of Spark SQL | Apache Spark

Page 1: Spark SQL | Apache Spark

View Apache Spark and Scalacourse details at www.edureka.co/apache-spark-scala-training

Apache Spark | Spark SQL

Page 2: Spark SQL | Apache Spark

Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2

Objectives

At the end of this module, you will be able to

Introduction of Spark

Spark Architecture

What is an RDD

Demo On Creating RDD and Running sample example

Spark SQL

Page 3: Spark SQL | Apache Spark

Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3

What is Spark?

Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it

easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.

Developed at UC Berkeley

Written in Scala , a Functional Programming Language that runs in a JMV

It generalize the Map Reduce framework

Page 4: Spark SQL | Apache Spark

Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4

Why Spark ?

Speed

Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk.

Ease of Use

Supports different languages for developing applications using Spark

Generality

Combine SQL, streaming, and complex analytics into one platform

Runs Everywhere

Spark runs on Hadoop, Mesos, standalone, or in the cloud.

Page 5: Spark SQL | Apache Spark

Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5

Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass

computations and algorithms ( Machine learning etc.)

To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in

sequence

Each of those jobs was high-latency, and none could start until the previous job had finished completely

The Job output data between each step has to be stored in the local file system before the next step can begin

Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning

and Storm for streaming data processing)

Map Reduce Limitations

Page 6: Spark SQL | Apache Spark

Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6

Spark Features

Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-memory data storage

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing

It’s designed to be an execution engine that works both in-memory and on-disk

Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow

Provides concise and consistent APIs in Scala, Java and Python

Offers interactive shell for Scala and Python. This is not available in Java yet

Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)

Page 7: Spark SQL | Apache Spark

Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7

Spark Core

SparkStreaming

Spark Sql

Blink DB

MLlib Graph X Spark R

Spark Architecture

Page 8: Spark SQL | Apache Spark

Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8

Spark Core

SparkStreaming

Spark Sql

Blink DB

MLlib Graph X Spark R

Spark Architecture

Cluster management ( Native Spark Cluster, YARN, MESOS )

Distributed storage ( HDFS, Cassandra, S3, HBase )

Page 9: Spark SQL | Apache Spark

Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9

Spark Advantages

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

IN-MEMORY PERFORMANCE

Easier APIs Python, Scala, Java

RDDs DAGs Unify Processing

Shark, MLStreaming, GraphX

Page 10: Spark SQL | Apache Spark

Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10

UNLIMITED SCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

Multiple data sources Multiple applications Multiple users

Reliability Multi-tenancy Security

Files Databases Semi-structured

Hadoop Advantages

Page 11: Spark SQL | Apache Spark

Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11

Spark + Hadoop

UNLIMITED SCALE

WIDE RANGE OF APPLICATIONS

ENTERPRISE PLATFORM

EASE OF DEVELOPMENT

COMBINE WORKFLOWS

IN-MEMORY PERFORMANCE

Operational Applications Augmented by In-Memory Performance

Page 12: Spark SQL | Apache Spark

Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12

Resilient Distributed Datasets

RDD ( Resilient Distributed Data Sets )

Resilient – If data in memory is lost, It can be recreated

Distributed – Stored in memory across the cluster

Dataset – Initial data can come from a file or created programmatically.

RDDs are the fundamental unit of data in spark

Page 13: Spark SQL | Apache Spark

Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13

Resilient Distributed Datasets

Core concept of Spark framework.

RDDs can store any type of data.

Primitive Types : Integer, Characters, Boolean etc.Files : Text files, SequencFiles etc.

RDD is fault tolerance.

RDDs are immutable

Page 14: Spark SQL | Apache Spark

Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14

RDD supports two types of operations:

Transformation: Transformations don't return a single value, they return a new RDD.

Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.

Action: Action operation evaluates and returns a new value.

Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.

Resilient Distributed Datasets

Page 15: Spark SQL | Apache Spark

Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15

Spark Sql

Spark Core

Spark SQL allows relational queries through Spark

The backbone for all these operations is SchemaRDD

Schema RDDs are mode of row objects along with the metadata information

SchemaRDDs are equivalent to RDBMS tables

They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data stored in Apache Hive(*)

Spark SQL

Page 16: Spark SQL | Apache Spark

Slide 16 www.edureka.co/apache-spark-scala-training

Spark SQL

Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Scala and Java

Shark Project is completely closed now

Earlier it was Shark but now we will use Spark SQL

Shark

Spark SQL Hive on Spark

Development ending: transitioning to Spark SQL

A new SQL engine designed from ground up for Spark

Help existing Hive users migrate Spark

Page 17: Spark SQL | Apache Spark

Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17

Efficient In-Memory Storage

Simply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Spark SQL employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3

john mike sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2 mike 3.5

3 sally 6.4

Page 18: Spark SQL | Apache Spark

Slide 18 www.edureka.co/apache-spark-scala-trainingSlide 18

Demo On Spark RDDs

Page 19: Spark SQL | Apache Spark

Slide 19 www.edureka.co/apache-spark-scala-training

LIVE Online Class

Class Recording in LMS

24/7 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate

Course Features

Page 20: Spark SQL | Apache Spark

Slide 20 www.edureka.co/apache-spark-scala-training

Questions

Page 21: Spark SQL | Apache Spark

Slide 21 www.edureka.co/apache-spark-scala-training

Course Topics

Module 1 » Introduction to Scala

Module 2» Scala Essentials

Module 3 » Traits and OOPs in Scala

Module 4 » Functional Programming in Scala

Module 5 » Introduction to Big Data and Spark

Module 6 » Spark Baby Steps

Module 7 » Playing with RDDs

Module 8» Spark with SQL- When Spark meets Hive

Page 22: Spark SQL | Apache Spark

Slide 22 www.edureka.co/apache-spark-scala-training