Apache Spark II (SparkSQL)

26
I I (SparkSQ L)

Transcript of Apache Spark II (SparkSQL)

Page 1: Apache Spark II (SparkSQL)

II(SparkSQ

L)

Page 2: Apache Spark II (SparkSQL)

ContentsIntroduction to Spark1

23

Spark modulesSparkSQL

4 Workshop

Page 3: Apache Spark II (SparkSQL)

1.Introduction

Page 4: Apache Spark II (SparkSQL)

What is Apache Spark?

● Extends MapReduce ● Cluster computing

platform● Runs in memory

Page 5: Apache Spark II (SparkSQL)

Fast Easy of development

UnifiedStack

MultiLanguageSupport

DeploymentFlexibility

❏ Scala, python, java, R

❏ Deployment: Mesos, YARN, standalone, local❏ Storage: HDFS, S3, local FS

❏ Batch❏ Streaming

❏ 10x faster on disk❏ 100x in memory

❏ Easy code❏ Interactive shell

WhySpark

Page 6: Apache Spark II (SparkSQL)

Rise of the data centerHugh amounts of data spread out across many commodity servers

MapReduce

lots of data → scale out

Data Processing Requirements

Network bottleneck → Distributed ComputingHardware failure → Fault Tolerance

Abstraction to organize parallelizable tasks

MapReduce

Abstraction to organize parallelizable tasks

Page 7: Apache Spark II (SparkSQL)

MapReduce Input Split Map [combine] Suffle &

Sort Reduce Output

AA BB AAAA CC DDAA EE DDBB FF AA

AA BB AA

AA CC DD

AA EE DD

BB FF AA

(AA, 1)(BB, 1)(AA, 1)

(AA, 1)(CC, 1)(DD, 1)

(AA, 1)(EE, 1)(DD, 1)

(BB, 1)(FF, 1)(AA, 1)

(AA, 2)(BB, 1)

(AA, 1)(CC, 1)(DD, 1)

(AA, 1)(EE, 1)(DD, 1)

(BB, 1)(FF, 1)(AA, 1)

(AA, 2)(AA, 1)(AA, 1)(AA, 1)

(BB, 1)(BB, 1)

(CC, 1)

(DD, 1)(DD, 1)

(EE, 1)

(FF, 1)

(AA, 5)

(BB, 2)

(CC, 1)

(DD, 2)

(EE, 1)

(FF, 1)

AA, 5BB, 2CC, 1DD, 2EE, 1FF, 1

Page 8: Apache Spark II (SparkSQL)

Spark Components

Cluster Manager

Driver Program

SparkContext

Worker Node

Executor

Task Task

Worker Node

Executor

Task Task

Page 9: Apache Spark II (SparkSQL)

Spark ComponentsSparkContext

● Main entry point for Spark functionality● Represents the connection to a Spark cluster ● Tells Spark how & where to access a cluster● Can be used to create RDDs, accumulators and

broadcast variables on that cluster

Driver program● “Main” process coordinated by the

SparkContext object● Allows to configure any spark process

with specific parameters● Spark actions are executed in the

Driver● Spark-shell● Application → driver program +

executors

Driver Program

SparkContext

Page 10: Apache Spark II (SparkSQL)

Spark Components

● External service for acquiring resources on the cluster● Variety of cluster managers

○ Local○ Standalone○ YARN○ Mesos

● Deploy mode:○ Cluster → framework launches the driver inside of the cluser○ Client → submitter launches the driver outside of the cluster

Cluster Manager

Page 11: Apache Spark II (SparkSQL)

Spark Components

● Any node that can run application code in the cluster● Key Terms

○ Executor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

○ Task: Unit of work that will be sent to one executor○ Job: A parallel computation consisting of multiple tasks that gets spawned in

response to a Spark action (e.g. save, collect)○ Stage: smaller set of tasks inside any job

Worker Node

Executor

Task Task

Worker

Page 12: Apache Spark II (SparkSQL)

RDDResilient Distributed Datasets

● Collection of objects that is distributed across nodes in a cluster

● Data Operations are performed on RDD● Once created, RDD are immutable● RDD can be persisted in memory or on disk● Fault Tolerant

numbers = RDD[1,2,3,4,5,6,7,8,9,10]Worker Node

Executor

[1,5,6,9]

Worker Node

Executor

[2,7,8]

Worker Node

Executor

[3,4,10]

Page 13: Apache Spark II (SparkSQL)

2. Spark modules

Page 14: Apache Spark II (SparkSQL)

Spark modules

Page 15: Apache Spark II (SparkSQL)

Spark streaming

Page 16: Apache Spark II (SparkSQL)

MLlib

● Classification: logistic regression, naive Bayes,...● Regression: generalized linear regression, survival regression,...● Decision trees, random forests, and gradient-boosted trees● Recommendation: alternating least squares (ALS)● Clustering: K-means, Gaussian mixtures (GMMs),...● Topic modeling: latent Dirichlet allocation (LDA)● Frequent itemsets, association rules, and sequential pattern mining

ML Algorithms Include

Page 17: Apache Spark II (SparkSQL)

GraphX

Page 18: Apache Spark II (SparkSQL)

3. SparkSQL

Page 19: Apache Spark II (SparkSQL)

Spark SQL

Page 20: Apache Spark II (SparkSQL)

Spark SQL

● Integrated: Query data stored in RDDs. Languages: Python, Scala, Java, R.

● Unified data access: Parquet, JSON, CSV, Hive tables

● Apache Hive compatibility.

● Standard connectivity: JDBC, ODBC.

● Scalability

Features

Page 21: Apache Spark II (SparkSQL)

DataFrame

Column 1 Column 2 Column 3 ... Column N

Column 1 Column 2 Column 3 ... Column N

Page 22: Apache Spark II (SparkSQL)

DataFrame

● Ability to process the data in the size of Kilobytes to Petabytes on a single node cluster to large cluster.

● Different data formats (JSON, Csv, Elastic Search, ...) and storage systems (HDFS, HIVE tables, Oracle, ...)

● Easily integrated with others Big Data tools (Spark-Core).

● API for Python, Java, Scala, and R.

Features

Page 23: Apache Spark II (SparkSQL)

Spark Architecture

Page 24: Apache Spark II (SparkSQL)

4. Workshop

Page 25: Apache Spark II (SparkSQL)

WORKSHOPIn order to practice the main concepts, please complete the exercises proposed at our Github repository by clicking the following link:

○ Homework

Page 26: Apache Spark II (SparkSQL)

THANKS!Any questions?

@datiobd

datio-big-data

Special thanks to Stratio for its theoretical contribution

[email protected]