Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of...

30
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki

Transcript of Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of...

Page 1: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Unifying Big Data Workloads in Apache Spark

Hossein Falaki@mhfalaki

Page 2: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Outline

• What’s Apache Spark• Why Unification• Evolution of Unification• Apache Spark + Databricks• Q & A

Page 3: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

What’s Apache Spark

Page 4: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

What is Apache Spark?

General cluster computing enginethat extends MapReduce

Rich set of APIs and libraries

Large community: 1000+ orgs,clusters up to 8000 nodes

Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation

SQLStreaming ML Graph

DL

Page 5: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Why Unification?

Page 6: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Unique Thing about Spark

Unification: same engine and same API for diverse use cases• Streaming, batch, or interactive• ETL, SQL, machine learning, or graph• Scala, Python, R, and SQL

This talk: Why unification? And how has it evolved in Spark?

Page 7: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Why Unification?

Page 8: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Why Unification?

MapReduce: a general engine for batch processing

Page 9: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

MapReduce

General batchprocessing

Pregel

Dremel Millwheel

Drill

Giraph

ImpalaStorm

S4 . . .

Specialized systemsfor new workloads

Big Data Systems Today

Hard to manage, tune, deployHard to combine in pipelines

Page 10: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

MapReduce

General batchprocessing

Unified engine

Big Data Systems Today

Pregel

Dremel Millwheel

Drill

Giraph

ImpalaStorm

S4 . . .

Specialized systemsfor new workloads

Page 11: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Key Idea

MapReduce + data sharing (RDDs) captures most distributed apps

StreamingIterative

Page 12: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Benefits of Unification

1. Simpler to use and operate

2. Code reuse: e.g. only write monitoring, FT, etc once

3. New apps that span processing types: e.g. interactive queries on a stream, online machine learning

Page 13: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

An Analogy

Specialized devices Unified device

New applications

Page 14: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

How Well Does It Work?

• 75% of users use multiple Spark components

• Competitive performance on diverse workloadsH

ive

Impa

la

Spar

k

0

10

20

30

40

50

Resp

onse

Tim

e (s

ec)

SQL

Mah

out

Grap

hLab

Spar

k

0

10

20

30

40

50

60

Resp

onse

Tim

e (m

in)

MLSt

orm

Spar

k

0

10

20

30

40

Thro

ughp

ut (M

B/s/

node

)

Streaming

Page 15: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Evolution of Unification

Page 16: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Challenges

Spark’s original “unifying abstraction” was RDDs: DAG of map/reduce steps given as Java functions on Java objects

Two challenges:1. Richness of optimizations: hard to understand the computation2. Changing hardware limits: I/O-bound → compute-bound

Page 17: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Solution: Structured APIs

New APIs that act on structured data (known schema) and make semantics of computation more visible (relational ops)• Spark SQL engine + DataFrames + Datasets

Enable rich optimizations similarto databases and compilers

Page 18: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Execution Steps

Logical Plan

Physical Plan

Catalog

OptimizerRDDs

DataSource

API

SQLAST

Code

Generator

Data Frame Dataset

Page 19: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Example: DataFrame API

DataFrames hold rows with a known schema and offer relational operations on them through a DSL

> sparkR.session()

> users <- sql(“select * from users”)

> ca.users <- users[users$state == “CA”]

> nrow(ca.users)

> ca.users %>% groupBy(“name”) %>% avg(“age”)

> dapply(ca.users, function(df) { df$age + rnorm(nrow(df)) }, ...)

Expression AST

Page 20: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

What Structured APIs Enable

1. Compact binary representation2. Optimization across operators (join order, pushdown, etc)3. Runtime code generation

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

sSeconds

Page 21: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

New in 2.2 & 2.3: Structured Streaming

High-level streaming API built over DataFrames• Event time, windowing, sessions, transactional sinks• Arbitrary or Customized Stateful Aggregation• Stream-to-stream joins

Supports interactive & batch queries• Ad-hoc queries via JDBC, joins with static data

Works by just incrementalizing static DataFrame queries!

Page 22: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

> logs <- read.df(source = "json”, path = "s3://logs")

> logs %>%

groupBy(“userid”, “hour”) %>%

avg(“latency”) %>%

write.df(source = "jdbc”, path = "jdbc:mysql//...")

Example: Batch Aggregation

Page 23: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

> logs <- read.stream(source = "json”, path = "s3://logs")

> logs %>%

groupBy(“userid”, “hour”) %>%

avg(“latency”) %>%

write.stream(source = "jdbc”, path = "jdbc:mysql//...")

Example: Continuous Aggregation

Page 24: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Incremental Execution

Scan Files

Aggregate

Write to MySQL

Scan New Files

Stateful Aggregate

Update MySQL

Batch Continuous

Automaticallytransformed

Page 25: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Benefits of Structured Streaming

1. Results always consistent with a batch job on a prefix of data

2. Same code and libraries run in both modes

3. Ad-hoc queries on consistent snapshots of state

Page 26: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Apache Spark: The First Unified Analytics Engine

RuntimeDeltaSparkCoreEngine

Big Data ProcessingETL + SQL +Streaming

Machine LearningMLlib + SparkR

Uniquely combines Data & AI technologies

Page 27: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

+

Page 28: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Databricks’ Unified Analytics Platform

DATABRICKS RUNTIME

COLLABORATIVE WORKSPACE

Delta SQL Streaming

Powered by

Data Engineers Data Scientists

CLOUD NATIVE SERVICE

Unifies Data Engineers and Data Scientists

Unifies Data and AI Technologies

Eliminates infrastructure complexity

Page 29: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

ConclusionUnified computing models are essential for usability & performance• Same results across batch, interactive & streaming• Powerful cross-library optimizations• Same functionality across programming languages

In most areas of computing, we converged to unified systems• UNIX OS, general-purpose languages, SQL, etc• Why not in cluster computing?• And that’s Unified Analytics Platform

– With Apache Spark at the core of Databricks Runtime

+

Page 30: Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of Unification. Challenges Spark’s original “unifying abstraction” was RDDs: DAG of

Thank You