Unifying Big Data Workloads in Apache Spark

Hossein Falaki@mhfalaki

Outline

• What’s Apache Spark• Why Unification• Evolution of Unification• Apache Spark + Databricks• Q & A

What’s Apache Spark

What is Apache Spark?

General cluster computing enginethat extends MapReduce

Rich set of APIs and libraries

Large community: 1000+ orgs,clusters up to 8000 nodes

Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation

SQLStreaming ML Graph

Why Unification?

Unique Thing about Spark

Unification: same engine and same API for diverse use cases• Streaming, batch, or interactive• ETL, SQL, machine learning, or graph• Scala, Python, R, and SQL

This talk: Why unification? And how has it evolved in Spark?

Why Unification?

MapReduce: a general engine for batch processing

MapReduce

General batchprocessing

Pregel

Dremel Millwheel

Giraph

ImpalaStorm

S4 . . .

Specialized systemsfor new workloads

Big Data Systems Today

Hard to manage, tune, deployHard to combine in pipelines

MapReduce

General batchprocessing

Unified engine

Big Data Systems Today

Pregel

Dremel Millwheel

Giraph

ImpalaStorm

S4 . . .

Specialized systemsfor new workloads

Key Idea

MapReduce + data sharing (RDDs) captures most distributed apps

StreamingIterative

Benefits of Unification

1. Simpler to use and operate

2. Code reuse: e.g. only write monitoring, FT, etc once

3. New apps that span processing types: e.g. interactive queries on a stream, online machine learning

An Analogy

Specialized devices Unified device

New applications

How Well Does It Work?

• 75% of users use multiple Spark components

• Competitive performance on diverse workloadsH

Streaming

Evolution of Unification

Challenges

Spark’s original “unifying abstraction” was RDDs: DAG of map/reduce steps given as Java functions on Java objects

Two challenges:1. Richness of optimizations: hard to understand the computation2. Changing hardware limits: I/O-bound → compute-bound

Solution: Structured APIs

New APIs that act on structured data (known schema) and make semantics of computation more visible (relational ops)• Spark SQL engine + DataFrames + Datasets

Enable rich optimizations similarto databases and compilers

Execution Steps

Logical Plan

Physical Plan

Catalog

OptimizerRDDs

DataSource

SQLAST

Generator

Data Frame Dataset

Example: DataFrame API

DataFrames hold rows with a known schema and offer relational operations on them through a DSL

> sparkR.session()

> users <- sql(“select * from users”)

> ca.users <- users[users$state == “CA”]

> nrow(ca.users)

> ca.users %>% groupBy(“name”) %>% avg(“age”)

> dapply(ca.users, function(df) { df$age + rnorm(nrow(df)) }, ...)

Expression AST

What Structured APIs Enable

1. Compact binary representation2. Optimization across operators (join order, pushdown, etc)3. Runtime code generation

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame R

sSeconds

New in 2.2 & 2.3: Structured Streaming

High-level streaming API built over DataFrames• Event time, windowing, sessions, transactional sinks• Arbitrary or Customized Stateful Aggregation• Stream-to-stream joins

Supports interactive & batch queries• Ad-hoc queries via JDBC, joins with static data

Works by just incrementalizing static DataFrame queries!

> logs <- read.df(source = "json”, path = "s3://logs")

> logs %>%

groupBy(“userid”, “hour”) %>%

avg(“latency”) %>%

write.df(source = "jdbc”, path = "jdbc:mysql//...")

Example: Batch Aggregation

> logs <- read.stream(source = "json”, path = "s3://logs")

> logs %>%

groupBy(“userid”, “hour”) %>%

avg(“latency”) %>%

write.stream(source = "jdbc”, path = "jdbc:mysql//...")

Example: Continuous Aggregation

Incremental Execution

Scan Files

Aggregate

Write to MySQL

Scan New Files

Stateful Aggregate

Update MySQL

Batch Continuous

Automaticallytransformed

Benefits of Structured Streaming

1. Results always consistent with a batch job on a prefix of data

2. Same code and libraries run in both modes

3. Ad-hoc queries on consistent snapshots of state

Apache Spark: The First Unified Analytics Engine

RuntimeDeltaSparkCoreEngine

Big Data ProcessingETL + SQL +Streaming

Machine LearningMLlib + SparkR

Uniquely combines Data & AI technologies

Databricks’ Unified Analytics Platform

DATABRICKS RUNTIME

COLLABORATIVE WORKSPACE

Delta SQL Streaming

Powered by

Data Engineers Data Scientists

CLOUD NATIVE SERVICE

Unifies Data Engineers and Data Scientists

Unifies Data and AI Technologies

Eliminates infrastructure complexity

ConclusionUnified computing models are essential for usability & performance• Same results across batch, interactive & streaming• Powerful cross-library optimizations• Same functionality across programming languages

In most areas of computing, we converged to unified systems• UNIX OS, general-purpose languages, SQL, etc• Why not in cluster computing?• And that’s Unified Analytics Platform

– With Apache Spark at the core of Databricks Runtime

Thank You

Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of...

Transcript of Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of...

Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of...

Documents

Transcript of Unifying Big Data Workloads in Apache Spark · Spark 0 10 20 30 40) Streaming. Evolution of...

METAMORPHOSIS FOR BETTER LIFE IN NICHOLAS SPARK’S A …eprints.ums.ac.id/55161/13/Naskah Publikasi OKUPLOAD BENER.pdf · METAMORPHOSIS FOR BETTER LIFE IN NICHOLAS SPARK’S A WALK

Machine learning with Spark - HPC-Forge...It allows developers to intermix SQL with Spark’s programming languages. ... • MLlib is the machine learning library that provides multiple

Spark - ce.uniroma2.it · Spark tools • A number of frameworks built on top of Spark • Spark SQL – Spark’s package for working with structured data – Allows querying data

Introd uction to Big Data · Introd uction to Apache Spark ( A very famous name in Big Data Ecosystem ) How Apache Spark’s architecture looks like? How we can do Machine Le arning

Introduction to Cloud Computingdell/teaching/cc/dell_cc...Spark’s Philosophy •Unified: Spark is designed to support a wide range of data analytics tasks over the same computing

Unifying italy

Half Year Report 2013 - Gresham House · HALF YEAR REPORT 2013 SPARK Ventures plc _01 > Sale of Notonthehighstreet after the period end for £11.0m. SPARK’s best ever IRR return

SPARK - Benniamount of space for a mini car. Five comfortable seats for you and four of your friends. When it comes to loading the weekly shopping, Spark’s five doors allow easy

Processing time of TFIDF and Naive Bayes on Spark 2.0, Hadoop …trap.ncirl.ie/2490/1/eringilheany.pdf · 2016-12-03 · 2.4 Machine Learning Tools: Mahout on Hadoop and Spark’s

Unifying Framework for Poverty Eradication and …governance.care2share.wikispaces.net/file/view/Unifying...CARE’s Unifying Framework for Poverty Eradication & Social Justice 3 The

Auntie Spark’s Guide to creating a Data Collection VI

HELPING TO SHARE THE MATTER - Spark Digital...OTN and has opted for Spark to fully manage its DTS service and its performance 24/7. Spark’s network capability, combined with its

Unifying Data and AI · 2020-04-03 · experimentation-to-production lifecycle. ... powered by Apache Spark ™, provides a Unified Analytics Platform that enables organizations to

JUNE AND JULY 2018 AUGUST 2018 SEPTEMBER 2018 family holidays in Spark’s home in Tuscany and ... UK festivals. A Far Cry from ... ‘Muriel Spark has written some of the best sentences

IBM - Spark y como lo usamos en ASPgems - …files.meetup.com/7770922/IBM - Spark y como lo usamos en ASPgems.pdfabout Free Software. Menu ... should know about real-time data's unifying

Note on Demonetization - Spark Capitalmailers.sparkcapital.in/uploads/Banking/Spark Financials...Page 4 Spark’s Ears on the Ground – Pan India coverage Guntur Vijayawada Amaravati

BIGGR:BringingGradooptoApplications - uni-leipzig.de · BIGGR:BringingGradooptoApplications ... Apache Spark Driver to run Spark jobs and can also utilize Spark’s distributed graph

HOW SPARK POWER IS USING A CULTURE OF …sparkpower.ca/wp-content/uploads/2015/10/SPARKPOWER-Disruptive... · energy sector, have established a ... Spark’s innovation is focused

Unifying Anthropocentrism and Ecocentrism Roger … - Unifying...Unifying Anthropocentrism and Ecocentrism Roger Emmelhainz University of Colorado, Boulder 16 March 2016

Big Data with Apache Spark and Amazon AWS · • Apache Spark: A fast, in-memory data processing engine. We will be using Spark’s engine for batch processing in the examples. •