Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote...

94
Conquering Big Data with Apache Spark Ion Stoica November 1 st , 2015 UC BERKELEY

Transcript of Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote...

Page 1: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Conquering Big Data with Apache Spark

Ion Stoica November 1st, 2015

UC  BERKELEY  

Page 2: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

The Berkeley AMPLab

January 2011 – 2017 •  8 faculty •  > 50 students •  3 software engineer team

Organized for collaboration

3 day retreats (twice a year)

lgorithms

achines eople

400+ campers (100s companies)

AMPCamp (since 2012)

Page 3: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

The Berkeley AMPLab

Governmental and industrial funding:

Goal: Next generation of open source data analytics stack for industry & academia:

Berkeley Data Analytics Stack (BDAS)

Page 4: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Generic Big Data Stack

Processing Layer

Resource Management Layer

Storage Layer

Page 5: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Hadoop Stack

Processing Layer

Resource Management Layer

Storage Layer HDFS

Stor

age

Yarn Res.

Mgm

nt

Stor

m

HadoopMR

Hive Pig

Proc

essin

g

Impa

la

Gira

ph

Page 6: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

BDAS Stack

Processing Layer

Resource Management Layer

Storage Layer

Spark Core

Spar

k St

ream

ing

SparkSQL Grap

hX

MLlib

MLBase BlinkDB Sample Clean

Spar

kR

Velox

Proc

essin

g

Velox

Tachyon HDFS, S3, Ceph, …

Stor

ageSuccinct

BDAS Stack 3rd party

Mesos Mesos Hadoop Yarn Res.

Mgm

nt

Page 7: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Today’s Talk

Resource Management Layer

Storage Layer

Spark Core

Spar

k St

ream

ing

SparkSQL Grap

hX

MLlib

MLBase BlinkDB Sample Clean

Spar

kR

Velox

Proc

essin

g

Velox

Tachyon HDFS, S3, Ceph, …

Stor

ageSuccinct

BDAS Stack 3rd party

Mesos Mesos Hadoop Yarn Res.

Mgm

nt

Today’s Talk

Page 8: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Overview

1.  Introduction 2.  RDDs 3.  Generality of RDDs (e.g. streaming) 4.  DataFrames 5.  Project Tungsten

Page 9: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Overview

1.  Introduction 2.  RDDs 3.  Generality of RDDs (e.g. streaming) 4.  DataFrames 5.  Project Tungsten

Page 10: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

A Short History

Started at UC Berkeley in 2009 Open Source: 2010 Apache Project: 2013 Today: most popular big data project

Page 11: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

What Is Spark?

Parallel execution engine for big data processing

Easy to use: 2-5x less code than Hadoop MR •  High level API’s in Python, Java, and Scala

Fast: up to 100x faster than Hadoop MR

•  Can exploit in-memory when available •  Low overhead scheduling, optimized engine

General: support multiple computation models

Page 12: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Analogy

First cellular phones

Unified device (smartphone)

Specialized devices

Page 13: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Analogy

First cellular phones

Unified device (smartphone)

Specialized devices

Better Games Better GPS Better Phone

Page 14: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Analogy

Batch processing Unified system Specialized systems

Page 15: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Analogy

Batch processing Unified system Specialized systems

Real-time analytics

Instant fraud detection

Better Apps

Page 16: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

General Unifies batch, interactive comp.

Spark Core

SparkSQL

Page 17: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

General Unifies batch, interactive, streaming comp.

Spark Core

Spark Streaming SparkSQL

Page 18: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

General Unifies batch, interactive, streaming comp. Easy to build sophisticated applications

•  Support iterative, graph-parallel algorithms •  Powerful APIs in Scala, Python, Java, R

Spark Core

Spark Streaming SparkSQL MLlib GraphX SparkR

Page 19: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Easy to Write Code

WordCount in 50+ lines of Java MR

WordCount in 3 lines of Spark

Page 20: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Fast: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Page 21: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Community Growth

June 2014 June 2015

total contributors 255 730

contributors/month 75 135

lines of code 175,000 400,000

Page 22: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Meetup Groups: January 2015

source: meetup.com

Page 23: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Meetup Groups: October 2015

source: meetup.com

Page 24: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Community Growth

2014 2015

Summit Attendees

2014 2015

Meetup Members

2014 2015

Developers Contributing

3900

1100

42K

12K

350

600

Page 25: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Large-Scale Usage

Largest cluster: 8000 nodes

Largest single job: 1 petabyte

Top streaming intake: 1 TB/hour

2014 on-disk sort record

Page 26: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Spark Ecosystem Distributions Applications

Page 27: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Overview

1.  Introduction 2.  RDDs 3.  Generality of RDDs (e.g. streaming) 4.  DataFrames 5.  Project Tungsten

Page 28: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

RDD: Resilient Distributed Datasets

Collections of objects distr. across a cluster •  Stored in RAM or on Disk •  Automatically rebuilt on failure

Operations •  Transformations •  Actions

Execution model: similar to SIMD

Page 29: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Operations on RDDs

Transformations f(RDD) => RDD § Lazy (not computed immediately) § E.g., “map”, “filter”, “groupBy”

Actions: § Triggers computation § E.g. “count”, “collect”, “saveAsTextFile”

Page 30: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Page 31: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns Worker  

Worker  

Worker  

Driver  

Page 32: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns Worker  

Worker  

Worker  

Driver  

lines = spark.textFile(“hdfs://...”)

Page 33: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns Worker  

Worker  

Worker  

Driver  

lines = spark.textFile(“hdfs://...”)

Base  RDD  

Page 34: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker  

Worker  

Worker  

Driver  

Page 35: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker  

Worker  

Worker  

Driver  

Transformed  RDD  

Page 36: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  

Driver  

messages.filter(lambda s: “mysql” in s).count()

Page 37: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  

Driver  

messages.filter(lambda s: “mysql” in s).count() Ac5on  

Page 38: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  

Driver  

messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Page 39: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Driver   tasks  

tasks  

tasks  

Page 40: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Driver  

Read  HDFS  Block  

Read  HDFS  Block  

Read  HDFS  Block  

Page 41: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Driver  

Cache  1  

Cache  2  

Cache  3  Process  &  Cache  Data  

Process  &  Cache  Data  

Process  &  Cache  Data  

Page 42: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Driver  

Cache  1  

Cache  2  

Cache  3  

results  

results  

results  

Page 43: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Driver  

Cache  1  

Cache  2  

Cache  3  messages.filter(lambda s: “php” in s).count()

Page 44: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Cache  1  

Cache  2  

Cache  3  messages.filter(lambda s: “php” in s).count()

tasks  

tasks  

tasks  

Driver  

Page 45: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Cache  1  

Cache  2  

Cache  3  messages.filter(lambda s: “php” in s).count()

Driver  

Process  from  Cache  

Process  from  Cache  

Process  from  Cache  

Page 46: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Cache  1  

Cache  2  

Cache  3  messages.filter(lambda s: “php” in s).count()

Driver  results  

results  

results  

Page 47: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker  

Worker  

Worker  messages.filter(lambda s: “mysql” in s).count()

Block  1  

Block  2  

Block  3  

Cache  1  

Cache  2  

Cache  3  messages.filter(lambda s: “php” in s).count()

Driver  

Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from mem vs. 20s for on-disk

Page 48: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Language Support

Standalone Programs Python, Scala, & Java Interactive Shells Python & Scala Performance Java & Scala are faster due to

static typing …but Python is often fine

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala

val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 49: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Expressive API

map reduce

Page 50: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Expressive API

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save ...

Page 51: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Fault Recovery: Design Alternatives

Replication: •  Slow: need to write data over network •  Memory inefficient

Backup on persistent storage •  Persistent storage still (much) slower than memory •  Still need to go over network to protect against machine failures

Spark choice: •  Lineage: track sequence of operations to efficiently reconstruct lost RRD

partitions

Page 52: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Fault Recovery Example

Two-partition RDD A={A1, A2} stored on disk 1)  filter and cache à RDD B 2)  joinà RDD C 3)  aggregate à RDD D

A1

A2

RDD

A

agg. D

filter

filter B2

B1 join

join C2

C1

Page 53: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Fault Recovery Example

C1 lost due to node failure before reduce finishes

A1

A2

RDD

A

agg. D

filter

filter B2

B1 join

join C2

C1

Page 54: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Fault Recovery Example

C1 lost due to node failure before reduce finishes Reconstruct C1, eventually, on different node

A1

A2

RDD

A

agg. D

filter

filter B2

B1

join C2

agg. D

join C1

Page 55: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Fault Recovery Results

119  

57   56   58   58  81  

57   59   57   59  

0  

50  

100  

150  

1   2   3   4   5   6   7   8   9   10  

Iteratrion

 time  (s)  

Iteration  

Failure  happens  

Page 56: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Overview

1.  Introduction 2.  RDDs 3.  Generality of RDDs (e.g. streaming) 4.  DataFrames 5.  Project Tungsten

Page 57: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Process large data streams at second-scale latencies •  Site statistics, intrusion detection, online ML

To build and scale these apps users want: •  Integration: with offline analytical stack •  Fault-tolerance: both for crashes and stragglers

Spark Streaming: Motivation

Page 58: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Traditional Streaming Systems

Event-driven record-at-a-times •  Each node has mutable state •  For each record, update state & send new records

State is lost if node dies Making stateful stream processing be fault-tolerant is challenging

mutable state

node 1

node 3

input records

node 2

input records

Page 59: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Spark Streaming

Data streams are chopped into batches •  A batch is an RDD holding a few 100s ms worth of data

Each batch is processed in Spark

data streams

rece

iver

s

batches

Page 60: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Streaming

How does it work?

Data streams are chopped into batches •  A batch is an RDD holding a few 100s ms worth of data

Each batch is processed in Spark Results pushed out in batches

data streams

rece

iver

s

batches results

Page 61: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Streaming Word Count

val lines = context.socketTextStream(“localhost”, 9999)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start()

print some counts on screen

count the words

split lines into words

create DStream from data over socket

start processing the stream

Page 62: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Word Count

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Page 63: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Word Count

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Spark Streaming

public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Storm

public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Page 64: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Machine Learning Pipelines tokenizer  =  Tokenizer(inputCol="text",  outputCol="words”)  hashingTF  =  HashingTF(inputCol="words",  outputCol="features”)  lr  =  LogisticRegression(maxIter=10,  regParam=0.01)  pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])    df  =  sqlCtx.load("/path/to/data")  model  =  pipeline.fit(df)  

df0 df1 df2 df3 tokenizer hashingTF lr.model

lr

Pipeline Model  

Page 65: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

0 20000 40000 60000 80000

100000 120000 140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

Powerful Stack – Agile Development

non-test, non-example source lines

Page 66: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

0 20000 40000 60000 80000

100000 120000 140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

non-test, non-example source lines

Streaming

Powerful Stack – Agile Development

Page 67: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

0 20000 40000 60000 80000

100000 120000 140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

SparkSQL Streaming

Powerful Stack – Agile Development

non-test, non-example source lines

Page 68: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Powerful Stack – Agile Development

0 20000 40000 60000 80000

100000 120000 140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

GraphX

Streaming SparkSQL

non-test, non-example source lines

Page 69: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Powerful Stack – Agile Development

0 20000 40000 60000 80000

100000 120000 140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

Spark

GraphX

Streaming SparkSQL

Your App?

non-test, non-example source lines

Page 70: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Benefits for Users

High performance data sharing •  Data sharing is the bottleneck in many environments •  RDD’s provide in-place sharing through memory

Applications can compose models •  Run a SQL query and then PageRank the results •  ETL your data and then run graph/ML on it

Benefit from investment in shared functionality •  E.g. re-usable components (shell) and performance optimizations

Page 71: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Overview

1.  Introduction 2.  RDDs 3.  Generality of RDDs (e.g. streaming) 4.  DataFrames 5.  Project Tungsten

Page 72: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Beyond Hadoop Users

Spark early adopters

Data Engineers Data Scientists Statisticians R users PyData …

Users

Understands MapReduce

& functional APIs

Page 73: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015
Page 74: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

DataFrames in Spark

Distributed collection of data grouped into named columns (i.e. RDD with schema)

Domain-specific functions designed for common tasks •  Metadata •  Sampling •  Project, filter, aggregation, join, … •  UDFs

Available in Python, Scala, Java, and R

Page 75: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Spark DataFrame

>  head(filter(df,  df$waiting  <  50))    #  an  example  in  R  ##    eruptions  waiting  ##1          1.750            47  ##2          1.750            47  ##3          1.867            48    

Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn

Page 76: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Spark RDD Execution

Java/Scala frontend

JVM backend

Python frontend

Python backend

opaque closures (user-defined functions)

Page 77: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Spark DataFrame Execution

DataFrame frontend

Logical Plan

Physical execution

Catalyst optimizer

Intermediate representation for computation

Page 78: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Spark DataFrame Execution

Python DF

Logical Plan

Physical execution

Catalyst optimizer

Java/Scala DF

R DF

Intermediate representation for computation

Simple wrappers to create logical plan

Page 79: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Benefit of Logical Plan: Simpler Frontend

Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add new language bindings (Julia, Clojure, …)

Page 80: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Performance

0 2 4 6 8 10

Java/Scala

Python

Runtime for an example aggregation workload

RDD

Page 81: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Benefit of Logical Plan: Performance Parity Across Languages

0 2 4 6 8 10

Java/Scala

Python

Java/Scala

Python

R

SQL

Runtime for an example aggregation workload (secs)

DataFrame

RDD

Page 82: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Overview

1.  Introduction 2.  RDDs 3.  Generality of RDDs (e.g. streaming) 4.  DataFrames 5.  Project Tungsten

Page 83: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Hardware Trends

Storage

Network

CPU

Page 84: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Hardware Trends

2010

Storage 50+MB/s (HDD)

Network 1Gbps

CPU ~3GHz

Page 85: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Hardware Trends

2010 2015

Storage 50+MB/s (HDD)

500+MB/s (SSD)

Network 1Gbps 10Gbps

CPU ~3GHz ~3GHz

Page 86: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Hardware Trends

2010 2015

Storage 50+MB/s (HDD)

500+MB/s (SSD) 10X

Network 1Gbps 10Gbps 10X

CPU ~3GHz ~3GHz L

Page 87: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Project Tungsten

Substantially speed up execution by optimizing CPU efficiency, via: (1)  Runtime code generation (2)  Exploiting cache locality (3)  Off-heap memory management

Page 88: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

From DataFrame to Tungsten

Python DF

Logical Plan

Java/Scala DF

R DF

Tungsten Execution

Initial phase in Spark 1.5 More work coming in 2016

Page 89: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Project Tungsten: Fully Managed Memory

Spark’s core API uses raw Java objects for aggregations and joins •  GC overhead •  Memory overhead: 4-8x more memory than serialized format •  Computation overhead: little memory locality

DataFrame’s use custom binary format and off-heap managed memory

•  GC free •  No memory overhead •  Cache locality

Page 90: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Hash Table Data Structure

Keep data closure to CPU cache

Page 91: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Example: Aggregation Operation

Page 92: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Python Java/Scala R SQL …

DataFrame Logical Plan

LLVM JVM SIMD 3D XPoint

Unified API, One Engine, Automatically Optimized

Tungsten backend

language frontend

Page 93: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Refactoring Spark Core

Tungsten Execution

Python SQL SparkR Streaming

DataFrame (& Dataset)

Advanced Analytics

Page 94: Conquering Big Data with Apache Spark - College of ...cci.drexel.edu/bigdata/bigdata2015/IEEEKeynote Nov 1 2015...Conquering Big Data with Apache Spark Ion Stoica November 1st, 2015

Summary

General engine with libraries for many data analysis tasks

Access to diverse data sources

Simple, unified API

Major focus going forward: •  Easy of use (DataFrames) •  Performance (Tungsten)

SQL Streaming ML Graph

…