Foundations for Scaling Analytics in Apache Spark

39
Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ®

Transcript of Foundations for Scaling Analytics in Apache Spark

Page 1: Foundations for Scaling Analytics in Apache Spark

Foundations for Scaling Analytics in Apache Spark

Joseph K. Bradley September 19, 2016

® ™

Page 2: Foundations for Scaling Analytics in Apache Spark

Who am I?

Apache Spark committer & PMC member Software Engineer @ Databricks (ML team) Machine Learning Department @ Carnegie Mellon

2

Page 3: Foundations for Scaling Analytics in Apache Spark

Talk outline

Intro Apache Spark Machine Learning (and graphs) in Spark

Original implementations: RDDs

Future implementations: DataFrames

3

Page 4: Foundations for Scaling Analytics in Apache Spark

• General engine for big data computing •  Fast & scalable •  Easy to use •  APIs in Python, Scala, Java & R

4

Apache Spark

SparkSQL Streaming MLlib GraphX

Open source •  Apache Software Foundation •  1000+ contributors •  250+ companies & universities

Page 5: Foundations for Scaling Analytics in Apache Spark

It’s big • Spark beat Hadoop’s Gray Sort record by 3x

with 1/10 as many machines • Largest cluster size of 8000 Nodes (Tencent)

5

Page 6: Foundations for Scaling Analytics in Apache Spark

MLlib: Spark’s ML library

6

ML tasks Classification Regression Recommendation Clustering Frequent itemsets

Data utilities Featurization Statistics Linear algebra

Workflow utilities Model import/export Pipelines DataFrames Cross validation

Goals Scale-out Standard library Extensible API

Challenges for big data •  Iterative algorithms •  Diverse algorithmic patterns •  Many data types

Page 7: Foundations for Scaling Analytics in Apache Spark

GraphX and GraphFrames

7

Graph algorithms Connected components PageRank Label propagation …

Graph queries Vertex degrees Subgraphs Motif finding …

Goals Scale-out Standard library Extensible API

Challenges for big data •  Iterative algorithms •  Many (big) joins •  Many data types

Page 8: Foundations for Scaling Analytics in Apache Spark

Talk outline

Intro Apache Spark Machine Learning (and graphs) in Spark

Original implementations: RDDs

Future implementations: DataFrames

8

Page 9: Foundations for Scaling Analytics in Apache Spark

Talk outline

Intro Apache Spark Machine Learning (and graphs) in Spark

Original implementations: RDDs

Future implementations: DataFrames

9

Page 10: Foundations for Scaling Analytics in Apache Spark

10

Map Reduce

master

Resilient Distributed Datasets (RDDs)

val myData: RDD[(String, Vector)] myData.map { _._2 * 0.5 }

Page 11: Foundations for Scaling Analytics in Apache Spark

Resilient Distributed Datasets (RDDs)

11

Page 12: Foundations for Scaling Analytics in Apache Spark

Resilient Distributed Datasets (RDDs)

12

Resiliency •  Lineage •  Caching &

checkpointing

Page 13: Foundations for Scaling Analytics in Apache Spark

13

Compute gradient (Vector) for each row (training example)

Aggregate gradient

master

ML on RDDs

Broadcast gradient

Page 14: Foundations for Scaling Analytics in Apache Spark

ML on RDDs: the good

Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data (2014) •  50+ million users x 30+ million songs •  50 billion ratings Cost ~ $10 •  32 r3.8xlarge nodes (spot instances) •  For rank 10 with 10 iterations, ~1 hour running time.

14

Page 15: Foundations for Scaling Analytics in Apache Spark

ML on RDDs: the challenges

Maintaining state

Python API

Iterator model

Data partitioning

15

Page 16: Foundations for Scaling Analytics in Apache Spark

16

master

Maintaining state on master

Current state

Page 17: Foundations for Scaling Analytics in Apache Spark

17

Maintaining state in RDDs

… … …

Current state

Page 18: Foundations for Scaling Analytics in Apache Spark

Maintaining state

18

Cons of master •  Single point of failure. •  Cannot support large state (1 billion parameters) Cons of RDDs • More complex •  Lineage becomes a problem à cache & checkpoint Unstated con: Developers have to choose 1 option!

Page 19: Foundations for Scaling Analytics in Apache Spark

19

Python API (RDD-based)

Spark worker (JVM)

Python

Python

Data stored as Python objects à Serialization overhead

Page 20: Foundations for Scaling Analytics in Apache Spark

20

Iterator model val rdd0: RDD[(String, Vector)] = …

val rdd1 = rdd0.map { (name, data) => (name.trim, normalizeVec(data))

}

val rdd2 = rdd1.map {

...

}

Arbitrary data types

Black box lambda functions

Iterative processing (especially in ML!)

à Boxed types à JVM object creation & GC

Page 21: Foundations for Scaling Analytics in Apache Spark

21

Data partitioning: numPartitions

Selecting numPartitions can be critical. •  Each task has overhead. •  Overhead / parallelism trade-off. Different numPartitions for different jobs: •  SQL: 200+ is reasonable •  ML: 1 per compute core

Page 22: Foundations for Scaling Analytics in Apache Spark

22

Data partitioning: co-partitioning

Algorithm •  Join •  Map •  Iterate

Co-partitioning is critical for •  ALS (matrix factorization) •  Graph algorithms

Page 23: Foundations for Scaling Analytics in Apache Spark

ML on RDDs: the challenges

Maintaining state (& lineage)

Python API

Iterator model

Data partitioning

23

Page 24: Foundations for Scaling Analytics in Apache Spark

Talk outline

Intro Apache Spark Machine Learning (and graphs) in Spark

Original implementations: RDDs

Future implementations: DataFrames

24

Page 25: Foundations for Scaling Analytics in Apache Spark

Talk outline

Intro Apache Spark Machine Learning (and graphs) in Spark

Original implementations: RDDs

Future implementations: DataFrames

25

Page 26: Foundations for Scaling Analytics in Apache Spark

Spark DataFrames & Datasets

26

dept age name

Bio 48 HSmith

CS 34 ATuring

Bio 43 BJones

Chem 61 MKennedy

Data grouped into named columns

DSL for common tasks •  Project, filter, aggregate, join, … •  Statistics, n/a values, sketching, … •  User-Defined Functions (UDFs) &

Aggregation (UDAFs)

data.groupBy(“dept”).avg(“age”)

Datasets: Strongly typed DataFrames

Page 27: Foundations for Scaling Analytics in Apache Spark

27

Catalyst query optimizer

SQL

DataFrame

Dataset

Query Plan Optimized Query Plan RDDs

Catalyst transformations

Abstractions of user programs (Trees)

Page 28: Foundations for Scaling Analytics in Apache Spark

Project Tungsten

Memory management •  Off-heap (Java Unsafe API) •  Avoid JVM GC •  Compressed format

Code generation •  Rewrite chain of iterators into single code blocks •  Operate directly on compressed format

28

Page 29: Foundations for Scaling Analytics in Apache Spark

DataFrames in ML and Graphs

API • DataFrame-based API in MLlib (spark.ml package) • GraphFrames (Spark package) Transformation & prediction Training

29

Page 30: Foundations for Scaling Analytics in Apache Spark

Python API

30

0 2 4 6 8 10

RDD Scala RDD Python

Spark Scala DF Spark Python DF

Time to aggregate 10^6 Int pairs (secs) in Spark 1.4

better

Page 31: Foundations for Scaling Analytics in Apache Spark

Transformation/prediction with DataFrames

User-Defined Types (UDTs) •  Vector (sparse & dense) • Matrix (sparse & dense)

31

User-Defined Functions (UDFs) •  Feature transformation • Model prediction

Page 32: Foundations for Scaling Analytics in Apache Spark

Future work: model training Goal: Port all ML/graph algorithms to run on DataFrames for better speed & scalability. Currently: •  Belief propagation •  Connected components

32

Page 33: Foundations for Scaling Analytics in Apache Spark

Catalyst in ML

What’s missing? •  Concept of iteration • Handling caching and checkpointing

across many iterations • ML/Graph-specific optimizations for

Catalyst query planner

33

Page 34: Foundations for Scaling Analytics in Apache Spark

Tungsten in ML

Partly done •  Vector/Matrix UDTs • UDFs for some operations What’s missing? •  Code generation for critical paths •  Closer integration of Vector/Matrix types with Tungsten

34

Page 35: Foundations for Scaling Analytics in Apache Spark

OOMing

DataFrames automatically spill to disk à Classic pain point of RDDs

35

java.lang.OutOfMemoryError

Goal: Smoothly scale, without custom per-algorithm optimizations

Page 36: Foundations for Scaling Analytics in Apache Spark

To summarize... MLlib on RDDs •  Required custom optimizations MLlib with a DataFrame-based API •  Friendly API •  Improvements for prediction In the future •  Potential for even greater scaling for training •  Simpler for non-experts to write new algorithms

36

Page 37: Foundations for Scaling Analytics in Apache Spark

Get started Get involved •  JIRA http://issues.apache.org •  mailing lists http://spark.apache.org •  Github http://github.com/apache/spark •  Spark Packages http://spark-packages.org Learn more •  New in Apache Spark 2.0

http://databricks.com/blog/2016/06/01 •  MOOCs on EdX http://databricks.com/spark/training

37

Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce

Many thanks to the community for contributions & support!

Page 38: Foundations for Scaling Analytics in Apache Spark

Databricks Founded by the creators of Apache Spark Offers hosted service •  Spark on EC2 •  Notebooks •  Visualizations •  Cluster management •  Scheduled jobs

38

We’re hiring!

Page 39: Foundations for Scaling Analytics in Apache Spark

Thank you! FB group: Databricks at CMU databricks.com/careers