Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden...
Transcript of Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden...
Benchmarking Data Flow Systems
for Scalable Machine Learning Christoph Boden, Andrea Spina, Tilmann Rabl, Volker Markl
1 SPEC Research Big Data Group call, 19.6.2017
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Motivation
- Hadoop MapReduce inherently ineffecient at executing iterative
computations
- second generation systems (Spark, Flink, GraphLab …) address this
shortcomming
- distributed data flow systems are poular choices to train machine learning
models at scale
- existing benchmarks use non-representative workloads and fail to
address scalability aspects of machine learning models
2
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
3
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Problem
existing (big data) benmcharks:
- use non-representative workloads
(word count, sort …)
- fail to address all dimensions of scalability
- use existing libraries for experiments
4
„[…] Spark obtains the best results for K-Means
thanks to the optimized MLlib library, although it is
expected that the support of K-Means in Flink-ML
can bridge this performance gap. […]”
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Example: Click-Through Rate Prediction
- Goal: predict whether a user will click an ad
- a crucial building block in the multi-billion dollar online advertising industry
- logistic regression models still a „major workhorse“
- Prediction models are trained on
- >100 TB data
- billions of training samples
- up to 100 billion unique features*
* https://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf
5
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Dimensions of Scalability
Problem: existing (big data) bencharks fail to address all dimensions of
scalability
- Scaling the data (number of training samples)
- Scaling the model (dimensions)
- Scaling the number of models (ensembles, hyperparameter tuning, …)
6
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Goal
- Introduce a representative workloads and experiments to evaluate the
Performance of distributed data flow systems for machine learning
- Implemented mathematically equivalent workloads on different systems and
assess their scalability w.r.t. Machine Learning
Systems:
Apache Flink Apache Spark Single Thread 7
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Scalability you say …
8
Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability! but at what cost?.
In Proceedings of the 15th USENIX conference on Hot Topics in Operating Systems (HOTOS'15)
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
COST
9
- hardware configuration required before the platform outperforms a
competent single-threaded implementation.
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Experiments
Production Scaling: maximum number of nodes, varying data size
Strong Scaling: varying number of nodes, fixed data size
Model Scaling: varying number of nodes and dimensionality
fixed number of data points
COST: varying number of nodes and dimensions compared
against single threaded implementation
10
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Background: Spark and Flink
Spark:
- data-parallel transformations on Resilient Distributed Datasets (RDDs)
- can be cached and recomputed in case of node failures
Flink:
- distributed streaming data flow engine supporting batch- and streaming
workloads
- native operator for iterative computations
- jobs are compiled and optimized by a cost-based optimizer
11
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Data Sets
Unsupervised Learning: generated 100 dimensional data sampled from k
Gaussians and added uniform random noise (similar to HiBench)
Supervised Learning: used part of the Criteo Click log data set (1 bn data
points) with feature hashing to convert to desired dimensionality for experiments –
(e.g. 530 GB for 1000 dim)
12
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Cluster Setup
• Quadcore Intel Xeon CPU E3-1230 V2 3.30GHz CPU (4 cores, 8 hyperthreads)
• 16 GB RAM
• 3x1TB hard disks (linux software RAID0)
• 1 GBit Ethernet NIC
• Flink Version: 1.0.3
• Spark Version: 1.6.2
• LibLinear Version
13
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Parameter Tuning
• parallelism
• caching
• buffers
• serialization
14
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Workloads
15
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Machine Learning Pipelines
16
raw training data
feature extraction
model training
model evaluation
model
Model performance
Feature Selection Feature Engineering
Model Selection
(Hyper-) parameter tuning
Validation Data
Test Data
Training Data
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Supervised Learning
Objective:
Batch Gradient Descent:
Different parametrizations of loss and regularization function yield a variety of ML methods
A good workload proxy for more sophisticated solvers that share a similar computational footprint
17
𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Map-Reduce Implementation
Map1 Map2 MapN compute gradient per data point
Reduce sum up partial gradients
…
18
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Map-Partition Implementation
MapPartition1 compute gradient per data point (per partition)
Reduce
locally sum up partial gradients (in udf)
MapPartitionN … pre-
aggregate pre-
aggregate
aggregate pre-aggregated partial sums 19
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Tree-Aggregate (Spark)
Executor Executor Executor Executor
Executor Executor
driver
20
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Experimental Results
21
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Production Scaling: Implementation Strategies
0
20
40
60
80
100
120
140
160
180
200
0 0,5 1 1,5 2 2,5 3 3,5 4
Ru
nti
me
in M
inu
tes
Data Set Size (linear scaling factor)
Spark MapPartition
Spark MapReduce
Spark TreeAggregate
Flink MapPartition
Flink MapReduce
22
• choice of implementation
stategy matters!
• all implementation scale
gracefully out-of-core
• Spark‘s MapPartition
slightly faster than
TreeAgregate, but not
robust
• unfortunate kryo
serialization bug
penalizing Flink‘s
MapReduce
implementation
5 Iterations of BGD training
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Strong Scaling Experiments
K-Means Clustering Batch Gradient Descent
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35
Ru
nti
me
in M
inu
tes
Number of Nodes
Apache Spark
Apache Flink
0
20
40
60
80
100
120
0 5 10 15 20 25
Ru
nti
me
in M
inu
tes
Number of Nodes
Apache Spark
Apache Flink
23
5 Iterations 5 Iterations
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Batch Gradient Descent on 4 Nodes
Apache Flink Apache Spark 24
CPU
DSK
NET
MMR
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Batch Gradient Descent on 25 Nodes
Apache Flink Apache Spark 25
CPU
DSK
NET
MMR
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Dimensionality Scaling (log-log)
26
two data sets:
• 0.2 = size of combined
main memory
• 0.8 = bigger than
combined main memory
• Spark performance
comparable or better than
flink for small dimensions
5 Iterations of BGD training 1
10
100
1 10 100 1000 10000 100000 1000000 10000000
run
tim
e in
min
ute
s
Dimensionality of the Model
Flink small data set (0.2)
Flink large data set (0.8)
Spark small data set (0.2)
Spark large data set (0.8)
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Dimensionality Scaling
27
0
10
20
30
40
50
60
70
80
90
100
0 2000000 4000000 6000000 8000000 10000000 12000000
run
tim
e in
min
ute
s
Dimensionality of the Model
Flink small data set (0.2)
Flink large data set (0.8)
Spark small data set (0.2)
Spark large data set (0.8)
• spark fails to tain models
beyond 6m dimensions on
0.8 data set
• spark fails to tain models
beyond 8m dimensions on
0.2 data set
• flink robustly scales to
10m dimensions for both
data sets
• flink fails to train models
greater than 10m
dimensions
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
BGD – 0.8 Data Set - 6 Million Dimensions
Apache Flink Apache Spark
28
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
COST: vs. Single Threaded Implementation
0
1
2
3
4
5
6
7
8
10 100 1000 10000 100000 1000000 10000000100000000 1E+09
Ru
nti
me
in M
inu
tes
Dimensionality of Model
LibLinear (Single Thread)
Spark 1 Node (4 Cores)
Flink 1 Node (4 Cores)
Spark 2 Nodes (8 Cores)
Flink 2 Nodes (8 Cores)
29
• 4GB subsample of criteo
data set
• 2 machines (8 cores)
sufficient to outperform
single threaded impl.
• both Flink and Spark fail to
train with 100m
dimensions or beyond
10 iterations of BGD training
Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017
Summary
• Proposed, implemented and evaluate a set of representative workloads and experiments to
evaluate systems for machine learing
• Both systems scale robustly with growing data-set sizes
• Choice of implementation strategy has a noticeable impact on performance
• Spark fails to train high dimensionsal models (beyond 6 million dimensions)
• Both systems did not manage to train a model with 100 million dimensions even on a small data set
• Two nodes (8 cores) are a sufficient hardware configuration to outperform a competent single-
threaded implementation
30
contact: [email protected] [soon] code: https://github.com/bodenc/ml-benchmark