Accelerating Cross-Validation in Spark Using...
Transcript of Accelerating Cross-Validation in Spark Using...
Minsik Cho, Rajesh Bordawekar
IBM TJW Research
Accelerating Cross-Validation in Spark Using GPU
1
Cross-Validation 101
2
[Wikipedia]
Popular Model Validation Technique
– to avoid overfitting, for better generalization
– useful when not enough dataset
Cross-Validation + Elastic Net Regression
3
Cross Validation is popularly used with
– Linear/Logistic Regression
– Elastic Net Regularization
A large number of problems to solve
– #fold from cross-validation
– various lambdas to find the best prediction model
– 4 fold x 1000 lambdas = 4000 regressions to fit
[Wikipedia]
Tons of problems to crunch
Apache Spark Overview
4
In-memory engine for large-scale distributed data processing
– Used in database, streaming, machine/deep learning, graph processing
– Support high-level APIs in Java, Scala, Python and R
RDD: resilient distributed datasets
– Partitioned collection of records
– Spread across the cluster
– Caching dataset in memory
Spark GPU Acceleration
5
[Rajesh, oreilly.com]
Accelerated Compute-Intensive Workload with GPUs
Cross-Validation in Spark
6
For each problem
– Create RDD
– Distribute RDD
– Call optimizer
– Return Model
[Berkeley]
Cross-Validation in Spark
7
Is this best for GPU?
Dataset
Dataset i Dataset j Dataset k
worker i worker j worker k
One Model
Reduce
PartitionedRDD
Proposed Cross-Validation in Spark Using GPU
8
Broadcast Data
– Cross-Validation reuses the same mother dataset
RDD of problem instances, not DATA
– Tons of problems with different folding/lambdas
Maximize GPU stream to minimize down-time
Cross-Validation in Spark Using GPU
9
DatasetProblems
Dataset Dataset Dataset
worker i worker j worker k
Broadcasted as Array
Cross-Validation in Spark Using GPU
10
Dataset Dataset Dataset
worker i worker j worker k
Distributed as RDD
Problems
Problems i Problems j Problems k
Problems in RDD
Code Snippet
11
Build a problem set
Code Snippet (cont.)
12
Input: dataset, problems
Dataset broadcast
Problem RDD
Cross-Validation in Spark Using GPU
13
Dataset
worker i
Problems i
Dataset fold 0 Dataset fold 1
Dataset fold 2 Dataset fold 3
GPU0 GPU1
Problem a:0 Problem a:2 Problem a:1 Problem a:3
cudaStream cudaStream cudaStream cudaStream
13
Problem b:0 Problem b:2 Problem b:1 Problem b:3
Cross-Validation in Spark Using GPU
14
Dataset Dataset Dataset
worker i worker j worker k
Distributed as RDD
Problems
Problems i Problems j Problems k
All Models
Reduce
Cross-Validation in Spark Using GPU (Advantages)
15
Dataset Broadcast
– Efficient p2p protocol in Spark
– One-time upfront overhead
– Data reused within GPUs
Problem RDD
– No communication among workers
– Multiple streams to maximize GPU utilization
Multi-level parallelism
– Functional parallelism from Problem RDD
– Multiple GPUs
– Multiple cudaStreams
Experimental Results
16
System– 2 node cluster– Each node with thirty two x86 cores– Each node with two K40ms
Software– Spark 2.0– OpenJDK 1.8
Workload– Real Watson Health dataset– 5 fold cross validation– 1024 lambda exploration
Algorithms– Logistic regression– Linear regression
Measured e2e runtime including dataset broadcast
Result: GPU utilization
17
Sustained over 97% Multi-GPU utilization
Result: Logistic RegressionNo help : 2 problems
Help : enough problems
Help : more problems
114x speedup
Result: Linear Regression
94x speedup
Cross-Validation on Spark using GPU
– New way of parallelization in Spark
Broadcast dataset
RDDmized problems
– Reduce communication
About 100x speedup for Logistic/Linear Regression + Elastic Net
Future work
– Support out of core execution
20
Conclusion