High-Performance Distributed Machine Learning using Apache Spark
Embed Size (px)
Transcript of High-Performance Distributed Machine Learning using Apache Spark
High-Performance Distributed Machine Learning using Apache SPARK
Celestine Dunner CDU@ZURICH.IBM.COMIBM Research Zurich, Switzerland
Thomas Parnell TPA@ZURICH.IBM.COMIBM Research Zurich, Switzerland
Kubilay Atasu KAT@ZURICH.IBM.COMIBM Research Zurich, Switzerland
Manolis Sifalakis EMM@ZURICH.IBM.COMIBM Research Zurich, Switzerland
Haralampos Pozidis HAP@ZURICH.IBM.COMIBM Research Zurich, Switzerland
AbstractIn this paper we compare the performance of dis-tributed learning using Apache SPARK and MPIby implementing a distributed linear learning al-gorithm from scratch on the two programmingframeworks. We then explore the performancegap and show how SPARK-based learning can beaccelerated, by reducing computational cost aswell as communication-related overheads, to re-duce the relative loss in performance versus MPIfrom 20 to 2. With these different implemen-tations at hand, we will illustrate how the optimalparameters of the algorithm depend strongly onthe characteristics of the framework on which itis executed. We will show that carefully tuning adistributed algorithm to trade-off communicationand computation can improve performance by or-ders of magnitude. Hence, understanding systemaspects of the framework and their implications,and then correctly adapting the algorithm provesto be the key to performance.
1. IntroductionMachine learning techniques provide consumers, re-searchers and businesses with valuable insight. However,significant challenges arise when scaling the correspond-ing training algorithms to massive datasets that do not fitin the memory of a single machine, making the implemen-tation as well as the design of distributed algorithms moredemanding than that of single machine solvers. Going dis-tributed involves a lot of programming effort and system-
level knowledge to correctly handle communication andsynchronization between single workers, and, in addition,requires carefully designed algorithms that run efficientlyin a distributed environment. In the past decades distributedprogramming frameworks such as Open MPI have empow-ered rich primitives and abstractions to leverage flexibil-ity in implementing algorithms across distributed comput-ing resources, often delivering high-performance but com-ing with the cost of high implementation complexity. Incontrast, more modern frameworks such as Hadoop andSPARK have recently emerged which dictate well-defineddistributed programming paradigms and offer a powerfulset of APIs specially built for distributed processing. Theseabstractions make the implementation of distributed algo-rithms more easily accessible to developers, but seem tocome with poorly understood overheads associated withcommunication and data management which make the tightcontrol of computation vs communication cost more diffi-cult. In this work we will analyze these overheads and showthat to minimize their effect and thus to design and im-plement efficient real-world distributed learning algorithmsusing Apache SPARK it is important to be aware of theseoverheads and adapt the algorithm to the conditions given.Hence, understanding the underlying system and adaptingthe algorithm, remains the key to performance even whenusing Apache SPARK and we demonstrate that a carefultuning of the algorithm parameters can decide upon sev-eral orders of magnitude in performance. The three maincontributions of this paper are the following:
We provide a fair analysis and measurements of theoverheads inherent in SPARK, relative to an equiva-lent MPI implementation of the same linear learning
Spark vs. MPI
algorithm. In contrast to earlier work (Reyes-Ortizet al., 2015; Gittens et al., 2016; Ousterhout et al.,2015) we clearly decouple framework-related over-heads from the computational time. We achieve thisby off-loading the critical computations of the SPARK-based learning algorithm into compiled C++ modules.Since the MPI implementation uses exactly the samecode, any difference in performance can be solely at-tributed to the overheads related to the SPARK frame-work.
We demonstrate that by using such C++ modules wecan accelerate SPARK-based learning by an order ofmagnitude and hence reduce its relative loss in perfor-mance over MPI from 20 to an acceptable level of2. We achieve this by, firstly, accelerating compu-tationally heavy parts of the algorithm by calling op-timized C++ modules from within SPARK, and, sec-ondly, utilizing such C++ modules to extend the func-tionality of SPARK to reduce overheads for machinelearning tasks.
Our clear separation of communication and computa-tion related costs provides new insights into how thecommunication-computation trade-off on real worldsystems impacts the performance of distributed learn-ing. We will illustrate that if the algorithm parametersare not chosen carefully in order to trade-off com-putation versus communication then this can leadto performance degradation of over an order of mag-nitude. Furthermore, we will show that the optimalchoice of parameters depends strongly on the pro-gramming framework the algorithm is executed on:the optimal choice of parameters differs significantlybetween the MPI and the SPARK implementations re-spectively, even when running on the same hardware.
For this analysis we will focus on standard distributedlearning algorithms using synchronous communication,such as distributed variants of single machine solvers basedon the principle of mini-batches, e.g., mini-batch stochas-tic gradient descent (SGD) (Dekel et al., 2012; Zinke-vich et al., 2010) and mini-batch stochastic coordinate de-scent (SCD) (Richtarik & Takac, 2015; Shalev-Shwartz &Zhang, 2013), as well as the recent COCOA method (Jaggiet al., 2014; Ma et al., 2015; Smith et al., 2016). In contrastto more complex asynchronous frameworks such as param-eter servers (Li et al., 2014; Microsoft, 2015), synchronousschemes have tighter theoretical convergence guaranteesand are easier to accurately benchmark, allowing for a moreisolated study of system performance measures, such asthe communication bottleneck and framework overheads.It is well known that for such methods, the mini-batch sizeserves as a tuning parameter to efficiently control the trade-off between communication and computation where we
Figure 1. Data partitioned across multiple machines in a dis-tributed environment. Arrows indicate the synchronous com-munication per round, that is, sending one vector update fromeach worker, and receiving back the resulting sum of the updates(AllReduce communication pattern in MPI and SPARK).
will see that the optimal mini-batch size shows a very sen-sitive dependence on the algorithm and the programmingframework being used. For our experiments we have cho-sen the state of the art COCOA algorithm and implementedthe same algorithm on the different programming frame-works considered. COCOA is applicable to a wide rangeof generalized linear machine learning problems. Whilehaving the same flexible communication patterns as mini-batch SGD and SCD, COCOA improves training speed byallowing immediate local updates per worker, leading to upto 50 faster training than standard distributed solvers suchas those provided by MLlib (Meng et al., 2016). Our re-sults can provide guidance to developers and data scientistsregarding the best way to implement machine learning al-gorithms in SPARK, as well as provide valuable insight intohow the optimal parameters of such algorithms depend onthe underlying system and implementation.
2. The Challenge of Distributed LearningIn distributed learning, we assume that the data is parti-tioned across multiple worker nodes in a cluster, and thesemachines are connected over a network to the master nodeas illustrated in Figure 1. We wish to learn the best clas-sification or regression model from the given training data,where every machine has access to only part of the dataand some shared information that is periodically exchangedover the network. This periodic exchange of information iswhat makes machine learning problems challenging in adistributed setting because worker nodes operate only ona subset of the data, and unless local information fromevery worker is diffused to every other worker the accu-racy of the final model can be compromised. This ex-change of information however is usually very expensiverelative to computation. This challenge has driven a sig-nificant effort in recent years to develop novel methods en-
Spark vs. MPI
abling communication-efficient distributed training of ma-chine learning models.
Distributed variants of single machine algorithms such asmini-batch SGD and SCD are well known work-horses inthis context, where, in every round, each worker processesa small fixed number H of local data samples, in order tocompute an update to the parameter vector, see e.g. (Dekelet al., 2012; Zinkevich et al., 2010) and (Richtarik & Takac,2015; Shalev-Shwartz & Zhang, 2013). The update of eachworker is then communicated to the master node, whichaggregates them, computes the new parameter vector andbroadcasts this information back to the workers. A use-ful property of the mentioned algorithms is that the size ofthe mini-batch, H , can be chosen freely. Hence, H allowscontrol of the amount of work that is done locally betweentwo consecutive rounds of communication. On a real-worldcompute system, the costs associated with communicationand computation are typically very different and thus theparameter H allows to optimally trade-off these costs toachieve the best overall performance.
2.1. Algorithmic Framework
Similar to mini-batch SGD and SCD, the state of the artCOCOA framework (Jaggi et al., 2014; Ma et al., 2015;Smith et al., 2016) allows the user to fr