Abstract arXiv:1909.02119v2 [cs.DC] 30 Jun 2020 · ot+1 ˘O(ot+1jst+1;at), and a reward rt+1...

Inductive-bias-driven Reinforcement Learning for Efficient Schedules inHeterogeneous Clusters

Subho S. Banerjee 1 Saurabh Jha 1 Zbigniew T. Kalbarczyk 1 Ravishankar K. Iyer 1

AbstractThe problem of scheduling of workloads ontoheterogeneous processors (e.g., CPUs, GPUs, FP-GAs) is of fundamental importance in moderndata centers. Current system schedulers rely onapplication/system-specific heuristics that have tobe built on a case-by-case basis. Recent work hasdemonstrated ML techniques for automating theheuristic search by using black-box approacheswhich require significant training data and time,which make them challenging to use in prac-tice. This paper presents Symphony, a schedulingframework that addresses the challenge in twoways: (i) a domain-driven Bayesian reinforce-ment learning (RL) model for scheduling, whichinherently models the resource dependencies iden-tified from the system architecture; and (ii) a sam-pling-based technique to compute the gradients ofa Bayesian model without performing full prob-abilistic inference. Together, these techniquesreduce both the amount of training data and thetime required to produce scheduling policies thatsignificantly outperform black-box approaches byup to 2.2×.

1. IntroductionThe problem of scheduling of workloads on heterogeneousprocessing fabrics (i.e., accelerated datacenters includingGPUs, FPGAs, and ASICs, e.g., Asanovic (2014); Shao &Brooks (2015)), is at its core an intractable NP-hard prob-lem (Mastrolilli & Svensson, 2008; 2009). System sched-ulers generally rely on application- and system-specificheuristics with extensive domain-expert-driven tuning ofscheduling policies (e.g., Isard et al. (2009); Giceva et al.(2014); Lyerly et al. (2018); Mars et al. (2011); Mars & Tang(2013); Ousterhout et al. (2013); Xu et al. (2018); Yang et al.(2013); Zhang et al. (2014); Zhuravlev et al. (2010); Za-

1University of Illinois at Urbana-Champaign, USA. Correspon-dence to: Subho S. Banerjee <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

haria et al. (2010)). Such heuristics are difficult to generate,as variations across applications and system configurationsmean that significant amounts of time and money must bespent in painstaking heuristic searches. Recent work hasdemonstrated machine learning (ML) techniques (Delim-itrou & Kozyrakis, 2013; 2014; Mao et al., 2016; 2018)for automating heuristic searches by using black-box ap-proaches which require significant training data and time,making them challenging to use in practice.

This paper presents Symphony, a scheduling frameworkthat addresses the challenge in two ways: (i) we use adomain-guided Bayesian-model-based partially observableMarkov decision process (POMDP) (Astrom, 1965; Kael-bling et al., 1998) to decrease the amount of training data(i.e., sampled trajectories); and (ii) a sampling-based tech-nique that allows one to compute the gradients of a Bayesianmodel without performing full probabilistic inference. Wethus, significantly reduce the costs of (i) running a large het-erogeneous computing system that uses an efficient schedul-ing policy; and (ii) training the policy itself.

Reducing Training Data. State-of-the-art methods forchoosing an optimal action in POMDPs rely on training ofneural networks (NNs) (Mnih et al., 2016; Dhariwal et al.,2017). As these approaches are model-free, training of theNN requires large quantities of data and time to computemeaningful policies. In contrast, we provide an inductivebias for the reinforcement learning (RL) agent by encod-ing domain knowledge as a Bayesian model that can inferthe latent state from observations, while at the same timeleveraging the scalability of deep learning methods throughend-to-end gradient descent. In the case of scheduling, ourinductive bias is a set of statistical relationships betweenmeasurements from microarchitectural monitors (Dreyer &Alpert, 1997). To the best of our knowledge, this is the firstpaper to exploit those relationships and measurements toinfer resource utilization in the system (i.e., latent state) tobuild RL-based scheduling polices.

Reducing Training Time. The addition of the inductivebias, while making the training process less data-hungry(i.e., requiring fewer workload executions to train themodel), comes at the cost of additional training time: thecost of performing full-Bayesian inference at every trainingstep (Dagum & Luby, 1993; Russell et al., 1995; Binder

arX

iv:1

909.

0211

9v2

[cs

.DC

] 3

0 Ju

n 20

20

Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters

0

2

4

6

8

10

12

28 210 212 214 216 218 220 222

Ban

dw

idth

(GB

/s)

Message Size (Bytes)

IsolatedContention

1.8xR

S S

GP

U

GP

U

GP

U

NIC

IsolatedFlow

Contentionflow

PCIe Root Complex

PCIe Switch

Figure 1. Performance degradation due to PCIe contention be-tween GPU and NIC (averaged over 10 runs).

et al., 1997). It is this cost that makes the use of deepRL techniques in dynamic real-world deployments (whichrequire periodic retraining) prohibitively expensive. To ad-dress that issue, we have developed a procedure for comput-ing the gradient of variables in the above Bayesian modelwithout requiring full inference computation, unlike priorwork (Russell et al., 1995; Binder et al., 1997). The keyis to calculate the gradient by generating samples from themodel, which is computationally simpler than inferring theposterior distribution.

Need for New Scheduler. Current schedulers prioritize theuse of simple generalized heuristics and coarse-grained re-source bucketing (e.g., core counts, free memory) to makescheduling decisions. Hence, even though they are per-ceived to perform well in practice, they do not model com-plex emergent heterogeneous compute platforms and henceleave a lot to be desired. Consider the case of a distributeddata processing framework that uses two GPUs to perform ahalo exchange.1 Fig. 1 shows the performance (here, band-width) of the exchange as “isolated” performance. If theapplication were to concurrently perform distributed net-work communication, we would observe that the originalGPU-to-GPU communication is affected because of PCIebandwidth contention at shared links (i.e., a “hidden” re-source that is not often exposed to the user). Such behavioris shown as “contention” in Fig. 1, and can cause as muchas a 0− 1.8× slowdown, depending on the size of the trans-mitted messages. Traditional approaches would either havesuch a heuristic manually searched and incorporated into ascheduling policy, or would expect it to be found automati-cally as part of the training of a black-box ML model, andboth approaches can require significant effort in profiling/-training. In contrast, our approach allows the utilization ofarchitectural resources (in this case, of the PCIe network)as an inductive bias for the RL-agent, thereby allowing thetraining process to automatically hone in on such resourcesof interest, without having to identify the resource’s impor-tance manually.

Results. The Symphony framework reduces the average

1A halo exchange occurs due to communication arsing betweenparallel processors computing an overlapping pieces of data, calledhalo regions, that need to be periodically updated.

job completion time over hand-tuned scheduling heuristicsby as much as 32%, and to within 6% of the time takenby an oracle scheduler. It also achieves a training timeimprovement of 4× compared to full Bayesian inferencebased on belief propagation. Further, the technique out-performs black-box ML techniques by 2.2× in terms oftraining time. We believe that Symphony is also representa-tive of RL applied to several other control-related problems(e.g., industrial scheduling, data center network schedul-ing) where data-driven approaches can be augmented withdomain knowledge to build sample-efficient RL-agents.

2. BackgroundPartially Observable Markov Decision Processes. APOMDP is a stochastic model that describe relationshipsbetween an agent and its environment. It is a tuple(S,A, T ,Ω, O,R, γ), where S is the state space, A isthe action space, and Ω is the observation space. Weuse st ∈ S to denote the hidden state at time t. Whenan action at ∈ A is executed, the state changes accord-ing to the transition distribution, st+1 ∼ T (st+1|st, at).Subsequently, the agent receives a noisy or partially oc-cluded observation ot+1 ∈ Ω according to the distributionot+1 ∼ O(ot+1|st+1, at), and a reward rt+1 ∈ R accordingto the distribution rt+1 ∼ R(rt+1|st+1, at).

An agent acts according to its policy π(at|st), which re-turns the probability of taking action at at time t. Theagent’s goal is to learn a policy π that maximizes the ex-pected future reward J = Eτ∼p(τ)[

∑Tt=1 γ

t−1rt] over tra-jectories τ = (s0, a0, . . . , aT−1, sT ) induced by its pol-icy, where γ ∈ [0, 1) is the discount factor. In gen-eral, a POMDP agent must infer the belief state bt =Pr(st|o1, . . . , ot, a0, . . . , at−1), which is used to calculateπ(at|st) where st ∼ bt. In the remainder of the paper, wewill use π(at|st) and π(at|bt) interchangeably.

Related Work. Finding solutions for many POMDPs in-volves (i) estimating the transition model T and observationmodel O, (ii) performing inference under this model, and(iii) choosing an action based on the inferred belief state.Prior work in this area has extensively explored the useof NNs, particularly recurrent NNs (RNNs), as universalfunction approximators for (i) and (iii) above because theycan be easily trained and have efficient inference procedures(e.g., Hausknecht & Stone (2015); Narasimhan et al. (2015);Mnih et al. (2015); Jaderberg et al. (2016); Foerster et al.(2016); Karkus et al. (2017); Zhu et al. (2018)). Neuralnetworks have proven to be extremely effective at learning,but usually require a lot of data (for RL-agents, sampled tra-jectories, which may be prohibitively expensive to acquirefor certain classes of applications, such as scheduling). Theability to incorporate explicit domain knowledge (which inthe case of scheduling, is based on system design invariants)could significantly reduce the amount of data required. To


that end, other work (Karkus et al., 2017; Silver et al., 2017;Igl et al., 2018) has advocated the integration of probabilis-tic models (including Bayesian filter models) for (i) above.The significant computational cost of learning and inferencein such deep probabilistic models has spurred the use of ap-proximation techniques for training and inference, includingNN-based approximations of Bayesian inference (Karkuset al., 2017; Zhu et al., 2018) and variational inference meth-ods (Igl et al., 2018).

In this paper, we too advocate the use of a domain-drivenprobabilistic model for bt that can be trained through end-to-end back-propagation to compute a policy. Specifically,the technique handles the gradient descent procedure on aBayesian network (BN) with known structure and incom-plete observations without performing inference on the BN,only requiring generation of samples from the model. Thatapproach is different from to prior work on learning BNsusing gradient descent (Russell et al., 1995; Binder et al.,1997) or expectation maximization, both of which requirefull posterior inference at every training step.

Actor-Critic Methods. Actor-Critic methods (Konda &Tsitsiklis, 2000) have previously been proposed for learn-ing the parameters ρ of an agent’s policy πρ(at|st). Here(i) the “Critic” estimates the value function V (s), and(ii) the “Actor” updates the policy π(a|s) in the directionsuggested by the Critic. In this paper, we use n-step learn-ing with the asynchronous advantage actor-critic (A3C)method (Mnih et al., 2016). For n-step learning, start-ing at time t, the current policy performs ns consecutivesteps in ne parallel environments. The gradient updatesof π and V are based on that mini-batch of size nens.The target for the value function Vη(st+i), i ∈ [0, ns),parameterized by η, is the discounted sum of on-policyrewards up until t + ns and the off-policy bootstrappedvalue V ∗η (st+ns). If we use an advantage function At,iη =

(∑ns−i−1j=0 γjrt+i+j) + γns−iV ∗η (st+ns) − Vη(st+1), the

value function is

LAt (ρ) = − 1

nens

ne−1∑e=0

ns−1∑i=0

Est+i∼bt+i[log πρ(at+i|st+i)

At,iη (st+i, at+i)] (1a)

LVt (η) =1

nens

ne−1∑e=0

ns−1∑i=0

Est+i∼bt+i

[At,iη (st+i, at+i)

2].

(1b)3. Training the POMDP RL-Agent with

Back-PropagationWe consider a special case of the POMDP formulationpresented above (illustrated in Fig. 2). We assume thatthe domain knowledge about the environment of theRL-agent is presented as a joint probability distributionPr(st, at−1, ot; ΘBN ) that can be factorized as a BN (with

Inference Procedure

Critic

Actor

BN Model

V (bt)<latexit sha1_base64="CKgbEygiY5kI0bdutUnTJqmk42w=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSrUS0mqoMeCF48V7AekoWy223bpZhN2J0IJ/RlePCji1V/jzX/jts1BWx8MPN6bYWZemEhh0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1LDpVC8iQIl7ySa0yiUvB2O72Z++4lrI2L1iJOEBxEdKjEQjKKV/FalO6JIwh5e9kplt+rOQVaJl5My5Gj0Sl/dfszSiCtkkhrje26CQUY1Cib5tNhNDU8oG9Mh9y1VNOImyOYnT8mFVfpkEGtbCslc/T2R0ciYSRTazojiyCx7M/E/z09xcBtkQiUpcsUWiwapJBiT2f+kLzRnKCeWUKaFvZWwEdWUoU2paEPwll9eJa1a1buq1h6uy/XzPI4CnMIZVMCDG6jDPTSgCQxieIZXeHPQeXHenY9F65qTz5zAHzifPxn+kGM=</latexit>

(at|bt)<latexit sha1_base64="tj51oPyZB2uxVwy1p0Vr1G4pRZQ=">AAAB/HicbVDLSsNAFJ3UV62vaJduBqtQNyWpgi4LblxWsA9oQphMJ+3QyYOZGyHE+ituXCji1g9x5984bbPQ1gMXDufcy733+IngCizr2yitrW9sbpW3Kzu7e/sH5uFRV8WppKxDYxHLvk8UEzxiHeAgWD+RjIS+YD1/cjPzew9MKh5H95AlzA3JKOIBpwS05JlVJ+F14gF+xM6YAPY9OPfMmtWw5sCrxC5IDRVoe+aXM4xpGrIIqCBKDWwrATcnEjgVbFpxUsUSQidkxAaaRiRkys3nx0/xmVaGOIilrgjwXP09kZNQqSz0dWdIYKyWvZn4nzdIIbh2cx4lKbCILhYFqcAQ41kSeMgloyAyTQiVXN+K6ZhIQkHnVdEh2Msvr5Jus2FfNJp3l7XWaRFHGR2jE1RHNrpCLXSL2qiDKMrQM3pFb8aT8WK8Gx+L1pJRzFTRHxifPyUyk7M=</latexit>at1

<latexit sha1_base64="zJkKW8BSQkS9Uys99at0qC6ECc8=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LFbBiyWpgh4LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM6VCIUjKKV2rSf4aU37ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrTIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4a2fCZWkyBVbLApTSTAms9/JQGjOUE4soUwLeythI6opQ5tQyYbgLb+8Slq1qndVrT1cV+pneRxFOIFTuAAPbqAO99CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QPnb48w</latexit>

ot<latexit sha1_base64="y9ZDDVNc7+JGZgF5Qaq1B+JdumY=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LFbBU0mqoMeCF48V7Qe0oWy2m3bpJht2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqGZVqxptMSaU7ATVcipg3UaDknURzGgWSt4Px7cxvP3FthIofcZJwP6LDWISCUbTSg+pjv1xxq+4cZJV4OalAjka//NUbKJZGPEYmqTFdz03Qz6hGwSSflnqp4QllYzrkXUtjGnHjZ/NTp+TcKgMSKm0rRjJXf09kNDJmEgW2M6I4MsveTPzP66YY3viZiJMUecwWi8JUElRk9jcZCM0ZyokllGlhbyVsRDVlaNMp2RC85ZdXSatW9S6rtfurSv0sj6MIJ3AKF+DBNdThDhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzifP1vOjcA=</latexit>

bt<latexit sha1_base64="bmwmM+LHOhk1tZjZGsLufgyB37g=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LFbBU0mqoMeCF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTm9EkQR97JcrbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrTIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms+fJQGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndZrd1fVepneRxFOIFTuAAPrqEOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QOmNo+e</latexit>

f(bt;)<latexit sha1_base64="fgiyA7nsfUDxiW5IqYaIi/1up4g=">AAACBHicdVDJSgNBEO2Je9yiHr00RiFehukkJgEvghePEbJBJgw9nZ6kSc9Cd40Qggcv/ooXD4p49SO8+Td2FkFFHxQ83quiqp6fSKHBcT6szNLyyura+kZ2c2t7Zze3t9/ScaoYb7JYxqrjU82liHgTBEjeSRSnoS952x9dTv32DVdaxFEDxgnvhXQQiUAwCkbycoeB5yai4A4pYN+Dc+w2hhzoVDz1cnnHdkiZVGrYsckZcYpVQ0q1aqVMMLGdGfJogbqXe3f7MUtDHgGTVOsucRLoTagCwSS/zbqp5gllIzrgXUMjGnLdm8yeuMUnRunjIFamIsAz9fvEhIZaj0PfdIYUhvq3NxX/8ropBLXeRERJCjxi80VBKjHEeJoI7gvFGcixIZQpYW7FbEgVZWByy5oQvj7F/5NW0SYlu3hdzl8cL+JYR4foCBUQQVV0ga5QHTURQ3foAT2hZ+veerRerNd5a8ZazBygH7DePgGbNpda</latexit>

fV (bt;V )<latexit sha1_base64="37HfhmiNq7vPq+bSc2lChEwWbR8=">AAACAHicdVDJSgNBEO1xjXGLevDgpTEK8TJMJzEJeAl48RghGyRh6On0JE16FrprhBBy8Ve8eFDEq5/hzb+xswgq+qDg8V4VVfW8WAoNjvNhrayurW9sprbS2zu7e/uZg8OmjhLFeINFMlJtj2ouRcgbIEDydqw4DTzJW97oeua37rjSIgrrMI55L6CDUPiCUTCSmzn23WauO6SAPReucLc+5EDd5oWbyTq2Q4qkVMGOTS6Jky8bUqiUS0WCie3MkUVL1NzMe7cfsSTgITBJte4QJ4behCoQTPJpuptoHlM2ogPeMTSkAde9yfyBKT43Sh/7kTIVAp6r3ycmNNB6HHimM6Aw1L+9mfiX10nAr/QmIowT4CFbLPITiSHCszRwXyjOQI4NoUwJcytmQ6ooA5NZ2oTw9Sn+nzTzNinY+dtitnq2jCOFTtApyiGCyqiKblANNRBDU/SAntCzdW89Wi/W66J1xVrOHKEfsN4+ATvslXQ=</latexit>

Pr(st, at1, ot;BN )<latexit sha1_base64="hJ2RmGw//ZRWtYwqWCaCX5CZ3yQ=">AAACE3icdVBNSyNBEO3xazW6btSjl8YouIs7TMdoAl5EL54kglEhCUNPp2Iaez7orhHDMP/Bi3/FiwdFvHrx5r+xEyO4sj4oeLxXRVW9IFHSoOe9OmPjE5NTP6ZnCrNzP+d/FRcWT0ycagENEatYnwXcgJIRNFCigrNEAw8DBafBxf7AP70EbWQcHWM/gXbIzyPZlYKjlfzinxbCFWZ1na8bHzco9zP8y/INGvu4Q1vHPUAr7R3m9LdfLHmuxypsu0Y9l20xr1y1ZLNW3a4wylxviBIZoe4XX1qdWKQhRCgUN6bJvATbGdcohYK80EoNJFxc8HNoWhrxEEw7G/6U0zWrdGg31rYipEP180TGQ2P6YWA7Q44989UbiP/zmil2a+1MRkmKEIn3Rd1UUYzpICDakRoEqr4lXGhpb6WixzUXaGMs2BA+PqXfk5Oyyzbd8lGltLs6imOaLJMVsk4YqZJdckDqpEEEuSa35J48ODfOnfPoPL23jjmjmSXyD5znNzN/nQA=</latexit>

Figure 2. The proposed RL architecture.

parameters ΘBN ). A BN is a probabilistic graphical modelthat represents a set of variables and their conditional de-pendencies via a directed acyclic graph (DAG). We useprobabilistic inference on the BN to calculate an estimate ofthe belief state bt. bt is then used in an NN fπ(bt; Θπ) (withparameters Θπ) to approximate the RL-agent’s policy, andan NN fV (bt; ΘV ) (with parameters ΘV ) to approximatethe state-based value function. We refer to all the parame-ters of the model as Θ = (ΘBN ,Θπ,ΘV ) = (ρ, η). Themodel is then trained by propagating the gradient of thetotal loss ∇ΘLRLt = ∇ΘLAt (ρ) +∇ΘLVt (η). Estimatingthis gradient requires us to compute ∇ΘBN

bt. Traditionalmethods for computing the gradient require inference com-putation (Russell et al., 1995; Binder et al., 1997). However,even approximate inference in such models is known to beNP-Hard (Dagum & Luby, 1993). Below we describe analgorithm for approximating the gradient without requiringcomputation of full Bayesian inference. All that is requiredis the ability to generate samples from the BN. Only thesubset of the BN necessary for generation of the samplesis expanded. The samples are then used as a representa-tion of the distribution of the BN. As a result, the proposedmethod decouples the training of the BN from the inferenceprocedure used on it to calculate bt.

3.1. The Bayesian Network & Its Gradient

Let the BN described above be a DAG (V,E), and letX = Xv|v ∈ V be a set of random variables indexed byV . Associated with each nodeX is a conditional probabilitydensity function Pr(X|℘(X)), where ℘(X) are the parentsof X in the graph. We assume that we are given (i) an effi-cient algorithm for sampling values of X given ℘(X), and(ii) a function fX(x, y; θX) = PrθX (X = x|℘(X) = y)whose partial derivative with respect to θX is known andefficiently computable. The BN can also have determinis-tic relationships between two random variables, under theassumption that the relationship is a differentiable diffeo-morphism. That is, for random variables X , Y , and diffeo-morphism F , Pr(Y = y) = Pr(X = F−1(y))|DF−1(y)|where DF−1 is the inverse of the Jacobian of F .

Computing Gradient. For a random variable X in theBN, we define its parents as ℘(X), its ancestor set asΞ(X) = Y |Y X ∧ Y 6∈ ℘(X) (where repre-


sents a directed path in the BN). We now define a procedureto approximately compute the gradient of X with respectto ΘBN . We do so in two parts: (i) ∂ Pr(X=x|ξ=a)/∂θX and(ii)∇ΘBN\θX Pr(X = x|ξ = a) for ξ ⊆ Ξ(X). First,

∂ Pr(X = x|ξ = a)

∂θX

=∂

∂θX

∫Pr(℘(X) = y|ξ = a)×

Pr(X = x|℘(X) = y, ξ = a)dy

=∂

∂θX

∫Pr(℘(X) = y|ξ = a)fX(x,y; θX)dy

=

∫Pr(℘(X) = y|ξ = a)

∂fX(x,y; θX)

∂θXdy

≈S∑i=1

nS(a,yi)

nS(a)

∂fX(x,yi; θX)

∂θX. (2)

Here, S samples are drawn from a variable(s) Z such thatnS(j) is the number of times the value j appears in the setof samples zi, i.e., nS(j) =

∑Si=1 1zi = j. Next,

∇ΘBN\θX Pr(X = x|ξ = a)

= ∇ΘBN\θX

∫Pr(℘(X) = y|ξ = a)×

Pr(X = x|℘(X) = y, ξ = a)dy

=

∫fX(x,y; θX)∇ΘBN\θX Pr(℘(X) = y|ξ = a)dy

≈S∑i=1

nS(yi)

SfX(x,yi; θX)∇ΘBN\θX Pr(℘(X) = yi|ξ = a)

(3)

When |℘(X)| > 1, variables in ℘(X) might not be condi-tionally independent given Ξ(X). Hence we find a set ofnodes N such that I ⊥ J |Ξ(X) ∪N ∀I, J ∈ ℘(X). Then,

Pr(℘(X) = yi|ξ = a)

=

∫Pr(N = n|ξ = a)Pr(℘(X) = y|N = n, ξ = a)dn

=

∫Pr(N = n|ξ = a)

m∏j=1

Pr(Pj = yj |N = n, ξ = a)dn

≈S∑k=1

nS(a,ni)

nS(a)

m∏j=1

Pr(Pj = yj |N = nk, ξ = a), (4)

where ℘(X) = (P1, . . . , Pm) and yi = (yi,1, . . . , yi,m).Thus, we obtain,

∇ΘBN\θX Pr(℘(X) = yi|ξ = a)

≈S∑k=1

nS(a,nk)

nS(a)×

∇ΘBN\θX

m∏j=1

Pr(Pj = yi,j |N = nk, ξ = a)

=

S∑k=1

nS(a,nk)

nS(a)×

m∑l=1

m∏h=1,h6=l

Pr(Ph = yi,h|N = nk, ξ = a)

×∇ΘBN\θX Pr(Pl = yi,l|N = nk, ξ = a)

≈S∑k=1

nS(a,nk)

nS(a)

m∑l=1

m∏h=1,h 6=l

nS(yi,h, a,nk)

nS(a,nk)

×Expand by recursion using Eqns. 2, 3, and 5︷︸︸︷

∇ΘBN\θX Pr(Pl = yi,l|N = nk, ξ = a) . (5)

The term ∇ΘBN\θX Pr(Pl = yi,l|N = nk, ξ = a) rep-resents the gradient operator on a subset of the originalBN, containing only the ancestors (from the BN’s graphicalstructure) of X . Hence that gradient term can be recursivelyexpanded using Eqns. 2, 3, and 5. Repeating that processfor all variables in bt allows us to calculate the∇ΘBN

bt.

Computational Complexity. The cost of computingEqns. 2, and 3 is O(S). The cost of computing Eqn. 5is O(mS). The cost of finding N is O(|℘(st)|2(|V |+ |E|))(i.e., the cost of running the Bayes ball algorithm (Shachter,2013) for every pair of nodes in ℘(X)). The total com-putational complexity of the entire procedure hinges onfinding the number of times Eqns. 2, 4, and 5 are executed,which we refer to as Q. Q depends on the size of N andon the graphical structure of the BN. Hence, the total costof computing∇ΘBN

bt is O(Q(|℘(st)|2(|V |+ |E|) +mS))(where |℘(st)| ≤ |V | − 1), which is computed nsne|bt|times during training. Note that for a polytree BN (thegraphical structure of the BN we will use in §4), N = ∅,and Q ≤ |V |. This is still better than belief propagationon the polytree with the gradient computation techniquefrom Russell et al. (1995); Binder et al. (1997), which isO(|V |maxv∈V (dom(Xv))), where dom(X) is the size ofthe domain of X , which could be exponentially large.

4. Scheduling Data Center Workloads ByUsing Reinforcement Learning

We now demonstrate an application of the POMDP modeland training methodology presented in §3 to the problemof scheduling tasks on a heterogeneous processing fabricthat includes CPUs, GPUs, and FPGAs. The model inte-grates real-time performance measurements, prior knowl-edge about workloads, and system architecture to (i) dy-namically infer system state (i.e., resource utilization), and(ii) automatically schedule tasks on a heterogeneous pro-cessing fabric.


Bayesian Inference

BN Model

Graph Network

System Topology

Graph Network

LSTM

FC

FC

Critic

Actor

Computer Systems

Data Flow Graph

Action

PMUMeasurements







at (at|bt)<latexit sha1_base64="tXY7Yvnb32HYxwcbabyUjcjZvoA=">AAACCnicbVC7TsMwFHV4lvIKMLIYClJZqqQgwViJhbFI9CE1UeS4TmvVeci+QapKZxZ+hYUBhFj5Ajb+BifNAC1HutLxOffK9x4/EVyBZX0bS8srq2vrpY3y5tb2zq65t99WcSopa9FYxLLrE8UEj1gLOAjWTSQjoS9Yxx9dZ37nnknF4+gOxglzQzKIeMApAS155pEzJICJB9hRPMROwqvZ4wHnuu/BmWdWrJqVAy8SuyAVVKDpmV9OP6ZpyCKggijVs60E3AmRwKlg07KTKpYQOiID1tM0IiFT7iQ/ZYpPtdLHQSx1RYBz9ffEhIRKjUNfd4YEhmrey8T/vF4KwZU74VGSAovo7KMgFRhinOWC+1wyCmKsCaGS610xHRJJKOj0yjoEe/7kRdKu1+zzWv32otI4KeIooUN0jKrIRpeogW5QE7UQRY/oGb2iN+PJeDHejY9Z65JRzBygPzA+fwD4z5kR</latexit>

Neural Network

ResourceUtilizations

Figure 3. Architecture of the Symphony ML model.

Workload & Programming Model. The system workloadconsists of multiple user programs, and each program isexpressed as a data flow graph (DFG). A DFG is a DAGwhere the nodes represent computations (which we referto as kernels, e.g., matrix multiplication), and edges rep-resent input-output relationships between the nodes. Priorwork has shown that a large number of applications canbe expressed as compositions of such kernels (Asanovicet al., 2009; Banerjee et al., 2016). Prominent examples ofsuch compositions include modern data analytics and MLframeworks that describe workloads as DFGs (Abadi et al.,2016; Chambers et al., 2010; McCool et al., 2012; Zahariaet al., 2012). We assume that the kernels are known aheadof time and have multiple implementations available fordifferent processors and accelerators. That assumption iscorrect for many ML workloads; for other workloads, it isan area of active research wherein accelerator designers andarchitects are trying to decompose larger applications intosmaller pieces. Once trained, our approach can schedule anycomposition (DFG) of the kernels, but requires retrainingwhen the set of available kernels change.

POMDP Architecture. The overall architecture of theSymphony POMDP model is illustrated in Fig. 3. The firstpart of the POMDP models the latent state bt of the com-puter system. For the scheduling problem, bt corresponds toresource utilization of various components of the computersystem. Utilization of some of the resources can be mea-sured directly in software (e.g., the amount of free memory);however, the different layers of abstraction of the computerstack hide some others from direct measurement. For exam-ple, consider the example in Fig. 1 in §1; here, PCIe linkbandwidth cannot be directly measured. However, it canbe measured indirectly by using the number of outstandingrequests to memory from each PCIe device and by using thetopology of the PCIe network. In essence, we statisticallyrelate the back pressure of one resource on another, until wecan find a resource that can be directly measured via real-time performance counter (PC) measurements (ot) (Dreyer& Alpert, 1997). We refer to such resources whose uti-lization cannot be directly measured as hidden resources.

PCs are special-purpose registers present in the CPU andother accelerators for characterization of an application’s be-havior and identification of microarchitectural performancebottlenecks. Specifically, we use a BN to (i) model aleatoricuncertainty in measurements, and (ii) encode our knowl-edge about system architecture in terms of invariants orstatistical relationships between the measurements. Infer-ence on that BN then gives us an accurate estimate of thelatent state of the system. Second, we use an RNN (i.e.,fπ(·) and fV (·)) to learn scheduling policies for user pro-grams that minimize resource contention and maximizeperformance. Those two ML models effectively decouplesystem-architecture-specific and measurement-specific as-pects of scheduling (the BN) from its optimization aspects(the NN). The compelling value of the above architecture(and its two constituent models) is that it can automaticallygenerate scheduling policies for the deployment of DFGs intruly heterogeneous environments (that have CPUs, GPUs,and FPGAs) without requiring configuration specifics, orpainstakingly tuned heuristics. The model improves over-all performance and resource utilization, and enables fine-grained resource sharing across workloads.

Performance Counters. PCs are generally relied upon toconduct low-level performance analysis or tuning of per-formance bottlenecks in applications. As the source ofsuch bottlenecks is generally the unavailability of systemresources, the performance counter can naturally be used toestimate resource utilization of a system. Another benefitof using PCs is that it is not necessary to modify an appli-cation’s source code in order to make measurements. PCscan be grouped into three categories: (i) those pertaining tothe processing fabric (CPU core or accelerators); (ii) thosepertaining to the memory subsystem; and (iii) those pertain-ing to the system interconnect (in our case, PCIe). Fig. 4illustrates the organization of a computer system as well asthe categories above. Fig. 5 shows a mapping between thesystem organization and the PCs that are used in the BNmodel (described below).2

BN Model. Measurements made from PCs have some inher-ent noise (Weaver & McKee, 2008). The measurements canonly be stored in a fixed number of registers. Hence, onlya fixed number of measurements can be made at any onepoint in time. As a result, one must make successive mea-surements that capture marginally different system states.Particular performance counters might become unavailable(or return incorrect values). Finally, if a single schedul-ing agent is controlling a cluster of machines (which iscommon in data centers), measurements made on differ-ent machines will not be in sync and will often be delayedby network latency. As a result, PCs are often sampledN times between successive scheduler invocations to getaround some of the sources of error. To maximize the per-

2A complete list of the PCs used in this paper can be found inthe supplementary material.


CPU CPU

RAM RAM

PCR

GP

U

PCS

NIC

FPG

A…

PCR

GP

U

PCS

NIC

FPG

A…

Processing Fabric

Interconnect Elements

Memory Interconnect

System Interconnect

Memory

(Accelerator memoryomitted)

(PCIe Root Complex)

(PCIe Switch)

Figure 4. Organization of a multi-CPU computer.

formance estimation fidelity, we apply statistical methodsto systematically model the variance of the measurements.For a single performance counter ot[c], if the error in mea-surement ec can be modeled, then the measured value mc

can be modeled in terms of the true value vc plus measure-ment noise ec, i.e., mc = vc + ec. Here, we focus onlyon random errors, and assume zero systematic error. Thatis a valid assumption because the only reason for system-atic errors is hardware or software bugs. We assume thatthe error can be modeled as ec ∼ N (0, σ) for some un-known variance σ; hence, Pr(mc | vc) = N (mc, σ). Thatfollows from prior work based on extensive measurementstudies (Weaver & McKee, 2008). Now, given N measure-ments of the value of the performance counter, we computetheir sample mean µ and sample variance S. A scaled andshifted t-distribution describes the marginal distribution ofthe unknown mean of a Gaussian, when the dependence onvariance has been marginalized out (Gelman et al., 1995);i.e., vc ∼ µ + S/

√N Student(ν = N − 1). In our experi-

ments, the confidence level of the t-distribution was 95%.

Now, given a distribution of vc for every element of ot, wedescribe the construction of the BN model. Our goal is tomodel resource utilization (a number in [0, 1]) for a relevantset of architectural resources R. To do so, we use algebraicmodels for composing PC measurements (vc) by using alge-braic (deterministic) relationships derived from informationabout the CPU architecture. Processor performance manu-als (Yasin, 2014; Intel Corp., 2016; Hall et al., 2017) andor vendor contributions in OS codebases (e.g., in the perfmodule in Linux) provide such information. When availablein the later format (which is indeed the case for all modernIntel, AMD, ARM, and IBM CPUs), these relationships canbe automatically parsed and be used to construct the BN.

As our error-corrected measurements are defined in terms ofdistributions, the algebraic models that encode static infor-mation about relationships (based on the microarchitectureof the processor or topology of the system) now define statis-tical relationships vcs (based on the Jacobian relationshipsdescribed in §3). Fig. 5 shows an example of the BN model.However, the types and meanings of hardware countersvary from one kind of architecture to another because of

Processing Fabric

MemoryInterconnect

InterconnectUtilization

A

… …

…

FP ScalarInst. Retired

FP VectorInst. Retired

FP ArithmeticUtilization

BackendUtilization

PortUtil.

Divider

P

#Cores

#Threadsμops issued

Core Utilization

CPUUtilization

Memory BWUtilization

#Sockets

DRAM Lat.

DRAM BW

CacheUtilization

PCRUtilization

PCS Utilization

S

L

SwitchedDMA Utilization

SystemInterconnect

A

Direct AttachedDMA Utilization

OutstandingRequests

Figure 5. Bayesian network (uses the plate notation) used to esti-mate resource utilization.

the variation in hardware organizations. As a result, themodel defined by the BN is parametric, changing with dif-ferent processors and system topologies (i.e., across all thedifferent types of systems in a data center).

Consider the example of identifying memory bandwidthutilization for a CPU core. According to the processor docu-mentation, the utilization can be computed by measuring thenumber of outstanding memory requests (which is availableas a PC), i.e., Outstanding Requests[≥θMB ]/Outstanding Requests[≥1].3

That is, identify the fraction of cycles in some time win-dow that CPU-core stalls because of insufficient bandwidth.Naturally, in order to sustain maximum performance, it isnecessary to ensure that no stalls occur. The value θMB

is processor-specific and might not always be known. Insuch cases, we use the training approach described in §3to learn θMB . The procedure is repeated for all relevantsystem utilization counters (marked as “Util.” in Fig. 5),which together represent bt. Such a BN model for a 16-coreIntel Xeon processor (with all PCIe lanes populated) has 68nodes, of which 32 are directly measured and the remainderare computed through inference.

BN Retraining. The architectural information required tobuild the BN can be found in processor manuals (Intel Corp.,2016; Sudhakar & Srinivasan, 2019; Hall et al., 2017) aswell as in machine-parsable databases in the Linux kernelsource code as part of the perf package. The only humanintervention required in the process of building the BN isfor filtering out those resources that cannot be controlledwith software (because they change too quickly). The BNmodel should only be rebuilt when the underlying hardwareconfiguration changes, which Mars & Tang (2013) observehappens every 5–6 years in a data center.

Implementation Details. We collect system-wide (for allprocesses) performance counter measurements for a vari-ety of hardware events (described in Table 1). The systemwide collection leads to occasional spurious measurements(e.g., from interrupt handlers), however, this allows us tomake holistic measurements (e.g., capture system calls or

3Here X[≥ t] counts cycles in which X exceeds threshold t.


Table 1. Performance counters used in test evaluation. We havedisambiguated the names to ensure platform independence.

Performance Counters/Events

On-core EventsCore Clock Cycles, Reference Clock Cycles,Temperature, Instructions (µops forIntel) issued, Instructions (µops forIntel) retired, Un-utilized slots due tomiss-speculation

Un-core & Memory Controller Events (per socket)#Read/Write requests to DRAM (from allchannels), #Local DRAM accesses, #RemoteDRAM Accesses, #Read/Write requests to DRAM(from all channels) from IO sources, #PCIeRead, #PCIe Write, QPI(for Intel)/Nest(forIBM) Transactions

OS/Driver EventsFree memory (CPU, GPU, FPGA), Total memory(CPU, GPU, FPGA)

drivers that perform memory and DMA operations). Wemake the minimum measurements to infer if a kernel sched-uled to a CPU-hardware thread is core-bound (floating point-and integer-intensive). This allows us to make schedulingdecisions on co-located kernels, i.e., those that get sched-uled to SMT/hyperthreads bound to a core. The majorityof measurements are made at the level of un-core eventsthat captures performance of the memory interconnect andthe system bus: to identify kernels that are bandwidth bot-tle necked. We do not explicitly model GPU performancecounters as low-level scheduling decisions (e.g., warp-levelscheduling) in GPUs are obfuscated by the NVIDIA run-time/driver.

NN Model. The second part of the POMDP-based schedul-ing model uses an NN (see Fig. 3) to learn the optimal policywith which to schedule user tasks given a belief state. TheNN takes two graphs as inputs. The first input is the beliefstate bt, encoded as vertex labels on a graph that describesthe topology of a computer system (i.e., the organizationshown in Fig. 4), and input labels that correspond to thelocations of inputs in the topology. The color coding inFigs. 4, and 5 shows a mapping (i.e., vertex labels) betweennodes in the topology graph and bt. The second input is theuser’s program expressed as a DFG. We use graph network(GN) layers (Battaglia et al., 2018) to “embed” the graphsinto a set of embedding vectors. GNs have been shown tocapture node, edge, and locality information. We chosesmall, fully connected NNs for modeling the functionaltransformations in the GN layers. Prior work in scheduling(e.g., Grandl et al. (2016b); Wu et al. (2012)) has shownthe benefit of considering temporal information to capturethe dependencies of system resources over time as well asthe time evolution of the user DFG. We capture those rela-tionships (between the embeddings of the input graphs) byusing an RNN, specifically an LSTM layer (Hochreiter &

Schmidhuber, 1997).

The action space A of the model is fixed as the numberof kernels/processors available in the system and is knownahead of time. The action space consists of the follow-ing types of actions. (i) Execution actions correspond toexecution of a kernel on a processor/accelerator. (ii) Recon-figuration actions correspond to reconfiguration of a singleFPGA context to a kernel. (iii) No-Op actions correspond tonot scheduling any task in a particular scheduler invocation.No-Ops are useful when the system resources are maxi-mally subscribed, and execution of more tasks will hinderperformance. The scheduler is invoked every time there isan idle processor/accelerator in the system (i.e., every timea processor finishes the work assigned to it), causing thesystem to take one of the above actions.

Reward Function. The reward rt is based on the objec-tive of minimizing the runtime of a user DFG. At time t,rt = −∑t

i=01/Ti, where Ti is the wall clock time taken to

execute the i actions executing in the system at time t. Wepicked rt as it represents the “makespan” of the schedule,a metric that is popularly used in the traditional schedulingliterature and accurately represents the user-facing perfor-mance of the system. Note that parallel actions are notdouble-counted in this formulation. The BN and NN mod-els are trained end-to-end using minimization of Eqn. 1through back-propagation, as described in §3.

Implementation details of the BN and NN models are pre-sented in the supplementary material.

5. Evaluation & DiscussionWe evaluated the Symphony along the following dimen-sions. (i) How well does Symphony perform compared tothe state of the art? (ii) How does the Symphony’s runtimeaffect scheduling decisions? (iii) What are the savings intraining time compared to traditional methods? The eval-uation testbed consisted of a rack-scale cluster of twelveIBM Power8 CPUs, two NVIDIA K40, six K80 GPUs, andtwo FPGAs. We illustrated the generality of techniques on avariety of real-world workloads that used CPUs, GPUs, andFPGAs: (i) variant calling and genotyping analysis (Van derAuwera et al., 2013) on human genome datasets using toolspresented in Banerjee et al. (2016; 2017; 2019a); Li &Durbin (2009; 2010); Langmead et al. (2009); McKennaet al. (2010); Nothaft et al. (2015); Nothaft (2015); Rimmeret al. (2014); Zaharia et al. (2011); (ii) epilepsy detectionand localization (Varatharajah et al., 2017) on intra-cranialelectroencephalography data; and (iii) in online securityanalytics (Cao et al., 2015) for intrusion detection systems.

State of the Art. Traditional dynamic scheduling tech-niques (e.g., Isard et al. (2009); Giceva et al. (2014); Ly-erly et al. (2018); Ousterhout et al. (2013); Zhuravlev et al.(2010); Zaharia et al. (2010)) use manually tuned heuristics


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

CD

F

Oracle Normalized Runtime

Symphony Graphene Paragon Sparrow

Figure 6. Comparing performance of Symphony to that of otherpopular schedulers for kernel executions in DFGs.

0

0.2

0.4

0.6

0.8

1

SparrowGraphene

Paragon

SymphonyOracle

% o

f K

ern

els

Scheduler

0 0-10% 10-20% >20%

Figure 7. Percentage of application executions that show a degrada-tion in performance.

(e.g., fairness, shortest-job-first) that prioritize simplicityand generality over achieving the best-case workload perfor-mance, often allocating coarse-grained resources (e.g., GBsof memory, CPU threads) and making simplifying assump-tions about the underlying workload. Several ML-basedscheduling strategies have also been proposed, wherein theabove heuristics are learned from data. They use a vari-ety of black-box ML models, e.g., model-free deep RLin (Mao et al., 2016; 2018), collaborative filtering (Delim-itrou & Kozyrakis, 2013; 2014), and other traditional MLtechniques like SVMs (e.g., Mars et al. (2011); Mars & Tang(2013); Yang et al. (2013); Zhang et al. (2014)). A commontheme in these techniques is that of treating the system asa black-box and performing scheduling to optimize appli-cation throughput metrics. The above approaches are notwell-suited to heterogeneous, accelerator-rich systems inwhich architectural diversity necessitates the use of low-level resources, which cannot be measured directly and arenot semantically comparable across processors. As pointsof comparison to Symphony, we used Graphene (Grandlet al., 2016b), a heuristic-accelerated job shop optimizationsolver4; Sparrow (Ousterhout et al., 2013), a randomizedscheduler; and Paragon (Delimitrou & Kozyrakis, 2013), acollaborative filtering-based scheduler.

Baseline for Comparison. We defined the oracle scheduleto correspond to the best performance possible for runningan application on the evaluation system. It corresponds toa completely isolated execution of an application. Here,different concurrently executing kernels of the same appli-cation contend for resources and might cause performancedegradation. For the benchmark applications, we accountedfor that by exhaustively executing schedules of the appli-cation DFGs to find the one with the lowest runtime (i.e.,the oracle run). We measured the runtime of kernel i inworkload (in the oracle run) j as toracle

i,j across all kernelsand workloads. toracle

i,j serves as the baseline for assessingthe performance of Symphony.

Effectiveness of Scheduling Model. First, we quantifiedhow well Symphony can handle scheduling of kernels in aDFG taking into account of resource contention and inter-

4Graphene was not originally designed to execute on heteroge-neous systems. In the supplementary material, we explain modifi-cations we made to the algorithm.

ference at (i) intra-DFG level; and (ii) when executing withan unknown co-located workload utilizing compute and I/Oresources. To do so, we measured the runtimes of each ofthe kernels i in the workload j (as above) to compute tsi,j foreach scheduler s under test. In Fig. 6, we illustrate the distri-bution of oracle-normalized runtimes for each of the kernelsin the workloads we tested, i.e., a distribution of tsi,j/toracle

i,j

across 500 executions of the three above workloads. In thefigure, a distribution whose probability mass is closest to 1is preferred, as it implies the least slowdown compared tothe oracle. We observe that the proposed technique signifi-cantly outperformed the state-of-the-art. In our experiments,the median and tail (i.e., 99th percentile) runtime of Sym-phony outperformed the second best (in this case, Paragon)by close to 32%. At the 99th percentile, the generated sched-ules performed at a 6% loss relative to the oracle. Next, wequantified the performance of end-to-end user workloads,shown in Fig. 7. Here, we calculated 1− (

∑i t

si,j)/(

∑i t

oraclei,j )

for all 500 runs of the DFGs and grouped them into bucketsof different kinds of normalized performance. Symphonysignificantly outperformed the other scheduling techniques,running up to 60% of the applications with no performanceloss relative to the oracle execution, and the rest with aperformance loss of less than 20%.

Latency. There are two latencies to consider in comparingschedulers: the latency of the entire user workload (“LW”,shown in Fig. 6), and the latency of the scheduler execution(“LS”, shown in Fig. 8). In Fig. 8, we show two config-urations of the Symphony scheduler: (i) “No-Opt” whichuses a belief propagation-based update for the BN (andMCMC-based inference); and (ii) “All-Opt” which uses thesampling technique described in §3, accelerators5 to per-form inference, and task batching (described below). LW(≥ LS) is the user-facing metric of interest. Symphony out-performs all baselines in terms of LW. In terms of medianLS, the Symphony is 1.8× and 1.6× faster than Paragon andGraphene, respectively. In contrast, Sparrow, which ran-domly assigns tasks to processors, has 3.6× lower medianlatency than Symphony. However, the reduced LS comes atthe cost of increased LW (see Fig. 6).

Batching Task Execution. A key concern with Symphony

5The accelerators include an NVIDIA K80 GPU for NN infer-ence and an FPGA for BN inference using Banerjee et al. (2019b).


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

10-3 10-2 10-1 100 101

Sparrow

Paragon

GrapheneC

DF

Scheduler Latency (s)

No OptAll Opt

Figure 8. Symphony’s latency (“All Opt” &“No Opt”) compared to prior work.

0.0001

0.001

0.01

0.1

1

10

1 10 100 1000

No

rm. R

un

tim

e (%

)

Batch Size (Tasks)

Figure 9. Symphony’s performance (oraclenormalized, in %) with varying batch size.

0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 2000 4000 6000 8000 10000 12000

Loss

Ove

r O

racl

e

Iteration Count

RNN No Opt. All Opt.

Figure 10. Training time for Symphony. Aniteration is 2 RL episodes of 20 steps.

is its large tail latency (100× larger than its median; seeFig. 8) compared to the other schedulers (which have deter-ministic runtime). This increased latency is brought aboutby Symphony having to perform significantly more com-pute if the RL-policy-update is triggered. The schedulerlatency adversely affects LW as the time spent executingscheduler calls, is time not utilized to make progress on theuser workload. In order to deal with this issue, our eval-uation executed Symphony on batches of tasks instead ofsingle tasks, thereby amortizing the cost of executing Sym-phony across the batch. Task batching works synergisticallywith the sampling based gradient propagation technique toreduce the tail latency by as much 12× (see Fig. 3). Fig. 9demonstrates the average improvement in LW normalizedto the oracle over a range of batch sizes. We observe thatthe optimal value for batch size is about 128 tasks per batch.This corresponds to the “All Opt” configuration in Figs. 8,and 10 as well as Figs. 6, and 7. The “No Opt” configurationin Fig. 8 is computed at a batch size of one.

Training Time. Finally, we quantified the improvementin training time offered by Symphony using the sampling-based gradient computation methodology presented in §3.We used the following baselines for evaluation: (i) mod-el-free RNN (labeled “RNN” in Fig. 10); and (ii) the“All Opt.” and “No Opt.” configurations from above.The RNN model here replaces the BN (and inference)and system-topology-embedding GN (in Fig. 3) with a 3-layer, fully connected NN to compute an embedding forot. Fig. 10 illustrates the differences in performance of thethese configurations with respect to degradation in perfor-mance of the user DFGs relative to the oracle schedule (i.e.,1 − (

∑i t

si,j)/(

∑i t

oraclei,j )). We observe that the RNN is sig-

nificantly less sample-efficient than the proposed POMDPis; specifically, it is ~2.2× worse than Symphony. Furtherlinearly extrapolating time to convergence from iteration12 × 103, the RNN would need > 48 × 103 iterations toachieve the same accuracy as Symphony.

The difference in training time for the “No Opt.” and “AllOpt” in Fig. 10 can be attributed to (i) time taken to per-form back-propagation for policy updates; and (ii) effectivescheduler latency. Linearly extrapolating the training-loss,we observe that “All Opt” is at least 4.3× more sample ef-ficient than “No Opt” to reach a 30% mean loss relative tothe oracle. That reduction is significant because the contin-

uous churn of user workloads and machine configurationsin a cloud, as pointed out in Mars et al. (2011), would re-quire that the scheduling model be periodically retrained.In absolute terms, the “All Opt” configuration is able toachieve ~30% mean loss relative to the oracle schedulerin 700 hours of training and ~4400 iterations of workloadexecution. That corresponds to approximately 500 hours ofsystem execution; hence, the total process takes 1200 hours.Though this might appear to be over 7 weeks of time, inwall clock time this is approximately 2 week because weuse parallel A3C-based training. In fact, the limiting factorhere is the availability of FPGAs, of which we have only 2in the evaluation cluster, hence limiting the number of RLepisodes that can be run in parallel.

6. ConclusionThis paper presents (i) a domain-driven Bayesian RL modelfor scheduling that captures the statistical dependencies be-tween architectural resources; and (ii) a sampling-basedtechnique that allows the computation of gradients of aBayesian model without performing full probabilistic in-ference. As data center architectures become more com-plex (Asanovic, 2014; Shao & Brooks, 2015), techniqueslike the one proposed here will be critical in the deploymentof future accelerated applications.

AcknowledgmentsWe thank K. Saboo, S. Lumetta, W-M. Hwu, K. Atchley,and J. Applequist for their insightful comments on the earlydrafts of this manuscript. This research was supported inpart by the National Science Foundation (NSF) under GrantNos. CNS 13-37732 and CNS 16-24790; by IBM undera Faculty Award and through equipment donations; andby Xilinx and Intel through equipment donations. Anyopinions, findings, and conclusions or recommendationsexpressed in this material are those of the authors and donot necessarily reflect the views of the NSF, IBM, Xilinx orIntel.

ReferencesAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,

J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur,


M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,M., Yu, Y., and Zheng, X. TensorFlow: A System forLarge-scale Machine Learning. In Proceedings of the12th USENIX Conference on Operating Systems Designand Implementation, pp. 265–283. USENIX Association,2016.

Asanovic, K. FireBox: A Hardware Building Block for2020 Warehouse-Scale Computers. Santa Clara, CA,February 2014. USENIX Association.

Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer,K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K.,Wawrzynek, J., Wessel, D., and Yelick, K. A View of theParallel Computing Landscape. Commun. ACM, 52(10):56–67, October 2009.

Astrom, K. J. Optimal control of Markov processes withincomplete state information. Journal of MathematicalAnalysis and Applications, 10(1):174–205, 1965.

Banerjee, S. S., Athreya, A. P., Mainzer, L. S., Jongeneel,C. V., Hwu, W.-M., Kalbarczyk, Z. T., and Iyer, R. K.Efficient and scalable workflows for genomic analyses.In Proceedings of the ACM International Workshop onData-Intensive Distributed Computing, DIDC ’16, pp.27–36, 2016.

Banerjee, S. S., El-Hadedy, M., Tan, C. Y., Kalbarczyk, Z. T.,Lumetta, S. S., and Iyer, R. K. On accelerating pair-HMMcomputations in programmable hardware. In Proc. 27thInternational Conference on Field Programmable Logicand Applications, FPL 2017, Ghent, Belgium, September4-8, 2017, pp. 1–8, 2017.

Banerjee, S. S., El-Hadedy, M., Lim, J. B., Kalbarczyk,Z. T., Chen, D., Lumetta, S. S., and Iyer, R. K. ASAP: Ac-celerated Short-Read Alignment on Programmable Hard-ware. IEEE Transactions on Computers, 68(3):331–346,March 2019a.

Banerjee, S. S., Kalbarczyk, Z. T., and Iyer, R. K. AcMC2

: Accelerating Markov Chain Monte Carlo Algorithmsfor Probabilistic Models. In Proceedings of the Twenty-Fourth International Conference on Architectural Supportfor Programming Languages and Operating Systems, pp.515–528, 2019b.

Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti,A., Raposo, D., Santoro, A., Faulkner, R., et al. Rela-tional inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018.

Binder, J., Koller, D., Russell, S., and Kanazawa, K. Adap-tive Probabilistic Networks with Hidden Variables. Ma-chine Learning, 29(2/3):213–244, 1997.

Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N.,Goglin, B., Mercier, G., Thibault, S., and Namyst, R.hwloc: A Generic Framework for Managing HardwareAffinities in HPC Applications. In Proc. 2010 18th Eu-romicro Conference on Parallel, Distributed and Network-based Processing, pp. 180–186, Feb 2010.

Cao, P., Badger, E., Kalbarczyk, Z., Iyer, R., and Slagell, A.Preemptive intrusion detection: Theoretical frameworkand real-world measurements. In Proceedings of the 2015Symposium and Bootcamp on the Science of Security,HotSoS ’15, pp. 5:1–5:12, 2015.

Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry,R., Bradshaw, R., and Nathan. FlumeJava: Easy, EfficientData-Parallel Pipelines. In ACM SIGPLAN Conferenceon Programming Language Design and Implementation(PLDI), pp. 363–375, 2010.

Chowdhury, M., Zhong, Y., and Stoica, I. Efficient coflowscheduling with varys. In Proceedings of the 2014 ACMConference on SIGCOMM, SIGCOMM ’14, pp. 443–454,2014.

Dagum, P. and Luby, M. Approximating probabilistic infer-ence in Bayesian belief networks is NP-hard. ArtificialIntelligence, 60(1):141–153, 1993.

Delimitrou, C. and Kozyrakis, C. Paragon: QoS-awareScheduling for Heterogeneous Datacenters. SIGPLANNot., 48(4):77–88, March 2013.

Delimitrou, C. and Kozyrakis, C. Quasar: Resource-efficient and QoS-aware Cluster Management. In Pro-ceedings of the 19th International Conference on Archi-tectural Support for Programming Languages and Oper-ating Systems, ASPLOS ’14, pp. 127–144, 2014.

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert,M., Radford, A., Schulman, J., Sidor, S., Wu, Y., andZhokhov, P. OpenAI Baselines. https://github.com/openai/baselines, 2017.

Doweck, J. Inside 6th generation Intel Core codenamed Skylake:: New Microarchitecture and PowerManagement. https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.23-Tuesday-Epub/HC28.23.90-High-Perform-Epub/HC28.23.911-Skylake-Doweck-Intel_SK3-r13b.pdf, 2016. Accessed2019-03-05.

Dreyer, R. S. and Alpert, D. B. Apparatus for monitoringthe performance of a microprocessor, August 1997. USPatent 5,657,253.

Foerster, J. N., Assael, Y. M., de Freitas, N., and White-son, S. Learning to communicate to solve riddles withdeep distributed recurrent Q-networks. arXiv preprintarXiv:1602.02672, 2016.

https://github.com/openai/baselines

https://github.com/openai/baselines

https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.23-Tuesday-Epub/HC28.23.90-High-Perform-Epub/HC28.23.911-Skylake-Doweck-Intel_SK3-r13b.pdf






Gelman, A., Carlin, J., Stern, H., and Rubin, D. BayesianData Analysis. Chapman & Hall, New York, 1995.

Giceva, J., Alonso, G., Roscoe, T., and Harris, T. Deploy-ment of query plans on multicores. Proc. VLDB Endow.,8(3):233–244, November 2014.

Grandl, R., Chowdhury, M., Akella, A., and Anantha-narayanan, G. Altruistic Scheduling in Multi-resourceClusters. In Proceedings of the 12th USENIX Conferenceon Operating Systems Design and Implementation, pp.65–80. USENIX Association, 2016a.

Grandl, R., Kandula, S., Rao, S., Akella, A., and Kulkarni,J. Graphene: Packing and Dependency-aware Schedulingfor Data-parallel Clusters. In Proceedings of the 12thUSENIX Conference on Operating Systems Design andImplementation, pp. 81–97, 2016b.

Hall, B., Bergner, P., Housfater, A. S., Kandasamy, M.,Magno, T., Mericas, A., Munroe, S., Oliveira, M.,Schmidt, B., Schmidt, W., et al. Performance optimiza-tion and tuning techniques for IBM Power Systems pro-cessors including IBM POWER8. IBM Redbooks, 2017.

Hausknecht, M. and Stone, P. Deep recurrent Q-learningfor partially observable MDPs. In 2015 AAAI Fall Sym-posium Series, 2015.

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A.,Joseph, A. D., Katz, R., Shenker, S., and Stoica, I.Mesos: A platform for fine-grained resource sharing inthe data center. In Proceedings of the 8th USENIX Confer-ence on Networked Systems Design and Implementation,NSDI’11, pp. 295–308. USENIX Association, 2011.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.

Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson,S. Deep variational reinforcement learning for POMDPs.arXiv preprint arXiv:1806.02426, 2018.

Intel Corp. Intel 64 and ia-32 architectures optimizationreference manual. Intel Corporation, Sept, 2014.

Intel Corp. Intel® 64 and IA-32 Architectures SoftwareDeveloper Manuals. https://software.intel.com/en-us/articles/intel-sdm, 2016. Ac-cessed 2019-03-05.

Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar,K., and Goldberg, A. Quincy: Fair scheduling for dis-tributed computing clusters. In Proceedings of the ACMSIGOPS 22nd Symposium on Operating Systems Princi-ples, SOSP ’09, pp. 261–276, 2009.

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce-ment learning with unsupervised auxiliary tasks. arXivpreprint arXiv:1611.05397, 2016.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Plan-ning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998.

Karkus, P., Hsu, D., and Lee, W. S. QMDP-Net: Deeplearning for planning under partial observability. In Ad-vances in Neural Information Processing Systems, pp.4694–4704, 2017.

Kleen, A. PMU-Tools. https://github.com/andikleen/pmu-tools, 2010. Accessed 2019-03-05.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. InAdvances in neural information processing systems, pp.1008–1014, 2000.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L.Ultrafast and memory-efficient alignment of short DNAsequences to the human genome. Genome Biol, 10(3):R25, 2009.

Li, H. and Durbin, R. Fast and accurate short-readalignment with burrows-wheeler rransform. Bioinfor-matics, 25(14):1754–1760, may 2009. doi: 10.1093/bioinformatics/btp324. URL http://dx.doi.org/10.1093/bioinformatics/btp324.

Li, H. and Durbin, R. Fast and accurate long-read alignmentwith Burrows–Wheeler transform. Bioinformatics, 26(5):589–595, 2010.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J.,Homer, N., Marth, G., Abecasis, G., Durbin, R., et al. Thesequence alignment/map format and SAMtools. Bioinfor-matics, 25(16):2078–2079, 2009.

Lyerly, R., Murray, A., Barbalace, A., and Ravindran, B.Aira: A framework for flexible compute kernel executionin heterogeneous platforms. IEEE Transactions on Par-allel and Distributed Systems, 29(2):269–282, Feb 2018.ISSN 1045-9219. doi: 10.1109/TPDS.2017.2761748.

Mao, H., Alizadeh, M., Menache, I., and Kandula, S. Re-source management with deep reinforcement learning. InProceedings of the 15th ACM Workshop on Hot Topics inNetworks, pp. 50–56. ACM, 2016.

Mao, H., Schwarzkopf, M., Venkatakrishnan, S. B.,Meng, Z., and Alizadeh, M. Learning scheduling al-gorithms for data processing clusters. arXiv preprintarXiv:1810.01963, 2018.

Mars, J. and Tang, L. Whare-map: Heterogeneity in "homo-geneous" warehouse-scale computers. SIGARCH Comput.Archit. News, 41(3):619–630, June 2013.

Mars, J., Tang, L., and Hundt, R. Heterogeneity in “ho-mogeneous” warehouse-scale computers: A performanceopportunity. IEEE Comput. Archit. Lett., 10(2):29–32,July 2011. ISSN 1556-6056.

https://software.intel.com/en-us/articles/intel-sdm

https://software.intel.com/en-us/articles/intel-sdm

https://github.com/andikleen/pmu-tools

https://github.com/andikleen/pmu-tools

http://dx.doi.org/10.1093/bioinformatics/btp324

http://dx.doi.org/10.1093/bioinformatics/btp324


Mastrolilli, M. and Svensson, O. (acyclic) job shops are hardto approximate. In 2008 49th Annual IEEE Symposiumon Foundations of Computer Science, pp. 583–592, Oct2008. doi: 10.1109/FOCS.2008.36.

Mastrolilli, M. and Svensson, O. Improved bounds forflow shop scheduling. In International Colloquium onAutomata, Languages, and Programming, pp. 677–688.Springer, 2009.

McCool, M., Reinders, J., and Robison, A. StructuredParallel Programming: Patterns for Efficient Computa-tion. Morgan Kaufmann Publishers Inc., San Francisco,CA, USA, 1st edition, 2012. ISBN 9780123914439,9780124159938.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibul-skis, K., Kernytsky, A., Garimella, K., Altshuler, D.,Gabriel, S., Daly, M., and DePristo, M. A. The genomeanalysis toolkit: A MapReduce framework for analyzingnext-generation DNA sequencing data. Genome Research,20(9):1297–1303, jul 2010. doi: 10.1101/gr.107524.110.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T.,Lillicrap, T. P., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InProceedings of the 33rd International Conference on In-ternational Conference on Machine Learning - Volume48, ICML’16, pp. 1928–1937. JMLR.org, 2016.

Narasimhan, K., Kulkarni, T., and Barzilay, R. Languageunderstanding for text-based games using deep reinforce-ment learning. arXiv preprint arXiv:1506.08941, 2015.

Nothaft, F. Scalable genome resequencing with adam andavocado. Master’s thesis, EECS Department, Universityof California, Berkeley, May 2015.

Nothaft, F., Massie, M., Danford, T., Zhang, Z., Laser-son, U., Yeksigian, C., Kottalam, J., Ahuja, A., Ham-merbacher, J., Linderman, M., Franklin, M. J., Joseph,A. D., and Patterson, D. A. Rethinking data-intensivescience using scalable analytics systems. In Proceedingsof the 2015 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’15, pp. 631–646, NewYork, NY, USA, 2015. ACM. ISBN 978-1-4503-2758-9.doi: 10.1145/2723372.2742787.

Ousterhout, K., Wendell, P., Zaharia, M., and Stoica, I. Spar-row: Distributed, low latency scheduling. In Proceed-ings of the Twenty-Fourth ACM Symposium on OperatingSystems Principles, SOSP ’13, pp. 69–84, New York,

NY, USA, 2013. ACM. ISBN 978-1-4503-2388-8. doi:10.1145/2517349.2522716.

Rimmer, A., Phan, H., Mathieson, I., Iqbal, Z., Twigg, S.R. F., Wilkie, A. O. M., McVean, G., and Lunter, G.Integrating mapping-, assembly- and haplotype-basedapproaches for calling variants in clinical sequencingapplications. Nature Genetics, 46(8):912–918, jul 2014.

Russell, S., Binder, J., Koller, D., and Kanazawa, K. Locallearning in probabilistic networks with hidden variables.In Proceedings of the 14th International Joint Conferenceon Artificial Intelligence - Volume 2, IJCAI’95, pp. 1146–1152, San Francisco, CA, USA, 1995. Morgan KaufmannPublishers Inc. ISBN 1-55860-363-8.

Shachter, R. D. Bayes-ball: The rational pastime (for de-termining irrelevance and requisite information in be-lief networks and influence diagrams). arXiv preprintarXiv:1301.7412, 2013.

Shao, Y. and Brooks, D. Research Infrastructures for Hard-ware Accelerators. Synthesis Lectures on Computer Ar-chitecture. Morgan & Claypool Publishers, 2015.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A.,Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz,N., Barreto, A., et al. The Predictron: End-to-end learningand planning. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, pp. 3191–3199. JMLR, 2017.

Stuecheli, J., Blaner, B., Johns, C. R., and Siegel, M. S.CAPI: A Coherent Accelerator Processor Interface. IBMJournal of Research and Development, 59(1):7:1–7:7,Jan 2015. ISSN 0018-8646. doi: 10.1147/JRD.2014.2380198.

Sudhakar, A. T. and Srinivasan, M. IBM POWERin-memory collection counters. https://developer.ibm.com/articles/power9-in-memory-collection-counters/, 2019.Accessed 2019-03-05.

Terpstra, D., Jagode, H., You, H., and Dongarra, J. Col-lecting Performance Data with PAPI-C. In Müller, M. S.,Resch, M. M., Schulz, A., and Nagel, W. E. (eds.), Toolsfor High Performance Computing 2009, pp. 157–173,Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.

Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin,R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir,K., Roazen, D., Thibault, J., Banks, E., Garimella, K. V.,Altshuler, D., Gabriel, S., and DePristo, M. A. Fromfastq data to high-confidence variant calls: The genomeanalysis toolkit best practices pipeline. Current Protocolsin Bioinformatics, 43(1):11.10.1–11.10.33, 2013.

https://developer.ibm.com/articles/power9-in-memory-collection-counters/




Varatharajah, Y., Chong, M. J., Saboo, K., Berry, B.,Brinkmann, B., Worrell, G., and Iyer, R. EEG-GRAPH: AFactor-Graph-Based Model for Capturing Spatial, Tempo-ral, and Observational Relationships in Electroencephalo-grams. In Advances in Neural Information ProcessingSystems, pp. 5377–5386, 2017.

Weaver, V. M. and McKee, S. A. Can hardware performancecounters be trusted? In 2008 IEEE International Sympo-sium on Workload Characterization, pp. 141–150. IEEE,2008.

Wu, H., Diamos, G., Cadambi, S., and Yalamanchili, S. Ker-nel Weaver: Automatically Fusing Database Primitivesfor Efficient GPU Computation. In 2012 45th AnnualIEEE/ACM International Symposium on Microarchitec-ture, pp. 107–118, Dec 2012.

Xu, L., Butt, A. R., Lim, S., and Kannan, R. AHeterogeneity-Aware Task Scheduler for Spark. In Proc.2018 IEEE International Conference on Cluster Comput-ing (CLUSTER), pp. 245–256, Sep. 2018.

Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux:Precise Online QoS Management for Increased Utiliza-tion in Warehouse Scale Computers. In Proceedings ofthe 40th Annual International Symposium on ComputerArchitecture, ISCA ’13, pp. 607–618, 2013.

Yasin, A. A Top-Down method for performance analysisand counters architecture. In Proc. 2014 IEEE Interna-tional Symposium on Performance Analysis of Systemsand Software (ISPASS), pp. 35–44, March 2014.

Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K.,Shenker, S., and Stoica, I. Delay Scheduling: A Sim-ple Technique for Achieving Locality and Fairness inCluster Scheduling. In Proceedings of the 5th EuropeanConference on Computer Systems, pp. 265–278, 2010.

Zaharia, M., Bolosky, W. J., Curtis, K., Fox, A., Patterson,D., Shenker, S., Stoica, I., Karp, R. M., and Sittler, T.Faster and more accurate sequence alignment with SNAP.arXiv preprint arXiv:1111.5572, 2011.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J.,McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I.Resilient Distributed Datasets: A Fault-tolerant Abstrac-tion for In-memory Cluster Computing. In Proceedingsof the 9th USENIX Conference on Networked SystemsDesign and Implementation, NSDI’12, pp. 15–28, 2012.

Zhang, Y., Laurenzano, M. A., Mars, J., and Tang, L. SMiTe:Precise QoS Prediction on Real-System SMT Processorsto Improve Utilization in Warehouse Scale Computers. In2014 47th Annual IEEE/ACM International Symposiumon Microarchitecture, pp. 406–418, Dec 2014.

Zhu, P., Li, X., Poupart, P., and Miao, G. On improvingdeep reinforcement learning for POMDPs. arXiv preprintarXiv:1804.06309, 2018.

Zhuravlev, S., Blagodurov, S., and Fedorova, A. AddressingShared Resource Contention in Multicore Processors viaScheduling. SIGPLAN Not., 45(3):129–142, March 2010.

Zook, J. M., Catoe, D., McDaniel, J., Vang, L., Spies, N.,Sidow, A., Weng, Z., Liu, Y., Mason, C. E., Alexander, N.,Henaff, E., McIntyre, A. B., Chandramohan, D., Chen, F.,Jaeger, E., Moshrefi, A., Pham, K., Stedman, W., Liang,T., Saghbini, M., Dzakula, Z., Hastie, A., Cao, H., Deikus,G., Schadt, E., Sebra, R., Bashir, A., Truty, R. M., Chang,C. C., Gulbahce, N., Zhao, K., Ghosh, S., Hyland, F.,Fu, Y., Chaisson, M., Xiao, C., Trow, J., Sherry, S. T.,Zaranek, A. W., Ball, M., Bobe, J., Estep, P., Church,G. M., Marks, P., Kyriazopoulou-Panagiotopoulou, S.,Zheng, G. X., Schnall-Levin, M., Ordonez, H. S., Mu-divarti, P. A., Giorda, K., Sheng, Y., Rypdal, K. B., andSalit, M. Extensive sequencing of seven human genomesto characterize benchmark reference materials. ScientificData, 3:160025, Jun 2016.


Supplementary Material

A. Extended Motivational ExampleCurrent schedulers prioritize the use of simple online heuris-tics (Grandl et al., 2016b) and coarse-grained resource buck-eting (e.g., core counts, free memory) and require userlabeling of commonly used system resources (Hindmanet al., 2011; Grandl et al., 2016a) to make scheduling de-cisions. Those approaches are untenable in truly heteroge-neous settings as (i) defining such heuristics is difficult overthe combinatorial space of application-processor/acceleratorconfigurations; and (ii) user-based resource usage labelingrequires in-depth understanding of the underlying system.This paper demonstrates the use of ML to automaticallyinfer such heuristics and their evolution over time as newuser workloads and/or new accelerators are added.

A.1. Dealing with Architectural Heterogeneity

We reiterate that state-of-the-art schedulers do not model theemergent heterogeneous compute platforms that are beingwidely adopted in data centers and hence leave a lot to bedesired (as can also be seen in the performance of our base-lines). Consider, for example, the execution of the forwardalgorithm on PairHMM models (Banerjee et al., 2017), acomputation that is commonly performed in computationalgenomics workloads. Fig. 11 shows the significant diversity(nearly 100×) in performance of this single workload acrossCPUs (from Intel and IBM), GPUs (two models of GPUsfrom NVIDIA) and FPGA implementations. The increasingheterogeneity necessitates rethinking of the design and im-plementation of future schedulers, as the current approachwill require an extraordinary amount of manual tuning andexpertise to adapt to the emergent systems. In contrast,the proposed technique eliminates that work and automatesthe whole process of learning the right granularity of re-sources and scheduling workloads in cloud-based, dynamic,multi-tenant environments, thereby improving application

1

10

100

1 10

No

rmal

ized

Thr

oug

hput

/Wat

t

Normalized Throughput

Intel Xeon

GPU - K40

FPGA

GPU - K80

IBM Power8

Figure 11. Architectural diversity leading to varied performancefor the PairHMM kernel.

KMerS

earch

LandauVishkin

Landau-Vish

kinCIG

ARFilt

erPairsEntropy

Gaussian

Thresh

oldin

gPairH

MM

KMerC

onstDeBruijn

Smith

Wate

rman

GenomeDiff

Genotype

KMerS

earch

LandauVishkin

Landau-Vish

kinCIG

AR

FilterP

airs

Entropy

Gaussian

Thresh

oldin

g

PairHM

M

KMerC

onst

DeBruijn

Smith

Wate

rman

GenomeDiff

Genotype

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Figure 12. Degradation in runtime of co-located kernels due toshared resource contention.

performance and system utilization, all with minimal humansupervision. Prior work uses microarchitectural throughputmetrics such as clock cycles per instruction (Giceva et al.,2014; Delimitrou & Kozyrakis, 2013; 2014; Mars et al.,2011; Mars & Tang, 2013) as proxies for processor affini-ties. In our case, such metrics are not usable because of thewide diversity in processors, i.e., CPU-centric units cannotdescribe the performance of GPUs/FPGAs.

A.2. Dealing with Resource Granularity

Traditional schedulers use coarse-grained resource buck-eting, i.e., they schedule macro-resources like CPU corecounts and GBs of memory. That simplifies the design ofthe scheduling algorithms (both the optimization algorithmsand attached heuristics), resulting in an inability to measurelow-level sources of resource contention in the system. Thecontention of such low-level resources is often the cause forperformance degradation and variability. Consider, for ex-ample, the concurrent execution of several compute kernels(described in Appendix C.2) on co-located hyper-threads(i.e., threads that share resources on a single core) on an IntelCPU. If we abstract the problem at the level of CPU threadsand memory allocated, then those kernels should execute inisolation. The normalized runtime variation is illustrated inFig. 12. We observe a slowdown of as much as 40% (i.e.,the co-located runtime is 60% of the isolated runtime) forsome combinations of kernels, and almost no slowdownfor others. That problem is further exacerbated by the ar-chitectural diversity in processors that we described earlier.The proposed technique accounts for such contention byexplicitly collecting information on low-level system stateby using performance counter measurements, and by esti-mating resource usage in the system by explicitly encoding


bt<latexit sha1_base64="HsznIeAg1etH1ElFU/sprL4Gsuo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFK3f6YIgkGOChX3Jq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uwxs/EypJkSu2XBSmkmBM5s+TodCcoZxaQpkW9lbCxlRThjaikg3BW315nbTrNe+yVr+/qjSqeRxFOINzqIIH19CAO2hCCxhIeIZXeHMenRfn3flYthacfOYU/sD5/AGnao+i</latexit>

T<latexit sha1_base64="Q89a4NCA0DjO1edCfKSg7xP2xbk=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoOQU9iNgh4DXjwmkBckS5iddJIxs7PLzKwQlnyBFw+KePWTvPk3TpI9aGJBQ1HVTXdXEAuujet+O7mt7Z3dvfx+4eDw6PikeHrW1lGiGLZYJCLVDahGwSW2DDcCu7FCGgYCO8H0fuF3nlBpHsmmmcXoh3Qs+YgzaqzUaA6KJbfiLkE2iZeREmSoD4pf/WHEkhClYYJq3fPc2PgpVYYzgfNCP9EYUzalY+xZKmmI2k+Xh87JlVWGZBQpW9KQpfp7IqWh1rMwsJ0hNRO97i3E/7xeYkZ3fsplnBiUbLVolAhiIrL4mgy5QmbEzBLKFLe3EjahijJjsynYELz1lzdJu1rxrivVxk2pVs7iyMMFXEIZPLiFGjxAHVrAAOEZXuHNeXRenHfnY9Wac7KZc/gD5/MHqSWMwg==</latexit>

Gt<latexit sha1_base64="CMWOkEwXZ0ETb5w9I4LG1L2bXZk=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FD3qsaGuhDWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFR28SpZrzFYhnrTkANl0LxFgqUvJNoTqNA8sdgfD3zH5+4NiJWDzhJuB/RoRKhYBStdH/Tx3654tbcOcgq8XJSgRzNfvmrN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42fzU6fkzCoDEsbalkIyV39PZDQyZhIFtjOiODLL3kz8z+umGF75mVBJilyxxaIwlQRjMvubDITmDOXEEsq0sLcSNqKaMrTplGwI3vLLq6Rdr3nntfrdRaVRzeMowgmcQhU8uIQG3EITWsBgCM/wCm+OdF6cd+dj0Vpw8plj+APn8wcgEo2c</latexit>

e<latexit sha1_base64="yVMLtc6jrW1JHP/cYAeufxrqRPs=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FLx4r2FZoY9lsJ+3azSbsboQS+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMTqPqAaBZfYMtwIvE8U0igQ2AnG1zO/84RK81jemUmCfkSHkoecUWOldi8Z8QfslytuzZ2DrBIvJxXI0eyXv3qDmKURSsME1brruYnxM6oMZwKnpV6qMaFsTIfYtVTSCLWfza+dkjOrDEgYK1vSkLn6eyKjkdaTKLCdETUjvezNxP+8bmrCKz/jMkkNSrZYFKaCmJjMXicDrpAZMbGEMsXtrYSNqKLM2IBKNgRv+eVV0q7XvPNa/fai0qjmcRThBE6hCh5cQgNuoAktYPAIz/AKb07svDjvzseiteDkM8fwB87nD4INjwA=</latexit>

v<latexit sha1_base64="/yzOIcihTglbLxMhAoJcqEEqjOU=">AAAB7XicbVBNSwMxEJ2tX7V+rXr0EixCT2W3CnosePFYwX5Au5Zsmm1js8mSZAtl6X/w4kERr/4fb/4b03YP2vpg4PHeDDPzwoQzbTzv2ylsbG5t7xR3S3v7B4dH7vFJS8tUEdokkkvVCbGmnAnaNMxw2kkUxXHIaTsc38799oQqzaR4MNOEBjEeChYxgo2VWr1kxB4nfbfsVb0F0Drxc1KGHI2++9UbSJLGVBjCsdZd30tMkGFlGOF0VuqlmiaYjPGQdi0VOKY6yBbXztCFVQYoksqWMGih/p7IcKz1NA5tZ4zNSK96c/E/r5ua6CbImEhSQwVZLopSjoxE89fRgClKDJ9agoli9lZERlhhYmxAJRuCv/ryOmnVqv5ltXZ/Va5X8jiKcAbnUAEfrqEOd9CAJhB4gmd4hTdHOi/Ou/OxbC04+cwp/IHz+QOb0Y8R</latexit>

u<latexit sha1_base64="K6bzwCpDPkmSfYi2mawVoTmRG7g=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvQU9mtgh4LXjxWsK3QriWbZtvYbLIkWaEs/Q9ePCji1f/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqG5VqylpUCaXvQ2KY4JK1LLeC3SeakTgUrBOOr2d+54lpw5W8s5OEBTEZSh5xSqyT2r1kxB/Sfrni1bw58Crxc1KBHM1++as3UDSNmbRUEGO6vpfYICPacirYtNRLDUsIHZMh6zoqScxMkM2vneIzpwxwpLQrafFc/T2RkdiYSRy6zpjYkVn2ZuJ/Xje10VWQcZmklkm6WBSlAluFZ6/jAdeMWjFxhFDN3a2Yjogm1LqASi4Ef/nlVdKu1/zzWv32otKo5nEU4QROoQo+XEIDbqAJLaDwCM/wCm9IoRf0jj4WrQWUzxzDH6DPH5pNjxA=</latexit>

e!v<latexit sha1_base64="1FgLzgIPzwJaoQS9f/tk51NRg+Y=">AAAB/nicbVBNS8NAEN3Ur1q/ouLJy2IReipJFfRY8OKxgv2AJpbNdtMu3WTD7qRSQsG/4sWDIl79Hd78N27bHLT1wcDjvRlm5gWJ4Boc59sqrK1vbG4Vt0s7u3v7B/bhUUvLVFHWpFJI1QmIZoLHrAkcBOskipEoEKwdjG5mfnvMlOYyvodJwvyIDGIeckrASD37xFND+ZAxT/HBEIhS8hGPpz277FSdOfAqcXNSRjkaPfvL60uaRiwGKojWXddJwM+IAk4Fm5a8VLOE0BEZsK6hMYmY9rP5+VN8bpQ+DqUyFQOeq78nMhJpPYkC0xkRGOplbyb+53VTCK/9jMdJCiymi0VhKjBIPMsC97liFMTEEEIVN7diOiSKUDCJlUwI7vLLq6RVq7oX1drdZbleyeMoolN0hirIRVeojm5RAzURRRl6Rq/ozXqyXqx362PRWrDymWP0B9bnD+eVlgI=</latexit>

v!u<latexit sha1_base64="/ylRlsw724yXpliYaKK+E9jD0mg=">AAAB/nicbVBNS8NAEN3Ur1q/ouLJy2IReipJFfRY8OKxgv2AJpbNdtMu3WTD7qRSQsG/4sWDIl79Hd78N27bHLT1wcDjvRlm5gWJ4Boc59sqrK1vbG4Vt0s7u3v7B/bhUUvLVFHWpFJI1QmIZoLHrAkcBOskipEoEKwdjG5mfnvMlOYyvodJwvyIDGIeckrASD37xFND+ZCNPcUHQyBKyUecTnt22ak6c+BV4uakjHI0evaX15c0jVgMVBCtu66TgJ8RBZwKNi15qWYJoSMyYF1DYxIx7Wfz86f43Ch9HEplKgY8V39PZCTSehIFpjMiMNTL3kz8z+umEF77GY+TFFhMF4vCVGCQeJYF7nPFKIiJIYQqbm7FdEgUoWASK5kQ3OWXV0mrVnUvqrW7y3K9ksdRRKfoDFWQi65QHd2iBmoiijL0jF7Rm/VkvVjv1seitWDlM8foD6zPHwDRlhI=</latexit>

e<latexit sha1_base64="yVMLtc6jrW1JHP/cYAeufxrqRPs=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FLx4r2FZoY9lsJ+3azSbsboQS+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMTqPqAaBZfYMtwIvE8U0igQ2AnG1zO/84RK81jemUmCfkSHkoecUWOldi8Z8QfslytuzZ2DrBIvJxXI0eyXv3qDmKURSsME1brruYnxM6oMZwKnpV6qMaFsTIfYtVTSCLWfza+dkjOrDEgYK1vSkLn6eyKjkdaTKLCdETUjvezNxP+8bmrCKz/jMkkNSrZYFKaCmJjMXicDrpAZMbGEMsXtrYSNqKLM2IBKNgRv+eVV0q7XvPNa/fai0qjmcRThBE6hCh5cQgNuoAktYPAIz/AKb07svDjvzseiteDkM8fwB87nD4INjwA=</latexit>

u<latexit sha1_base64="K6bzwCpDPkmSfYi2mawVoTmRG7g=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvQU9mtgh4LXjxWsK3QriWbZtvYbLIkWaEs/Q9ePCji1f/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqG5VqylpUCaXvQ2KY4JK1LLeC3SeakTgUrBOOr2d+54lpw5W8s5OEBTEZSh5xSqyT2r1kxB/Sfrni1bw58Crxc1KBHM1++as3UDSNmbRUEGO6vpfYICPacirYtNRLDUsIHZMh6zoqScxMkM2vneIzpwxwpLQrafFc/T2RkdiYSRy6zpjYkVn2ZuJ/Xje10VWQcZmklkm6WBSlAluFZ6/jAdeMWjFxhFDN3a2Yjogm1LqASi4Ef/nlVdKu1/zzWv32otKo5nEU4QROoQo+XEIDbqAJLaDwCM/wCm9IoRf0jj4WrQWUzxzDH6DPH5pNjxA=</latexit>

e!u<latexit sha1_base64="4vktl5Sgb5C5dMN8oK4GPl/qzCY=">AAAB/nicbVBNS8NAEJ34WetXVDx5WSxCTyWpgh4LXjxWsB/QxLLZbpqlm2zY3SglFPwrXjwo4tXf4c1/47bNQVsfDDzem2FmXpByprTjfFsrq2vrG5ulrfL2zu7evn1w2FYik4S2iOBCdgOsKGcJbWmmOe2mkuI44LQTjK6nfueBSsVEcqfHKfVjPExYyAjWRurbx56MxH1OPcmGkcZSikeUTfp2xak5M6Bl4hakAgWaffvLGwiSxTTRhGOleq6Taj/HUjPC6aTsZYqmmIzwkPYMTXBMlZ/Pzp+gM6MMUCikqUSjmfp7IsexUuM4MJ0x1pFa9Kbif14v0+GVn7MkzTRNyHxRmHGkBZpmgQZMUqL52BBMJDO3IhJhiYk2iZVNCO7iy8ukXa+557X67UWlUS3iKMEJnEIVXLiEBtxAE1pAIIdneIU368l6sd6tj3nrilXMHMEfWJ8/5hCWAQ==</latexit>

1

2

u<latexit sha1_base64="duUhNco2rxaq7e1qIVI2mMKjW3M=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSL0VJIq6LHgxWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGgUCO8Hkbu53nlBpHssHM03Qj+hI8pAzaqzUTAelslt1FyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZwFmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmvDWz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsrYWOqKDM2m6INwVt9eZ20a1XvqlprXpfrlTyOApzDBVTAgxuowz00oAUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4A2ymM4w==</latexit>

V<latexit sha1_base64="NcMdSLJTfoMoaT3ko+faYuigCOQ=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSL0VJIq6LHgxWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGgUCO8Hkbu53nlBpHssHM03Qj+hI8pAzaqzUbA9KZbfqLkDWiZeTMuRoDEpf/WHM0gilYYJq3fPcxPgZVYYzgbNiP9WYUDahI+xZKmmE2s8Wh87IpVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNkUbgrf68jpp16reVbXWvC7XK3kcBTiHC6iABzdQh3toQAsYIDzDK7w5j86L8+58LFs3nHzmDP7A+fwBrC2MxA==</latexit>

u0<latexit sha1_base64="/LZlp05kxaINaR7+vaOzkpgaIhY=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBaxp5JUQY8FLx6r2FpoQ9lsJ+3SzSbsboQS+g+8eFDEq//Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOl+/S8X664NXcOskq8nFQgR7Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nr90Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIye5sMuEJmxMQSyhS3txI2oooyY8Mp2RC85ZdXSbte8y5q9bvLSqOax1GEEziFKnhwBQ24hSa0gEEIz/AKb87YeXHenY9Fa8HJZ47hD5zPHzutjRQ=</latexit>

Graph network layers corresponding to NN

Edge Level

Vertex

Level

Global Level

Bayesian Inference

BN Model

Graph Network

System Topology

Graph Network

LSTM

FC

FC

Critic

Actor

Computer Systems

Data Flow GraphAction

PMUMeasurements







at (at|bt)<latexit sha1_base64="tXY7Yvnb32HYxwcbabyUjcjZvoA=">AAACCnicbVC7TsMwFHV4lvIKMLIYClJZqqQgwViJhbFI9CE1UeS4TmvVeci+QapKZxZ+hYUBhFj5Ajb+BifNAC1HutLxOffK9x4/EVyBZX0bS8srq2vrpY3y5tb2zq65t99WcSopa9FYxLLrE8UEj1gLOAjWTSQjoS9Yxx9dZ37nnknF4+gOxglzQzKIeMApAS155pEzJICJB9hRPMROwqvZ4wHnuu/BmWdWrJqVAy8SuyAVVKDpmV9OP6ZpyCKggijVs60E3AmRwKlg07KTKpYQOiID1tM0IiFT7iQ/ZYpPtdLHQSx1RYBz9ffEhIRKjUNfd4YEhmrey8T/vF4KwZU74VGSAovo7KMgFRhinOWC+1wyCmKsCaGS610xHRJJKOj0yjoEe/7kRdKu1+zzWv32otI4KeIooUN0jKrIRpeogW5QE7UQRY/oGb2iN+PJeDHejY9Z65JRzBygPzA+fwD4z5kR</latexit>

Neural Network

1

2

Figure 13. Proposed POMDP model.

the measurements in its POMDP model.

B. Implementation DetailsThe scheduling framework functions as follows.

1. The scheduler first makes measurements by using theavailable processor performance counters (e.g., instruc-tions retired, cache misses).

2. When a processor becomes idle (finishes running thecurrent kernel), it invokes the scheduler.

3. The measurements are fed into the scheduler’s BN modelas input. Using those measurements, the BN model com-putes the utilization of different levels of architecturalresources in the system (e.g., memory bandwidth utiliza-tion, PCIe link utilization). We refer to those utilizationsas the state of the system.

4. The computed utilization numbers, user programs repre-sented as a DFG, and a system topology graph are fedinto an NN. The NN produces a scheduling decision thatis actuated in the system. The action space consists of akernel-processor pair.

5. Finally, the scheduler gets feedback from the system (i.e.,the reward) in terms of the time it took for the job to runas a result of its scheduling decision.

6. While in training mode, if an incorrect decision is made,Symphony enqueues an update of the policy parametersusing back-propagation on the A2C/A3C loss function.An incorrect decision is one where kernel input-outputdependencies are not respected, or a kernel-acceleratorpair is picked where the accelerator does not provide animplementation of the kernel.

B.1. Graph Network Details

The structure of the graph network used in the proposedmodel is illustrated in Fig. 13. The numbers of parametersused in the different layers of the graph network are listedin Table 2.

Table 2. Mapping of the graph network layer functions in Fig. 13.We use the notation FCNN(a, b) to denote a 2-hidden fully-connected layers with a and b hidden units, respectively.

Function in GN Function in 1 Function in 2

φe FCNN(64, 32) FCNN(64, 32)φv FCNN(32, 16) –φu FCNN(16, 16) FCNN(32, 16)

ρe→v∑e –

ρv→u∑v –

ρe→u – ReLU(e)

B.2. Hyperparameters

The hyperparameters used to train the proposed POMDPmodel are listed in Table 3.

B.3. System Measurement Details

Topology Information. Consider the example of standardNUMA based computing system with PCIe based acceler-ators shown in Fig. 14. The system contains (i) multipleCPUs which have non-uniform access to memory, (ii) sev-eral accelerators (including GPUs and FPGAs) each withtheir own memory, and (iii) a system interconnect whichconnects all of the components of the system together. Sym-phony encodes the system topology as a graph T = (P,N)(also shown in Fig. 14). The nodes of the graph P corre-spond to the processing elements (and attached memory)and memory/system interconnects. Each of the these nodesp ∈ P have an attached resource utilization vector. Forexample, in an Intel processor, the utilization vector would

Table 3. Hyperparameters used in the model.

Hyperparameter Value

Learning Rate 0.005LSTM Unroll Length 20ns 20ne 2


CPU 0 CPU 1Memory Bus

Mem

ory M

emo

ry

PCIe Switch

PCIe Switch

PCIe Switch

PCIe Switch

GPU CAPI FPGA

NIC

GP

U

PC

Ie Device

PC

Ie Device

PC

Ie Device

PC

Ie Device

PC

Ie Device

GP

U

PCIe Bus

PCIe BackplaneTo Network

Directly Connected PCIe Lanes

PCIe lanes connected over switches

PCIe Based Accelerator

PCIe Bus

Memory Interconnect

CPU Thread

CPU Core

Figure 14. Example of a dual-socket NUMA-based system topol-ogy with a PCIe-interconnect and -devices. Figure on the rightshows an graph-encoding of the topology.

include utilization like that of micro-op issue ports, floatingpoint unit utilization etc. (Doweck, 2016; Intel Corp., 2014).

The scheduler queries the system topology and builds thetopology graph T (which is used as an input to the RLagent) using hwloc (Broquedis et al., 2010). hwloc pro-vides information about CPU cores, caches, NUMA mem-ory nodes, and the PCIe interconnect layout (i.e., connec-tions between the PCIe root complex and PCIe switches),as well as connection information on peripheral acceler-ators, storage, and network devices in the system. Thescheduler does not explicitly model the rack-scale or datacenter network (unlike some previous approaches, e.g., Is-ard et al. (2009); Chowdhury et al. (2014)), but the BNand RL model can be extended to do so. Our measure-ments considers injection bandwidth at the network inter-face card (NIC) to be a proxy for network performance, i.e.,the NIC is modeled as an accelerator that accepts data atmin(PCIe Bandwidth, Injection Bandwidth).

Performance Counter Measurements. Performancecounters’ configuration and access instructions require ker-nel mode privileges, and hence those operations are sup-ported by Linux: system calls to configure and read theperformance counter data. Symphony uses a combinationof user-space tools, e.g., libPAPI (Terpstra et al., 2010),PMUTools (Kleen, 2010), and perf that wrap around the sys-tem call interface to make both system-specific and system-independent measurements.

We configure the performance counters to make system-wide measurements (i.e., for all processes). If the perfor-mance counter measurements are configured in that way, itmight incur security risks, particularly by opening up sidechannels through which attackers could infer workload char-acteristics. However, analysis or mitigation of such risksis not in the scope of this paper and may form the basis offuture work.

All kernel executions are non-preemptive in the context

Host CPU Symphony

Runtime Env

FPGA Driver and Library

PCIePCIe

EndpointStorage Device

NIC

Control Unit

Pipelined Interconnect

Mem

ory

C

ont

rolle

r

On-b

oard

DR

AM

PE(0,0)

PE(0,1)

PE(0,2)

PE(0,3)

PE(n,0)

PE(n,1)

PE(n,2)

PE(n,3)

…

On-

chip

M

emo

ry

PCIe DMA Controller

CAPI Controller

P2P Device Access

Systolic element

Can use either PCIe

or CAPI

Runtime System Controlled Dynamically Reconfigured OS Controlled Off-the-shelf IPsGPU

Pipelined systolic connections

(Application Accelerators)

Figure 15. Architecture of the FPGA-based hardware co-processorcontrolled by Symphony.

of the proposed runtime, however the OS scheduler canpreempt CPU threads. Further we prevent the OS schedulerfrom re-balance tasks/threads once assigned to a particularCPU. This is achieved by explicitly setting affinities ofthreads to cores (i.e., pinning them).

Performance Penalties. Monitoring of performance coun-ters without having to perform interrupts is almost free.In our implementation, we capture on-core performancecounters directly before and after a single kernel invocation.Un-core performance counters are measured periodically(every million dynamic instructions on a core) by using aperformance monitoring interrupt. On an IBM PowerPCprocessor, the interrupt handler initiates a DMA transfer ofthe performance counters to memory (Sudhakar & Srini-vasan, 2019), thereby incurring no performance penalty(other than the time to service the interrupt). On Intel pro-cessors, the interrupt handler has to explicitly read the per-formance counter registers and write them to memory. Inour tests (on Intel processors), we observed a ~3% perfor-mance penalty for applications with interrupts enabled. Thatcorresponds to an execution of a usermode interrupt with anaverage 900-ns latency.

Distributed Execution. In our evaluation we have de-ployed Symphony in a rack-scale distributed context (overan EDR Infiniband network fabric) as a centralized sched-uler controlling all processing resources. Here, all the perfor-mance counter measurements are sent over the network to acentralized server that makes scheduling decisions. This ap-proach works well at the scale of a rack, where all resourcesare essentially one hop away at 0.2-µs latency. ExtendingSymphony to larger or slower networks might present chal-lenges, where network latency causes stale performancecounter data to reach the scheduler. We will address thesechallenges in future work.

B.4. Dynamically Reconfigurable FPGA Accelerator

Our implementation and evaluation of Symphony uses acustom FPGA accelerator (see Fig. 15). Due to space limita-tions, here we briefly describe the features of the accelerator.

• Processing Elements (PEs). The co-processor is opti-


Table 4. Hardware specifications of test cluster.

Name # Specifications

M1 2 CPU IBM Power8 (SMT 8); 870 GBRAM; GPU NVIDIA K80; FPGAAlpha Data 7V3

M2 4 CPU IBM Power8 (SMT 4); 512 GBRAM; GPU NVIDIA K40; FPGANallatech 385

N 1 Mellanox FDR Infiniband

mized to execute the computational kernels as a singleinstruction of the application. Sets of four neighboringPEs are directly connected as a systolic element, therebyenabling high bandwidth data transfer in between PEs andforming the quantum of reconfiguration.

• Host-FPGA Communication. The board interfaces withthe host CPU over PCIe and can be configured to commu-nicate with the host processor over this interface in one oftwo ways: (i) using direct memory access (DMA) to thehosts memory over the PCIe bus, or (ii) using IBM’s co-herent accelerator processor interface (CAPI) (Stuecheliet al., 2015).

• Dynamic Reconfiguration. The configuration of the accel-erator (i.e., which kernels PEs are available at any time)is controlled by Symphony. Symphony treats the recon-figuration of the accelerator as a kernel that has to bedispatched to the FPGA. The state of the accelerator isfed into Symphony along with the system topology T .

• Launching Kernels. Remember CPU executors (i.e.,threads which are given tasks to execute) are pinned orbound to underlying hardware SMT thread. Acceleratorshowever require the CPUs to initiate their execution. As aresult, each accelerator in the system is assigned a proxyexecutor thread that orchestrates (i.e., launches, polls forcompletion etc.) its execution. These executors are re-sponsible for managing their own queues for maintainingtasks that are “waiting” for execution.

C. Evaluation EnvironmentC.1. Evaluation System

All evaluation experiments are performed on an 11 noderack-scale test-bed of IBM Power8 CPUs, NVIDIA K40and K80 GPUs, as well as FPGAs (listed in Table 4). All themachines in the cluster are connected using a single switchEDR Infiniband network.

C.2. Evaluation Workloads

We illustrated the generality of the proposed approach on avariety of real-world workloads (listed in Table 5) that usedCPUs, GPUs, and FPGAs:

1. variant-calling and genotyping analysis (Van der Auw-era et al., 2013) on human genome datasets appropriatefor clinical use (consisting of Align, IR, and HC inTable 5),

2. epilepsy detection and localization (Varatharajah et al.,2017) on intra-cranial electroencephalography data;and

3. online security analytics (Cao et al., 2015) onnetwork- and host-level intrusion detection systemevent-streams.

For the variant-calling and genotyping workload we use theNA12878 genome sample from the GIAB consortium (Zooket al., 2016) for all our experiments as it is representativeof human clinical datasets. For the EEG and AT workloads,we use the same datasets as discussed in the original papers.

Table 5. Enumeration of workloads used to evaluation.Application Processors Implementations

CPU GPU FPGA

Alignment (Align) 3 3 3 (Li & Durbin, 2009; 2010; Langmead et al., 2009; Zaharia et al.,2011; Banerjee et al., 2019a; 2016),

Indel Realignment (IR) 3 7 7 (McKenna et al., 2010; Nothaft et al., 2015)Variant Calling (HC) 3 3 3 (Li et al., 2009; McKenna et al., 2010; Nothaft, 2015; Rimmer

et al., 2014; Banerjee et al., 2017)EEG-Graph (EEG) 3 3 3 (Varatharajah et al., 2017; Banerjee et al., 2019b)AttackTagger (AT) 3 3 3 (Cao et al., 2015; Banerjee et al., 2019b)

Abstract arXiv:1909.02119v2 [cs.DC] 30 Jun 2020 · ot+1 ˘O(ot+1jst+1;at), and a reward rt+1...

Documents

Transcript of Abstract arXiv:1909.02119v2 [cs.DC] 30 Jun 2020 · ot+1 ˘O(ot+1jst+1;at), and a reward rt+1...