GIFT: A Coupon Based Throttle-and- Reward …...OST 1 OST 2 OST 3 100% 0% 25% 50% 75% B / W A A A B...

A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems

Rohan GargTirthak Patel Devesh Tiwari

GIFT:

The Key Idea Behind

The Key Idea Behind

…but, first some background

Data-intensive Parallel Applications

I/O Phase

Compute Phase

Compute Phase

Compute System

Parallel Storage System

Data-intensive Parallel Applications

Compute Nodes (OSCs)

SIONCTRL A

CTRL B

CTRL A

CTRL B

HBA

HBA

HBA

HBA

NET

NET

NET

NET

OSSes OSTs

MDSes MDTs

CTRL A

CTRL B

HBA

HBA

NET

NET

I/O Phase

Compute Phase

Compute Phase

Object Storage Targets (OSTs)

C AE D B A

A B C D

Isovalues on compressed simulation data with bounding error - (32 bits, 3200x2400x42, 1.4 GB) !

0.25 bits!10.8 MB!

1.0 bits!43.3 MB!

0.5 bits!21.6 MB!

2.0 bits!86.5 MB!

EOne application performs I/O

concurrently to multiple OSTs.

Parallel applications can cause unmanaged and unpredictable I/O interference!

GeoScience

Object Storage Targets (OSTs)

C AE D B A

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D

OST 1 OST 2 OST 3OST 1 OST 2 OST 3

MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

A B C D

Isovalues on compressed simulation data with bounding error - (32 bits, 3200x2400x42, 1.4 GB) !

0.25 bits!10.8 MB!

1.0 bits!43.3 MB!

0.5 bits!21.6 MB!

2.0 bits!86.5 MB!

GeoScience

EOne application performs I/O

concurrently to multiple OSTs.

Parallel applications can cause unmanaged and unpredictable I/O interference!

Inefficient I/O bandwidth utilization

A A A A

B

D D

C C

B

Traditional

Time t1

Time t2

A A A A

B

D D

C C

B

A A A A

B

D D

C C

B

Traditional

Time t1

Time t2

GIFT

Time t1

Time t2

GIFT’s coupon-based I/O bandwidth allocation appears

appealing, but…

GIFT’s coupon-based I/O bandwidth allocation appears

appealing, but…

What are the challenges?What are the favorable

characteristics?

GIFT Enablers

Repetitive runs

HPC applications run repeatedly, are frequent, and exhibit similar I/O behavior across different runs.

Low-periodicityRepetitive runs

GIFT EnablersHPC applications run repeatedly, are frequent, and exhibit similar I/O behavior across different runs.

HPC applications run repeatedly, are frequent, and exhibit similar I/O behavior across different runs.

Low-periodicityRepetitive runs Predictable I/O

GIFT Enablers

Parallel applications suffer from non-synchronous I/O progress leading to bandwidth waste.

Engaging

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

waste

Significant variation in I/O finish time among MPI processes of the same

application.

GIFT Challenges

Need for synchronous I/O progress in parallel applications poses new challenges in

maintaining efficiency and fairness in I/O bandwidth allocation.

GIFT Challenges

Let’s look at some bandwidth allocation policies and compare them.

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

Per-OST Fair Share

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

Per-OST Fair Share

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

Per-OST Fair Share

FairNot synchronous

B/W waste

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

FairNot synchronous

B/W waste

Per-OST Fair Share

Basic Synchronous I/O Progress

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

Per-OST Fair Share


FairNot synchronous

B/W waste

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / WFair

Not synchronousB/W waste

FairSynchronous B/W waste

Per-OST Fair Share


FairNot synchronous

B/W waste

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

Per-OST Fair Share


Minimum Bandwidth Wastage


FairNot synchronous

B/W waste

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

AA A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


100%

0%

25%

50%

75%

B / W

Per-OST Fair Share


Minimum Bandwidth Wastage


Not FairSynchronous No B/W waste

Balances three goalsFairness Synchronous

I/O ProgressMinimize B/W

Wastage

Three Key Ingredients

Fairness

GIFT breaks away from instantaneous fairness and maintains fairness over a long time-window. Barter system for unfair treatment: award compute hours for unfairness in I/O bandwidth allocation. Concept of “System Compute Hour Regret Budget”


Synchronous I/O Progress

GIFT’s initial allocation is the same as BSIP scheme and any subsequent readjustments ensure that this property is preserved.


Minimize B/W Wastage

GIFT designs a “throttle-and-reward” mechanism that picks “throttle-friendly” applications, issues them coupons to reduce b/w waste at a given time, and “reward” them later (i.e., redeem their coupons).

GIFT Workflow

Determine Throttle-Friendly

Applications

Redeem Coupons

Issue Coupons to

Throttled Applications

Decrease Redemption

Rate

Perform BSIP Bandwidth Allocation

Allocate Bandwidth Optimally

Every Decision Instance

Increase Redemption

Rate

Whom to throttle?

Which coupons to redeem?

How much to throttle and expand?

Identifying Throttle-Friendly Applicationsü Careful design leads to

minimal system regret budget (compute hours given out due to unfair treatment in long term).

ü Throttle-friendly apps can also be expanded if deemed beneficial.

ü Set of throttle-friendly applications changes over time.N𝜏

N is the length of receding window. 𝜏 is the minimum redemption rate required for an app. to be throttle-eligible.

Initial redemption

Coupons issued

Coupons issued

Coupons issued

Coupons redeemed

Careful Coupon RedemptionGIFT redeems coupons only when it does not

require throttling other applications. Spare B/W available. A has an outstanding coupon worth 15%.

A (38%)

B (25%)

OST 1 OST 2

C (38%)

B (25%)

D (25%)

E (25%)

F (25%)

B (25%)

OST 1 OST 2

B (25%)

D (25%)

E (25%)

F (25%)A (42%)

B (25%)

OST 1 OST 2

C (33%)

B (25%)

D (25%)

E (25%)

F (25%)

C (33%)

A (33%)



B/W can be divided equally, but GIFT

does not.

A (38%)

B (25%)

OST 1 OST 2

C (38%)

B (25%)

D (25%)

E (25%)

F (25%)

B (25%)

OST 1 OST 2

B (25%)

D (25%)

E (25%)

F (25%)A (42%)

B (25%)

OST 1 OST 2

C (33%)

B (25%)

D (25%)

E (25%)

F (25%)

C (33%)

A (33%)



B/W can be divided equally, but GIFT

does not.

A (38%)

B (25%)

OST 1 OST 2

C (38%)

B (25%)

D (25%)

E (25%)

F (25%)

B (25%)

OST 1 OST 2

B (25%)

D (25%)

E (25%)

F (25%)A (42%)

B (25%)

OST 1 OST 2

C (33%)

B (25%)

D (25%)

E (25%)

F (25%)

C (33%)

A (33%)

Instead, GIFT (partially) redeems A’s coupon, but

w/o throttling C.

One may argue that if spare I/O bandwidth is available, applications

would have naturally been allocated that I/O bandwidth.

So, how does GIFT reduce wasted bandwidth?

Issue coupon worth 15% b/w on one OST to app. A

A (35%)

B (65%)

OST 1 OST 2

B (65%)

100%

0%

25%

50%

75%

B / W

Redeem app. A’s coupon with 9% b/w on one OST

A (42%)

B (25%)

OST 1 OST 2

C (33%)

B (25%)

D (25%)

E (25%)

F (25%)

Redeem app. A’s coupon with 6% b/w on one OST

A (39%)

B (25%)

OST 1 OST 2

C (36%)

B (25%)

D (25%)

E (25%)

F (25%)

Instance k1 Instance k2 Instance k3

A (50%)

B (50%)

OST 1 OST 2

B (50%)

100%

0%

25%

50%

75%

B / W

A (38%)

B (25%)

OST 1 OST 2

C (38%)

B (25%)

D (25%)

E (25%)

F (25%)

B (25%)

OST 1 OST 2

B (25%)

D (25%)

E (25%)

F (25%)A (38%)

C (38%)

GIFT

BSIP

Optimal I/O Bandwidth AllocationHow much to throttle and whom to expand by how much?Formulated as a linear programming optimization problemSubject to constraintso All I/O requests of an application

issued across all OSTs should get the same B/W for synch. I/O progress

o The final B/W allocation should be fairo All OSTs are constrained by their full

capacityj = 2 j = 3

A B C

max$"∈$

$%∈&"

𝑏%

Set of all OSTs

Set of all apps on OST j

B/w allocation of app i

j = 1

A D C A D

GIFT: A Coupon Based Throttle-and-Reward Mechanismfor Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems

Tirthak PatelNortheastern University

Rohan GargNutanix

Devesh TiwariNortheastern University

AbstractLarge-scale parallel applications are highly data-intensive

and perform terabytes of I/O routinely. Unfortunately, on alarge-scale system where multiple applications run concur-rently, I/O contention negatively affects system efficiency andcauses unfair bandwidth allocation among applications. Toaddress these challenges, this paper introduces GIFT, a princi-pled dynamic approach to achieve fairness among competingapplications and improve system efficiency.

1 Introduction

Problem Space and Gaps in Existing Approaches. In-crease in computing power has enabled scientists to expeditethe scientific discovery process, but scientific applications pro-duce more and more analysis and checkpoint data, worseningtheir I/O bottleneck [7, 45]. Many applications spend 15-40%of their execution time performing I/O, which is expectedto increase for exascale systems [12, 15, 22, 31, 53, 55]. Un-fortunately, multiple concurrent applications on a large-scalesystem lead to severe I/O contention, limiting the usability offuture HPC systems [11, 45].

Recognizing the importance of the problem, there havebeen numerous efforts to mitigate I/O contention from bothI/O throughput and fairness perspectives [13, 14, 17, 25, 37,42, 75, 76, 78, 88, 89]. Unfortunately, ensuring fairness andmaximizing throughput are conflicting objectives, and it ischallenging to strike a balance between them under I/Ocontention. For parallel HPC applications, the side-effect ofI/O contention is further amplified because of the need forsynchronous I/O progress. HPC applications are inherentlytightly synchronized; during an I/O phase, MPI processesof an HPC application must wait for all processes to finishtheir I/O before resuming computation (i.e., synchronous I/Oprogress among MPI processes is required) [28,31,39,57,90].

MPI processes of an HPC application perform parallel I/Oaccess to multiple back-end storage targets (e.g., an arrayof disks) concurrently. These back-end storage targets areshared among concurrently running applications and havedifferent degree of sharing over time and hence, a varyinglevel of contention. A varying level of I/O contention atthe shared back-end parallel storage system makes differ-ent MPI processes progress at different rates and hence, leads

to non-synchronous I/O progress. In Sec. 2, we quantify non-synchronous I/O progress as a key source of inefficiency inshared parallel storage systems. It results in (1) wastage ofcompute cycles on compute nodes, and (2) reduction in effec-tive system I/O bandwidth (i.e., the bandwidth that contributestoward synchronous I/O progress), since full bandwidth is notutilized toward synchronous I/O progress.

Recent works have noted that non-synchronous I/Oprogress degrades application and system performances onmodern supercomputers like Mira, Edison, Cori, and Ti-tan [9, 31, 32, 39, 69, 83]. Thus, there is an emerging interestin improving the quality-of-service (QoS) of parallel stor-age systems [24, 80, 86]. Previous works have proposed rule-based or ad-hoc bandwidth allocation strategies for HPC stor-age [14, 17, 23, 36, 42, 88, 89]. However, existing approachesdo not systematically implement synchronous I/O progress tobalance the competing objectives: improving effective systemI/O bandwidth and improving fairness.

To bridge this solution gap, this paper describes GIFT, acoupon-based bandwidth allocation approach to ensure syn-chronous I/O progress of HPC applications while maximizingI/O bandwidth utilization and ensuring fairness among con-current applications on parallel storage systems.

Summary of the GIFT Approach. GIFT introduces twokey ideas: (1) Relaxing the fairness window: GIFT breaksaway from the traditional concept of instantaneous fairnessat each I/O request, and instead, ensures fairness over multi-ple I/O phases and runs of an application. This opportunityis enabled by exploiting the observation that HPC applica-tions have multiple I/O phases during a run and are highlyrepetitive, often exhibiting similar behavior across runs; and(2) Throttle-and-reward approach for I/O bandwidth alloca-tion: GIFT opportunistically throttles the I/O bandwidth ofcertain applications at times in an attempt to improve theoverall effective system I/O bandwidth (i.e., it minimizes thewasted I/O bandwidth that does not contribute toward syn-chronous I/O progress). GIFT’s throttle-and-reward approachintelligently exploits instantaneous opportunities to improveeffective system I/O bandwidth. Further, relaxing the fairnesswindow enables GIFT to reward the “throttled” application ata later point to ensure fairness.

More GIFT Design and Implementation Details

IT’S

IN

THE

PAPE

R!ü Mathematical formulation of throttle-friendly application selection

ü Balancing system regret budget vs. stability of throttling decisions

ü Details of bandwidth allocation optimization solution

ü Design parameters and their impact

ü GIFT prototype implementation details

Evaluation and Analysis

Experimental Methodology

FUSE-based prototype for

testbed-based evaluation

Testbed evaluation uses job characteristics from Stampede2, Mira and Theta supercomputers:

Number of nodes, compute time, amount of data I/O, I/O interval, job inter-arrival time, backfilling scheduling

strategy, etc.

Refer to the paper for more details and simulation-based set-up.

Min. B/W Waste(MBW)

Basic Synch-I/O Progress (BSIP)

Per-OST Fair Share (POFS)

Competing Strategies

A A A

B

DD

C

B

E

POFS BSIP

A A A

B

DC

B

E

D


MBW

A A A

B

DDC

B

OST 1 OST 2 OST 3

100%

0%

25%

50%

75%

B / W

Throttle Randomly (RND)Throttle Small App (TSA)

Other selective-throttle/expand-focused heuristics

Throttle Most Frequent App (TMF)Expand Small App (ESA)

GIFT improves system I/O bandwidth, mean app IO time and runtime

GIFT real-system prototype improves the system bandwidth my more than 15% and app I/O time by more than 10%, compared to POFS.

GIFT’s fairness is comparable to BSIP and is much fairer than MBW

POFS is the baseline for fairness.

GIFT’s fairness is comparable to BSIP and is much fairer than MBW

POFS is the baseline for fairness. Avg. I/O time degradation for degraded apps is only

1.2% for GIFT

Simulation-based results confirm real-system prototype results

Simulation results show even larger improvements because (1) longer time window, and (2) larger system scale.

GIFT can even improve the overall system throughput.

GIFT is not inherently biased against certain types of I/O behaviors.

Applications with different I/O behaviors observe an improvement with GIFT

GIFT needs to award outstanding compute node hours for coupons which are not redeemed. GIFT can bound these hours

at a low-level even under pessimistic scenarios.

GIFT’s system regret budget needed to award outstanding hours is low(a) Mean App I/O Time (b) Mean App Runtime (c) Effective System I/O B/w (d) System Throughput

Figure 7: GIFT’s implementation provides improvement for both application- and system- level objectives (higher is better).

Scheduling Policies. We evaluate GIFT against seven com-peting I/O scheduling policies: Per-OST Fair Share (POFS),Basic Synchronous I/O Progress (BSIP), Minimum Band-width Wastage (MBW), Throttle Small Applications (TSA),Expand Small Applications (ESA), Throttle Most FrequentApplications (TMF), and Throttle Randomly (RND). POFS,BSIP, and MBW are implemented as discussed in Sec. 2. TSAattempts to increase the effective system bandwidth by throt-tling small applications, while ESA attempts to improve thesystem throughput by increasing the bandwidth allocation forlonger-running, smaller applications that generally do smallI/O [2, 4, 5]. We also compare against other simple, intuitivestrategies such as TMF and RND, which pick the “most fre-quently appearing” and “random” applications for bandwidththrottling, respectively. POFS is used as the baseline policy.

Objective Metrics. Application I/O Time is the amount oftime spent in I/O by an application during its run. ApplicationRun Time is the run time of the application. Effective SystemBandwidth is the average effective I/O bandwidth during therun of an application set, defined as overall system bandwidthminus the wasted bandwidth (Sec. 2). System Throughput isthe number of jobs completed per unit time.

GIFT’s real-system implementation provides betterapplication- and system- level performances. First, our re-sults show that GIFT outperforms all competing techniquessignificantly. Fig. 7 (a)-(d) show that GIFT performs better formean application I/O time, mean application runtime, effec-tive system bandwidth, and system throughput, respectively.The mean application I/O time with GIFT is 10% better thanwith POFS, and 3.5% better than the next best technique,BSIP. Interestingly, when applications are throttled based ontheir characteristics (TSA, ESA, and TMF), or are arbitrarilythrottled (RND), the performance remains similar to that ofBSIP. This shows that naïve, rule-based techniques cannotmatch the performance delivered by the GIFT approach.

GIFT also improves the effective system bandwidth bymore than 17% compared to POFS and other techniques, ex-cept MBW. Expectedly, MBW improves the effective systembandwidth the highest because it solely focuses on this metric.Next, we note that by compromising fairness one could designtechniques that solely focus on improving system throughput(e.g., favor small jobs). GIFT does not compromise fairness,

Figure 8: GIFT implementation bounds outstanding node-hoursusing application- and system-level redemption rate thresholds.

and it neither directly manipulates nor aims to improve thesystem job throughput, but by virtue of reducing I/O band-width waste and mean application I/O time, GIFT yields 2%improvement in system throughput. We note that even a smallimprovement in system throughput leads to large monetarysavings in operational cost of HPC systems [18, 71, 84].

Next, we recall that GIFT gives out compute node-hours asregret, but it is minimal compared to the system throughputimprovement it enables (2% savings in total compute node-hours). Fig. 8 shows that GIFT gave out less than 0.06% hoursof total compute node-hours from the system regret budget ina more than two-day long experimental run – this result showsthat application- and system-level redemption rate thresholdskeep the system regret budget under control. Even if one wereto award outstanding node-hours every day, GIFT would giveout only 0.12% of node-hours, which is much smaller thanthe gains in system throughput (2%); this trend is also latersupported by simulation results.

Next, we discuss the effectiveness of GIFT in terms of fair-ness. First, recall that the design of GIFT introduces two ideas:(1) opportunistically rewarding applications, and (2) compen-sating unfairness in I/O performance via additional computehours. These ideas do not naturally align with the traditionalnotion of fairness - where a scheme tends to distribute the“benefits” equally among all applications and the “currency”of fairness measurement remains the same. In contrast, GIFTis designed to distribute the benefit opportunistically amongapplications because, as discussed earlier, distributing the ben-efits equally among all applications leads to benefit (systembandwidth) wastage due to non-synchronous I/O progress.GIFT achieves fairness by compensating I/O unfairness withcompute resources. Therefore, GIFT’s performance cannotbe directly compared with POFS to establish its fairness ef-fectiveness. Nevertheless, we provide this comparison forcompleteness and to demonstrate that GIFT is not unfair.

Testbed evaluation Simulation evaluation

GIFT is open-sourced athttps://github.com/GoodwillComputingLab/GIFT

Where is my gift in all this?

GIFT: A Coupon Based Throttle-and- Reward …...OST 1 OST 2 OST 3 100% 0% 25% 50% 75% B / W A A A B...

Documents

Transcript of GIFT: A Coupon Based Throttle-and- Reward …...OST 1 OST 2 OST 3 100% 0% 25% 50% 75% B / W A A A B...