Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker,...

Scheduling in Heterogeneous Grid Scheduling in Heterogeneous Grid Environments:Environments:

The Effects of Data Migration The Effects of Data Migration

Leonid Oliker, Hongzhang ShanFuture Technology Group

Lawrence Berkeley Research Laboratory

Warren Smith, Rupak BiswasNASA Advanced Supercomputing Division

NASA Ames Research Center

MotivationMotivation

• Geographically distributed resources

• Difficult to schedule and manage efficiently– Autonomy (local scheduler)– Heterogeneity– Lack of perfect global information– Conflicting requirements between users and

system administrators

Current StatusCurrent Status

• Grid Initiatives– Global Grid Forum, NASA Information Power Grid, TeraGrid,

Particle Physics Data Grid, E-Grid, LHC Challenge

• Grid Scheduling Services– Enabling multi-site application

Multi-Disciplinary Applications, Remote Visualization, Co-Scheduling,Distributed Data Mining, Parameter Studies

– Job Migration Improve Time-to-Solution Avoid dependency on single resource provider Optimize application mapping to target architecture But what are the tradeoffs of data migration?

Our ContributionsOur Contributions

• Interaction between grid scheduler and local scheduler

• Architecture: distributed, centralized, and ideal

• Real workloads

• Performance metrics

• Job migration overhead

• Superscheduler scalability

• Fault tolerance

• Multi-resource requirements

Distributed ArchitectureDistributed Architecture

Local Env

Grid Env

Info

Job

Local Queue

Compute Server

Local Scheduler

Grid Scheduler

MiddlewareGrid Queue

Job

Communication Infrastructure

PE PE … PE

Interaction between Grid and Interaction between Grid and Local SchedulersLocal Schedulers

If AWT < :

JR

AWT & CRU

Local Scheduler

Grid Scheduler

MiddlewareGrid Queue

Job

Local Queue

Else : Considered for MigrationSender-Initiated (S-I)Receiver-Initiated (R-I)Symmetrically-Initiated (Sy-I)

• AWT: Approximate Wait Time• CRU: Current Resource Utilization• JR: Job Requirements

Sender-Initiated (S-I)Sender-Initiated (S-I)HostPartner 1 Partner 2

JobiRequirementsART

1 & CRU1

Jobi

ART2 & CRU2

Job i

Resultsi

Select the machine with the smallest Approximate Response Time (ART), Break tie by CRU

ART = Approx Wait Time + Estimated Run Time

ART0 & CRU0

Job i

Requirements

Receiver-Initiated (R-I)Receiver-Initiated (R-I)HostPartner 1 Partner 2

Free Signal Free SignalJobi

Job i

RequirementsJob

iRequirementsART1 & CRU

1 ART2 & CRU2ART0 & CRU0

Jobi

Querying begins after receiving free signal

Symmetrically-Initiated (Sy-I)Symmetrically-Initiated (Sy-I)

• First, work in R-I mode

• Change to S-I mode if no machines volunteer

• Switch back to R-I after job is scheduled

R-I

Have Volunteers

S-I

No Volunteer After Time Period

Centralized ArchitectureCentralized Architecture

Advantages: Global ViewDisadvantages: Single point of failure, Scalability

Web Portals O

r Super Shell W

eb Portals Or Super Shell

Jobs

Grid Queue

Grid Scheduler

Middleware

Performance MetricsPerformance Metrics

€

AverageWaitTime =1

N(StartTime j

j∈Jobs

∑ − SubmitTime j )

€

AverageTurnaroundTime =1

N(EndTime j

j∈Jobs

∑ − SubmitTime j )

€

FractionOfJobsTransferred =NumberofJobsMigrated

TotalNumberOfJobs

€

FractionDataVolumeMigrated =(InputSizeK +OutputSizeK )

K∑

(InputSizeJ +OutputSizeJ )J

∑

€

DataMigrationOverhead =TotalDataMigrationTime

(ETJ −QTJ )J∑

Resource ConfigurationResource Configuration

and Site Assignment and Site AssignmentServer

IDNumber of

NodesCPUs per

NodeCPU Speed Site Locator

3 Sites 6 Sites 12 Sites

S1 184 16 375 MHz 0 0 0S2 305 4 332 MHz 1 1 1S3 144 8 375 MHz 2 3 2S4 256 4 600 MHz 1 0 3S5 32 2 250 MHz 2 2 4S6 128 4 400 MHz 2 5 5

S7 64 2 250 MHz 2 5 6S8 144 8 375 MHz 1 2 7S9 256 4 600 MHz 0 4 8

S10 32 2 250 MHz 0 1 9S11 128 4 400 MHz 0 3 10S12 64 2 250 MHz 1 4 11

• Each local site network has peak bandwidth of 800Mb/s (gigabit Ethernet LAN)• External network has 40Mb/s available point-to-point (high-performance WAN)• Assume all data transfers share network equally (network contention is modeled)• Assume performance linearly related to CPU speed• Assume users pre-compiled code for each of the heterogeneous platforms

Job WorkloadsJob WorkloadsWorkload

IDTime Period(Start-End)

# ofJobs

Avg. InputSize (MB)

W1 03/2002-05/2002 59,623 312.7W2 03/2002-05/2002 22,941 300.8W3 03/2002-05/2002 16,295 305.0W4 03/2002-05/2002 8,291 237.3W5 03/2002-05/2002 10,543 28.9W6 03/2002-05/2002 7,591 236.1W7 03/2002-05/2002 7,251 86.5W8 09/2002-11/2002 27,063 293.0W9 09/2002-11/2002 12,666 328.3W10 09/2002-11/2002 5,236 29.3W11 09/2002-11/2002 11,804 226.5W12 09/2002-11/2002 6,911 53.7

• Systems located at Lawrence Berkeley Laboratory, NASA Ames Research Center,Lawrence Livermore Laboratory, San Diego Supercomputing Center• Data volume info not available. Assume volume is correlated to volume of work

B is number if Kbytes of each work unit (CPU * runtime) Our best estimate is B=1Kb for each CPU second of application execution

Scheduling PolicyScheduling Policy

• Large potential gain using grid superscheduler• Reduced average wait time by 25X compared with local scheme!

• Sender-Initiated performance comparable to Centralized• Inverse between migration (FOJM,FDVM) and timing (NAWT, NART)•Very small fraction of response time spent moving data (DMOH)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

NAWT NART FOJM FDVM DMOH

S-IR-ISY-ICntlLocal

12 SitesWorkload B

Data Migration SensitivityData Migration Sensitivity

•NAWT for 100B almost 8X than B, NART 50% higher•DMOH increases to 28% and 44% for 10B and 100B respectively•As B increases, data migration (FDVM) decreases due to increasing overhead•FOJM inconsistent because it measures # of jobs NOT data volume

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


0B

0.1B

B

10B

100B

Sender-I

12 Sites

Site Number SensitivitySite Number Sensitivity

• 0.1B causes no site sensitivity, • 10B has noticeable effect as sites decrease from 12 to 3:

• Decrease in time (NAWT, NART) due to increase in network bandwidth• Increase in fraction of data volume migrated (FDVM)• 40% Increase in fraction of response time moving data (DMOH)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


0.1B,12Sites

0.1B,6Sites

0.1B,3Sites

10B,12Sites

10B,6Sites

10B,3Sites

Sender-I

Communication ObliviousCommunication ObliviousSchedulingScheduling

• For B10 If data migration cost is not considered in scheduling algorithm:• NART increases 14X, 40X for 12Sites, 3Sites respectively • NAWT increases 28X,43X for 12Sites, 3Sites respectively• DMOH is over 96%! (only 3% for B set)• 16% of all jobs blocked from executing waiting for data

Compared with practically 0% for communication-aware scheduling

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


0.1B,12SitesB,12Sites10B,12Sites0.1B,3SitesB,3Sites10B,3Sites

2.1 2.4 6.3 17.1

Sender-I

Increased WorkloadIncreased WorkloadSensitivitySensitivity

• Grid scheduling 40% more jobs, compared with non-grid local scheme: • No increase in time NAWT NART• Weighted Utilization increased from 66% to 93%

• However there is fine line, when # of jobs increase by 45%•NAWT grows 3.5X, NART grows 2.4X!

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

NAWT NART FOJM FMD DMOH UTIL

Base

125%

133%

140%

145%

1.2 1.13.5 2.4

Sender-I12 Sites

Workload B

ConclusionsConclusions

• Studied impact of data migration, simulating:– Compute servers– Grouping of serves into sites– Inter-server networks

• Results showed huge benefits of grid scheduling

• S-I reduced average turnaround time by 60% compared with local approach, even in the presence of input/output data migration

• Algorithm can execute 40% more jobs in grid environment and deliver same turnaround times as non-grid scenario

• For large data files, critical to consider migration overhead– 43X increase in NART using communication-oblivious scheduling

Future WorkFuture Work

• Superscheduling scalability:– Resource discovery– Fault tolerance

• Multi-resource requirements

• Architectural heterogeneity

• Practical deployment issues

Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker,...

Documents

Transcript of Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker,...