Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker,...
-
Upload
veronica-thornton -
Category
Documents
-
view
218 -
download
1
Transcript of Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker,...
Scheduling in Heterogeneous Grid Scheduling in Heterogeneous Grid Environments:Environments:
The Effects of Data Migration The Effects of Data Migration
Leonid Oliker, Hongzhang ShanFuture Technology Group
Lawrence Berkeley Research Laboratory
Warren Smith, Rupak BiswasNASA Advanced Supercomputing Division
NASA Ames Research Center
MotivationMotivation
• Geographically distributed resources
• Difficult to schedule and manage efficiently– Autonomy (local scheduler)– Heterogeneity– Lack of perfect global information– Conflicting requirements between users and
system administrators
Current StatusCurrent Status
• Grid Initiatives– Global Grid Forum, NASA Information Power Grid, TeraGrid,
Particle Physics Data Grid, E-Grid, LHC Challenge
• Grid Scheduling Services– Enabling multi-site application
Multi-Disciplinary Applications, Remote Visualization, Co-Scheduling,Distributed Data Mining, Parameter Studies
– Job Migration Improve Time-to-Solution Avoid dependency on single resource provider Optimize application mapping to target architecture But what are the tradeoffs of data migration?
Our ContributionsOur Contributions
• Interaction between grid scheduler and local scheduler
• Architecture: distributed, centralized, and ideal
• Real workloads
• Performance metrics
• Job migration overhead
• Superscheduler scalability
• Fault tolerance
• Multi-resource requirements
Distributed ArchitectureDistributed Architecture
Local Env
Grid Env
Info
Job
Local Queue
Compute Server
Local Scheduler
Grid Scheduler
MiddlewareGrid Queue
Job
Communication Infrastructure
PE PE … PE
Interaction between Grid and Interaction between Grid and Local SchedulersLocal Schedulers
If AWT < :
JR
AWT & CRU
Local Scheduler
Grid Scheduler
MiddlewareGrid Queue
Job
Local Queue
Else : Considered for MigrationSender-Initiated (S-I)Receiver-Initiated (R-I)Symmetrically-Initiated (Sy-I)
• AWT: Approximate Wait Time• CRU: Current Resource Utilization• JR: Job Requirements
Sender-Initiated (S-I)Sender-Initiated (S-I)HostPartner 1 Partner 2
JobiRequirementsART
1 & CRU1
Jobi
ART2 & CRU2
Job i
Resultsi
Select the machine with the smallest Approximate Response Time (ART), Break tie by CRU
ART = Approx Wait Time + Estimated Run Time
ART0 & CRU0
Job i
Requirements
Receiver-Initiated (R-I)Receiver-Initiated (R-I)HostPartner 1 Partner 2
Free Signal Free SignalJobi
Job i
RequirementsJob
iRequirementsART1 & CRU
1 ART2 & CRU2ART0 & CRU0
Jobi
Querying begins after receiving free signal
Symmetrically-Initiated (Sy-I)Symmetrically-Initiated (Sy-I)
• First, work in R-I mode
• Change to S-I mode if no machines volunteer
• Switch back to R-I after job is scheduled
R-I
Have Volunteers
S-I
No Volunteer After Time Period
Centralized ArchitectureCentralized Architecture
Advantages: Global ViewDisadvantages: Single point of failure, Scalability
Web Portals O
r Super Shell W
eb Portals Or Super Shell
Jobs
Grid Queue
Grid Scheduler
Middleware
Performance MetricsPerformance Metrics
€
AverageWaitTime =1
N(StartTime j
j∈Jobs
∑ − SubmitTime j )
€
AverageTurnaroundTime =1
N(EndTime j
j∈Jobs
∑ − SubmitTime j )
€
FractionOfJobsTransferred =NumberofJobsMigrated
TotalNumberOfJobs
€
FractionDataVolumeMigrated =(InputSizeK +OutputSizeK )
K∑
(InputSizeJ +OutputSizeJ )J
∑
€
DataMigrationOverhead =TotalDataMigrationTime
(ETJ −QTJ )J∑
Resource ConfigurationResource Configuration
and Site Assignment and Site AssignmentServer
IDNumber of
NodesCPUs per
NodeCPU Speed Site Locator
3 Sites 6 Sites 12 Sites
S1 184 16 375 MHz 0 0 0S2 305 4 332 MHz 1 1 1S3 144 8 375 MHz 2 3 2S4 256 4 600 MHz 1 0 3S5 32 2 250 MHz 2 2 4S6 128 4 400 MHz 2 5 5
S7 64 2 250 MHz 2 5 6S8 144 8 375 MHz 1 2 7S9 256 4 600 MHz 0 4 8
S10 32 2 250 MHz 0 1 9S11 128 4 400 MHz 0 3 10S12 64 2 250 MHz 1 4 11
• Each local site network has peak bandwidth of 800Mb/s (gigabit Ethernet LAN)• External network has 40Mb/s available point-to-point (high-performance WAN)• Assume all data transfers share network equally (network contention is modeled)• Assume performance linearly related to CPU speed• Assume users pre-compiled code for each of the heterogeneous platforms
Job WorkloadsJob WorkloadsWorkload
IDTime Period(Start-End)
# ofJobs
Avg. InputSize (MB)
W1 03/2002-05/2002 59,623 312.7W2 03/2002-05/2002 22,941 300.8W3 03/2002-05/2002 16,295 305.0W4 03/2002-05/2002 8,291 237.3W5 03/2002-05/2002 10,543 28.9W6 03/2002-05/2002 7,591 236.1W7 03/2002-05/2002 7,251 86.5W8 09/2002-11/2002 27,063 293.0W9 09/2002-11/2002 12,666 328.3W10 09/2002-11/2002 5,236 29.3W11 09/2002-11/2002 11,804 226.5W12 09/2002-11/2002 6,911 53.7
• Systems located at Lawrence Berkeley Laboratory, NASA Ames Research Center,Lawrence Livermore Laboratory, San Diego Supercomputing Center• Data volume info not available. Assume volume is correlated to volume of work
B is number if Kbytes of each work unit (CPU * runtime) Our best estimate is B=1Kb for each CPU second of application execution
Scheduling PolicyScheduling Policy
• Large potential gain using grid superscheduler• Reduced average wait time by 25X compared with local scheme!
• Sender-Initiated performance comparable to Centralized• Inverse between migration (FOJM,FDVM) and timing (NAWT, NART)•Very small fraction of response time spent moving data (DMOH)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
NAWT NART FOJM FDVM DMOH
S-IR-ISY-ICntlLocal
12 SitesWorkload B
Data Migration SensitivityData Migration Sensitivity
•NAWT for 100B almost 8X than B, NART 50% higher•DMOH increases to 28% and 44% for 10B and 100B respectively•As B increases, data migration (FDVM) decreases due to increasing overhead•FOJM inconsistent because it measures # of jobs NOT data volume
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
NAWT NART FOJM FDVM DMOH
0B
0.1B
B
10B
100B
Sender-I
12 Sites
Site Number SensitivitySite Number Sensitivity
• 0.1B causes no site sensitivity, • 10B has noticeable effect as sites decrease from 12 to 3:
• Decrease in time (NAWT, NART) due to increase in network bandwidth• Increase in fraction of data volume migrated (FDVM)• 40% Increase in fraction of response time moving data (DMOH)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
NAWT NART FOJM FDVM DMOH
0.1B,12Sites
0.1B,6Sites
0.1B,3Sites
10B,12Sites
10B,6Sites
10B,3Sites
Sender-I
Communication ObliviousCommunication ObliviousSchedulingScheduling
• For B10 If data migration cost is not considered in scheduling algorithm:• NART increases 14X, 40X for 12Sites, 3Sites respectively • NAWT increases 28X,43X for 12Sites, 3Sites respectively• DMOH is over 96%! (only 3% for B set)• 16% of all jobs blocked from executing waiting for data
Compared with practically 0% for communication-aware scheduling
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
NAWT NART FOJM FDVM DMOH
0.1B,12SitesB,12Sites10B,12Sites0.1B,3SitesB,3Sites10B,3Sites
2.1 2.4 6.3 17.1
Sender-I
Increased WorkloadIncreased WorkloadSensitivitySensitivity
• Grid scheduling 40% more jobs, compared with non-grid local scheme: • No increase in time NAWT NART• Weighted Utilization increased from 66% to 93%
• However there is fine line, when # of jobs increase by 45%•NAWT grows 3.5X, NART grows 2.4X!
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
NAWT NART FOJM FMD DMOH UTIL
Base
125%
133%
140%
145%
1.2 1.13.5 2.4
Sender-I12 Sites
Workload B
ConclusionsConclusions
• Studied impact of data migration, simulating:– Compute servers– Grouping of serves into sites– Inter-server networks
• Results showed huge benefits of grid scheduling
• S-I reduced average turnaround time by 60% compared with local approach, even in the presence of input/output data migration
• Algorithm can execute 40% more jobs in grid environment and deliver same turnaround times as non-grid scenario
• For large data files, critical to consider migration overhead– 43X increase in NART using communication-oblivious scheduling
Future WorkFuture Work
• Superscheduling scalability:– Resource discovery– Fault tolerance
• Multi-resource requirements
• Architectural heterogeneity
• Practical deployment issues