Cloudmesh a Gentle Overview Gregor von Laszewski Sep. 2014 [email protected].
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok...
-
Upload
dwayne-moody -
Category
Documents
-
view
215 -
download
0
Transcript of Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok...
1
MS Thesis DefenseDynamic Fault Tolerant Grid Workflow
in the Water Threat Management Project
Young Suk Moon
Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon
2
OutlineIntroduction to the Water Threat
Management Project
Motivation
Research Objectives
Fault-Tolerant Queue
Evaluation
Conclusion
3
Water Threat ManagementMotivation
Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water.
Methods
Detect contamination using the sensors located across the WDSs.
Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.
4
Existing Water Threat Management System Architecture
Optimization Engine: Runs Evolutionary Algorithm (EA)
Simulation Engine: Runs EPANET
5
Water Threat Management System RequirementsRequirements
Time sensitiveMassive calculationDynamic adaptation to a Grid environmentFault tolerance
Our goalThe current system is not fault-tolerant -
develop a fault-tolerant framework in the dynamic environment.
6
MotivationResource (Site)
Outage5% down during
2009
Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)
7
Research ObjectivesDevelop a fault-tolerant framework dealing
with resource outages
Strategy: generation distribution on multiple sites
Reduce queue wait time
Strategy: dynamic job dependency
8
Water Threat Management ApplicationSequential & parallel processing
9
Generation DistributionDivide generations into multiple parts as
multiple jobs. Distribute them on multiple sites.
10
Dynamic Job DependencyProblems of generation distribution on
multiple sitesAdditional queue wait times
Each job is dependent on another. Cannot submit a job before the prior job finishes.Solution: determine job dependency at run
time.Submit jobs at the same time.Any job start first computes the first set of
generations
11
Dynamic WTM Workflow ManagementExample scenario
12
Fault-tolerant QueueMost common fault-tolerant strategies in a Grid
ReplicationCheckpointing
Limitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a
different site (heterogeneity)Cannot reschedule job on the same site in case of
site outageChoosing the replication strategy within the
fault-tolerant queue
13
Fault-tolerant Queue DesignComponents
Command Line Interface
Task Pool
Resource Pool
Scheduler
Resource Checker (intergration with the TeraGrid Information Services)
14
Fault Detection in Fault-tolerant QueueFault detection
Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit Communicate with GRAM to detect job failure
TeraGrid Information Services GRAM service may fail when the resource is down Publishes XML documents containing the outage
information
15
Evaluation – WTM performanceWTM application performance (original)
Abe
Big Red
#CPUs
16 16
CPU per Node
8 4
16
Evaluation – Queue Wait TimeQueue wait time statistics
Abe Big Red
Avg. (min)
82 42
Var. 38513
5354
sd. 196 73
17
Evaluation – Performance OverheadPerformance overhead
Integrating a fault-tolerant framework usually causes performance degradation
No performance loss in our framework
Different type of workflow run time comparisonOriginal deployment VS. fault-tolerant
deploymentDynamic job dependency VS. static job
dependencyTest each type of deployment in the real Grid
system including queue wait time
Workflow Dependency
Site Name # Jobs Gen. range
Original - Abe 1 1-20
Original - Big Red 1 1-20
Fault-tolerant
static Abe, Big Red
2 1-10 (Abe),11-20 (Big Red)
Fault-tolerant
dynamic Abe, Big Red
2 1-10,11-2018
Evaluation – Workflow Performance
19
Evaluation – Workflow PerformanceWorkflow comparison results Experiment 1 Experiment 2
Experiment 3
20
Simulation – Worst Case Run Time Comparison
A threat management system must deliver results in any circumstances.
Thus, a run time of the worst case is a critical factor in the Water Threat Management system.
21
Simulation – Worst Case Run Time ComparisonSimulation setup
The generations are equally distributed among the machines.
Use the 2009 TeraGrid outage data.Submit jobs every 5 minutes starting from
1/1/2009 12:00 am EST.
Abe Big Red
Queen Bee
Run Time per Gen. (min)
0.52 2.07 1.02
#CPUs 16 16 8
22
Simulation – Worst Case Run Time ComparisonSimulation
queue wait time setup (unit: minutes)
23
Simulation – Worst Case Run Time Comparison
TeraGrid User & System News (http://news.teragrid.org/)
24
Simulation – Worst Case Run Time Comparison
25
Simulation – Worst Case Run Time Comparison
26
Simulation – Median Run Time, Worst Case (Max.) Run Time
27
ConclusionAchievement:
Worst case run time is significantly reduced.Limitation:
In “general” cases, the dynamic workflow has performance degradation. Due to the low failure rate & compute performance
difference between difference machines.
Possible improvement:Migrate the generation process to a faster
machine whenever possible.