Slide 1
MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management ProjectYoung Suk Moon
Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon1OutlineIntroduction to Water Threat Management ProjectMotivationResearch ObjectivesFault-Tolerant QueueAnalysis of Fault-Tolerant WorkflowEvaluationConclousion2Water Threat ManagementManaging contamination incidents in urban Water Distribution Systems (WDSs)Simulating water quality and hydraulic behavior using EPANET.Determining locations of sensors in WDSs to detect contaminations.Searching the contaminant source location.3Motivation should be hereSecurity problem e.g. terror, threat, sensors are expensiveRun algorithm put sensor for minimize timeNCSA developed statistical algorithmFault tolerantNetwork graphic3Existing Water Threat Management System Architecture
4EPANET Simulation in the Simulation Engine
????5Water Threat Management RequirementsTime sensitivityLarge computational powerDynamic adaptation to a Grid environmentFault tolerance6What Sensors Obtain are Statistical values 90% in that location.Slide number6MotivationA Grid has faulty environments.TeraGrid User & System News (http://news.teragrid.org/)77MotivationA Grid has faulty environments.TeraGrid User & System News (http://news.teragrid.org/)88MotivationOutage Rate (TeraGrid) (add unit) during 2009
TeraGrid User & System News (http://news.teragrid.org/)9May look reliable, but it is threat management9MotivationQueue wait time
10Research ObjectivesFault-tolerant Water Threat Management system deploymentGeneration distribution on multiple sites
Reducing queue wait timeDynamic job dependency11Water Threat Management ApplicationSequential & parallel processing
12Generation DistributionDivide generations into multiple parts as multiple jobs.
13Generation DistributionFile communication
14Dynamic Job DependencyDetermine job dependency at run time.
15Not understandable:15Dynamic Job DependencyWithout dynamic job dependency
16Dynamic Job DependencyWith dynamic job dependency
17Fault-tolerant QueueMost common fault-tolerant strategies in a GridReplicationCheckpointingLimitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a different siteCannot reschedule job on the same site in case of site outageChoosing the replication strategy within the fault-tolerant queue18heterogeneous18Fault-tolerant Queue DesignArchitecture
19Fault-tolerant Queue DesignComponentsCyberaide Shell Command Line InterfaceTask PoolResource PoolSchedulerResource Checker (intergration with the TeraGrid Information Services)20Fault-tolerant Queue DesignCommand Line Interface (CLI)The queue commandcybershell> queue
Setting policyqueue> policy -task -replicate
Submitting a jobqueue> submit -cmd /mydir/mpijob -mpi 16
21Fault-tolerant Queue DesignLifecycle of a job
22Fault-tolerant Queue DesignFault detectionMessage from Grid Resource Allocation and Management (GRAM) in the Globus ToolkitCommunicate with GRAM to detect job failure
TeraGrid Information ServicesWeb Application23Teragrid info service fast another service23Fault-tolerant Queue DesignTeraGrid Information Services
24Dynamic WTM Workflow ManagementExample scenario
25Run Time Analysis of WTM Job DistributionHypothesisThe total run time can be reduced by the generation distribution in case of failureThe run time of the original WTM job: TDivide the original job into N parts (jobs)The run time of each divided job is T/NThere is a lower probability of failure with T/N than TIf N increases, the total run time decreases.26Run Time Analysis of WTM Job DistributionFailure probability during time x:
The expected number of times to run a job until it succeeds (geometric distribution):
T(n): run time of n generations, k: number of jobs
27Run Time Analysis of WTM Job DistributionRun time for X:
Run time with m times:
28Run Time Analysis of WTM Job DistributionRun time of k jobs:
29K division factor, didnt consider queue wait time29EvaluationWTM application performance (generation)
30Epanet 30EvaluationQueue wait time statistics
AbeBig RedAvg. (min)8242Var.385135354sd.196733131Evaluation Performance overheadMore..
32Evaluation Goal: run time of Different type of Workflow comparisonSetup, what to measure, why measureVersionWorkflowSite NameNumber of JobsGeneration rangeOriginal-Abe11-20Original-Big Red11-20Fault-tolerantstaticAbe, Big Red21-10 (Abe),11-20 (Big Red)Fault-tolerantdynamicAbe, Big Red21-10,11-2033Explain what are static, dynamic33Evaluation Workflow comparison (submitted jobs at different times)
34+Explain this figure related to the previous slide34EvaluationSimulationStatistical model for the original WTM deployment
t: run time of a job, p: failure rate, q: avg. queue wait timeStatistical model for the dynamic WTM deployment
k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job
AbeBig RedAvg. queue wait time (min)8242P(failure per year).065.050Run time per gen. (min).522.0735EvaluationGeneration distribution optimization
36Evaluation explain morePerformance simulation
37Explain queue wait time + job run time37ConclusionThe queue wait time in the workflow can be reduced by the dynamic job dependency strategy (with the generation distribution on multiple sites).
Fault tolerance in the WTM deployment can be achieved by the fault-tolerant queue integrating GRAM and TeraGrid Information Services.38ReferencesL. Rossman, EPANET 2 users manual, US Environmental Protection Agency, Cincinnati, Ohio, Tech. Rep., 2000.TeraGrid Information Services, Web Page. [Online]. Available: http://info.teragrid.org/, A Globus Primer: Describing Globus Toolkit 4, Globus, August 2005. [Online]. Available: http://www.globus.org/toolkit/docs/4.0/key/GT4 Primer 0.6.pdf39
Top Related