Download - MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Slide 1

MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management ProjectYoung Suk Moon

Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon1OutlineIntroduction to Water Threat Management ProjectMotivationResearch ObjectivesFault-Tolerant QueueAnalysis of Fault-Tolerant WorkflowEvaluationConclousion2Water Threat ManagementManaging contamination incidents in urban Water Distribution Systems (WDSs)Simulating water quality and hydraulic behavior using EPANET.Determining locations of sensors in WDSs to detect contaminations.Searching the contaminant source location.3Motivation should be hereSecurity problem e.g. terror, threat, sensors are expensiveRun algorithm put sensor for minimize timeNCSA developed statistical algorithmFault tolerantNetwork graphic3Existing Water Threat Management System Architecture

4EPANET Simulation in the Simulation Engine

????5Water Threat Management RequirementsTime sensitivityLarge computational powerDynamic adaptation to a Grid environmentFault tolerance6What Sensors Obtain are Statistical values 90% in that location.Slide number6MotivationA Grid has faulty environments.TeraGrid User & System News (http://news.teragrid.org/)77MotivationA Grid has faulty environments.TeraGrid User & System News (http://news.teragrid.org/)88MotivationOutage Rate (TeraGrid) (add unit) during 2009

TeraGrid User & System News (http://news.teragrid.org/)9May look reliable, but it is threat management9MotivationQueue wait time

10Research ObjectivesFault-tolerant Water Threat Management system deploymentGeneration distribution on multiple sites

Reducing queue wait timeDynamic job dependency11Water Threat Management ApplicationSequential & parallel processing

12Generation DistributionDivide generations into multiple parts as multiple jobs.

13Generation DistributionFile communication

14Dynamic Job DependencyDetermine job dependency at run time.

15Not understandable:15Dynamic Job DependencyWithout dynamic job dependency

16Dynamic Job DependencyWith dynamic job dependency

17Fault-tolerant QueueMost common fault-tolerant strategies in a GridReplicationCheckpointingLimitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a different siteCannot reschedule job on the same site in case of site outageChoosing the replication strategy within the fault-tolerant queue18heterogeneous18Fault-tolerant Queue DesignArchitecture

19Fault-tolerant Queue DesignComponentsCyberaide Shell Command Line InterfaceTask PoolResource PoolSchedulerResource Checker (intergration with the TeraGrid Information Services)20Fault-tolerant Queue DesignCommand Line Interface (CLI)The queue commandcybershell> queue

Setting policyqueue> policy -task -replicate

Submitting a jobqueue> submit -cmd /mydir/mpijob -mpi 16

21Fault-tolerant Queue DesignLifecycle of a job

22Fault-tolerant Queue DesignFault detectionMessage from Grid Resource Allocation and Management (GRAM) in the Globus ToolkitCommunicate with GRAM to detect job failure

TeraGrid Information ServicesWeb Application23Teragrid info service fast another service23Fault-tolerant Queue DesignTeraGrid Information Services

24Dynamic WTM Workflow ManagementExample scenario

25Run Time Analysis of WTM Job DistributionHypothesisThe total run time can be reduced by the generation distribution in case of failureThe run time of the original WTM job: TDivide the original job into N parts (jobs)The run time of each divided job is T/NThere is a lower probability of failure with T/N than TIf N increases, the total run time decreases.26Run Time Analysis of WTM Job DistributionFailure probability during time x:

The expected number of times to run a job until it succeeds (geometric distribution):

T(n): run time of n generations, k: number of jobs

27Run Time Analysis of WTM Job DistributionRun time for X:

Run time with m times:

28Run Time Analysis of WTM Job DistributionRun time of k jobs:

29K division factor, didnt consider queue wait time29EvaluationWTM application performance (generation)

30Epanet 30EvaluationQueue wait time statistics

AbeBig RedAvg. (min)8242Var.385135354sd.196733131Evaluation Performance overheadMore..

32Evaluation Goal: run time of Different type of Workflow comparisonSetup, what to measure, why measureVersionWorkflowSite NameNumber of JobsGeneration rangeOriginal-Abe11-20Original-Big Red11-20Fault-tolerantstaticAbe, Big Red21-10 (Abe),11-20 (Big Red)Fault-tolerantdynamicAbe, Big Red21-10,11-2033Explain what are static, dynamic33Evaluation Workflow comparison (submitted jobs at different times)

34+Explain this figure related to the previous slide34EvaluationSimulationStatistical model for the original WTM deployment

t: run time of a job, p: failure rate, q: avg. queue wait timeStatistical model for the dynamic WTM deployment

k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job

AbeBig RedAvg. queue wait time (min)8242P(failure per year).065.050Run time per gen. (min).522.0735EvaluationGeneration distribution optimization

36Evaluation explain morePerformance simulation

37Explain queue wait time + job run time37ConclusionThe queue wait time in the workflow can be reduced by the dynamic job dependency strategy (with the generation distribution on multiple sites).

Fault tolerance in the WTM deployment can be achieved by the fault-tolerant queue integrating GRAM and TeraGrid Information Services.38ReferencesL. Rossman, EPANET 2 users manual, US Environmental Protection Agency, Cincinnati, Ohio, Tech. Rep., 2000.TeraGrid Information Services, Web Page. [Online]. Available: http://info.teragrid.org/, A Globus Primer: Describing Globus Toolkit 4, Globus, August 2005. [Online]. Available: http://www.globus.org/toolkit/docs/4.0/key/GT4 Primer 0.6.pdf39