Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok...

1

MS Thesis DefenseDynamic Fault Tolerant Grid Workflow

in the Water Threat Management Project

Young Suk Moon

Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon

2

OutlineIntroduction to the Water Threat

Management Project

Motivation

Research Objectives

Fault-Tolerant Queue

Evaluation

Conclusion

3

Water Threat ManagementMotivation

Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water.

Methods

Detect contamination using the sensors located across the WDSs.

Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.

4

Existing Water Threat Management System Architecture

Optimization Engine: Runs Evolutionary Algorithm (EA)

Simulation Engine: Runs EPANET

5

Water Threat Management System RequirementsRequirements

Time sensitiveMassive calculationDynamic adaptation to a Grid environmentFault tolerance

Our goalThe current system is not fault-tolerant -

develop a fault-tolerant framework in the dynamic environment.

6

MotivationResource (Site)

Outage5% down during

2009

Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)

7

Research ObjectivesDevelop a fault-tolerant framework dealing

with resource outages

Strategy: generation distribution on multiple sites

Reduce queue wait time

Strategy: dynamic job dependency

8

Water Threat Management ApplicationSequential & parallel processing

9

Generation DistributionDivide generations into multiple parts as

multiple jobs. Distribute them on multiple sites.

10

Dynamic Job DependencyProblems of generation distribution on

multiple sitesAdditional queue wait times

Each job is dependent on another. Cannot submit a job before the prior job finishes.Solution: determine job dependency at run

time.Submit jobs at the same time.Any job start first computes the first set of

generations

11

Dynamic WTM Workflow ManagementExample scenario

12

Fault-tolerant QueueMost common fault-tolerant strategies in a Grid

ReplicationCheckpointing

Limitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a

different site (heterogeneity)Cannot reschedule job on the same site in case of

site outageChoosing the replication strategy within the

fault-tolerant queue

13

Fault-tolerant Queue DesignComponents

Command Line Interface

Task Pool

Resource Pool

Scheduler

Resource Checker (intergration with the TeraGrid Information Services)

14

Fault Detection in Fault-tolerant QueueFault detection

Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit Communicate with GRAM to detect job failure

TeraGrid Information Services GRAM service may fail when the resource is down Publishes XML documents containing the outage

information

15

Evaluation – WTM performanceWTM application performance (original)

Abe

Big Red

#CPUs

16 16

CPU per Node

8 4

16

Evaluation – Queue Wait TimeQueue wait time statistics

Abe Big Red

Avg. (min)

82 42

Var. 38513

5354

sd. 196 73

17

Evaluation – Performance OverheadPerformance overhead

Integrating a fault-tolerant framework usually causes performance degradation

No performance loss in our framework

Different type of workflow run time comparisonOriginal deployment VS. fault-tolerant

deploymentDynamic job dependency VS. static job

dependencyTest each type of deployment in the real Grid

system including queue wait time

Workflow Dependency

Site Name # Jobs Gen. range

Original - Abe 1 1-20

Original - Big Red 1 1-20

Fault-tolerant

static Abe, Big Red

2 1-10 (Abe),11-20 (Big Red)

Fault-tolerant

dynamic Abe, Big Red

2 1-10,11-2018

Evaluation – Workflow Performance

19

Evaluation – Workflow PerformanceWorkflow comparison results Experiment 1 Experiment 2

Experiment 3

20

Simulation – Worst Case Run Time Comparison

A threat management system must deliver results in any circumstances.

Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

21

Simulation – Worst Case Run Time ComparisonSimulation setup

The generations are equally distributed among the machines.

Use the 2009 TeraGrid outage data.Submit jobs every 5 minutes starting from

1/1/2009 12:00 am EST.

Abe Big Red

Queen Bee

Run Time per Gen. (min)

0.52 2.07 1.02

#CPUs 16 16 8

22

Simulation – Worst Case Run Time ComparisonSimulation

queue wait time setup (unit: minutes)

23


TeraGrid User & System News (http://news.teragrid.org/)

24


25


26

Simulation – Median Run Time, Worst Case (Max.) Run Time

27

ConclusionAchievement:

Worst case run time is significantly reduced.Limitation:

In “general” cases, the dynamic workflow has performance degradation. Due to the low failure rate & compute performance

difference between difference machines.

Possible improvement:Migrate the generation process to a faster

machine whenever possible.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok...

Documents

Transcript of Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok...