PANORAMA: Predictive Modeling and Diagnostic Monitoring of ... · Relative data flow priorities...

1
Pegasus WMS 1. Pegasus determines “Express flow” requirements for workflow data transfers 2. Pegasus sends “modifyNetwork” QoS requests to messaging space Modify requests Request Manager (RM), AHAB, VirtualSDX API Mobius ++ 3. RM uses AHAB and VirtualSDX API to send SDN QoS requests to SDX Tree Network 4. SDX Tree Network actuates QoS actions using REST API of SDN controller 6. Slice modified with required bandwidth QoS NIaaS/Infrastructure VirtualSDX + ExoGENI Slice ready with QoS 7. RM sends ack. to Pegasus when Express flow QoS is set 5. SDN controller contacts SDX switches for each site Workflow Models (Aspen) + Simulation (CODES) Resource Provisioner Ewa Deelman, Christopher Carothers, Anirban Mandal, Brian Tierney, Jeffrey Vetter, Ilya Baldin, Mark Blanco Rafael Ferreira da Silva, Mariam Kiran, Vickie Lynch, Shirley Moore, Paul Ruth LEARN MORE Panorama Website http://sites.google.com/site/panoramaofworkflows/ Panorama is funded by the US Department of Energy under Grant #DE-SC0012636 PANORAMA Overview of the Project Description Water is seen as small red and white molecules on large nanodiamonds spheres. The colored tRNA can be seen on the nanodiamond surface. Image :Michael Mattheson, ORNL (https://www.ornl.gov/news/diamonds-deliver). IMPACT ON DOE SCIENCE Diamonds that deliver! Panorama enabled cutting-edge domain science research and development that has the potential to solve some of the challenges associated with drug discovery and delivery: The motions of a tRNA (or transfer RNA) model system can be enhanced when coupled with nanodiamonds, or diamond nanoparticles approximately 5 to 10 nanometers in size Flow Prioritization Use Case NETWORK PROVISIONING Data Flow Prioritization for Scientific Workflows Using a Virtual SDX SYSTEM DESIGN Workflow System, Infrastructure Monitoring, Analytical Modeling, Simulation PANORAMA: Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows Infrastructure Dashboard <web UI> Pegasus DB <RDBMS> Monitoring DB <InfluxDB> Workflow Monitor <monitord> Real Time Job Monitor <Kickstart> Analytical Modeling <Aspen> Workflow State, Performance, Estimates, Anomalies Monitoring Timeseries Monitoring Events Timeseries Anomalies Actual Performance Perf. Estimates DB Updates Estimated Performance Infrastructure Monitoring I/O Network Resource Workflow Planner <Pegasus> Workflow Structure and Estimates Parameters Workflow and Task Models Workflow Logs Executable Workflow <DAG> Abstract Workflow <DAX> Workflow Engine <DAGMan> Job Manager <Condor> Application Job Simulator <ROSS/CODES> Message Broker <RabbitMQ> Monitoring Events Monitoring Events Anomaly Detection <Complex Event Processing> New PANORAMA components Unmodified existing components Components modified by PANORAMA Pegasus interfaces with Aspen to estimate resource requirements of individual workflow tasks as well as the entire workflow Aspen interfaces with ROSS/CODES to simulate network behavior not easily modeled using analytical techniques Workflow and infrastructure monitoring data is stored in InfluxDB and Pegasus DB Anomaly detection process monitors data stream and generates anomaly notifications, which are displayed in the web dashboard The Panorama project aims to further the understanding of the behavior of scientific workflows as they are executing in heterogeneous environments. Panaroma’s approach to modeling and diagnosing the runtime performance of complex scientific workflows is to integrate extreme-scale systems testbed experimentation, structured analytical modeling and parallel systems simulation into a comprehensive workflow framework that can characterize the end-to-end workflow performance on today’s and future generation architectures, which can be used to improve the overall workflow performance and reliability. The Panaroma architecture includes the individual framework components: the Aspen analytical application modeling software, the ROSS simulation framework, the Pegasus workflow management system, and how they are used to model the behavior of DOE-relevant applications. By having a coupled model of the application and execution environment, decisions can be made about resource provisioning, application task scheduling, data management within the application, etc. Our approach for correlating the real time application and infrastructure monitoring data can be used to verify application behavior, perform anomaly detection and diagnosis, and support adaptivity during workflow execution. We have developed an SNS Pegasus workflow to confirm that nanodiamonds enhance the dynamics of tRNA when in the presence of water. The workflow calculates the epsilon which best matches experimental data. These calculations used almost 400,000 CPU hours on a Cray XE6at NERSC. The workflow runs NAMD parallel simulations, which varies the epsilon between -0.01 and -0.19 for each temperature specified (it requires 800 cores: equilibrium runs take ~1.5hs and production runs 12-16hs). AMBER’s cpptraj removes global translation and rotation, and SASSENA calculates neutron scattering intensities from the trajectories (400 cores, 3-6hs). This workflow was used to computer 4 temperatures between 260K and 300K, which generated ~3TB of data. MODELING AND SIMULATION Scalable Workload Generation for Application Performance Modeling and Simulation BURST BUFFERS On the use of Burst Buffers for Accelerating Scientific Workflow Executions Major Findings: I/O write performance was improved by a factor of 9, and I/O read performance by a factor of 15 Performance decreased slightly at node counts above 64 (potential I/O ceiling) I/O performance must be balanced with parallel efficiency when using burst buffers with highly parallel applications I/O contention may limit the broad applicability of burst buffers for all workflow applications (e.g., in situ processing) Burst Buffers (BB) have emerged as a non- volatile storage solution that is positioned between the processors’ memory and the PFS, buffering the large volume of data produced by the application at an higher rate than the PFS, while seamlessly draining the data to the PFS asynchronously. We explored the impact of Burst Buffers (BB) in scientific workflow applications. Using a software stack including Pegasus-WMS and HT-Condor, we ran a workflow on the Cori system at NERSC which included provisioning and releasing remote-shared BB nodes. Our application wrote and read about 550 GB of data. Virtualized SDX Site 1 OpenFlow Controller Site 3 Site 2 Data Site ExoGENI Slice(s) Dynamic Connectivity SDX Tree Network with Express Flows RYU OpenFlow Controller OpenFlow v1.3 Site0 Site1 Express Flows (Bandwidth Provisioned) Site5 Site6 Site4 Site3 Site2 Tree network with vSDX SDX: meeting point networks to exchange traffic, securely and with QoS, using SDN protocols We developed mechanisms to arbitrate and prioritize data flows from competing workflows by leveraging advanced network provisioning technologies like a virtual Software Defined Exchange (SDX). Virtualized SDX Site 1 OpenFlow Controller Site 3 Site 2 Data Site ExoGENI Slice with Virtual SDX + HTCondor Pools + Data Site Dynamic Connectivity HTCondor Pool HTCondor Pool HTCondor Pool vSDX use case with HTCondor pools and Pegasus Software Defined Exchanges (SDX) – meeting point of networks to exchange traffic, securely and with QoS, using SDN protocols Virtual SDX – virtual overlay acting as SDX without persistent physical location ExoGENI virtual SDX can modify compute, network, storage to support changing demands of SDX 0 2500 5000 7500 1 4 8 16 32 64 128 # Nodes MiB/s BB NoBB 0.5 1.0 1.5 2.0 Speedup fx.sgt fy.sgt 1 4 8 16 32 64 128 1 4 8 16 32 64 128 0 500 1000 1500 2000 # Nodes Time (seconds) BB NoBB BB NoBB 1 4 8 16 32 64 128 1 4 8 16 32 64 128 0 25 50 75 100 # Nodes Cumulative CPU time usage (%) stime utime MPI-IO module data: Average time consumed in I/O read operations per process I/O performance estimate for read operations at the MPI-IO layer (solid lines), and average runtime speedup (dashed lines) Ratio between the cumulative time spent in the user (utime) and kernel (stime) spaces for different numbers of nodes Virtual SDX transparently arbitrates workflow data flows communicated by Pegasus Prioritized Data Flows The Mobius++ framework can be used by several high-level applications to provision and adapt infrastructure based on particular requirements Experimental Results 10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10 Relative data flow priorities (Workflow1-Workflow2) 0 200 400 600 800 1000 1200 1400 1600 1800 Avg. transfer time for "individual" jobs (secs) Transfer time vs. Flow priorities (Different bandwidths) Workflow1 on HTCondor pool 1 (vSDX bandwidth = 250Mb/s) Workflow2 on HTCondor pool 2 (vSDX bandwidth = 250Mb/s) Workflow1 on HTCondor pool 1 (vSDX bandwidth = 500Mb/s) Workflow2 on HTCondor pool 2 (vSDX bandwidth = 500Mb/s) 10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10 Relative data flow priorities (Workflow1-Workflow2) 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 Workflow execution time (secs) Execution time vs. Flow priorities (vSDX bandwidth = 250Mb/s) Workflow1 on HTCondor pool 1 Workflow2 on HTCondor pool 2 Effect of relative flow priorities on observed data transfer times for data-intensive workflow tasks for different vSDX provisioned bandwidths Effect of relative flow priorities on overall workflow execution times We have created a new technique for generating scalable workloads from real applications, and implemented a prototype, called Durango, using a performance modeling toolkit. We demonstrate the efficacy of Durango’s direct integration approach, which links Aspen into CODES as part of the running network simulation model. Here, Aspen generates the application-level computation timing events, which in turn drives the start of a network communication phase. Durango in direct integration mode with 1.3M node dragonfly network and Aspen compute node generator for 1K to 32K MPI ranks Durango in direct integration mode with 32K node torus network and Aspen compute node generator for 1K to 16K MPI ranks.

Transcript of PANORAMA: Predictive Modeling and Diagnostic Monitoring of ... · Relative data flow priorities...

Page 1: PANORAMA: Predictive Modeling and Diagnostic Monitoring of ... · Relative data flow priorities (Workflow1-Workflow2) 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 Workflow execution

Pegasus WMS

1. Pegasus determines “Express flow” requirements for workflow data transfers

2. Pegasus sends “modifyNetwork” QoS requests to messaging space

Mod

ify

requ

ests

Request Manager (RM), AHAB, VirtualSDX API

Mobius ++

3. RM uses AHAB and VirtualSDX API to send SDN QoS requests to SDX Tree Network

4. SDX Tree Network actuates QoS actions using REST API of SDN controller

6. Slice modified with required bandwidth QoS

NIaaS/Infrastructure

VirtualSDX + ExoGENI

Slic

e re

ady

with

Q

oS 7. RM sends ack. to Pegasus when Express flow QoS is set

5. SDN controller contacts SDX switches for each site

Workflow Models (Aspen) +

Simulation (CODES)Resource

Provisioner

Ewa Deelman, Christopher Carothers, Anirban Mandal, Brian Tierney, Jeffrey Vetter, Ilya Baldin, Mark BlancoRafael Ferreira da Silva, Mariam Kiran, Vickie Lynch, Shirley Moore, Paul Ruth

LEARN MOREPanorama Websitehttp://sites.google.com/site/panoramaofworkflows/

Panorama is funded by the US Department of Energy under Grant #DE-SC0012636

PANORAMAOverview of the Project Description

Water is seen as small red and white molecules on large nanodiamonds spheres. The colored tRNA can be seen on the nanodiamond surface. Image :Michael Mattheson, ORNL (https://www.ornl.gov/news/diamonds-deliver).

IMPACT ON DOE SCIENCEDiamonds that deliver!

Panorama enabled cutting-edge domain science research and development that has the potential to solve some of the challenges associated with drug discovery and delivery:

• The motions of a tRNA (or transfer RNA) model system can be enhanced when coupled with nanodiamonds, or diamond nanoparticles approximately 5 to 10 nanometers in size

Flow Prioritization Use Case

NETWORK PROVISIONINGData Flow Prioritization for Scientific Workflows Using a Virtual SDX

SYSTEM DESIGNWorkflow System, Infrastructure Monitoring, Analytical Modeling, Simulation

PANORAMA: Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows

Infrastructure

Dashboard<web UI>

Pegasus DB<RDBMS>

Monitoring DB<InfluxDB>

WorkflowMonitor

<monitord>

Real TimeJob Monitor<Kickstart>

Analytical Modeling<Aspen>

Workflow State,Performance, Estimates,

Anomalies

MonitoringTimeseries

MonitoringEvents

Timeseries

Anomalies

ActualPerformance

Perf.Estimates

DBUpdates

EstimatedPerformance

Infrastructure Monitoring

I/O Network

Resource

Workflow Planner

<Pegasus>

WorkflowStructure

and Estimates

Parameters

Workflow and Task Models

Workflow Logs

ExecutableWorkflow<DAG>Abstract

Workflow<DAX>

Workflow Engine<DAGMan>

Job Manager<Condor>

ApplicationJob

Simulator<ROSS/CODES>

Message Broker<RabbitMQ>

MonitoringEvents

MonitoringEvents

Anomaly Detection<Complex Event

Processing>

New PANORAMA components

Unmodified existing components

Components modified by PANORAMA

• Pegasus interfaces with Aspen to estimate resource requirements of individual workflow tasks as well as the entire workflow

• Aspen interfaces with ROSS/CODES to simulate network behavior not easily modeled using analytical techniques

• Workflow and infrastructure monitoring data is stored in InfluxDB and Pegasus DB

• Anomaly detection process monitors data stream and generates anomaly notifications, which are displayed in the web dashboard

The Panorama project aims to further the understanding of the behavior of scientific workflows asthey are executing in heterogeneous environments. Panaroma’s approach to modeling anddiagnosing the runtime performance of complex scientific workflows is to integrate extreme-scalesystems testbed experimentation, structured analytical modeling and parallel systems simulationinto a comprehensive workflow framework that can characterize the end-to-end workflowperformance on today’s and future generation architectures, which can be used to improve theoverall workflow performance and reliability. The Panaroma architecture includes the individualframework components: the Aspen analytical application modeling software, the ROSS simulationframework, the Pegasus workflow management system, and how they are used to model thebehavior of DOE-relevant applications. By having a coupled model of the application andexecution environment, decisions can be made about resource provisioning, application taskscheduling, data management within the application, etc. Our approach for correlating the realtime application and infrastructure monitoring data can be used to verify application behavior,perform anomaly detection and diagnosis, and support adaptivity during workflow execution.

• We have developed an SNS Pegasus workflow to confirm that nanodiamonds enhance the dynamics of tRNA when in the presence of water. The workflow calculates the epsilon which best matches experimental data. These calculations used almost 400,000 CPU hours on a Cray XE6at NERSC.

• The workflow runs NAMD parallel simulations, which varies the epsilon between -0.01 and -0.19 for each temperature specified (it requires 800 cores: equilibrium runs take ~1.5hs and production runs 12-16hs). AMBER’s cpptraj removes global translation and rotation, and SASSENA calculates neutron scattering intensities from the trajectories (400 cores, 3-6hs). This workflow was used to computer 4 temperatures between 260K and 300K, which generated ~3TB of data.

MODELING AND SIMULATIONScalable Workload Generation for Application Performance Modeling and Simulation

BURST BUFFERSOn the use of Burst Buffers for Accelerating Scientific Workflow Executions

Major Findings:• I/O write performance was improved by a

factor of 9, and I/O read performance by a factor of 15

• Performance decreased slightly at node counts above 64 (potential I/O ceiling)

• I/O performance must be balanced with parallel efficiency when using burst buffers with highly parallel applications

• I/O contention may limit the broad applicability of burst buffers for all workflow applications (e.g., in situ processing)

Burst Buffers (BB) have emerged as a non-volatile storage solution that is positioned between the processors’ memory and the PFS, buffering the large volume of data produced by the application at an higher rate than the PFS, while seamlessly draining the data to the PFS asynchronously.

We explored the impact of Burst Buffers (BB) in scientific workflow applications. Using a software stack including Pegasus-WMS and HT-Condor, we ran a workflow on the Cori system at NERSC which included provisioning and releasing remote-shared BB nodes. Our application wrote and read about 550 GB of data.

Virtualized SDX

Site 1

OpenFlowController Site 3Site 2

Data Site

ExoGENI Slice(s)

DynamicConnectivity

SDX Tree Network with Express Flows

RYU OpenFlowController

OpenFlow v1.3

Site0

Site1

Express Flows (Bandwidth Provisioned)

Site5

Site6

Site4

Site3

Site2

Tree network with vSDX

SDX: meeting point networks to exchange traffic, securely and with QoS, using SDN protocols

We developed mechanisms to arbitrate and prioritize data flows from competing workflows by leveraging advanced network provisioning technologies like a virtual Software Defined Exchange (SDX).

Virtualized SDX

Site 1

OpenFlowController Site 3Site 2

Data Site

ExoGENI Slice with Virtual SDX + HTCondor Pools + Data Site

DynamicConnectivity

HTCondorPool

HTCondorPool

HTCondorPool

vSDX use case with HTCondor pools and Pegasus

• Software Defined Exchanges (SDX) –meeting point of networks to exchange traffic, securely and with QoS, using SDN protocols

• Virtual SDX – virtual overlay acting as SDX without persistent physical location

• ExoGENI virtual SDX can modify compute, network, storage to support changing demands of SDX

0

2500

5000

7500

1 4 8 16 32 64 128# Nodes

MiB

/s

BB No−BB

0.5

1.0

1.5

2.0

Spee

dup

fx.sgt fy.sgt

1 4 8 16 32 64 128 1 4 8 16 32 64 1280

500

1000

1500

2000

# Nodes

Tim

e (s

econ

ds)

BB No−BB

BB No−BB

1 4 8 16 32 64 128 1 4 8 16 32 64 1280

25

50

75

100

# Nodes

Cum

ulat

ive C

PU ti

me

usag

e (%

)

stime utime

MPI-IO module data: Average time consumed in I/O read operations per process

I/O performance estimate for read operations at the MPI-IO layer (solid lines), and average runtime speedup (dashed lines)

Ratio between the cumulative time spent in the user (utime) and kernel (stime) spaces for different numbers of nodes

• Virtual SDX transparently arbitrates workflow data flows communicated by Pegasus

Prioritized Data Flows

The Mobius++ framework can be used by several high-level applications to provision and adapt infrastructure based on particular requirements

Experimental Results

10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10

Relative data flow priorities (Workflow1-Workflow2)

0

200

400

600

800

1000

1200

1400

1600

1800

Avg

. tr

an

sfe

r tim

e f

or

"in

div

idu

al"

job

s (s

ecs

)

Transfer time vs. Flow priorities (Different bandwidths)

Workflow1 on HTCondor pool 1 (vSDX bandwidth = 250Mb/s)Workflow2 on HTCondor pool 2 (vSDX bandwidth = 250Mb/s)Workflow1 on HTCondor pool 1 (vSDX bandwidth = 500Mb/s)Workflow2 on HTCondor pool 2 (vSDX bandwidth = 500Mb/s)

10-80 20-70 30-60 40-50 45-45 50-40 60-30 70-20 80-10

Relative data flow priorities (Workflow1-Workflow2)

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

Work

flow

exe

cutio

n tim

e (

secs

)

Execution time vs. Flow priorities (vSDX bandwidth = 250Mb/s)

Workflow1 on HTCondor pool 1Workflow2 on HTCondor pool 2

Effect of relative flow priorities on observed data transfer times for data-intensive workflow tasks for different vSDX provisioned bandwidths

Effect of relative flow priorities on overall workflow execution times

• We have created a new technique for generating scalable workloads from real applications, and implemented a prototype, called Durango, using a performance modeling toolkit.

• We demonstrate the efficacy of Durango’s direct integration approach, which links Aspen into CODES as part of the running network simulation model. Here, Aspen generates the application-level computation timing events, which in turn drives the start of a network communication phase. Durango in direct integration

mode with 1.3M node dragonfly network and Aspen compute node generator for 1K to 32K MPI ranks

Durango in direct integration mode with 32K node torus network and Aspen compute node generator for 1K to 16K MPI ranks.