VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for Materials Science...

26
VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for Materials Science Calculations" Cesar R. S. da Silva 1 Pedro R. C. da Silveira 1 1 Minnesota Supercomputing Institute, University of Minnesota Work Sponsored by NSF grant ITR-0426757 and MSI
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for Materials Science...

VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for

Materials Science Calculations"

VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for

Materials Science Calculations"

Cesar R. S. da Silva1

Pedro R. C. da Silveira1

1Minnesota Supercomputing Institute, University of Minnesota

Work Sponsored by NSF grant ITR-0426757 and MSI

The VLab

-“A cyberinfrastructure designed to facilitate/enable execution of extensive calculations that can be broken into several decoupled tasks."

Typically parameter sweeping applications, like:

- Weather and Climate - Oil Search - Stress tests of investment strategies - Seismology - Geodynamics - The everybody's favorite: Calculation thermal properties of materials at high pressures and temperatures.”

VLab has three main roles

1 - Science Enabler

• Empowering users to manage extensive workflows

-Automatic workflow management -Ease of use -Collaborative support -Diversity of tools for Data Analysis, Visualization, etc …

• Aggregating throughput of scattered resources to cope with huge workloads

-Distributed computations -Fault tolerance -Optimal scheduling

… three main roles

2 - Community facility

• Available to the entire community of planetary materials

• Provide a set of tools of common interest

3 - Virtual Organization

• Globally accessible through the WWW

• Strong collaborative support

- Shared access to projects - Collaborative data analysis with synchronous view of data. - Works combined with teleconference software.

Allow geographically distributed groups to work on the Same project.

However, VLab is not :

1 - A program or Software Distribution

• You can download the sources and create your own VLab

• But You don't have any advantage doing so.

2 - A tool to calculate thermal properties of Materials

• This is just one VLab application

• New applications can be developed as users show

interest and willingness to participate.

The VLab

- Composed by a set of tools, made available to each other as Web Services distributed throughout the internet. Currently available tools include:

- Quantum ESPRESSO Package tools - Input preparation for pwscf, phonon, workflows, etc … - Data Analysis and Visualization Tools (VTK/OpenGL) - Workflow Management and monitoring tools - and many more to come …

- Automatic generation of task input and recollection of output

- User Interface consolidated through a easy to use Portal

The VLab

VLab Workflows Typical VLab workflows, like

the High-T Cij calculation involve iterations through the following steps:

1) Prepare inputs for tasks, and generate execution packages containing required files.

2) Dispatch the execution packages to compute nodes for execution.

3) Gather results for analysis and eventually iterate steps 1-3.

Leverages computing capabilities of distributed resources (TeraGrid, OSG, scattered resources, other grids)

- Automatic Task Distribution and Data Recollection

Exploit workflow level parallelism to increase performance

Optimal scheduling is an Open field

Vlab - A Distributed System Approach

-Distributed components are replicated for:

- Redundancy- Performance- Flexibility

-No central component to fail and bring everything down!

-Flexible Scheduling for:- Cost- Turnaround Time - Job Throughput- Workload Balance- System Throughput

Vlab - What already works

-Automatic task distribution and data recollection

-Shared access to project monitoring tools and data

-Non colaborative data analysis and 2D graphs.

-High PT properties workflow and its sub-workflows

•High PT application completes successfully, generating a number of thermodynamic variables from a single input, with no user intervention during execution.

Vlab - What has to be done

-Fault tolerance -Registry Based.

-Redundant Registry and Metadata DB for data persistence.

-Full Journaling of critical transactions for data (metadata) integrity.

-Dynamical Composition of Web Services -Will facilitate development of new applications.

-Volumetric (3D) data visualization

*Has to be rewritten from the scratch.

-Collaborative data analisys and visualization. -Have inconsistent iUI.

-Erratic behavior with 2 or more simultaneous users.

-Support for synchronous view of data not yet implemented

… What has to be done-Methodological improvements

-Real space symmetry operations in ESPRESSO -> reciprocal space

-Numerical instability with Wentzcovitch VCS-MD -> (PR?)

-Constant g-space cut-off in VCS-MD in ESPRESSO -> (?)

-Fitting procedure in High PT data analysis tool. Tool currently in use has a serious flaw.

VLab in Action

Live demo at 2nd VLab Workshop 07: http://www.vlab.msi.umn.edu/events/videos/secondworkshop/08082007/Demo.mov

Calculation of High P,T Thermodynamic Properties Cubic MgO 2 atom cell Static + Lattice Dynamics calculation {Pn}x{qi} sampling

Show distributed computing capabilities Ability to integrate visualization and data analysis tools

Visit the VLab web site: http://www.vlab.msi.umn.edu/

VLab Service Oriented Architecture On the Web:

http://dasilveira.msi.umn.edu:8080/vlab/Usage oriented view of VLab SOA

=> Tree-like structure in 4 layers: 1) User Interface (Portal)2) Workflow control and monitoring (Project Executor / Interaction)3) Task Dispatching / Interaction, task data retrieving, Auxiliary Services4) Heavy computations and Visualization resources layer.

Cij Workflow

Left: Extensive High-T Cij

Right: Detailed View of Cij and phonon

Scheduling=> Fundamental importance for Performance

The usual approach: -Use agents that interact with the broker

Problem: Agents are not stateless! -More complicated to develop -Persistence must be guaranteed

The VLab approach: -Use an independent WS to monitor workload. -Persistence of data is provided by a local DB. -Compute WS and Workload Monitor are stateless!

Vlab - Not Just a Client/Server

The Client/Server Approach:

-The portal and the supporting modules have access to a large central multi-processor system.

-Can work as a facilitator but lacks other important features found in VLab.

-No Flexibility of Scheduling -No redundancy => Poor availability-No choice for cost (usually High)

Fault Tolerance

- Reactive: We have not identified any need for proactive FT.

- Registry Based: Persistent sessions are registered and must periodically inform the registry about its "alive" state.

- Redundant Registry and Metadata DB for data persistence

- Fully Journaling (data and metadata) of Critical Transactions for data and metadata integrity. This guarantee the state of any persistent session can be restored in case of failure.

• Only Project Executor sessions and few user and project interaction sessions are required to be persistent. Therefore, a simple approach to Fault Tolerance (FT) is possible:

VLAB requirements•Workflow management => Facilitator/Enabler

•Support for distributed computations

•Ease of use

•Support for collaboration

•Flexibility (update/add tools, new features)

•Fault tolerance

•Diversity of tools–analysis, visualization, data reduction, storage, etc .

Compute Performance x Throughput

Leveraging Concurrent Computing for features and performance

High Performance Parallel Computing

High Throughput Distributed Processing

The red line is the predicted optimal performance for upto 16 independent 4-way parallel tasks runningconcurrently (HTC job).

Basic ProblemDemand for Extensive Parameter Sampling

Typical High (P,T) study(ex. Thermal Properties) {Pn}x{qi} => ~102 jobs

Large High (P,T) study( Cij(P,T) ) {Pn}x{i}x{qj} => ~103-4 jobs

Future studies:Extension to alloys

(sampling over configurations)

{{xm}l}x{Pn}x{i}x{qj} => ~105 jobs

• 102-105 Jobs to prepare, submit, monitor, and analyze results• Manual work is prone to human errors => Unmanageable!!!

• First Principles => Sheer number (1015-1020) of operations (Today) => Well over 1022 in 3-5 years

Basic Problem (cont. …)

Fundamental Requirements• Enable user to manage these extensive workflows

-Automatic workflow management -Ease of use, collaborative support, diversity of tools, flexibility• Aggregate throughput to cope with huge workloads -Distributed computations, fault tolerance, optimal scheduling

The Big Challenge of PerformanceMPP systems are not very cost effective for this class of problems

•FFT and matrix transposition: Limited scalability or•Low performance per processor

Examples of Operational Procedure 1 - Cij Workflow Input Preparation

Examples of Operational Procedure 2 - Consolidated view of the distributed workflow