Exploring Distributed Computing Techniques with Ccactus and Globus

15
Albert-Einstein-Institut www.aei-potsdam.mpg.de Exploring Distributed Computing Techniques with Ccactus and Globus Solving Einstein’s Equations, Black Holes, and Gravitational Wave Astronomy Cactus, a new community simulation code framework: Grid enabling capabilities Previous Metacomputing experiments What we learned form those Current work, improvements The present state Future development, goals Thomas Dramlitsch Albert-Einstein-Institut MPI-Gravitationsphysik (and AEI-ANL-NCSA-LBL team)

description

Exploring Distributed Computing Techniques with Ccactus and Globus. Thomas Dramlitsch Albert-Einstein-Institut MPI-Gravitationsphysik (and AEI-ANL-NCSA-LBL team). Solving Einstein’s Equations, Black Holes, and Gravitational Wave Astronomy - PowerPoint PPT Presentation

Transcript of Exploring Distributed Computing Techniques with Ccactus and Globus

Page 1: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Exploring Distributed Computing Techniques with Ccactus and Globus

• Solving Einstein’s Equations, Black Holes, and Gravitational Wave Astronomy

• Cactus, a new community simulation code framework: Grid enabling capabilities

• Previous Metacomputing experiments

• What we learned form those

• Current work, improvements– The present state

• Future development, goals

Thomas DramlitschAlbert-Einstein-InstitutMPI-Gravitationsphysik(and AEI-ANL-NCSA-LBL team)

Thomas DramlitschAlbert-Einstein-InstitutMPI-Gravitationsphysik(and AEI-ANL-NCSA-LBL team)

Page 2: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

What is Cactus?: new concept in community developed simulation code infrastructure

• Numerical/computational infrastructure to solve PDE’sFreely available, open community source code: spirit of gnu/linux

• Developed as Response to Needs of these projects• It’s production-software• Cactus Divided in “Flesh” (core) and “Thorns” (modules or collections

of subroutines)– User choice between Fortran, C, C++; automated interface between them– Parallelism largely automatic and hidden (if desired) from user– Checkpointing / Restart capabilities

• Many parallel utilities / features enabled by Cactus– Parallel IO: FlexIO, HDF5; Data streaming, remote visualization/steering– Elliptic solvers: PETSc– And of course Metacomputing

• A Vision: any application can plug into Cactus to be Grid enabled• Demo tomorrow night at HPDC

Page 3: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Modularity of Cactus...

Application 1a

Cactus Flesh

Application 2 ...

Sub-app

AMR (Grace, etc)

MPI layer 1 I/O layer 2

Remote Steer 3

Globus Metcomputing Services

User selectsdesired functionality...

Application 1b

Page 4: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Metacomputing: harnessing power when and where it is needed

• Einstein equations typical of apps that require extreme memory, speed– many Flops per grid zone (~103 - 104)

– Finite differences on regular grids

– Communications of variables through derivatives: ghost zones

• Largest supercomputers too small!

• Networks very fast!

– OC-12 and higher very common in US

– G-Win: 622 Mbits Potsdam-Berlin-Garching, connect multiple supercomputers

– Gigabit networking to US possible

• “Seamless computing and visualization from anywhere”

• Many metacomputing experiments in progress

Page 5: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

High performance: Full 3D Einstein Equations solved on NCSA NT Supercluster, Origin 2000, T3E

Cactus Scaling on T3E-600

192

760

5980

47900

100

1000

10000

100000

1 10 100 1000

Number of Processors

Cactus on T3E 600 Total Mflops/sec

• Excellent scaling on many architectures– Origin up to 256 processors

– T3E up to 1024

– NCSA NT cluster up to 128 processors

• Achieved 142 Gflops/s on 1024 node T3E-1200 (benchmarked for NASA NS Grand Challenge)

• But, of course, we want much more… metacomputing, meaning connected computers...

Page 6: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Metacomputing the Einstein Equations:Connecting T3E’s in Berlin, Garching, San Diego

Want to migrate this technology to the generic user...

Page 7: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Scaling of Cactus on two T3Es on different continents

San Diego & Berlin

Berlin & Munich

Page 8: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Scaling of Cactus on Multiple SGIs at Remote Sites

Argonne & NCSA

Page 9: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Analysis of previous metacomputing experiments

• It worked! (That’s the main thing we wanted at SC98…)

• Cactus was not optimized for metacomputing: messages too small, latency etc..

• Mpich-G could perform better, e.g. intra-machine communication one order of magnitude slower than native MPI– Mpich-G2 improves this...

• Communication is non-trivial (not “embarrassingly parallel”) and very intensive

• Experiments showed:– For some problems, this is feasible

– We to improve performance significantly with work on optimization of Cactus and Mpich-G

– That’s what we did!

Page 10: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Optimizing Cactus Communication Layers for Metacomputing

• Made the communication layer(s) much more flexible:– Can specify size and number of messages, in order to achieve best performance

with the underlying network (bandwith, latency)

– Reduced communication to a bare minimum

– Overlapping of communication with other cpu’s

– Overlapping of communication and Computation

• Made the load balancing of cactus more flexible (Matei Ripeanu):– Cactus now allows to decompose the total problem into pieces of different size,

according to cpu-power, number of cpu’s used on one machine etc...

• Cactus compiles (out of the box) with globus and mpich on most common architectures (T3e, Irix, SP-2,…?)

Page 11: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Optimizing Mpich-G: Used Mpich-G2

• MPICH-G2 is a completely rewritten communication layer

• Can distinguish between inter- and intra-machine communication– It uses the vendor’s supplied mpi for intra-machine communication

– Uses TCP/IP between machines

• This means optimal performance in a metacomputing environment

• Works with Cactus and Globus on all major unix-systems

TCP/IP

MPI_COMM_WORLD

Page 12: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Current experiments and future plans

• Current Experiment– Complete testing and production of tightly coupled simulation

between different sites in the USA (NCSA, NERSC, ANL, SDSC and others)

– Want to use advanced software (Portal, co-scheduling systems etc..)

– Want to run across many sites and nodes as possible

• More General Grid Computing problems– Distribution of multiple grids

– Dynamic resource acquisition• Aquiring more memory when needed (AMR)

• Spawning off connected jobs on remote machines

• Cactus thorn would have access to MDS

• …

Page 13: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Cactus Computational Toolkit

Science, Autopilot, AMR, Petsc, HDF, MPI, GrACE, Globus, Remote Steering...

A Portal to Computational Science: The Cactus Collaboratory

1. User has scienceidea...

3. Selects Appropriate Resources...

5. Collaborators log in to monitor...

4. Steers simulation, monitors performance...

2. Composes/Builds Code Components w/Interface...

Want to integrate and migrate this technology to the generic user...

Page 14: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

German Gigabit Project supported by DFN-Verein

• Developing Techniques to Exploit High Speed Networks

• Focus on Remote Steering and Visualization• OC-12 Testbed between AEI, ZIB, RZG with

built-in application groups ready to use it!• Already closely connected to ANL, NCSA, KDI

projects

AEI

Page 15: Exploring Distributed Computing Techniques with Ccactus and Globus

Albert-Einstein-Institut www.aei-potsdam.mpg.de

Metacomputing Experiments, Production

• SC93: remote CM-5 simulation with live viz in CAVE

• SC95: Heroic I-Way experiments leads to development of Globus. Cornell SP-2, Power Challenge, with live viz in San Diego CAVE

• SC97: Garching 512 node T3E, launched, controlled, visualized in San Jose

• SC98: HPC Challenge. SDSC, ZIB, and Garching T3E compute collision of 2 Neutron Stars, controlled from Orlando

• SC99: Colliding Black Holes using Garching, ZIB T3E’s, with remote collaborative interaction and viz at ANL and NCSA booths

• April 2000: Attempting to use LANL, NCSA, NERSC, SDSC, ZIB, Garching, NASA-Ames, Maui?, +…? for single simulation!

• All this technology is available to in main production code for different applications!