Exploring Distributed Computing Techniques with Ccactus and Globus
-
Upload
herman-monroe -
Category
Documents
-
view
15 -
download
0
description
Transcript of Exploring Distributed Computing Techniques with Ccactus and Globus
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Exploring Distributed Computing Techniques with Ccactus and Globus
• Solving Einstein’s Equations, Black Holes, and Gravitational Wave Astronomy
• Cactus, a new community simulation code framework: Grid enabling capabilities
• Previous Metacomputing experiments
• What we learned form those
• Current work, improvements– The present state
• Future development, goals
Thomas DramlitschAlbert-Einstein-InstitutMPI-Gravitationsphysik(and AEI-ANL-NCSA-LBL team)
Thomas DramlitschAlbert-Einstein-InstitutMPI-Gravitationsphysik(and AEI-ANL-NCSA-LBL team)
Albert-Einstein-Institut www.aei-potsdam.mpg.de
What is Cactus?: new concept in community developed simulation code infrastructure
• Numerical/computational infrastructure to solve PDE’sFreely available, open community source code: spirit of gnu/linux
• Developed as Response to Needs of these projects• It’s production-software• Cactus Divided in “Flesh” (core) and “Thorns” (modules or collections
of subroutines)– User choice between Fortran, C, C++; automated interface between them– Parallelism largely automatic and hidden (if desired) from user– Checkpointing / Restart capabilities
• Many parallel utilities / features enabled by Cactus– Parallel IO: FlexIO, HDF5; Data streaming, remote visualization/steering– Elliptic solvers: PETSc– And of course Metacomputing
• A Vision: any application can plug into Cactus to be Grid enabled• Demo tomorrow night at HPDC
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Modularity of Cactus...
Application 1a
Cactus Flesh
Application 2 ...
Sub-app
AMR (Grace, etc)
MPI layer 1 I/O layer 2
Remote Steer 3
Globus Metcomputing Services
User selectsdesired functionality...
Application 1b
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Metacomputing: harnessing power when and where it is needed
• Einstein equations typical of apps that require extreme memory, speed– many Flops per grid zone (~103 - 104)
– Finite differences on regular grids
– Communications of variables through derivatives: ghost zones
• Largest supercomputers too small!
• Networks very fast!
– OC-12 and higher very common in US
– G-Win: 622 Mbits Potsdam-Berlin-Garching, connect multiple supercomputers
– Gigabit networking to US possible
• “Seamless computing and visualization from anywhere”
• Many metacomputing experiments in progress
Albert-Einstein-Institut www.aei-potsdam.mpg.de
High performance: Full 3D Einstein Equations solved on NCSA NT Supercluster, Origin 2000, T3E
Cactus Scaling on T3E-600
192
760
5980
47900
100
1000
10000
100000
1 10 100 1000
Number of Processors
Cactus on T3E 600 Total Mflops/sec
• Excellent scaling on many architectures– Origin up to 256 processors
– T3E up to 1024
– NCSA NT cluster up to 128 processors
• Achieved 142 Gflops/s on 1024 node T3E-1200 (benchmarked for NASA NS Grand Challenge)
• But, of course, we want much more… metacomputing, meaning connected computers...
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Metacomputing the Einstein Equations:Connecting T3E’s in Berlin, Garching, San Diego
Want to migrate this technology to the generic user...
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Scaling of Cactus on two T3Es on different continents
San Diego & Berlin
Berlin & Munich
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Scaling of Cactus on Multiple SGIs at Remote Sites
Argonne & NCSA
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Analysis of previous metacomputing experiments
• It worked! (That’s the main thing we wanted at SC98…)
• Cactus was not optimized for metacomputing: messages too small, latency etc..
• Mpich-G could perform better, e.g. intra-machine communication one order of magnitude slower than native MPI– Mpich-G2 improves this...
• Communication is non-trivial (not “embarrassingly parallel”) and very intensive
• Experiments showed:– For some problems, this is feasible
– We to improve performance significantly with work on optimization of Cactus and Mpich-G
– That’s what we did!
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Optimizing Cactus Communication Layers for Metacomputing
• Made the communication layer(s) much more flexible:– Can specify size and number of messages, in order to achieve best performance
with the underlying network (bandwith, latency)
– Reduced communication to a bare minimum
– Overlapping of communication with other cpu’s
– Overlapping of communication and Computation
• Made the load balancing of cactus more flexible (Matei Ripeanu):– Cactus now allows to decompose the total problem into pieces of different size,
according to cpu-power, number of cpu’s used on one machine etc...
• Cactus compiles (out of the box) with globus and mpich on most common architectures (T3e, Irix, SP-2,…?)
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Optimizing Mpich-G: Used Mpich-G2
• MPICH-G2 is a completely rewritten communication layer
• Can distinguish between inter- and intra-machine communication– It uses the vendor’s supplied mpi for intra-machine communication
– Uses TCP/IP between machines
• This means optimal performance in a metacomputing environment
• Works with Cactus and Globus on all major unix-systems
TCP/IP
MPI_COMM_WORLD
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Current experiments and future plans
• Current Experiment– Complete testing and production of tightly coupled simulation
between different sites in the USA (NCSA, NERSC, ANL, SDSC and others)
– Want to use advanced software (Portal, co-scheduling systems etc..)
– Want to run across many sites and nodes as possible
• More General Grid Computing problems– Distribution of multiple grids
– Dynamic resource acquisition• Aquiring more memory when needed (AMR)
• Spawning off connected jobs on remote machines
• Cactus thorn would have access to MDS
• …
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Cactus Computational Toolkit
Science, Autopilot, AMR, Petsc, HDF, MPI, GrACE, Globus, Remote Steering...
A Portal to Computational Science: The Cactus Collaboratory
1. User has scienceidea...
3. Selects Appropriate Resources...
5. Collaborators log in to monitor...
4. Steers simulation, monitors performance...
2. Composes/Builds Code Components w/Interface...
Want to integrate and migrate this technology to the generic user...
Albert-Einstein-Institut www.aei-potsdam.mpg.de
German Gigabit Project supported by DFN-Verein
• Developing Techniques to Exploit High Speed Networks
• Focus on Remote Steering and Visualization• OC-12 Testbed between AEI, ZIB, RZG with
built-in application groups ready to use it!• Already closely connected to ANL, NCSA, KDI
projects
AEI
Albert-Einstein-Institut www.aei-potsdam.mpg.de
Metacomputing Experiments, Production
• SC93: remote CM-5 simulation with live viz in CAVE
• SC95: Heroic I-Way experiments leads to development of Globus. Cornell SP-2, Power Challenge, with live viz in San Diego CAVE
• SC97: Garching 512 node T3E, launched, controlled, visualized in San Jose
• SC98: HPC Challenge. SDSC, ZIB, and Garching T3E compute collision of 2 Neutron Stars, controlled from Orlando
• SC99: Colliding Black Holes using Garching, ZIB T3E’s, with remote collaborative interaction and viz at ANL and NCSA booths
• April 2000: Attempting to use LANL, NCSA, NERSC, SDSC, ZIB, Garching, NASA-Ames, Maui?, +…? for single simulation!
• All this technology is available to in main production code for different applications!