“Beowulfery” – Cluster Computing Using GoldSim to Solve Embarrassingly Parallel Problems
description
Transcript of “Beowulfery” – Cluster Computing Using GoldSim to Solve Embarrassingly Parallel Problems
OFFCIAL USE ONLY
“Beowulfery” – Cluster Computing Using GoldSim to Solve Embarrassingly
Parallel Problems
Presented to:GoldSim Users Conference - 2007
October 25, 2007San Francisco, CA
Presented by:Patrick D. Mattie, M.S., P.G. Senior Member of Technical StaffSandia National Laboratories
Contributions by: Stefan Knopf, GTG and Randy Dockter, SNL-YMP
OFFCIAL USE ONLY
Presentation Outline
• Cluster Computing Defined– GoldSim and Beowulf?
• ‘COTS’ Cluster Computing using GoldSim– GoldSim and E.T.?
• Example Cluster – TSPA-Wulf
• What is next? Pushing the limits….
OFFCIAL USE ONLY
What is Cluster Computing?
What is a Beowulf Cluster?
Background
OFFCIAL USE ONLY
Cluster Computing Defined
• What is a compute cluster?– A Cluster is a widely-used term meaning
independent computers combined into a unified system through software and networking. At the most fundamental level, when two or more computers are used together to solve a problem, it is considered a cluster.
– Clusters are typically used for High Availability (HA) for greater reliability or High Performance Computing (HPC) to provide greater computational power than a single computer can provide.
OFFCIAL USE ONLY
Beowulf Class Cluster
• Beowulf Class Cluster is a simple design for high-performance computing clusters on inexpensive personal computer hardware.
– Originally developed in 1994 by Thomas Sterling and Donald Becker at NASA
• Beowulf Clusters– are scalable performance clusters – based on commodity hardware – require no custom hardware or software
• A Beowulf Cluster is constructed from commodity computer hardware (Dell, HP, IBM, etc.) as simple as two networked computers sharing a file system on the same LAN or as complex as thousands of nodes with a high-speed, low-latency interconnects (networking)
• Common uses are traditional technical applications such as simulations, biotechnology, and petroleum; financial market modeling, data mining and stream processing.
• http://www.beowulf.org
OFFCIAL USE ONLY
Advantages of a Beowulf Class Cluster
• Less computation time then running a serial process
• COTS –’Commodity Off the Shelf’– Doesn’t require a big budget– Doesn’t require specialized skill set
• Can be built using existing computer resources and Local Area Networks (LAN)
• Can be constructed over different system configurations/brands/resources
• Useful for solving embarrassingly parallel problems
OFFCIAL USE ONLY
Why do I need a cluster?
• An embarrassingly parallel problem is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks, and there is no essential dependency (or communication) between those parallel tasks– A Monte Carlo simulation is an embarrassingly
parallel problem • For example: a 100 realization simulation can be
broken into 100 separate problems, each solved independently from the other.
• http://en.wikipedia.org/wiki/Embarrassingly_parallel
OFFCIAL USE ONLY
Why do I need a cluster?
• 100 realization run takes 1 minute per realization
– One Computer (or core):• ~1.6 hours
– On four computers (or cores):• 25 minutes
– Ten computers (or cores):• 10 minutes
OFFCIAL USE ONLY
Cluster Computing Using GoldSimPro
• GoldSim Distributed Processing Module– The Distributed Processing Module uses multiple copies of
GoldSim running on multiple machines (and/or multiple processes within a single machine that has a multi-core CPU)
– Grid Computing:
Master
Slaves
OFFCIAL USE ONLY
"Distributed" or "grid computing" - in general is a special type of parallel computing which relies on complete computers (with onboard CPU, storage, power supply, network interface, etc.) connected to a network (private, public or the internet) by a conventional network interface, such as Ethernet.
Examples include:– SETI@home Project: http://setiathome.ssl.berkeley.edu/ Analyzing radio telescope data in search of extraterrestrial intelligence
Cluster Computing - Distributed Processing
OFFCIAL USE ONLY
Cluster Computing Using GoldSimPro
There are two versions of the Distributed Processing Module:
– GoldSim DP (comes with all versions of GoldSim)– GoldSim DP Plus (licensed separately)
OFFCIAL USE ONLY
“Beowulfery” - YMP & GoldSim A Cluster Computing Example
OFFCIAL USE ONLY
TSPA-Wulf – Cluster Configuration
• Window Server 2003, and Windows 2000 Advanced Server (3GB)
• Network simulations (master-slave)
– About 220 Intel Xeon 3.6 GHz dual-processor nodes with 8 GB RAM per machine, on a GigE LAN
– 60 Intel Xeon 3.0 GHz dual-processor dual-core nodes with 16 GB RAM per machine, on a GigE LAN
– One realization per slave CPU—after a slave CPU finishes one realization it accepts another from the master server
– 680 processors available (plus 62 legacy processors)
– 752 total
OFFCIAL USE ONLY
OFFCIAL USE ONLY
OFFCIAL USE ONLY
Running the Model -- Overview
File Server Master Computer Slave Computers
Storage area for completed TSPA cases.
Controlled Storage area for:• Parameter Database• DLLs• input files
Cases are run by GoldSim as a
distributed process from a directory on a
Master.
Individual realizations are run by GoldSim processes on
Slaves.
Storage area for TSPA model file
OFFCIAL USE ONLY
Set-Up On the Master Computer
Master ComputerFile Server
• Parameter Database- parameter values- links to DLLs- links to input files
• input files• DLLs
• TSPA model file • TSPA model file
• input files• DLLs
directory onmaster computer
(3a) Global download of parameter values to model file.
(1) Manually move model file to the Master computer.
(2) Set-up model file to run specific case.
(4) Document changes - conceptual write-up - check list - version control file
storage areas onfile server
(3b) Global download transfers input files and DLLs to the Maste computer.
(Transfers occur over LAN)
OFFCIAL USE ONLY
Running - Transfers to Slaves
Master Computer Slave Computers
• DLLs• input files
• TSPA model file
(1) At the start of the distributed process:• A “Networked” directory is created for each processor on each Slave computer.• GoldSim slave process is started for each processor on each Slave computer.• Model file transferred• DLLs transferred• Input files transferred
(2) Information (i.e., LHS sampling) for each realization is transferred to slave processes as they are available.
• PA02- Networked1- Networked2
144 other slave computers
directory onmaster server
(Transfers occur over LAN)
• PA03- Networked1- Networked2
• PA04- Networked1- Networked2
OFFCIAL USE ONLY
Running - Transfers from Slaves
Slave Computers(Transfers occur over LAN)
• PA02- Networked1- Networked2
144 other slave computers
• PA03- Networked1- Networked2
• PA04- Networked1- Networked2
Master Computer
• DLLs • input files
• TSPA model file
(1) .gsr files transferred as each realization is completed.
(2) GoldSim loads the .gsr files into the model file when all realizations are completed.
•.gsr filesone per realization
directory onmaster computer
OFFCIAL USE ONLY
TSPA Model Architecture
• File size and count– 645 input files (approximately 5 GB in size)
– 14 DLLs
– GoldSim file with no results (pre-run) is about 200 MB in size
– GoldSim file after a run is about 5 to 6 GB in size (compressed); however, there is no intrinsic limitation other than the slowness of file manipulation on a 32-bit operating system
OFFCIAL USE ONLY
TSPA-Wulf Benchmarks
• 1,000 realizations @ 90 minutes per realization– 62.5 Days to run serial mode– 120 processors would take ~ 12.5 hours– 99% faster
• A Typical 1,000,000-year, 1000-realization run (about 470 time steps) requires 24 hours on 150 CPUs (75 dual processor single core nodes, 32-bit, 2.8-3.0 GHz)
OFFCIAL USE ONLY
What comes next?
OFFCIAL USE ONLY
SNL/GoldSim HPCC R&D
• GoldSim evolution/migration to Microsoft HPC- Migration from 32-bit to 64-bit architecture?
• Optimize modeling system for Microsoft HPC• Combined SNL/Microsoft/GoldSim task
• Link GoldSim with the Microsoft CCS scheduler tool to automatically queue jobs and ‘on the fly’ prioritize or re-prioritize job resources.
• Microsoft’s developers working with GoldSim
• True Parallel processing? • Using OpenMP to take advantage of multi-cores
• Optimize HPC Software for large compute cluster• Combined SNL/Microsoft task