System Software for High Performance Computingsandhya/csc256/seminars/HPC_OSes_12.pdf · HPC at UR....
Transcript of System Software for High Performance Computingsandhya/csc256/seminars/HPC_OSes_12.pdf · HPC at UR....
Agenda● Overview of Supercomputers● Blue Gene/Q System
● LoadLeveler Job Scheduler● General Parallel File System
● HPC at UR
What is a Supercomputer?● Lots of other computers● Closely colocated on a managed network● Backing store
IPC
The World's Simplest Supercomputer(Beowulf Cluster)
Linux w/ rsh enabled
Linux w/ rsh enabled
Key Concepts in Supercomputers
● Cluster: a grouping of computers● Node: a computer within the cluster● Job: a program instance (a set of
processes)
Operating Systems for HPC● Each computer in the cluster has an operating
system● Off the shelf
– Linux Red Hat, Windows Server● Specialized
– Compute Node Linux, CNK, INK● But the supercomputer can also have an “operating
system” called the system management software, which manages its component nodes
OS System Management
Software
Application
“Operating Systems” for HPC● System Management Software Components
● Node Operating System (Linux, CNK, etc.)● Message Passing (MPI, PVM)● Job Scheduler (Maui Scheduler, LoadLeveler)● Resource Manager (Torque Resource Manager,
LSF, SLURM)● Backing Store (AFS, DFS, GPFS)● Front End UI
● Hardware Architecture
Blue Gene/Q Cluster● IBM Flagship supercomputer● Third generation● Complete Supercomputer System
● Architecture● System Management Software
Blue Gene/Q ArchitectureFront End UI
System Management
Software
INK OS (IO Node Kernel)
CNK OS(Compute Node Kernel)
- on 17 cores
Backing Store – GPFS(General Parallel File System)
File I/O Network
IPC Network
Blue Gene System Management Software
● Job Scheduler: LoadLeveler● Resource Manager: LoadLeveler Central
Manager● IPC: MPICH2● File System: GPFS● OS: CNK, INK
Job Scheduling● Maximize resource usage
● CPU cycles, RAM, storage space, software licenses
● Algorithms● SJF, LJF, FIFO, High Priority, etc.
● Considerations● Job type, OS Awareness, Scalability,
Efficiency, Dynamic Capability, Preemption, OS Scheduling
Job Scheduler: LoadLeveler● Built in Blue Gene/Q job scheduler● Checkpoint support● Priority Queues
● Priority from user group● FIFO within jobs of equal priority
● Generally nonpremptible
LoadLeveler: LL_DEFAULT● Double Queue w/ Advanced Reservation
● As nodes are freed, reserve them for the next job
● NEGOTIATOR_PARALLEL_HOLD:– Specify the amount of time a job can hold onto a
resource● Serial programs queued separate from parallel
● Issues● Under utilization● Jobs may never get enough resources within the
time allotted
LoadLeveler: BACKFILL● Double Queue w/ Advanced Reservation
w/ Wall Clock Limit● Scheduler can determine when resources will
be available● Can “backfill” shorter jobs before large jobs
● Issues● Priority Inversion● Incorrect wall clock limit
LoadLeveler: GANG● Coordinated time multiplex scheduler● Each time slice a virtual machine● Issues
● Increased run time● Context switch
overhead● RAM limited
General Parallel File System (GPFS)
● Blue Gene/Q default file system● Parallel access to files, file metadata● Design considerations:
● Highly parallel access● Bandwidth bottleneck● Huge disks and files
I/O Nodes I/O Network Disk ArrayCompute Nodes
GPFS Overview● Striped Files
● Files stored in (~256K) blocks per disk● Distributed in round robin fashion
● Massively parallel file retrieval, bandwidth limited● Vulnerable to failure
● RAID redundancy on each disk Block
File
GPFS: Read/Write● File Parallelism in two methods
● Distributed lock manager– Lock for byte ranges within file– Lock tokens issued to I/O nodes
● Data Shipping– RCU managed blocks within single file
● Metadata Parallelism● One I/O node designated as metanode for file and
maintains the inode information
GPFS: Allocate/Delete● Allocation Manager
● Maintain bitmap of free blocks● Issue region locks
● File Allocation● Get region lock, check for free space
● File Deletion● Requires update of allocation manager ● Requires clearing disk space while holding region
lock● Delayed distributed deletion across I/O nodes based
on lock ownership
Block
File
Region
GPFS: Disk Organization● Extensible Hashing within directories
● Use n bits of hashing function to group files● On collision, increase to n+1 and reorganize
● Journal file system on disk● Shared journal, so any node can restore disk
Message Passing (MPI)MPI (Message Passing Interface)● Standard (not a library)● Implementations with compliant compilers:
– OpenMPI, MPICH, mpiJava, pyMPI, etc. ● “Superfork” to all CPUs available●
● MPI_INIT(), MPI_SEND(), MPI_RECV(), MPI_WAIT()● (mostly) OS, Cluster Manager independent
Master Process Process
Process
ProcessMPI_INIT()
Resource Manager● “Resources managers provide the low-level
functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone cannot control jobs.”
● Daemon runs on each node on top of OS● Layer of abstraction between OS and Message
Passing Interface● Interfaces with Job Scheduler● Manages job submission, admin interface● Monitors compute resources
HPC at U of R● Blue Streak
● Blue Gene/Q System● SLURM
● BG/P● Blue Gene/P System● LoadLeveler Resource Manager /Scheduler
● BlueHive● Torque Resource Manager, Maui Scheduler● Intel Blade Center System
Works Cited● Barney, Blaise. Message Passing Interface (MPI). Lawrence Liverpool National
Laboratory. https://computing.llnl.gov/tutorials/mpi/#Abstract (2012).
● Center for Integrated Research Computing. “Resources.” University of Rochester. http://www.circ.rochester.edu/resources.html (2012).
● Gilge, Megan. IBM System Blue Gene Solution: Blue Gene/Q Application Development. International Technical Support Organization. IBM. March 2012. http://www.redbooks.ibm.com/redpieces/pdfs/sg247948.pdf
● Iqbal, Saeed, Rinku Gupta, Yung-Chin Fang. Planning Considerations for Job Scheduling in HPC Clusters. Dell Power Solutions, February 2005.http://www.dell.com/downloads/global/power/ps1q05-20040135-fang.pdf
● Lakner, Gary and Brant Knudson. IBM System Blue Gene Solution: Blue Gene/Q System Administration. International Technical Support Organization. IBM. June 2012. http://www.redbooks.ibm.com/redbooks/pdfs/sg247869.pdf
● Kannan, Subramanian, Mark Roberts, Peter Mayes, Dave Brelsford, Joseph F Skovira. Workload Management with LoadLeveler. International Technical Support Organization. IBM. November 2001. http://www.redbooks.ibm.com/redbooks/pdfs/sg246038.pdf
● Schmuck, Frank and Roger Haskin. “GPFS: A Shared-Disk File System for Large Computing Clusters.” Proceedings of the Conference on File and Storage Technologies (FAST’02), 28–30 January 2002, Monterey, CA, pp. 231–244. (USENIX, Berkeley, CA.)