System Software for High Performance Computingsandhya/csc256/seminars/HPC_OSes_12.pdf · HPC at UR....

System Software for High Performance Computing

Joe Izraelevitz

Agenda● Overview of Supercomputers● Blue Gene/Q System

● LoadLeveler Job Scheduler● General Parallel File System

● HPC at UR

What is a Supercomputer?● Lots of other computers● Closely colocated on a managed network● Backing store

IPC

The World's Simplest Supercomputer(Beowulf Cluster)

Linux w/ rsh enabled

Linux w/ rsh enabled

Key Concepts in Supercomputers

● Cluster: a grouping of computers● Node: a computer within the cluster● Job: a program instance (a set of

processes)

Operating Systems for HPC● Each computer in the cluster has an operating

system● Off the shelf

– Linux Red Hat, Windows Server● Specialized

– Compute Node Linux, CNK, INK● But the supercomputer can also have an “operating

system” called the system management software, which manages its component nodes

OS System Management

Software

Application

“Operating Systems” for HPC● System Management Software Components

● Node Operating System (Linux, CNK, etc.)● Message Passing (MPI, PVM)● Job Scheduler (Maui Scheduler, LoadLeveler)● Resource Manager (Torque Resource Manager,

LSF, SLURM)● Backing Store (AFS, DFS, GPFS)● Front End UI

● Hardware Architecture

Blue Gene/Q Cluster● IBM Flagship supercomputer● Third generation● Complete Supercomputer System

● Architecture● System Management Software

Blue Gene/Q ArchitectureFront End UI

System Management

Software

INK OS (IO Node Kernel)

CNK OS(Compute Node Kernel)

- on 17 cores

Backing Store – GPFS(General Parallel File System)

File I/O Network

IPC Network

Blue Gene System Management Software

● Job Scheduler: LoadLeveler● Resource Manager: LoadLeveler Central

Manager● IPC: MPICH2● File System: GPFS● OS: CNK, INK

Job Scheduling● Maximize resource usage

● CPU cycles, RAM, storage space, software licenses

● Algorithms● SJF, LJF, FIFO, High Priority, etc.

● Considerations● Job type, OS Awareness, Scalability,

Efficiency, Dynamic Capability, Preemption, OS Scheduling

Job Scheduler: LoadLeveler● Built in Blue Gene/Q job scheduler● Checkpoint support● Priority Queues

● Priority from user group● FIFO within jobs of equal priority

● Generally nonpremptible

LoadLeveler: LL_DEFAULT● Double Queue w/ Advanced Reservation

● As nodes are freed, reserve them for the next job

● NEGOTIATOR_PARALLEL_HOLD:– Specify the amount of time a job can hold onto a

resource● Serial programs queued separate from parallel

● Issues● Under utilization● Jobs may never get enough resources within the

time allotted

LoadLeveler: BACKFILL● Double Queue w/ Advanced Reservation

w/ Wall Clock Limit● Scheduler can determine when resources will

be available● Can “backfill” shorter jobs before large jobs

● Issues● Priority Inversion● Incorrect wall clock limit

LoadLeveler: GANG● Coordinated time multiplex scheduler● Each time slice a virtual machine● Issues

● Increased run time● Context switch

overhead● RAM limited

General Parallel File System (GPFS)

● Blue Gene/Q default file system● Parallel access to files, file metadata● Design considerations:

● Highly parallel access● Bandwidth bottleneck● Huge disks and files

I/O Nodes I/O Network Disk ArrayCompute Nodes

GPFS Overview● Striped Files

● Files stored in (~256K) blocks per disk● Distributed in round robin fashion

● Massively parallel file retrieval, bandwidth limited● Vulnerable to failure

● RAID redundancy on each disk Block

File

GPFS: Read/Write● File Parallelism in two methods

● Distributed lock manager– Lock for byte ranges within file– Lock tokens issued to I/O nodes

● Data Shipping– RCU managed blocks within single file

● Metadata Parallelism● One I/O node designated as metanode for file and

maintains the inode information

GPFS: Allocate/Delete● Allocation Manager

● Maintain bitmap of free blocks● Issue region locks

● File Allocation● Get region lock, check for free space

● File Deletion● Requires update of allocation manager ● Requires clearing disk space while holding region

lock● Delayed distributed deletion across I/O nodes based

on lock ownership

Block

File

Region

GPFS: Disk Organization● Extensible Hashing within directories

● Use n bits of hashing function to group files● On collision, increase to n+1 and reorganize

● Journal file system on disk● Shared journal, so any node can restore disk

Message Passing (MPI)MPI (Message Passing Interface)● Standard (not a library)● Implementations with compliant compilers:

– OpenMPI, MPICH, mpiJava, pyMPI, etc. ● “Superfork” to all CPUs available●

● MPI_INIT(), MPI_SEND(), MPI_RECV(), MPI_WAIT()● (mostly) OS, Cluster Manager independent

Master Process Process

Process

ProcessMPI_INIT()

Resource Manager● “Resources managers provide the low-level

functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone cannot control jobs.”

● Daemon runs on each node on top of OS● Layer of abstraction between OS and Message

Passing Interface● Interfaces with Job Scheduler● Manages job submission, admin interface● Monitors compute resources

HPC at U of R● Blue Streak

● Blue Gene/Q System● SLURM

● BG/P● Blue Gene/P System● LoadLeveler Resource Manager /Scheduler

● BlueHive● Torque Resource Manager, Maui Scheduler● Intel Blade Center System

Works Cited● Barney, Blaise. Message Passing Interface (MPI). Lawrence Liverpool National

Laboratory. https://computing.llnl.gov/tutorials/mpi/#Abstract (2012).

● Center for Integrated Research Computing. “Resources.” University of Rochester. http://www.circ.rochester.edu/resources.html (2012).

● Gilge, Megan. IBM System Blue Gene Solution: Blue Gene/Q Application Development. International Technical Support Organization. IBM. March 2012. http://www.redbooks.ibm.com/redpieces/pdfs/sg247948.pdf

● Iqbal, Saeed, Rinku Gupta, Yung-Chin Fang. Planning Considerations for Job Scheduling in HPC Clusters. Dell Power Solutions, February 2005.http://www.dell.com/downloads/global/power/ps1q05-20040135-fang.pdf

● Lakner, Gary and Brant Knudson. IBM System Blue Gene Solution: Blue Gene/Q System Administration. International Technical Support Organization. IBM. June 2012. http://www.redbooks.ibm.com/redbooks/pdfs/sg247869.pdf

● Kannan, Subramanian, Mark Roberts, Peter Mayes, Dave Brelsford, Joseph F Skovira. Workload Management with LoadLeveler. International Technical Support Organization. IBM. November 2001. http://www.redbooks.ibm.com/redbooks/pdfs/sg246038.pdf

● Schmuck, Frank and Roger Haskin. “GPFS: A Shared-Disk File System for Large Computing Clusters.” Proceedings of the Conference on File and Storage Technologies (FAST’02), 28–30 January 2002, Monterey, CA, pp. 231–244. (USENIX, Berkeley, CA.)

https://computing.llnl.gov/tutorials/mpi/#Abstract

http://www.circ.rochester.edu/resources.html

http://www.redbooks.ibm.com/redpieces/pdfs/sg247948.pdf

http://www.dell.com/downloads/global/power/ps1q05-20040135-fang.pdf

http://www.redbooks.ibm.com/redbooks/pdfs/sg247869.pdf

http://www.redbooks.ibm.com/redbooks/pdfs/sg246038.pdf

System Software for High Performance Computingsandhya/csc256/seminars/HPC_OSes_12.pdf · HPC at UR....

Documents

Transcript of System Software for High Performance Computingsandhya/csc256/seminars/HPC_OSes_12.pdf · HPC at UR....