Parallel Scalable Operating Systems
Presented by Dr. Florin Isaila
Universidad Carlos III de Madrid Visiting Scholar Argonne National Lab
Contents
Preliminaries Top500 Scalability
Blue Gene History BG/L, BG/C, BG/P, BG/Q
Scalable OS for BG/L Scalable file systems BG/P at Argonne National Lab Conclusions
Top500Top500
Generalities
Since 1993 twice a year: June and November Ranking of the most powerful computing
systems in the world Ranking criteria: performance of the
LINPACK benchmark Jack Dongarra alma máter Site web: www.top500.org
HPL: High-Performance Linpack
solves a dense system of linear equations Variant of LU factorization of matrices of size N
measure of a computer’s floating-point rate of execution computation done in 64 bit floating point arithmetic Rpeak : theoretic system performance
upper bound for the real performance (in MFLOP) Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS
Nmax : obtained by varying N and choosing the maximum performance
Rmax : maximum real performance achieved for Nmax
N1/2: size of problem needed to achieve ½ of Rmax
Jack Dongarra´s slide
Amdahl´s law
Suppose a fraction f of your application is not parallelizable
1-f : parallelizable on p processors
Speedup(P) = T1 /Tp
<= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p)
<= 1/f
Amdahl’s Law (for 1024 processors)
Speedup
0
128
256
384
512
640
768
896
1024
0 0.01 0.02 0.03 0.04
s
Load Balance
Work: data access, computation Not just equal work, but must be busy at same time Ex: Speedup ≤ 1000/400 = 2.5
Sequential Work
Max Work on any ProcessorSpeedup ≤
1000
200
100
400
300
0 200 400 600 800 1000 1200
Sequential
Processor 1
Processor 2
Processor 3
Processor 4
Communication and synchronization
Communication is expensive! Measure: communication to computation ratio Inherent communication
Determined by assignment of tasks to processesActual communication may be larger (artifactual)
One principle: Assign tasks that access same data to same process
Sequential Work
Max (Work + Synch Wait Time + Comm Cost)Speedup <
Synchronization pointWork Synchronization wait time
Communication
Process 1
Process 2
Process 3
Blue Gene
Blue Gene partners
IBM “Blue”: The corporate color of IBM “Gene”: The intended use of the Blue Gene
clusters – Computational biology, specifically, protein folding
Lawrence Livermore National Lab Department of Energy Academia
BG HistoryDecember 1999 Project created
Supercomputer for biomolecular phenomena
29 Septiembre 2004
Blue Gene/L prototype had overtaken NEC's Earth Simulator
(36,01 TFlops | 8 Cabinets )
November 2004 Blue Gene/L reaches 70,82 TFlops (16 Cabinets)
24 March 2005 Blue Gene/L broke its record, reaching 135.5 TFLOPS
(32 Cabinets)
June 2005 Several sites world-wide took 5 out of 10 in top 10
(top 500.org)
27 October 2006
November 2007
280.6 TFLOPS (65,536 "Compute Nodes" and adition1024 "IO nodes“
in 64 air-cooled cabinets)
478.2 TFLOPS
Family
BG/L BG/C BG/P BG/Q
Blue Gene/L
2.8/5.6 GF/s4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s1.0 GB
(32 chips 4x4x2)16 compute, 0-2 IO cards
90/180 GF/s16 GB
32 Node Cards
2.8/5.6 TF/s512 GB
64 Racks, 64x32x32
180/360 TF/s32 TB
Rack
System
Node Card
Compute Card
Chip
Technical specifications
64 cabinets which contain 65.536 high-performance compute nodes (chips)
1.024 I/O nodes. 32-bit PowerPC processors 5 networks The main memory has a size of 33 terabytes. Maximum performance of 183.5 TFLOPS when
using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.
Blue Gene / L
Networks: 3D Torus Collective Network Global Barrier/Interrupt Gigabit Ethernet (I/O & Connectivity) Control (system boot, debug,
monitoring)
Networks
Three dimensional torus- Compute nodes
Global tree- collective communication- I/O
Ethernet Control network
Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.
Blue Gene / L
Processor: PowerPC 440 700Mhz Low power allows dense packaging
External Memory: 512MB SDRAM per node / 1GB
Slow embedded core at a clock speed of 700 Mhz 32 KB L1 cache L2 is a small prefetch buffer 4MB Embedded DRAM L3 cache
PowerPC 440 core
BG/L compute ASIC
Non-cache coherent L1 Pre-fetch buffer L2 Shared 4MB DRAM (L3) Interface to external DRAM 5 network interfaces
Torus, collective, global barrier, Ethernet, control
Block diagram
Blue Gene / L
Compute Nodes: Dual processor, 1024 per Rack
I/O Nodes: Dual processor, 16-128 per Rack
Blue Gene / L
Compute Nodes: Proprietary kernel (tailored to processor design)
I/O Nodes: Embedded Linux
Front-end and service nodes: Suse SLES 9 Linux (familiarity with users)
Blue Gene / L
Performance: Peak performance per rack: 5,73 TFlops Linpack performance per rack: 4,71 TFlops
Blue Gene / C
a.k.a Cyclops64 massively parallel (first supercomputer on a
chip) Processors with a 96 port, 7 stage non-internally
blocking crossbar switch. Theoretical peak performance (chip): 80 GFlops
Blue Gene / C
Cellular architecture 64-bit Cyclops64 chip:
500 Mhz 80 processors ( each has 2 thread units and a FP
unit) Software
Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.
Blue Gene / C
Picture of BG/C Performances:
Board: 320 GFlops Rack: 15,76 Tflops System: 1,1 PFlops
Blue Gene / P Similar Architecture to BG/L, but
Cache coherent L1 cache 4 cores per nodes 10 Gbit Ethernet external IO infrastructure Scales upto 3-PFLOPS More energy efficient
167TF/s by 2007, 1PF by 2008
Blue Gene / Q
Continuation of Blue Gene/L and /P Targeting 10PF/s by 2010/2011 Higher freq at similar performance / watt Similar number of nodes Many more cores More generally useful Aggressive compiler New network: Scalable and cheap
Motivation for a scalable OSMotivation for a scalable OS
Blue Gene/L is currently the world’s fastest and most scalable supercomputer
Several system components contribute to that scalability. The Operating Systems for the different nodes of Blue
Gene/L are among the components responsible for that scalability.
The OS overhead on one node affects the scalability of the whole system
Goal: design a scalable solution for the OS.
High level view of BG/LHigh level view of BG/L
Principle: the structure of the software should reflect the structure of the hardware.
BG/L Partitioning
Space-sharing Divided along natural boundaries into
partitions Each partition can run only one job Each node can be in one of this modes
Coprocessor: one processor assists the other Virtual node: two separate processors, each of
them with its own memory space
OS
Compute nodes: dedicated OS I/O nodes: dedicated OS Service nodes: conventional off-the-shelf OS Front-end nodes: program compilation,
debug, submit File servers: store data , no specific for BG/L
BG/L OS solution BG/L OS solution
Components: I/O, service nodes, CNK The compute and I/O nodes organized into logical
entities called processing sets or psets: 1 I/O node + a collection of CNs 8, 16, 64, 128 CNs Logical concept Should reflect physical proximity => fast communication
Job: collection of N compute processes (on CNs) Own private address space Message passing MPI: ranks 0, N-1
High level view of BG/LHigh level view of BG/L
BG/L OS solution:CNKBG/L OS solution:CNK
Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: Coprocessor mode Virtual node mode
Compute Node Kernel (CNK): simple OS Creates an address spaces Load code and initialize data Transfer processor control to the loaded executable
CNK
consumes 1MB Creates either
One address space of 511/1023MB 2 address spaces of 255/511MB
No virtual memory, no paging The entire mapping fits into the TLB of PowerPC
Load in push mode: 1 CN reads the executable from FS and sends to all the others
One image loaded and then stays out of the way!!!
CNK
No OS scheduling (one thread) No memory management (No TLB overhead) No local file services User level execution until:
Process requests a system call Hardware interrupts: timer (requested by application),
abnormal events Syscall
Simple: handled locally (getting the time, set an alarm) Complex: forward to I/O nodes Unsupported (fork/mmap): error
Benefits of the simple solution
Robustness: simple design, implementation, test, debugging
Scalability: no interference among compute nodes Low system noise Performance measurements
I/O nodeI/O node
Two roles in Blue Gene/L: Act as an effective master of its corresponding pset To offer services request from compute nodes in its pset
Mainly I/O operations on locally mounted FSs
Only one processor used: due to the lack of memory coherency
Executes an embedded version of the Linux operating system: Does not use any swap space it has an in-memory root file system it uses little memory lacks the majority of LINUX daemons.
I/O node
Complete TCP/IP stack Supported FS: NFS, GPFS, Lustre, PVFS Main process: Control and I/O daemon (CIOD) Launch a job
Job manager sends the request to the service node Service node contacts the CIOD CIOD sends the executable to all processes in pset
System calls System calls
Service nodesService nodes
run the Blue Gene/L control system. Tight integration with CNs and IONs CN and IONs: stateless, no persistent memory Responsible for operation and monitoring the CNs and I/ONs
Creates system partitions and isolates it Computes network routing for torus, collective and global interrupt
networks loads OS code for CNs and I/ONs
Problems
Not fully POSIX compliant Many applications need
Process/thread creation Full server sockets Shared memory segments Memory mapped files
File systems for BG systems
Need for scalable file systems: NFS is not a solution
Most supercomputers and clusters in top 500 use one of these parallel file systems GPFS Lustre PVFS2
GPFS/PVFS/Lustre mounted on the I/O nodes
File system servers
File systems for BG systems
All these parallel file systems are storing files round-robin over several FS servers
Allow concurrent access to files FS calls are forwarded from the compute nodes
to the I/O nodes The I/O nodes execute the calls on behalf of
the compute nodes and forward back the result Problem: data travels over different networks
for each FS call (caching becomes critical)
SummarySummary
The OS solution for Blue Gene adopts a software architecture reflecting the hardware architecture of the system.
The result of this approach is a lightweight kernel for the compute nodes (CNK) and a port of Linux that implements the file system and TCP/IP functionality for the I/O nodes.
This separation of responsibilities leads to an OS solution that is: Simple Robust High Performing Scalable Extensible
Problem: limited applicability Scalable parallel file systems
Further reading
Designing a highly-scalable operating system: the Blue Gene/L story -Proceedings of the 2006 ACM/IEEE conference on Supercomputing
www.research.ibm.com/bluegene/
Top Related