Parallel Scalable Operating Systems

Presented by Dr. Florin Isaila

Universidad Carlos III de Madrid Visiting Scholar Argonne National Lab

Contents

Preliminaries Top500 Scalability

Blue Gene History BG/L, BG/C, BG/P, BG/Q

Scalable OS for BG/L Scalable file systems BG/P at Argonne National Lab Conclusions

Top500Top500

Generalities

Since 1993 twice a year: June and November Ranking of the most powerful computing

systems in the world Ranking criteria: performance of the

LINPACK benchmark Jack Dongarra alma máter Site web: www.top500.org

HPL: High-Performance Linpack

solves a dense system of linear equations Variant of LU factorization of matrices of size N

measure of a computer’s floating-point rate of execution computation done in 64 bit floating point arithmetic Rpeak : theoretic system performance

upper bound for the real performance (in MFLOP) Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS

Nmax : obtained by varying N and choosing the maximum performance

Rmax : maximum real performance achieved for Nmax

N1/2: size of problem needed to achieve ½ of Rmax

Jack Dongarra´s slide

Amdahl´s law

Suppose a fraction f of your application is not parallelizable

1-f : parallelizable on p processors

Speedup(P) = T1 /Tp

<= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p)

<= 1/f

Amdahl’s Law (for 1024 processors)

Speedup

0 0.01 0.02 0.03 0.04

Load Balance

Work: data access, computation Not just equal work, but must be busy at same time Ex: Speedup ≤ 1000/400 = 2.5

Sequential Work

Max Work on any ProcessorSpeedup ≤

0 200 400 600 800 1000 1200

Sequential

Processor 1

Processor 2

Processor 3

Processor 4

Communication and synchronization

Communication is expensive! Measure: communication to computation ratio Inherent communication

Determined by assignment of tasks to processesActual communication may be larger (artifactual)

One principle: Assign tasks that access same data to same process

Sequential Work

Max (Work + Synch Wait Time + Comm Cost)Speedup <

Synchronization pointWork Synchronization wait time

Communication

Process 1

Process 2

Process 3

Blue Gene

Blue Gene partners

IBM “Blue”: The corporate color of IBM “Gene”: The intended use of the Blue Gene

clusters – Computational biology, specifically, protein folding

Lawrence Livermore National Lab Department of Energy Academia

BG HistoryDecember 1999 Project created

Supercomputer for biomolecular phenomena

29 Septiembre 2004

Blue Gene/L prototype had overtaken NEC's Earth Simulator

(36,01 TFlops | 8 Cabinets )

November 2004 Blue Gene/L reaches 70,82 TFlops (16 Cabinets)

24 March 2005 Blue Gene/L broke its record, reaching 135.5 TFLOPS

(32 Cabinets)

June 2005 Several sites world-wide took 5 out of 10 in top 10

(top 500.org)

27 October 2006

November 2007

280.6 TFLOPS (65,536 "Compute Nodes" and adition1024 "IO nodes“

in 64 air-cooled cabinets)

478.2 TFLOPS

Family

BG/L BG/C BG/P BG/Q

Blue Gene/L

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

5.6/11.2 GF/s1.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

90/180 GF/s16 GB

32 Node Cards

2.8/5.6 TF/s512 GB

64 Racks, 64x32x32

180/360 TF/s32 TB

System

Node Card

Compute Card

Technical specifications

64 cabinets which contain 65.536 high-performance compute nodes (chips)

1.024 I/O nodes. 32-bit PowerPC processors 5 networks The main memory has a size of 33 terabytes. Maximum performance of 183.5 TFLOPS when

using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.

Blue Gene / L

Networks: 3D Torus Collective Network Global Barrier/Interrupt Gigabit Ethernet (I/O & Connectivity) Control (system boot, debug,

monitoring)

Networks

Three dimensional torus- Compute nodes

Global tree- collective communication- I/O

Ethernet Control network

Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.

Blue Gene / L

Processor: PowerPC 440 700Mhz Low power allows dense packaging

External Memory: 512MB SDRAM per node / 1GB

Slow embedded core at a clock speed of 700 Mhz 32 KB L1 cache L2 is a small prefetch buffer 4MB Embedded DRAM L3 cache

PowerPC 440 core

BG/L compute ASIC

Non-cache coherent L1 Pre-fetch buffer L2 Shared 4MB DRAM (L3) Interface to external DRAM 5 network interfaces

Torus, collective, global barrier, Ethernet, control

Block diagram

Blue Gene / L

Compute Nodes: Dual processor, 1024 per Rack

I/O Nodes: Dual processor, 16-128 per Rack

Blue Gene / L

Compute Nodes: Proprietary kernel (tailored to processor design)

I/O Nodes: Embedded Linux

Front-end and service nodes: Suse SLES 9 Linux (familiarity with users)

Blue Gene / L

Performance: Peak performance per rack: 5,73 TFlops Linpack performance per rack: 4,71 TFlops

Blue Gene / C

a.k.a Cyclops64 massively parallel (first supercomputer on a

chip) Processors with a 96 port, 7 stage non-internally

blocking crossbar switch. Theoretical peak performance (chip): 80 GFlops

Blue Gene / C

Cellular architecture 64-bit Cyclops64 chip:

500 Mhz 80 processors ( each has 2 thread units and a FP

unit) Software

Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

Blue Gene / C

Picture of BG/C Performances:

Board: 320 GFlops Rack: 15,76 Tflops System: 1,1 PFlops

Blue Gene / P Similar Architecture to BG/L, but

Cache coherent L1 cache 4 cores per nodes 10 Gbit Ethernet external IO infrastructure Scales upto 3-PFLOPS More energy efficient

167TF/s by 2007, 1PF by 2008

Blue Gene / Q

Continuation of Blue Gene/L and /P Targeting 10PF/s by 2010/2011 Higher freq at similar performance / watt Similar number of nodes Many more cores More generally useful Aggressive compiler New network: Scalable and cheap

Motivation for a scalable OSMotivation for a scalable OS

Blue Gene/L is currently the world’s fastest and most scalable supercomputer

Several system components contribute to that scalability. The Operating Systems for the different nodes of Blue

Gene/L are among the components responsible for that scalability.

The OS overhead on one node affects the scalability of the whole system

Goal: design a scalable solution for the OS.

High level view of BG/LHigh level view of BG/L

Principle: the structure of the software should reflect the structure of the hardware.

BG/L Partitioning

Space-sharing Divided along natural boundaries into

partitions Each partition can run only one job Each node can be in one of this modes

Coprocessor: one processor assists the other Virtual node: two separate processors, each of

them with its own memory space

Compute nodes: dedicated OS I/O nodes: dedicated OS Service nodes: conventional off-the-shelf OS Front-end nodes: program compilation,

debug, submit File servers: store data , no specific for BG/L

BG/L OS solution BG/L OS solution

Components: I/O, service nodes, CNK The compute and I/O nodes organized into logical

entities called processing sets or psets: 1 I/O node + a collection of CNs 8, 16, 64, 128 CNs Logical concept Should reflect physical proximity => fast communication

Job: collection of N compute processes (on CNs) Own private address space Message passing MPI: ranks 0, N-1

High level view of BG/LHigh level view of BG/L

BG/L OS solution:CNKBG/L OS solution:CNK

Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: Coprocessor mode Virtual node mode

Compute Node Kernel (CNK): simple OS Creates an address spaces Load code and initialize data Transfer processor control to the loaded executable

consumes 1MB Creates either

One address space of 511/1023MB 2 address spaces of 255/511MB

No virtual memory, no paging The entire mapping fits into the TLB of PowerPC

Load in push mode: 1 CN reads the executable from FS and sends to all the others

One image loaded and then stays out of the way!!!

No OS scheduling (one thread) No memory management (No TLB overhead) No local file services User level execution until:

Process requests a system call Hardware interrupts: timer (requested by application),

abnormal events Syscall

Simple: handled locally (getting the time, set an alarm) Complex: forward to I/O nodes Unsupported (fork/mmap): error

Benefits of the simple solution

Robustness: simple design, implementation, test, debugging

Scalability: no interference among compute nodes Low system noise Performance measurements

I/O nodeI/O node

Two roles in Blue Gene/L: Act as an effective master of its corresponding pset To offer services request from compute nodes in its pset

Mainly I/O operations on locally mounted FSs

Only one processor used: due to the lack of memory coherency

Executes an embedded version of the Linux operating system: Does not use any swap space it has an in-memory root file system it uses little memory lacks the majority of LINUX daemons.

I/O node

Complete TCP/IP stack Supported FS: NFS, GPFS, Lustre, PVFS Main process: Control and I/O daemon (CIOD) Launch a job

Job manager sends the request to the service node Service node contacts the CIOD CIOD sends the executable to all processes in pset

System calls System calls

Service nodesService nodes

run the Blue Gene/L control system. Tight integration with CNs and IONs CN and IONs: stateless, no persistent memory Responsible for operation and monitoring the CNs and I/ONs

Creates system partitions and isolates it Computes network routing for torus, collective and global interrupt

networks loads OS code for CNs and I/ONs

Problems

Not fully POSIX compliant Many applications need

Process/thread creation Full server sockets Shared memory segments Memory mapped files

File systems for BG systems

Need for scalable file systems: NFS is not a solution

Most supercomputers and clusters in top 500 use one of these parallel file systems GPFS Lustre PVFS2

GPFS/PVFS/Lustre mounted on the I/O nodes

File system servers

File systems for BG systems

All these parallel file systems are storing files round-robin over several FS servers

Allow concurrent access to files FS calls are forwarded from the compute nodes

to the I/O nodes The I/O nodes execute the calls on behalf of

the compute nodes and forward back the result Problem: data travels over different networks

for each FS call (caching becomes critical)

SummarySummary

The OS solution for Blue Gene adopts a software architecture reflecting the hardware architecture of the system.

The result of this approach is a lightweight kernel for the compute nodes (CNK) and a port of Linux that implements the file system and TCP/IP functionality for the I/O nodes.

This separation of responsibilities leads to an OS solution that is: Simple Robust High Performing Scalable Extensible

Problem: limited applicability Scalable parallel file systems

Parallel Scalable Operating Systems

Documents

Transcript of Parallel Scalable Operating Systems

ParaLearn: A Massively Parallel, Scalable System …web.stanford.edu/group/wonglab/doc/ICS2010_CR.pdfParaLearn: A Massively Parallel, Scalable System for Learning Interaction Networks

Efficient and Scalable Parallel Algorithm for Sorting ...

Highly Scalable Parallel Algorithms for Sparse Matrix ...rsim.cs.illinois.edu/arch/qual_papers/systems/8.pdf · Highly Scalable Parallel Algorithms for Sparse Matrix Factorization

Scalable Parallel Architectures and their Software

The ELPA Library { Scalable Parallel Eigenvalue …pubman.mpdl.mpg.de/pubman/item/escidoc:2029617/component/...TOPICAL REVIEW The ELPA Library { Scalable Parallel Eigenvalue Solutions

A Scalable Parallel Algorithm for Balanced Sampling ...

Scalable Parallel Solvers for Highly Heterogeneous ...users.ices.utexas.edu/~johann/site_data/presentations/rudi_poster... · Scalable Parallel Solvers for Highly Heterogeneous Nonlinear

Scalable Parallel Feature Extraction and Tracking for ...

A Scalable Parallel Framework for Analyzing … Scalable...A Scalable Parallel Framework for Analyzing Terascale Molecular Dynamics Simulation Trajectories Tiankai Tu, Charles A. Rendleman,

PARSYMONY: Scalable Parallel Data Mining

NORX: Parallel and Scalable AEAD

Parallel & Scalable Machine Learning - Morris Riedelmorrisriedel.de/wp-content/uploads/2018/03/Parallel-and...Parallel & Scalable Machine Learning LECTURE 5 Review of Lecture 4 Lecture

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

Scalable Parallel Computing on Clouds Using Twister4Azure ...

A Highly Scalable Parallel Algorithm for Isotropic Total ...jwangumi/publications/Isotropic_TV_ICML... · A Highly Scalable Parallel Algorithm for Isotropic Total Variation Models

Prototyping a Scalable Massively-Parallel Supercomputer ...

A Scalable Parallel Cell-Projection Volume Rendering ...

SCALABLE PARALLEL COMPUTING ON CLOUDSdsc.soic.indiana.edu/publications/thilina-gunarathne-thesis-final.pdf · scalable parallel computing landscape. These disruptions are the data

A new class of scalable parallel pseudorandom number ...spot.colorado.edu/.../PohligHellmanPseudorandom.pdf · A new class of scalable parallel pseudorandom number generators based

Scalable Parallel Systems Barcelonaresearch.ac.upc.edu/HPCseminar/SEM0001/snir.pdf · Scalable Parallel Systems ... Can be used as modest cluster/farm, not only large ... 2Switch