Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

41
Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY verview of the Blue Gene supercomputers

description

Overview of the Blue Gene supercomputers. Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY. Supercomputer trends Blue Gene/L and Blue Gene/P architecture Blue Gene applications Terminology: FLOPS= Floating Point Operations Per Second - PowerPoint PPT Presentation

Transcript of Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Page 1: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Dr. Dong ChenIBM T.J. Watson Research CenterYorktown Heights, NY

Overview of the Blue Gene supercomputers

Page 2: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Supercomputer trends Blue Gene/L and Blue Gene/P architecture Blue Gene applications

Terminology:

FLOPS = Floating Point Operations Per Second

Giga = 10^9, Tera = 10^12, Peta = 10^15, Exa = 10^18

Peak speed v.s. Sustained Speed

Top 500 list (top500.org):

Based on the Linpack Benchmark:

Solve dense linear matrix equation, A x = b

A is N x N dense matrix, total FP operations, ~ 2/3 N^3 + 2 N^2

Green 500 list (green500.org):

Rate Top 500 supercomputers in FLOPS/Watt

Page 3: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Supercomputer speeds over time

BG/QNext Gen Super

BG/PJaguarRoadrunner

TACCNASA Pleiades

Red Storm

BG/PBG/LSX-9

SX-8RBG/L

ASC PurpleNASA ColumbiaRed Storm

ThunderSX-8

ASCI QEarth Simulator

SX-6ASCI WhiteASCI Red

ASCI Blue MountainT3E

SR8000SX-5Blue Pacific

SX-4Red Option

CP-PACST3D

ParagonNWTCM-5

TestDeltaSX-3/44

i860(MPPs)Y-MPSX-2Cray 2

X-MPS810/20X-MPCyber 205

Cray 1

ILLIAV IVCDC Star 100

CDC7600CDC6600

IBM Stretch

IBM7090IBM704

IBM701

UnivacEniac (vacuum tubes)

1.00E+02

1.00E+05

1.00E+08

1.00E+11

1.00E+14

1.00E+17

1940 1950 1960 1970 1980 1990 2000 2010 2020

Year

Peak

Spe

ed (fl

ops)

Page 4: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

© 2007 IBM Corporation4

CMOS Scaling in Petaflop Era

Three decades of exponential clock rate (and electrical power!) growth has endedInstruction Level Parallelism (ILP) growth has endedSingle threaded performance improvement is dead (Bill Dally)Yet Moore’s Law continues in transistor countIndustry response: Multi-core (i.e. double the number of cores every 18 months instead of the clock frequency (and power!)

Source: “The Landscape of Computer Architecture,” John Shalf, NERSC/LBNL, presented at ISC07, Dresden, June 25, 2007

Page 5: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

Jun9

3

Nov

93

Jun9

4

Nov

94

Jun9

5

Nov

95

Jun9

6

Nov

96

Jun9

7

Nov

97

Jun9

8

Nov

98

Jun9

9

Nov

99

Jun0

0

Nov

00

Jun0

1

Nov

01

Jun0

2

Nov

02

Jun0

3

Nov

03

Jun0

4

Nov

04

Jun0

5

Nov

05

Jun0

6

Nov

06

Jun0

7

Nov

07

Jun0

8

Nov

08

Jun0

9

Nov

09

Jun1

0

Rm

ax P

erfo

rman

ce (

GF

lop

s)TOP500 Performance Trend

Over the long haul IBM has demonstrated continued leadership in various TOP500 metrics, even as the performance continues it’s relentless growth.

Total Aggregate Performance# 1

# 10# 500

Source: www.top500.org

Blue Square Markers Indicate IBM Leadership

IBM has most aggregate performance for last 22 listsIBM has #1 system for 10 out of last 12 lists (13 in total)IBM has most in Top10 for last 14 listsIBM has most systems 14 out of last 22 lists

32.43 PF

1.759 PF

433.2 TF

24.67 TF

Page 6: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

President Obama Honors IBM's Blue Gene Supercomputer With National Medal Of Technology And InnovationNinth time IBM has received nation's most prestigious tech award Blue Gene has led to breakthroughs in science, energy efficiency and analytics

WASHINGTON, D.C. - 18 Sep 2009: President Obama recognized IBM (NYSE: IBM) and its Blue Gene family of supercomputers with the National Medal of Technology and Innovation, the country's most prestigious award given to leading innovators for technological achievement.President Obama will personally bestow the award at a special White House ceremony on October 7.  IBM, which earned the National Medal of Technology and Innovation on eight other occasions, is the only company recognized with the award this year.  Blue Gene's speed and expandability have enabled business and science to address a wide range of complex problems and make more informed decisions -- not just in the life sciences, but also in astronomy, climate, simulations, modeling and many other areas.  Blue Gene systems have helped map the human genome, investigated medical therapies, safeguarded nuclear arsenals, simulated radioactive decay, replicated brain power, flown airplanes, pinpointed tumors, predicted climate trends, and identified fossil fuels – all without the time and money that would have been required to physically complete these tasks. The system also reflects breakthroughs in energy efficiency. With the creation of Blue Gene, IBM dramatically shrank the physical size and energy needs of a computing system whose processing speed would have required a dedicated power plant capable of generating power to thousands of homes. The influence of the Blue Gene supercomputer's energy-efficient design and computing model can be seen today across the Information Technology industry. Today, 18 of the top 20 most energy efficient supercomputers in the world are built on IBM high performance computing technology, according to the latest Supercomputing 'Green500 List' announced by Green500.org in July, 2009. 

Page 7: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Blue Gene Roadmap

• BG/L (5.7 TF/rack) – 130nm ASIC (1999-2004GA)– 104 racks, 212,992 cores, 596 TF/s, 210 MF/W; dual-core system-on-chip, – 0.5/1 GB/node

• BG/P (13.9 TF/rack) – 90nm ASIC (2004-2007GA)– 72 racks, 294,912 cores, 1 PF/s, 357 MF/W; quad core SOC, DMA– 2/4 GB/node– SMP support, OpenMP, MPI

• BG/Q (209 TF/rack) – 20 PF/s

Page 8: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

IBM Blue Gene/P Solution: Expanding the Limits of Breakthrough Science

IBM® System Blue Gene®/P Solution © 2007 IBM Corporation

Blue Gene Technology Roadmap

Blue Gene/QPower Multi Core

Scalable to 100 PF

Per

form

ance

2004 2010

Blue Gene/P(PPC 450 @ 850MHz)Scalable to 3.56 PF

2007

Blue Gene/L(PPC 440 @ 700MHz)

Scalable to 595 TFlops

Note: All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.

Page 9: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

© 2007 IBM Corporation

BlueGene/L System Buildup

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

5.6/11.2 GF/s2.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

90/180 GF/s32 GB

32 Node Cards

2.8/5.6 TF/s1 TB

64 Racks, 64x32x32

180/360 TF/s64 TB

Rack

System

Node Card

Compute Card

Chip

Page 10: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

IBM System Blue Gene®/P Solution

BlueGene/L Compute ASIC

PLB (4:1)

“Double FPU”

Ethernet Gbit

JTAGAccess

128 +16 ECC DDR512/1024MB

JTAG

Gbit Ethernet

440 CPU

440 CPUI/O proc

L2

L2

MultiportedSharedSRAM Buffer

Torus

DDR Control with ECC

SharedL3 directoryfor EDRAM

Includes ECC

4MB EDRAM

L3 CacheorMemory

6 out and6 in, each at 1.4 Gbit/s link

256

256

1024+144 ECC256

128

128

32k/32k L1

32k/32k L1

“Double FPU”

256

snoop

Collective

3 out and3 in, each at 2.8 Gbit/s link

GlobalInterrupt

4 global barriers orinterrupts

128

Page 11: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

© 2006 IBM Corporation

IBM Research | BlueGene Systems

Double Floating-Point Unit

Quadword Store Data

Quadword Load Data

SecondaryFPR

S0

S31

PrimaryFPR

P0

P31

– Two replicas of a standard single-pipe PowerPC FPU

–2 x 32 64-bit registers– Attached to the PPC440 core

using the APU interface–Issues instructions across APU interface–Instruction decode performed in Double FPU –Separate APU interface from LSU to provide up to 16B data for load and store –Datapath width is 16 bytes–Feeds two FPUs with 8 bytes each every cycle

– Two FP multiply-add operations per cycle

–2.8 GF/s peak

Page 12: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Memory: Node System (64k nodes)L1 32kB/32kBL2 2kB per processorSRAM 16kBL3 4MB (ECC)/nodeMain store 512MB (ECC)/node 32TB

Bandwidth:L1 to Registers 11.2 GB/s Independent R/W and InstructionL2 to L1 5.3 GB/s Independent R/W and InstructionL3 to L2 11.2 GB/sMain (DDR) 5.3GB/s

Latency:L1 miss, L2 hit 13 processor cycles (pclks)L2 miss, L3 hit 28 pclks (EDRAM page hit/EDRAM page miss)L2 miss (main store) 75 pclks for DDR closed page access (L3

disabled/enabled)

Blue Gene L/Memory Charateristics

Page 13: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Blue Gene Interconnection Networks

3 Dimensional Torus– Interconnects all compute nodes (65,536)– Virtual cut-through hardware routing– 1.4Gb/s on all 12 node links (2.1 GB/s per node)– Communications backbone for computations– 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth

Global Collective Network– One-to-all broadcast functionality– Reduction operations functionality– 2.8 Gb/s of bandwidth per link; Latency of tree traversal

2.5 µs– ~23TB/s total binary tree bandwidth (64k machine)– Interconnects all compute and I/O nodes (1024)

Low Latency Global Barrier and Interrupt– Round trip latency 1.3 µs

Control Network– Boot, monitoring and diagnostics

Ethernet– Incorporated into every node ASIC– Active in the I/O nodes (1:64)– All external comm. (file I/O, control, user interaction, etc.)

Page 14: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BlueGene/P

13.6 GF/s8 MB EDRAM

4 processors

1 chip, 20 DRAMs

13.6 GF/s2.0 GB DDR2

(4.0GB 6/30/08)

32 Node Cards

13.9 TF/s2 (4) TB

72 Racks, 72x32x32

1 PF/s144 (288) TB

Cabled 8x8x16Rack

System

Compute Card

Chip

435 GF/s64 (128) GB

(32 chips 4x4x2)32 compute, 0-1 IO cards

Node Card

Page 15: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

JTAG 10 Gb/s

256

256

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

Ethernet10 Gbit

Ethernet10 GbitJTAG

Access

JTAGAccess Collective

CollectiveTorus

Torus GlobalBarrier

GlobalBarrier

DDR-2Controllerw/ ECC

DDR-2Controllerw/ ECC

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

6 3.4Gb/sbidirectional

4 globalbarriers orinterrupts

128

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU L2

L2

Snoop filter

Snoop filter

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

512b data 72b ECC

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

Mu

ltiple

xing switch

Mu

ltiple

xing switch

DMADMA

Mu

ltiple

xing switch

Mu

ltiple

xing switch

3 6.8Gb/sbidirectional

DDR-2Controllerw/ ECC

DDR-2Controllerw/ ECC

13.6 Gb/sDDR-2 DRAM bus

32

SharedSRAM

SharedSRAM

snoop

Hybrid PMU

w/ SRAM256x64b

Hybrid PMU

w/ SRAM256x64b

BlueGene/P compute ASIC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

ArbArb

512b data 72b ECC

Page 16: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Memory: Node

L1 32kB/32kB

L2 2kB per processorL3 8MB (ECC)/nodeMain store 2-4GB (ECC)/node

Bandwidth:L1 to Registers 6.8 GB/s instruction Read

6.8 GB/s data Read 6.8 GB/s Write

L2 to L1 5.3 GB/s Independent R/W and InstructionL3 to L2 13.6 GB/sMain (DDR) 13.6 GB/s

Latency:L1 hit 3 processor cycles (pclks)L1 miss, L2 hit 13 pclksL2 miss, L3 hit 46 pclks (EDRAM page hit/EDRAM page miss)L2 miss (main store) 104 pclks for DDR closed page access (L3

disabled/enabled)

Blue Gene/P Memory Characteristics

Page 17: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BlueGene/P Interconnection Networks

3 Dimensional Torus Interconnects all compute nodes (73,728) Virtual cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 µs latency between nearest neighbors, 5 µs to the

farthest MPI: 3 µs latency for one hop, 10 µs to the farthest Communications backbone for computations 1.7/3.9 TB/s bisection bandwidth, 188TB/s total bandwidth

Collective Network One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link per direction Latency of one way tree traversal 1.3 µs, MPI 5 µs ~62TB/s total binary tree bandwidth (72k machine) Interconnects all compute and I/O nodes (1152)

Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 µs,

MPI 1.6 µs

Page 18: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

0.21

0.37

0.15

0.08

0.05

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

BG/L BG/P SGI8200

HPCluster

CraySandia

CrayORNL

CrayNERSC

JS21BSC

November 2007 Green 500L

inp

ac

k G

FL

OP

S/W

0.05

0.02

0.09

Page 19: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

IBM System Blue Gene®/P Solution

Relative power, space and cooling efficiencies(Published specs per peak performance)

0%

100%

200%

300%

400%

Racks/TF kW/TF Sq Ft/TF Tons/TF

Sun/Constellation Cray/XT4 SGI/ICE

IBM BG/P

Page 20: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

0.23

0.370.25

0.44

0.25

0.635

0.829

0.958

1.68

0.000.200.400.600.801.001.201.401.601.80

BG/L2005

BG/P2007

SGINASA-AMES2010

RoadRunner

2008

CrayXT52009

TianHe-1A2010

FujitsuK 2010

Titech2010

BG/QProt2010

System Power Efficiency L

inp

ac

k G

F/W

att

Source: www.top500.org

Page 21: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

HPCC 2009

IBM BG/P 0.557 PF peak (40 racks) Class 1: Number 1 on G-Random Access (117 GUPS) Class 2: Number 1

Cray XT5 2.331 PF peak Class 1: Number 1 on G-HPL (1533 TF/s) Class 1: Number 1 on EP-Stream (398 TB/s) Number 1 on G-FFT (11 TF/s)

Source: www.top500.org

Page 22: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Main Memory Capacity per Rack

0500

10001500200025003000350040004500

LRZIA64

CrayXT4

ASCPurple

RR BG/P SunTACC

SGIICE

Page 23: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Peak Memory Bandwidth per node (byte/flop)

0 0.5 1 1.5 2

BG/P 4 core

Roadrunner

Cray XT3 2 core

Cray XT5 4 core

POWER5

Itanium 2

Sun TACC

SGI ICE

Page 24: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Main Memory Bandwidth per Rack

0

2000

4000

6000

8000

10000

12000

14000

LRZItanium

Cray XT5

ASCPurple

RR BG/P SunTACC

SGI ICE

Page 25: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Interprocessor Peak Bandwidth per node (byte/flop)

0 0.2 0.4 0.6 0.8

BG/L,P

Cray XT5 4c

Cray XT4 2c

NEC ES

Power5

Itanium 2

Sun TACC

x86 cluster

Dell Myrinet

Roadrunner

Page 26: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Failures per Month per TFFrom:http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf  

Page 27: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

IBM System Blue Gene®/P Solution © 2007 IBM Corporation

Execution Modes in BG/P per Node

Hardware Abstractions BlackSoftware Abstractions Blue

node

core

core core

core

P0

T0

T1

T2

P0

T0

T1 T3

T2

P0

T0

T1

P0

T0

SMP Mode1 Process

1-4 Threads/Process

P0

T0

T1 T1

T0

P1

P0

T0

T0

P1

Dual Mode2 Processes

1-2 Threads/Process

P0

T0

T0

Quad Mode (VNM)4 Processes

1 Thread/Process

P1

P2

T0

T0

P3

Next Generation HPC– Many Core

– Expensive Memory

– Two-Tiered Programming Model

Page 28: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Blue Gene Software Hierarchical Organization

Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)

I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination

Service node performs system management services (e.g., partitioning, heart beating, monitoring errors) - transparent to application software

Front-end nodes, file system

10 Gb Ethernet

1 Gb Ethernet

Page 29: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

Noise measurements (from Adolphy Hoisie)

Page 30: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Blue Gene/P System Architecture

Functional Ethernet

(10Gb)

Functional Ethernet

(10Gb)

I/O Node

Linux

ciod

C-Node 0

CNK

I/O Node

Linux

ciod

C-Node 0

CNK

C-Node n

CNK

C-Node n

CNK

Control Ethernet

(1Gb)

Control Ethernet

(1Gb)

FPGA

LoadLeveler

SystemConsole

MMCS

JTAG

torus

tree

DB2

Front-endNodes

I2C

FileServers

fs client

fs client

Service Node

app app

appapp

Page 31: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

BG/P Software Stack Source AvailabilityA

pp

licat

ion

Sys

tem

Fir

mw

are

Har

dw

are

ESSL

MPI GPSHMEMGA

XL RuntimeOpen Toolchain Runtime

Message Layer

CNK

CIOD

Linux kernel

MPI-IO

Messaging SPIs

Common Node Services

Hardware init, RAS, Recovery, MailboxDiags

Compute node

I/O node

Use

r/S

ched

Sys

tem

Fir

mw

are

Har

dw

are

Link card

SNFEN

FEN

Low Level Control SystemPower On/Off, Hardware probe,

Hardware init, Parallel monitoringParallel boot, Mailbox

High Level Control System (MMCS)Partitioning, Job management and

monitoring,RAS, Administrator interfaces, CIODB

ISVSchedulers, debuggers

Link card

Service card

Node cardNode card

Compute nodeCompute node

Compute nodeCompute node

Compute nodeCompute node

I/O node

Bootloader

Node SPIs

totalviewd

New open source community under CPL license. Active IBM participation.

Key:

Closed. No source provided. Not buildable. Closed. Buildable source. No redistribution of derivative works allowed under license.

Existing open source communities under various licenses. BG code will be contributed and/or new sub-community started..

DB2

CSM

Loadleveler

GPFS (1)

PerfMon

mpirun Bridge API

BG Nav

New open source reference implementation licensed under CPL.

I/O and Compute Nodes Service Node/Front End Nodes

Notes:

1. GPFS does have an open build license available which customers may utilize.

HPC Toolkit

Page 32: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Areas Where BG is Used

Weather/Climate Modeling (GOVERNMENT / INDUSTRY / UNIVERSITIES)

Computational Fluid Dynamics – Airplane and Jet Engine Design, Chemical Flows, Turbulence (ENGINEERING / AEROSPACE)

Seismic Processing : (PETROLEUM, Nuclear industry) Particle Physics : (LATTICE Gauge QCD) Systems Biology – Classical and Quantum Molecular Dynamics

(PHARMA / MED INSURANCE / HOSPITALS / UNIV) Modeling Complex Systems

(PHARMA / BUSINESS / GOVERNMENT / UNIVERSITIES) Large Database Search Nuclear Industry Astronomy

(UNIVERSITIES) Portfolio Analysis via Monte Carlo

(BANKING / FINANCE / INSURANCE)

Page 33: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Page 34: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

LLNL Applications

Page 35: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

IDC Technical Computing Systems ForecastBio Sci Genomics, proteomics, pharmacogenomics, pharma research, bioinformatics, drug discovery.

Chem Eng Chemical Engineering: Molecular modeling, computational chemistry, process design

CAD Mechanical CAD, 3D Wireframe – mostly graphics

CAE Computer Aided Engineering – Finite Element modeling, CFD, crash, solid modeling (Cars, Aircraft, …)

DCC&D Digital Content Creation and Distribution

Econ Fin Economic and Financial Modeling, econometric modeling, portfolio management, stock market modeling.

EDA Electronic Design and Analysis: schematic capture, logic synthesis, circuit simulation, system modeling

Geo Sci Geo Sciences and Geo Engineering: seismic analysis, oil services, reservoir modeling.

Govt Lab Government Labs and Research Centers: government-funded R&D

Defense Surveillance, Signal Processing, Encryption, Command, Control, Communications, Intelligence, Geospatial Image Management. Weapon Design

Software Engineering Development and Testing of Technical Applications

Technical Management Product Data management, Maintenance Records management, Revision Control, Configuration Management

Academic University Based R&D

Weather Atmospheric Modeling, Meteorology, Weather Forecasting

Page 36: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Materials Science

Climate Modeling

Genome Sequencing BiologicalModeling

Pandemic Research

Fluid Dynamics

Drug Discovery

Geophysical Data Processing

What is driving the need for more HPC cycles?

Financial Modeling

Page 37: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

HPC Use Cases

Capability– Calculations not possible on small machines – Usually these calculations involve systems where many disparate

scales are modeled.– One scale defines required work per “computation step”– A different scale determines total time to solution.

Complexity– Calculations which seek to combine multiple components to

produce an integrated model of a complex system. – Individual components can have significant computational

requirements.– Coupling between components requires that all components be

modeled simultaneously.– As components are modeled, changes in interfaces are

constantly transferred between the components

Understanding– Repetition of a basic calculation many times with different model

parameters, inputs and boundary conditions.– Goal is to develop a clear understanding of behavior /

dependencies / and sensitivities of the solution over a range of parameters

Examples– Protein Folding:

• 10-15.secs – 1 sec– Refined grids in Weather forecasting:

• 10km today -> 1km in a few years– Full Simulation of Human Brain

Examples– Water Cycle Modeling in Climate/Environment

– Geophysical Modeling for Oil Recovery

– Virtual Fab

– Multisystem / Coupled Systems Modeling

Examples– Multiple independent simulations of Hurricane

paths to develop probability estimates of possible paths, possible strength,

– Thermodynamics of Protein / Drug Interactions

– Sensitivity Analysis in Oil Reservoir Modeling

– Optimization of Aircraft Wing Design,

Useful as proofs of concept

Critical to manage multiple scales in physical systems

Essential to develop parameter understanding, and sensitivity analysis

Page 38: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Capability

Page 39: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Complexity: Modern Integrated Water Management

– Climatologists– Environmental Observation Systems Companies– Sensors Companies– Environmental Sciences Consultants– Engineering Services Companies.– Subject Matter Experts– Universities

– Physical– Chemical– Biological– Environmental– In-situ– Remotely sensed– Planning and placement

– Climate– Hydrological – Meteorological– Ecological

– Stochastic model & stats– Machine learning– Optimization

– Selection– Integration & coupling– Validation– Temporal/spatial scales

– HPC– Visualization– Data management

Historical – Present – Near future – Seasonal – Long term – Far future

Physical Models

PartnerEcosystem

Adv Water MgmtReference ITArchitecture

AnalysesModel Strategy

Enabling IT

Sensors

Page 40: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

Overall Efficiencies of BG Applications - Major Scientific Advances

1. Qbox (DFT) LLNL: 56.5%; 2006 Gordon-Bell Award 64 L racks, 16 PCPMD IBM: 30% highest scaling 64 LMGDC highest scaling 32 P

2. ddcMD (Classical MD) LLNL: 27.6% 2005 Gordon-Bell Award 64 LNew ddcMD LLNL: 17.4% 2007 Gordon-Bell Award 104 LMDCASK LLNL, SPaSM LANL: highest scaling 64 LLAMMPS SNL: highest scaling 64 L, 32 PRXFF, GMD: highest scaling 64 LRosetta UW: highest scaling 20 LAMBER 4 L

3. Quantum Chromodynamics CPS: 30%; 2006 GB Special Award 64L, 32PMILC, Chroma 32 P

4. sPPM (CFD) LLNL: 18%; highest scaling 64 LMiranda, Raptor LLNL: highest scaling 64 LDNS3D highest scaling 32 PNEK5 (Thermal Hydraulics) ANL: 22% 32 PHYPO4D, PLB (Lattice Boltzmann) 32 P

5. ParaDis (dislocation dynamics) LLNL: highest scaling 64 L

6. WRF (Weather) NCAR: 10%; highest scaling 64 LPOP (Oceanography): highest scaling 8 LHOMME (Climate) NCAR: 12%; highest scaling 32 L, 24Ki P

7. GTC (Plasma Physics) PPPL: 7%; highest scaling 20 L, 32 PNimrod GA: 17%

8. FLASH (Supernova Ia) highest scaling 64 L, 40 PCactus (General Relativity) highest scaling 16 L, 32 P

9. DOCK5, DOCK6 highest scaling 32 P

10. Argonne v18 Nuclear Potential 16% 2010 Bonner Prize 32 P

11. “Cat” Brain 2009 GB Special Award 36 P

Page 41: Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

BG/P Software Overview | IBM Confidential © 2007 IBM Corporation

High Performance Computing Trends Three distinct phases .

– Past: Exponential growth in processor performance mostly through CMOS technology advances– Near Term: Exponential (or faster) growth in level of parallelism.– Long Term: Power cost = System cost ; invention required

Curve is not only indicative of peak performance but also performance/$

Supercomputer Peak Performance

1940 1950 1960 1970 1980 1990 2000 2010

Year Introduced

1E+2

1E+5

1E+8

1E+11

1E+14

1E+17

Pea

k S

pee

d (

flo

ps)

Doubling time = 1.5 yr.

ENIAC (vacuum tubes)UNIVAC

IBM 701IBM 704

IBM 7090 (transistors)

IBM StretchCDC 6600 (ICs)

CDC 7600CDC STAR-100 (vectors) CRAY-1

Cyber 205 X-MP2 (parallel vectors)

CRAY-2X-MP4 Y-MP8

i860 (MPPs)

ASCI White

Blue Gene/L

Blue Pacific

DeltaCM-5 Paragon

NWT

ASCI Red OptionASCI Red

CP-PACS

Earth

VP2600/10SX-3/44

Red Storm

ILLIAC IV

SX-2

SX-4

SX-5

S-810/20

T3D

T3E

ASCI Purple

2020

Blue Gene/P

Blue Gene/Q

Past

Near Term

Long Term

1PF: 2008

10PF: 2011