NNSA Advanced Simulation and Computing: Past, Present, … · 2004. 3. 15. · $2M/TF (Purple C)...

NNSA Advanced Simulation and Computing: Past, Present, Future

Steve LouisIntegrated Computing and Communications Department

Lawrence Livermore National Laboratory7000 East Avenue, Livermore, CA, 94550-9234

TEL: 1-925-422-1550 FAX: 1-925-423-8715E-mail: [email protected]

Presented at the March 9-10, 2004 THIC MeetingSony Auditorium

3300 Zanker Rd, San Jose CA 95134-1940UCRL-PRES-202571

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

THIC Meeting, San Jose CA, 10-Mar-2004 2

OutlineOutlineOutline

• The NNSA Stockpile Stewardship Program

• Where we are now: ASCI Red, Blue, White, Q

• Challenges for today: Linux, Red Storm, Purple

• The future challenges: BlueGene/L and beyond

• A short digression: Storage and file systems


The NNSA’s StockpileStewardship Program (SSP)

The NNSA’s StockpileThe NNSA’s StockpileStewardship Program (SSP)Stewardship Program (SSP)

• In October 1992, underground nuclear testing ceased by Presidential decree

• In August 1995, the President announced U.S. intention to pursue stockpile stewardship in the absence of nuclear testing

• The NNSA SSP is tasked with ensuring the safety, security, and reliability of the nation’s stockpile

• Leading-edge physics simulation capabilities are key to assessment and certification requirements


Simulation plays central role to maintain stockpile confidence

Simulation plays central role to Simulation plays central role to maintain stockpile confidencemaintain stockpile confidence

But its value is critically dependent on the other elements of the integrated programBut its value is critically dependent on the other elements of the integrated program


Role of Advanced Simulationand Computing Program (ASC)Role of Advanced SimulationRole of Advanced Simulation

and Computing Program (ASC)and Computing Program (ASC)

• ASC Mission:Provide computational means to assess and certify the safety, performance and reliability of nuclear stockpile and its components

• ASC Goals:Deliver predictive codes based on multi-scale modeling, code verification and validation, small-scale experimental data, test data, judgment engineering analysis, expert judgment

• ASC started in 1996 (as ASCI):approximately 1/8 of the total Stockpile Stewardship Program budget


Confluence of events leadsto creation of ASCI in 1996Confluence of events leadsConfluence of events leadsto creation of ASCI in 1996to creation of ASCI in 1996

• Inexpensive commodity “killer micros” emerged with scalar speeds equal to or exceeding those of custom processors

• Parallel computing technology reached the point where it could begin to support large multi-physics simulations

• The decision to halt underground nuclear testinggenerated a requirement for much higher fidelity simulations

• Reducing numerical errors to no longer mask inadequacies in the physical models requires ~ 100 teraFLOPS (initial 2004 target for ASCI)


ASC Program is more than platforms and physics codesASC Program is more than ASC Program is more than

platforms and physics codesplatforms and physics codes

AdvancedApplications

Materials and PhysicsModeling

Integration

Simulation Support

Physical Infrastructure

And Platforms

Problem SolvingEnvironment

UniversityPartnerships

AdvancedArchitectures

ComputationalSystems Verification

andValidation

VIEWS PathForwardDISCOM


ASCI Red, Blue, White, QASCI Red, Blue, White, QASCI Red, Blue, White, Q

Red(upgraded ‘99)

SNL 1997 Intel 9,460 3.15 Pentium II 333MHz

2,500 0.85 Use existing

BluePacific

LLNL 1998 IBM 5,808 3.89 PowerPC 604e


BlueMountain

LANL 1998 SGI 6,144 3.072 MIPS R10000


White LLNL 2000 IBM 8,192 12.3 Power3-II 10,000 1.0 Expandexisting

Q LANL 2002 HPCompaq

8,192 20.5 Alpha EV-68

14,000 1.9 Newbuilding

System Host Install Date

Vendor # ofProcessors

Peak Floating

Point (TFLOPs)

Processor Type

Footprint (sq ft)

Power (MW)

Construction Issues

More information at http://www.llnl.gov/asci/platforms/platforms.htmlMore information at http://www.llnl.gov/asci/platforms/platforms.html


ASCI Red at SNLASCI Red at SNLASCI Red at SNL

• 4,576 nodes and 9,472 Pentium II processors, 400 MB/sec interconnect, 3.21 TF (after upgrades)

• 3D mesh (38x32x2) interconnect with separate partitions for service, I/O, computing nodes - red/black switching

• OSF1 runs on service and I/O nodes, Puma/Cougar light-weight kernel runs on compute nodes

Net I/O

Service

Users

File I/OCompute

/home


ASCI Blue Pacific at LLNLASCI Blue Pacific at LLNLASCI Blue Pacific at LLNL

24

24

24HPGNHPGN

FDDI6

6

Sector YEach SP sector comprised of• 488 Silver nodes• 24 HPGN Links

System Parameters• 3.89 TFLOP/s Peak• 2.6 TB Memory• 62.5 TB Global disk

1.5 GB/node Memory20.5 TB Global Disk4.4 TB Local Disk

GbE

2

GbE

2

GbE

2

Sector S


Sector K



ASCI White IBM Nighthawk-2Compute Node SpecificationASCI White IBM NighthawkASCI White IBM Nighthawk--22Compute Node SpecificationCompute Node Specification

CPUs per Nodes 16CPU Clock Speed 375 MHzNode Peak Perf. ~24 GigaOP/sMemory per node 16 GB Local Disk per node 72 GB

POWER3 processors are super-scalar pipelined 64-bit RISC chips with two floating-point units and three integer units. They are capable of executing up to eight instructions per clock cycle and up to four floating-point operations per cycle.


ASCI Q at LANL is Alpha ES45SMP with Quadrics interconnectASCI Q at LANL is Alpha ES45ASCI Q at LANL is Alpha ES45

SMP with Quadrics interconnectSMP with Quadrics interconnect

• Alpha 21264 EV-68 processor• AlphaServer ES45 SMP

• 4 processor/SMP, 8/16/32 GB Memory/SMP

• Quadrics (QSW) dual rail switch interconnect• Fat-tree switch• High bandwidth (250 MB/s/rail), low latency(~5 us)

• Switch-based Fibre attached Storage Arrays• RAID5 sets, 72 GB Drives

• AlphaServer SC and Tru64 Unix based


Platforms are shaken out with groundbreaking science runsPlatforms are shaken out with Platforms are shaken out with groundbreaking science runsgroundbreaking science runs

The first million-atom simulation in biology: molecular mechanism of the genetic code. (Kevin Sanbonmatsu)

This work (on ASCI Q at LANL) will define a new state-of-the-art in bio-molecular simulation, paving the way for other large biological studies.

Conclusions•This simulation more than 5 times larger than previous largest to date.

•Core of the ribosome is more stable than outer regions.

• Identified possible pivot point for ratchet motion during translocation.


LLNL strategy is to straddlemultiple technology wavesLLNL strategy is to straddleLLNL strategy is to straddlemultiple technology wavesmultiple technology waves

Three complementary curves…

1. Delivers to today’s stockpile demanding needs– Production environment– For “must have” deliverables

2. Delivers transitions for next generation– “Near production,” but riskier

environment – Capacity/capability systems in a

strategic mix

3. Delivers affordable path to petaFLOP/s computing– Research environment, leading

transition to petaflop systems

Perf

orm

ance

Time

Mainframes(RIP)

Vendor integrated SMP Cluster

(IBM SP, HP SC)

IA32/ IA64/AMD + Linux

Cell-Based(IBM BG/L)

Today FY05

$2M/TF (Purple C)

$1.2M/TF(MCR)

$170K/TF

$10 M/TF (White)

$7M/TF (Q) $ 500K /TF

Any given technology curve is ultimatelylimited by Moore’s Law


Ramifications of LLNL strategyRamifications of LLNL strategyRamifications of LLNL strategy

• Benefits—Maximizes cost performance and adapt quickly to change—Offer options to customers that match their requirements

• Costs—Requires expertise in multiple technologies

– Simultaneously field systems on multiple technology curves—Requires constant attention to new technology

– Must correctly assess the longevity, maturity (risk), and usability of technology

• Issues—Programming model/environment must be made as

consistent as possible


Maintain similar programming model across LLNL platformsMaintain similar programming Maintain similar programming model across LLNL platformsmodel across LLNL platforms

Local

Shared Serial (NFS) I/O

LocalGlobal Shared Scalable I/O

MPI Comms

OpenMP OpenMP OpenMP

Idea: Provide a consistent programming model for multiple platform generations and across multiple vendors!!

Idea: Incrementally increase functionality over time!!

Local


A supercomputing caveat: a new facility may be requiredA supercomputing caveat: a A supercomputing caveat: a new facility may be requirednew facility may be required

Nicholas MetropolisCenter for Modeling

and Simulation, LANL

Future Terascale Simulation Facility at LLNLFuture Terascale Simulation Facility at LLNL

Future Building for Red Storm at SNLFuture Building for Red Storm at SNL


The MCR Linux Cluster made 10+ teraFLOPS computing affordableThe MCR Linux Cluster made 10+ The MCR Linux Cluster made 10+ teraFLOPS computing affordableteraFLOPS computing affordable

OSTOST OST

OST OSTOST OST

OST OSTOST OST

OST OSTOST OST

OST

QsNet Elan3, 100BaseT Control

1,116 P4 Compute Nodes

2 Login nodeswith 4 Gb-Enet

2 Service

64 Object Storage Targets70 MB/s delivered eachLustre Total 4.48 GB/s

GW

2 MetaData (fail-over) Servers32 Gateway nodes @ 140 MB/s

delivered Lustre I/O over 2x1GbE

GbEnet Federated Switch

GW GW GW GW GW GW GW

1152 Port (12x96D32U+4x96D32U) QsNet Elan3

100BaseT Management

MDS MDS

System Parameters• Dual 2.4 GHz Pentium 4 Prestonia nodes with 4.0 GB PC2100 DDR SDRAM• Aggregate 11.1 TF/s peak, 4.608 GiB memory•


LANL announced Lightning Linux Cluster in August 2003LANL announced Lightning LANL announced Lightning

Linux Cluster in August 2003Linux Cluster in August 2003

• Theoretical peak speed of 11.26 teraFLOP/s• Will be built and integrated by Linux NetworX• Will be used for the smaller SSP calculations• Initial delivery with 1280 dual-processor nodes• Option to extend machine to 1408 nodes • Uses AMD Opteron 64-bit processors, Linux• Uses Myrinet 2000 Lanai XP interconnect

Yet another example of capacity computingat around $1 million or so per peak teraFLOP

More information on Lightning at http://www.lnxi.com/news/lightning_info.phpMore information on Lightning at http://www.lnxi.com/news/lightning_info.php


Thunder is newest LLNL cluster:a 1000+ node quad Itanium2

Thunder is newest LLNL cluster:Thunder is newest LLNL cluster:a 1000+ node quad Itanium2a 1000+ node quad Itanium2

Thunder at LLNL will be the world’s largest Linux cluster (www.llnl.gov/linux/thunder/)

• 23 TF (peak) procurement on condensed schedule to meet crushing demand

• Quad 1.4 GHz Itanium2 Tiger4 nodes• 8.0 GB DDR266 SDRAM per node• 400 MB/s transfers to archive over quad

Jumbo Frame Gb-Enet and QSW links• 75 TB in local, 192 TB global parallel disk• Lustre file system with 6.4 GB/s delivered

parallel I/O performance• Expected to be operational and in use by

early 2004


Key Open Source Softwarefor MCR and Thunder clusters

Key Open Source SoftwareKey Open Source Softwarefor MCR and Thunder clustersfor MCR and Thunder clusters

• Chaos—RedHat Linux distribution (www.redhat.com)—LinuxBIOS (www.linuxbios.org) —Chaos (www.llnl.gov/linux/chaos)

• Lustre cluster wide file system—www.lustre.org

• Quadrics QsNet drivers—www.quadrics.com

• SLURM/DPCS/LCRM—http://www.llnl.gov/icc/lc/dpcs/dpcs_overview.html—http://www.llnl.gov/linux/slurm/slurm.html

http://www.redhat.com/http://www.linuxbios.org/http://www.llnl.gov/linux/chaoshttp://www.lustre.org/http://www.quadrics.com/http://www.llnl.gov/icc/lc/dpcs/dpcs_overview.htmlhttp://www.llnl.gov/linux/slurm/slurm.html


Linux cluster and opensource observations

Linux cluster and openLinux cluster and opensource observationssource observations

• 10 TF/s scale computing is now becoming affordable

—LTO options for $10-15M over 2-4 yrs put it into the budget range of departmental supercomputing

• These technologies are somewhat disruptive

— Important to factor Linux and commodity hardware into an overall computing strategy

• Significant opportunities for broad collaborations

—Key open source cluster technologies under active development

1,004 Quad Itanium2 Compute Nodes

B451GW GW MDS MDS


16 Gateway nodes @ 350 MB/sdelivered Lustre I/O over 4x1GbE

Thunder – 1,024 Port QsNet Elan4

OST OST OST64OST

Heads

65,536 Dual PowerPC 440 Compute Nodes

1024 I/O NodesPPC440

TSF

BG/L torus, global tree/barrier

OST OST OST208OST

Heads

64OST

Heads

OCF SGS File System Cluster (OFC)

FederatedEthernet

B439

B113

PVC - 128 Port Elan3

52 Dual P4Render Nodes

B451

GW GW

6 Dual P4Display

BB/MKS Version 6Dec 23, 2003

24

PFTPMM Fiber 1 GigE

SM Fiber 1 GigE

Copper 1 GigE

2 Login nodes

OST OST OST OST

2Gig FC

OST Dual P4 Head

FC RAID

146, 73, 36 GB

400- 600 TerabytesHPSSArchive

LLNLExternal

Backbone

OST OST196OST

Heads

924 Dual P4 Compute Nodes

B439GW GW MDS MDS



ALC - 960 Port QsNet Elan3

SW

SW

SW

SW

SW

SW

SW

SW

1,114 Dual P4 Compute Nodes

B439GW GW MDS MDS



MCR – 1,152 Port QsNet Elan3

Open Computing Facility (OCF) Clusters, Networks, StorageOpen Computing Facility (OCF) Clusters, Networks, StorageOpen Computing Facility (OCF) Clusters, Networks, Storage


SNL’s new MPP architecturemachine is ASCI Red StormSNL’s new MPP architectureSNL’s new MPP architecturemachine is ASCI Red Stormmachine is ASCI Red Storm

• Scalability

• Reliability

• Usability

• Simplicity

• Cost effectiveness


System Layout(27 x 16 x 24 mesh)

System LayoutSystem Layout(27 x 16 x 24 mesh)(27 x 16 x 24 mesh)

NormallyUnclassified

SwitchableNodes

NormallyClassified

Disconnect Cabinets


Red Storm ArchitectureRed Storm ArchitectureRed Storm Architecture

• True MPP designed to be a single system• Distributed memory parallel supercomputer• Fully connected 3D mesh interconnect• 108 compute node cabinets• 10,368 processors (AMD Opteron @ 2.0 GHz)• ~10 TB of DDR memory @ 333MHz• Red/Black switching (classified/unclassified)• 8 Service and I/O cabinets on each end (256

processors for each color)• 240 TB of disk storage (120 TB per color)


Purple is next 100 TF platform at LLNL, with 2 PB of disk storagePurple is next 100 TF platform at Purple is next 100 TF platform at LLNL, with 2 PB of disk storageLLNL, with 2 PB of disk storage

Parallel Batch/Interactive/Visualization Nodes

System Data and Control NetworksSystem Data and Control NetworksSystem Data and Control Networks

…

I/O … NFSLogin

LoginNet

NFSLogin

LoginNet

NFSLogin

LoginNet

NFSLogin

LoginNet

System Data and Control Networks

I/O I/O I/O I/O I/O I/O I/O … Fibre Channel 2 I/O Network

Specific Purple details still in contract negotiation


BlueGene/L is being built as part of IBM Purple contract

BlueGene/L is being built as BlueGene/L is being built as part of IBM Purple contract part of IBM Purple contract

Compute Chip

2 processors2.8/5.6 GF/s

4 MiB* eDRAM

System

64 cabinets65,536 nodes

(131,072 CPUs)(32x32x64)

180/360 TF/s16 TiB*1.2 MW

2500 sq.ft.~11mm

(compare this with a 1988 Cray YMP/8 at 2.7 GF/s)

* http://physics.nist.gov/cuu/Units/binary.html

Compute CardI/O Card

FRU (field replaceable unit)

25mmx32mm2 nodes (4 CPUs)

(2x1x1)2.8/5.6 GF/s

256/512 MiB* DDR15 W

Node Card

16 compute cards0-2 I/O cards

32 nodes(64 CPUs)

(4x4x2)90/180 GF/s8 GiB* DDR

Midplane

SU (scalable unit)16 node boards

512 nodes(1,024 CPUs)

(8x8x8)1.4/2.9 TF/s

128 GiB* DDR7-10 kW

Cabinet

2 midplanes1024 nodes

(2,048 CPUs)(8x8x16)

2.9/5.7 TF/s256 GiB* DDR

15-20 kW


BlueGene/L will have 32,768 dual-node compute cards

BlueGene/L will have 32,768 BlueGene/L will have 32,768 dualdual--node compute cardsnode compute cards

Heatsinks designed for 16W

54 mm (2.125”)

206 mm (8.125”) wide, 14 layers

6x180 pin connector

9 x 256/512Mb DRAM


BlueGene/L will contribute at all length and time scales

BlueGene/L will contribute at BlueGene/L will contribute at all length and time scalesall length and time scales

Continuum

ALE3DPlasticity of

complex shapes

Mesoscale

NIKE3DAggregate grain response, poly-crystal plasticity

Microscale

Dislocation Dynamics

Collective behavior of defects, single crystal plasticity

Atomic Scale

Molecular Dynamics

unit mechanisms of defect

mobility and interaction

nm (10-9 m) µm (10-6 m) mm (10-3 m)

ps

ns

µs

ms

s

Tim

e Sc

ale

·σ = σ(ε, ε, T, P, ...)

first-time overlap calculations will allow direct comparison of detailed and course-scale models of plasticity of crystals


BlueGene/L has a growing list of industry/academia collaboratorsBlueGene/L has a growing list of BlueGene/L has a growing list of industry/academia collaboratorsindustry/academia collaborators

Debugger

ApplicationsFile SystemsBatch systemKernel EvaluationProgramming ModelsDebugger & Vis

Network simulatorMPI tracingApplication scaling

Application Tracing & Performance

Hardware design and buildNetwork design and buildOS and system software

Parallel ObjectsCHARM++

STAPL – standard adaptive templatelibrary

Optimized FFT

MPI – messagepassing interface

PAPI - performancemonitoring

Performance analysisVampir/GuideView


If industry progress slows, howto get simulation to next step?If industry progress slows, howIf industry progress slows, howto get simulation to next step?to get simulation to next step?

PITAC 1999 Recommendations

(President’s IT Advisory Committee)

— $$ for R&D on innovative computing technologies

— $$ for software research

— $$ for Petaflops on some applications by 2010

— $$ to fund the most powerful high-end systems

— Can this be leveraged into a broad national program??


National Academies Report: The Future of Supercomputing

National Academies Report: National Academies Report: The Future of SupercomputingThe Future of Supercomputing

The Future of Supercomputing: An Interim Report (2003)from Computer Science and Telecommunications Boardhttp://www7.nationalacademies.org/cstb/pub_supercomp_int.html

• Sponsored by DOE Office of Science and DOE ASC• Goals of the study

— Supercomputing R&D in support of U.S. needs— Context and background— Applications and implications for design— Market, national security, role of U.S. government— Options for progress/recommendations (final report, 2004)


National Academies Report: The Future of Supercomputing

National Academies Report: National Academies Report: The Future of SupercomputingThe Future of Supercomputing

Some observations from the Interim Report• U.S. is in pretty good shape regarding manufacturing• Custom and commodity species have own niches• Need balance between customization and commodity• Need balance between evolution and innovation• Need continuity and sustained investment• Government role essential (market incentives insufficient)• Supercomputer software is not in good shape

— Hard to program— Inadequate development tools— Legacy code porting problems


Some interesting things about 70’s, 80’s, early 90’s (G. Grider)Some interesting things about Some interesting things about 70’s, 80’s, early 90’s (G. 70’s, 80’s, early 90’s (G. GriderGrider))

• Many (a dozen or so) supercomputers, all disk poor• Common serial file system and archive (integrated)• Archival file system was less than an order of

magnitude slower than supercomputer local disk• Invented own networks, out of parallel technology

—Invented our own protocols, much like IP today—3-5 MB/sec in 80’s when fast networks were 56 kb/s—HIPPI is an example, 100 MB/sec in 1989, when

most networks were 10 Mb/s and fast networks were 50 Mb/s

This slide and the following three slides courtesy of Gary Grider, LANLThis slide and the following three slides courtesy of Gary Grider, LANL


Enter ASCI in the mid-90’sEnter ASCI in the midEnter ASCI in the mid--90’s90’s

• In the mid 90’s the ASCI program attempted to accelerate things through the use of massive parallelism.

• We moved to a new model for balanced system, with new ratios for storage feeds and speeds.

• Developed a forward-looking data transfer and storage technology roadmap to address barriers.


Some interesting thingsabout the late 90’s

Some interesting thingsSome interesting thingsabout the late 90’sabout the late 90’s

• Went from a dozen supercomputers to 1 or 2, and from disk-poor supercomputers to disk-rich

• Supercomputer file system had to become parallel and use supercomputer interconnect to move data

• Each supercomputer had to have its own parallel file system (not a common file system)

• Went from integrated common file system with archive to separate parallel archive (HPSS)

• Archive now over an order of magnitude slower than supercomputer parallel file system


Some interesting things about the current 21st century trend

Some interesting things about Some interesting things about the current 21the current 21stst century trendcentury trend

Due to lower costs for supercomputers—Going back to lots of lower cost supercomputers

that are disk-poor—Probably need to move towards a scalable

common parallel file system—Probably need to integrate parallel archive and

common parallel file system—Probably need to have a parallel multi-

supercomputer secure scalable backbone

We have not been sitting idle hoping for “magic happens here” solutionsWe have not been sitting idle hoping for “magic happens here” solutions


HPSS archival storage slide from last time I was here…

HPSS archival storage slide HPSS archival storage slide from last time I was here…from last time I was here…

Accomplishments— A 20x performance increase in

15 months (faster nets and disks)— PSE Milepost demonstrated

170 MB/s aggregate throughput White-to-HPSS

— Large single file transfer rates of up to 80MB/s White-to-HPSS

— Large singe file transfer rates of up to 150MB/s White-to-SGI

Challenges— Yearly doubling of throughput

is needed for next machine

At 170 MB/s, 2TB of data moves to At 170 MB/s, 2TB of data moves to storage in less than 4 hours. A year storage in less than 4 hours. A year and a half ago it took two and a half and a half ago it took two and a half days to move the same amount of datadays to move the same amount of data

Aggregate Throughput to Storage

1 MB/s 4 MB/s 6 MB/s 9 MB/s

120 MB/s

170 MB/s

0

20

40

60

80

100

120

140

160

180

FY96 FY97 FY98 FY99 FY00 FY01

MB/

sMoved to

HPSS

Moved to SP Nodes

Moved to Jumbo GE & Parallel Striping

Moved to Faster Disk on Faster Nodes & multi-node Concurrency


LLNL’s yearly “I/O Blueprints” have helped to increase ratesLLNL’s yearly “I/O Blueprints” LLNL’s yearly “I/O Blueprints” have helped to increase rateshave helped to increase rates

A 115x performance improvement in four years!

Aggregate Throughput to Storage

1,037 MB/s

170 MB/s120 MB/s

9 MB/s6 MB/s4 MB/s1 MB/s

854 MB/s

0

200

400

600

800

1000

1200

FY96 FY97 FY98 FY99 FY00 FY01 FY02 FY03

MB/

s

Moved to HPSS

Moved to SP Nodes

Moved to Jumbo GE, Parallel Striping, Faster Disk & Nodes using multiple pftp sessions

Moved to Faster Disk using multiple Htar sessions on multiple nodes

12/03 Throughput


Current HPSS data movement with HSM disk/tape hierararchyCurrent HPSS data movement Current HPSS data movement

with HSM disk/tape with HSM disk/tape hierararchyhierararchy

HPSSMover

HPSSMover

HPSS HPSS DiskDisk

MCR or ALC

PlatformPlatformDiskDisk

Client Mover

PFTP Client

Application1

2

4

3 5

6

7

Key elements1. User in the loop to force keep/delete decision2. Large HPSS disk cache and multiple copies on disk and tape3. Bandwidth limited by PFTP bandwidth off compute platform


Lustre Object Storage Target HPSS Data Movement VisionLustre Object Storage Target Lustre Object Storage Target HPSS Data Movement VisionHPSS Data Movement Vision

Tape Front-End

Lustre Lustre DiskDisk

Global Accessible Lustre OST

Application

MCR or ALCPlatform

1 2open(

), seek

(), read

()

3

Location Independent

Client ArchiveAgent

Key elements1. User in the loop to force keep/delete decision2. Direct to tape transfers eliminate HPSS disk cache3. Bandwidth limited by SERIAL file system bandwidth access from HPSS


Proposed HPSS parallel local file movers for tape parallelismProposed HPSS parallel local Proposed HPSS parallel local

file movers for tape parallelismfile movers for tape parallelism

Lustre Lustre DiskDisk

Application

MCR or ALCCapacity Platform

Location independent

PFTP Client

1 2…

HPSSHPSSPLFMPLFM

HPSSHPSSPLFMPLFM

HPSSHPSSPLFMPLFM

HPSSPLFM

3

PLFMs are location independentope

n(), see

k(), rea

d()

Key elements1. User in the loop to force keep/delete decision2. Direct to tape transfers eliminate HPSS disk cache3. Improved by PARALLEL file system bandwidth access from HPSS


Tri-lab historical timeline forscalable, parallel file systemsTriTri--lab historical timeline forlab historical timeline for

scalable, parallel file systemsscalable, parallel file systemsRFQ, analysis, recommend funding open source OBSD development and NFSv4 efforts

PathForward proposal with OBSD vendor, Panasas born

20022000 20011999

Proposed PathForward activity for SGSFS

Propose initial architecture

SGSFS workshop “You’re Crazy”

Build initial requirements document

PathForward team formed to pursue an RFI/RFQ approach, RFI issued, recommend RFQ process

Begin partnering talks negotiations for OBSD and NFSv4 PathForwards

HECRTF workshop: Re-invent Posix I/O ?

“Are WeStill Crazy?”

Tri-Lab joint requirements document complete

Lustre PathForward effort is born

Alliance contracts placed with universities on OBSD, overlapped I/O and NFSv4

U MinnObject Archive begins

20042003


From the June 2003 HECRTFworkshop report (available)

From the June 2003 HECRTFFrom the June 2003 HECRTFworkshop report (available)workshop report (available)

• For info: http://www.nitrd.gov/hecrtf-outreach/index.html• NNSA Tri-labs (Lee Ward of SNL, Tyce McClarty of LLNL,

and Gary Grider of LANL) were lone I/O representatives at this workshop

• Overwhelming consensus that POSIX I/O is inadequate

5.5. Data Management and File SystemsWe believe legacy, POSIX I/O interfaces are incompatible with the full range of hardware architecture choices contemplated …

The interface does not fully support the needs for parallel support along the I/O path …

An alternative, appropriate operating system API should be developed for high-end computing systems …


DISCLAIMER

This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

OutlineThe NNSA’s StockpileStewardship Program (SSP)Simulation plays central role to maintain stockpile confidenceRole of Advanced Simulationand Computing Program (ASC)Confluence of events leadsto creation of ASCI in 1996ASC Program is more than platforms and physics codesOutlineASCI Red, Blue, White, QASCI Red at SNLASCI Blue Pacific at LLNLASCI White IBM Nighthawk-2Compute Node SpecificationASCI Q at LANL is Alpha ES45SMP with Quadrics interconnectPlatforms are shaken out with groundbreaking science runsOutlineLLNL strategy is to straddlemultiple technology wavesRamifications of LLNL strategyMaintain similar programming model across LLNL platformsA supercomputing caveat: a new facility may be requiredThe MCR Linux Cluster made 10+ teraFLOPS computing affordableLANL announced Lightning Linux Cluster in August 2003Thunder is newest LLNL cluster:a 1000+ node quad Itanium2Key Open Source Softwarefor MCR and Thunder clustersLinux cluster and opensource observationsSNL’s new MPP architecturemachine is ASCI Red StormSystem Layout(27 x 16 x 24 mesh)Red Storm ArchitecturePurple is next 100 TF platform at LLNL, with 2 PB of disk storageOutlineBlueGene/L is being built as part of IBM Purple contractBlueGene/L will have 32,768 dual-node compute cardsBlueGene/L will contribute at all length and time scalesBlueGene/L has a growing list of industry/academia collaboratorsIf industry progress slows, howto get simulation to next step?National Academies Report: The Future of SupercomputingNational Academies Report: The Future of SupercomputingOutlineSome interesting things about 70’s, 80’s, early 90’s (G. Grider)Enter ASCI in the mid-90’sSome interesting thingsabout the late 90’sSome interesting things about the current 21st century trendHPSS archival storage slide from last time I was here…LLNL’s yearly “I/O Blueprints” have helped to increase ratesCurrent HPSS data movement with HSM disk/tape hierararchyLustre Object Storage Target HPSS Data Movement VisionProposed HPSS parallel local file movers for tape parallelismTri-lab historical timeline forscalable, parallel file systemsFrom the June 2003 HECRTFworkshop report (available)

NNSA Advanced Simulation and Computing: Past, Present, … · 2004. 3. 15. · $2M/TF (Purple C)...

Documents

Transcript of NNSA Advanced Simulation and Computing: Past, Present, … · 2004. 3. 15. · $2M/TF (Purple C)...