NNSA Advanced Simulation and Computing: Past, Present, … · 2004. 3. 15. · $2M/TF (Purple C)...

50
NNSA Advanced Simulation and Computing: Past, Present, Future Steve Louis Integrated Computing and Communications Department Lawrence Livermore National Laboratory 7000 East Avenue, Livermore, CA, 94550-9234 TEL: 1-925-422-1550 FAX: 1-925-423-8715 E-mail: [email protected] Presented at the March 9-10, 2004 THIC Meeting Sony Auditorium 3300 Zanker Rd, San Jose CA 95134-1940 UCRL-PRES-202571 This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

Transcript of NNSA Advanced Simulation and Computing: Past, Present, … · 2004. 3. 15. · $2M/TF (Purple C)...

  • NNSA Advanced Simulation and Computing: Past, Present, Future

    Steve LouisIntegrated Computing and Communications Department

    Lawrence Livermore National Laboratory7000 East Avenue, Livermore, CA, 94550-9234

    TEL: 1-925-422-1550 FAX: 1-925-423-8715E-mail: [email protected]

    Presented at the March 9-10, 2004 THIC MeetingSony Auditorium

    3300 Zanker Rd, San Jose CA 95134-1940UCRL-PRES-202571

    This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

  • THIC Meeting, San Jose CA, 10-Mar-2004 2

    OutlineOutlineOutline

    • The NNSA Stockpile Stewardship Program

    • Where we are now: ASCI Red, Blue, White, Q

    • Challenges for today: Linux, Red Storm, Purple

    • The future challenges: BlueGene/L and beyond

    • A short digression: Storage and file systems

  • THIC Meeting, San Jose CA, 10-Mar-2004 3

    The NNSA’s StockpileStewardship Program (SSP)

    The NNSA’s StockpileThe NNSA’s StockpileStewardship Program (SSP)Stewardship Program (SSP)

    • In October 1992, underground nuclear testing ceased by Presidential decree

    • In August 1995, the President announced U.S. intention to pursue stockpile stewardship in the absence of nuclear testing

    • The NNSA SSP is tasked with ensuring the safety, security, and reliability of the nation’s stockpile

    • Leading-edge physics simulation capabilities are key to assessment and certification requirements

  • THIC Meeting, San Jose CA, 10-Mar-2004 4

    Simulation plays central role to maintain stockpile confidence

    Simulation plays central role to Simulation plays central role to maintain stockpile confidencemaintain stockpile confidence

    But its value is critically dependent on the other elements of the integrated programBut its value is critically dependent on the other elements of the integrated program

  • THIC Meeting, San Jose CA, 10-Mar-2004 5

    Role of Advanced Simulationand Computing Program (ASC)Role of Advanced SimulationRole of Advanced Simulation

    and Computing Program (ASC)and Computing Program (ASC)

    • ASC Mission:Provide computational means to assess and certify the safety, performance and reliability of nuclear stockpile and its components

    • ASC Goals:Deliver predictive codes based on multi-scale modeling, code verification and validation, small-scale experimental data, test data, judgment engineering analysis, expert judgment

    • ASC started in 1996 (as ASCI):approximately 1/8 of the total Stockpile Stewardship Program budget

  • THIC Meeting, San Jose CA, 10-Mar-2004 6

    Confluence of events leadsto creation of ASCI in 1996Confluence of events leadsConfluence of events leadsto creation of ASCI in 1996to creation of ASCI in 1996

    • Inexpensive commodity “killer micros” emerged with scalar speeds equal to or exceeding those of custom processors

    • Parallel computing technology reached the point where it could begin to support large multi-physics simulations

    • The decision to halt underground nuclear testinggenerated a requirement for much higher fidelity simulations

    • Reducing numerical errors to no longer mask inadequacies in the physical models requires ~ 100 teraFLOPS (initial 2004 target for ASCI)

  • THIC Meeting, San Jose CA, 10-Mar-2004 7

    ASC Program is more than platforms and physics codesASC Program is more than ASC Program is more than

    platforms and physics codesplatforms and physics codes

    AdvancedApplications

    Materials and PhysicsModeling

    Integration

    Simulation Support

    Physical Infrastructure

    And Platforms

    Problem SolvingEnvironment

    UniversityPartnerships

    AdvancedArchitectures

    ComputationalSystems Verification

    andValidation

    VIEWS PathForwardDISCOM

  • THIC Meeting, San Jose CA, 10-Mar-2004 8

    OutlineOutlineOutline

    • The NNSA Stockpile Stewardship Program

    • Where we are now: ASCI Red, Blue, White, Q

    • Challenges for today: Linux, Red Storm, Purple

    • The future challenges: BlueGene/L and beyond

    • A short digression: Storage and file systems

  • THIC Meeting, San Jose CA, 10-Mar-2004 9

    ASCI Red, Blue, White, QASCI Red, Blue, White, QASCI Red, Blue, White, Q

    Red(upgraded ‘99)

    SNL 1997 Intel 9,460 3.15 Pentium II 333MHz

    2,500 0.85 Use existing

    BluePacific

    LLNL 1998 IBM 5,808 3.89 PowerPC 604e

    5,100 0.6 Use existing

    BlueMountain

    LANL 1998 SGI 6,144 3.072 MIPS R10000

    12,000 1.6 Use existing

    White LLNL 2000 IBM 8,192 12.3 Power3-II 10,000 1.0 Expandexisting

    Q LANL 2002 HPCompaq

    8,192 20.5 Alpha EV-68

    14,000 1.9 Newbuilding

    System Host Install Date

    Vendor # ofProcessors

    Peak Floating

    Point (TFLOPs)

    Processor Type

    Footprint (sq ft)

    Power (MW)

    Construction Issues

    More information at http://www.llnl.gov/asci/platforms/platforms.htmlMore information at http://www.llnl.gov/asci/platforms/platforms.html

  • THIC Meeting, San Jose CA, 10-Mar-2004 10

    ASCI Red at SNLASCI Red at SNLASCI Red at SNL

    • 4,576 nodes and 9,472 Pentium II processors, 400 MB/sec interconnect, 3.21 TF (after upgrades)

    • 3D mesh (38x32x2) interconnect with separate partitions for service, I/O, computing nodes - red/black switching

    • OSF1 runs on service and I/O nodes, Puma/Cougar light-weight kernel runs on compute nodes

    Net I/O

    Service

    Users

    File I/OCompute

    /home

  • THIC Meeting, San Jose CA, 10-Mar-2004 11

    ASCI Blue Pacific at LLNLASCI Blue Pacific at LLNLASCI Blue Pacific at LLNL

    24

    24

    24HPGNHPGN

    FDDI6

    6

    Sector YEach SP sector comprised of• 488 Silver nodes• 24 HPGN Links

    System Parameters• 3.89 TFLOP/s Peak• 2.6 TB Memory• 62.5 TB Global disk

    1.5 GB/node Memory20.5 TB Global Disk4.4 TB Local Disk

    GbE

    2

    GbE

    2

    GbE

    2

    Sector S

    2.5 GB/node Memory24.5 TB Global Disk8.3 TB Local Disk

    Sector K

    1.5 GB/node Memory20.5 TB Global Disk4.4 TB Local Disk

  • THIC Meeting, San Jose CA, 10-Mar-2004 12

    ASCI White IBM Nighthawk-2Compute Node SpecificationASCI White IBM NighthawkASCI White IBM Nighthawk--22Compute Node SpecificationCompute Node Specification

    CPUs per Nodes 16CPU Clock Speed 375 MHzNode Peak Perf. ~24 GigaOP/sMemory per node 16 GB Local Disk per node 72 GB

    POWER3 processors are super-scalar pipelined 64-bit RISC chips with two floating-point units and three integer units. They are capable of executing up to eight instructions per clock cycle and up to four floating-point operations per cycle.

  • THIC Meeting, San Jose CA, 10-Mar-2004 13

    ASCI Q at LANL is Alpha ES45SMP with Quadrics interconnectASCI Q at LANL is Alpha ES45ASCI Q at LANL is Alpha ES45

    SMP with Quadrics interconnectSMP with Quadrics interconnect

    • Alpha 21264 EV-68 processor• AlphaServer ES45 SMP

    • 4 processor/SMP, 8/16/32 GB Memory/SMP

    • Quadrics (QSW) dual rail switch interconnect• Fat-tree switch• High bandwidth (250 MB/s/rail), low latency(~5 us)

    • Switch-based Fibre attached Storage Arrays• RAID5 sets, 72 GB Drives

    • AlphaServer SC and Tru64 Unix based

  • THIC Meeting, San Jose CA, 10-Mar-2004 14

    Platforms are shaken out with groundbreaking science runsPlatforms are shaken out with Platforms are shaken out with groundbreaking science runsgroundbreaking science runs

    The first million-atom simulation in biology: molecular mechanism of the genetic code. (Kevin Sanbonmatsu)

    This work (on ASCI Q at LANL) will define a new state-of-the-art in bio-molecular simulation, paving the way for other large biological studies.

    Conclusions•This simulation more than 5 times larger than previous largest to date.

    •Core of the ribosome is more stable than outer regions.

    • Identified possible pivot point for ratchet motion during translocation.

  • THIC Meeting, San Jose CA, 10-Mar-2004 15

    OutlineOutlineOutline

    • The NNSA Stockpile Stewardship Program

    • Where we are now: ASCI Red, Blue, White, Q

    • Challenges for today: Linux, Red Storm, Purple

    • The future challenges: BlueGene/L and beyond

    • A short digression: Storage and file systems

  • THIC Meeting, San Jose CA, 10-Mar-2004 16

    LLNL strategy is to straddlemultiple technology wavesLLNL strategy is to straddleLLNL strategy is to straddlemultiple technology wavesmultiple technology waves

    Three complementary curves…

    1. Delivers to today’s stockpile demanding needs– Production environment– For “must have” deliverables

    2. Delivers transitions for next generation– “Near production,” but riskier

    environment – Capacity/capability systems in a

    strategic mix

    3. Delivers affordable path to petaFLOP/s computing– Research environment, leading

    transition to petaflop systems

    Perf

    orm

    ance

    Time

    Mainframes(RIP)

    Vendor integrated SMP Cluster

    (IBM SP, HP SC)

    IA32/ IA64/AMD + Linux

    Cell-Based(IBM BG/L)

    Today FY05

    $2M/TF (Purple C)

    $1.2M/TF(MCR)

    $170K/TF

    $10 M/TF (White)

    $7M/TF (Q) $ 500K /TF

    Any given technology curve is ultimatelylimited by Moore’s Law

  • THIC Meeting, San Jose CA, 10-Mar-2004 17

    Ramifications of LLNL strategyRamifications of LLNL strategyRamifications of LLNL strategy

    • Benefits—Maximizes cost performance and adapt quickly to change—Offer options to customers that match their requirements

    • Costs—Requires expertise in multiple technologies

    – Simultaneously field systems on multiple technology curves—Requires constant attention to new technology

    – Must correctly assess the longevity, maturity (risk), and usability of technology

    • Issues—Programming model/environment must be made as

    consistent as possible

  • THIC Meeting, San Jose CA, 10-Mar-2004 18

    Maintain similar programming model across LLNL platformsMaintain similar programming Maintain similar programming model across LLNL platformsmodel across LLNL platforms

    Local

    Shared Serial (NFS) I/O

    LocalGlobal Shared Scalable I/O

    MPI Comms

    OpenMP OpenMP OpenMP

    Idea: Provide a consistent programming model for multiple platform generations and across multiple vendors!!

    Idea: Incrementally increase functionality over time!!

    Local

  • THIC Meeting, San Jose CA, 10-Mar-2004 19

    A supercomputing caveat: a new facility may be requiredA supercomputing caveat: a A supercomputing caveat: a new facility may be requirednew facility may be required

    Nicholas MetropolisCenter for Modeling

    and Simulation, LANL

    Future Terascale Simulation Facility at LLNLFuture Terascale Simulation Facility at LLNL

    Future Building for Red Storm at SNLFuture Building for Red Storm at SNL

  • THIC Meeting, San Jose CA, 10-Mar-2004 20

    The MCR Linux Cluster made 10+ teraFLOPS computing affordableThe MCR Linux Cluster made 10+ The MCR Linux Cluster made 10+ teraFLOPS computing affordableteraFLOPS computing affordable

    OSTOST OST

    OST OSTOST OST

    OST OSTOST OST

    OST OSTOST OST

    OST

    QsNet Elan3, 100BaseT Control

    1,116 P4 Compute Nodes

    2 Login nodeswith 4 Gb-Enet

    2 Service

    64 Object Storage Targets70 MB/s delivered eachLustre Total 4.48 GB/s

    GW

    2 MetaData (fail-over) Servers32 Gateway nodes @ 140 MB/s

    delivered Lustre I/O over 2x1GbE

    GbEnet Federated Switch

    GW GW GW GW GW GW GW

    1152 Port (12x96D32U+4x96D32U) QsNet Elan3

    100BaseT Management

    MDS MDS

    System Parameters• Dual 2.4 GHz Pentium 4 Prestonia nodes with 4.0 GB PC2100 DDR SDRAM• Aggregate 11.1 TF/s peak, 4.608 GiB memory•

  • THIC Meeting, San Jose CA, 10-Mar-2004 21

    LANL announced Lightning Linux Cluster in August 2003LANL announced Lightning LANL announced Lightning

    Linux Cluster in August 2003Linux Cluster in August 2003

    • Theoretical peak speed of 11.26 teraFLOP/s• Will be built and integrated by Linux NetworX• Will be used for the smaller SSP calculations• Initial delivery with 1280 dual-processor nodes• Option to extend machine to 1408 nodes • Uses AMD Opteron 64-bit processors, Linux• Uses Myrinet 2000 Lanai XP interconnect

    Yet another example of capacity computingat around $1 million or so per peak teraFLOP

    More information on Lightning at http://www.lnxi.com/news/lightning_info.phpMore information on Lightning at http://www.lnxi.com/news/lightning_info.php

  • THIC Meeting, San Jose CA, 10-Mar-2004 22

    Thunder is newest LLNL cluster:a 1000+ node quad Itanium2

    Thunder is newest LLNL cluster:Thunder is newest LLNL cluster:a 1000+ node quad Itanium2a 1000+ node quad Itanium2

    Thunder at LLNL will be the world’s largest Linux cluster (www.llnl.gov/linux/thunder/)

    • 23 TF (peak) procurement on condensed schedule to meet crushing demand

    • Quad 1.4 GHz Itanium2 Tiger4 nodes• 8.0 GB DDR266 SDRAM per node• 400 MB/s transfers to archive over quad

    Jumbo Frame Gb-Enet and QSW links• 75 TB in local, 192 TB global parallel disk• Lustre file system with 6.4 GB/s delivered

    parallel I/O performance• Expected to be operational and in use by

    early 2004

  • THIC Meeting, San Jose CA, 10-Mar-2004 23

    Key Open Source Softwarefor MCR and Thunder clusters

    Key Open Source SoftwareKey Open Source Softwarefor MCR and Thunder clustersfor MCR and Thunder clusters

    • Chaos—RedHat Linux distribution (www.redhat.com)—LinuxBIOS (www.linuxbios.org) —Chaos (www.llnl.gov/linux/chaos)

    • Lustre cluster wide file system—www.lustre.org

    • Quadrics QsNet drivers—www.quadrics.com

    • SLURM/DPCS/LCRM—http://www.llnl.gov/icc/lc/dpcs/dpcs_overview.html—http://www.llnl.gov/linux/slurm/slurm.html

    http://www.redhat.com/http://www.linuxbios.org/http://www.llnl.gov/linux/chaoshttp://www.lustre.org/http://www.quadrics.com/http://www.llnl.gov/icc/lc/dpcs/dpcs_overview.htmlhttp://www.llnl.gov/linux/slurm/slurm.html

  • THIC Meeting, San Jose CA, 10-Mar-2004 24

    Linux cluster and opensource observations

    Linux cluster and openLinux cluster and opensource observationssource observations

    • 10 TF/s scale computing is now becoming affordable

    —LTO options for $10-15M over 2-4 yrs put it into the budget range of departmental supercomputing

    • These technologies are somewhat disruptive

    — Important to factor Linux and commodity hardware into an overall computing strategy

    • Significant opportunities for broad collaborations

    —Key open source cluster technologies under active development

  • 1,004 Quad Itanium2 Compute Nodes

    B451GW GW MDS MDS

    4 Login nodeswith 6 Gb-Enet

    16 Gateway nodes @ 350 MB/sdelivered Lustre I/O over 4x1GbE

    Thunder – 1,024 Port QsNet Elan4

    OST OST OST64OST

    Heads

    65,536 Dual PowerPC 440 Compute Nodes

    1024 I/O NodesPPC440

    TSF

    BG/L torus, global tree/barrier

    OST OST OST208OST

    Heads

    64OST

    Heads

    OCF SGS File System Cluster (OFC)

    FederatedEthernet

    B439

    B113

    PVC - 128 Port Elan3

    52 Dual P4Render Nodes

    B451

    GW GW

    6 Dual P4Display

    BB/MKS Version 6Dec 23, 2003

    24

    PFTPMM Fiber 1 GigE

    SM Fiber 1 GigE

    Copper 1 GigE

    2 Login nodes

    OST OST OST OST

    2Gig FC

    OST Dual P4 Head

    FC RAID

    146, 73, 36 GB

    400- 600 TerabytesHPSSArchive

    LLNLExternal

    Backbone

    OST OST196OST

    Heads

    924 Dual P4 Compute Nodes

    B439GW GW MDS MDS

    2 Login nodeswith 4 Gb-Enet

    32 Gateway nodes @ 190 MB/sdelivered Lustre I/O over 2x1GbE

    ALC - 960 Port QsNet Elan3

    SW

    SW

    SW

    SW

    SW

    SW

    SW

    SW

    1,114 Dual P4 Compute Nodes

    B439GW GW MDS MDS

    4 Login nodeswith 4 Gb-Enet

    32 Gateway nodes @ 190 MB/sdelivered Lustre I/O over 2x1GbE

    MCR – 1,152 Port QsNet Elan3

    Open Computing Facility (OCF) Clusters, Networks, StorageOpen Computing Facility (OCF) Clusters, Networks, StorageOpen Computing Facility (OCF) Clusters, Networks, Storage

  • THIC Meeting, San Jose CA, 10-Mar-2004 26

    SNL’s new MPP architecturemachine is ASCI Red StormSNL’s new MPP architectureSNL’s new MPP architecturemachine is ASCI Red Stormmachine is ASCI Red Storm

    • Scalability

    • Reliability

    • Usability

    • Simplicity

    • Cost effectiveness

  • THIC Meeting, San Jose CA, 10-Mar-2004 27

    System Layout(27 x 16 x 24 mesh)

    System LayoutSystem Layout(27 x 16 x 24 mesh)(27 x 16 x 24 mesh)

    NormallyUnclassified

    SwitchableNodes

    NormallyClassified

    Disconnect Cabinets

  • THIC Meeting, San Jose CA, 10-Mar-2004 28

    Red Storm ArchitectureRed Storm ArchitectureRed Storm Architecture

    • True MPP designed to be a single system• Distributed memory parallel supercomputer• Fully connected 3D mesh interconnect• 108 compute node cabinets• 10,368 processors (AMD Opteron @ 2.0 GHz)• ~10 TB of DDR memory @ 333MHz• Red/Black switching (classified/unclassified)• 8 Service and I/O cabinets on each end (256

    processors for each color)• 240 TB of disk storage (120 TB per color)

  • THIC Meeting, San Jose CA, 10-Mar-2004 29

    Purple is next 100 TF platform at LLNL, with 2 PB of disk storagePurple is next 100 TF platform at Purple is next 100 TF platform at LLNL, with 2 PB of disk storageLLNL, with 2 PB of disk storage

    Parallel Batch/Interactive/Visualization Nodes

    System Data and Control NetworksSystem Data and Control NetworksSystem Data and Control Networks

    I/O … NFSLogin

    LoginNet

    NFSLogin

    LoginNet

    NFSLogin

    LoginNet

    NFSLogin

    LoginNet

    System Data and Control Networks

    I/O I/O I/O I/O I/O I/O I/O … Fibre Channel 2 I/O Network

    Specific Purple details still in contract negotiation

  • THIC Meeting, San Jose CA, 10-Mar-2004 30

    OutlineOutlineOutline

    • The NNSA Stockpile Stewardship Program

    • Where we are now: ASCI Red, Blue, White, Q

    • Challenges for today: Linux, Red Storm, Purple

    • The future challenges: BlueGene/L and beyond

    • A short digression: Storage and file systems

  • THIC Meeting, San Jose CA, 10-Mar-2004 31

    BlueGene/L is being built as part of IBM Purple contract

    BlueGene/L is being built as BlueGene/L is being built as part of IBM Purple contract part of IBM Purple contract

    Compute Chip

    2 processors2.8/5.6 GF/s

    4 MiB* eDRAM

    System

    64 cabinets65,536 nodes

    (131,072 CPUs)(32x32x64)

    180/360 TF/s16 TiB*1.2 MW

    2500 sq.ft.~11mm

    (compare this with a 1988 Cray YMP/8 at 2.7 GF/s)

    * http://physics.nist.gov/cuu/Units/binary.html

    Compute CardI/O Card

    FRU (field replaceable unit)

    25mmx32mm2 nodes (4 CPUs)

    (2x1x1)2.8/5.6 GF/s

    256/512 MiB* DDR15 W

    Node Card

    16 compute cards0-2 I/O cards

    32 nodes(64 CPUs)

    (4x4x2)90/180 GF/s8 GiB* DDR

    Midplane

    SU (scalable unit)16 node boards

    512 nodes(1,024 CPUs)

    (8x8x8)1.4/2.9 TF/s

    128 GiB* DDR7-10 kW

    Cabinet

    2 midplanes1024 nodes

    (2,048 CPUs)(8x8x16)

    2.9/5.7 TF/s256 GiB* DDR

    15-20 kW

  • THIC Meeting, San Jose CA, 10-Mar-2004 32

    BlueGene/L will have 32,768 dual-node compute cards

    BlueGene/L will have 32,768 BlueGene/L will have 32,768 dualdual--node compute cardsnode compute cards

    Heatsinks designed for 16W

    54 mm (2.125”)

    206 mm (8.125”) wide, 14 layers

    6x180 pin connector

    9 x 256/512Mb DRAM

  • THIC Meeting, San Jose CA, 10-Mar-2004 33

    BlueGene/L will contribute at all length and time scales

    BlueGene/L will contribute at BlueGene/L will contribute at all length and time scalesall length and time scales

    Continuum

    ALE3DPlasticity of

    complex shapes

    Mesoscale

    NIKE3DAggregate grain response, poly-crystal plasticity

    Microscale

    Dislocation Dynamics

    Collective behavior of defects, single crystal plasticity

    Atomic Scale

    Molecular Dynamics

    unit mechanisms of defect

    mobility and interaction

    nm (10-9 m) µm (10-6 m) mm (10-3 m)

    ps

    ns

    µs

    ms

    s

    Tim

    e Sc

    ale

    ·σ = σ(ε, ε, T, P, ...)

    first-time overlap calculations will allow direct comparison of detailed and course-scale models of plasticity of crystals

  • THIC Meeting, San Jose CA, 10-Mar-2004 34

    BlueGene/L has a growing list of industry/academia collaboratorsBlueGene/L has a growing list of BlueGene/L has a growing list of industry/academia collaboratorsindustry/academia collaborators

    Debugger

    ApplicationsFile SystemsBatch systemKernel EvaluationProgramming ModelsDebugger & Vis

    Network simulatorMPI tracingApplication scaling

    Application Tracing & Performance

    Hardware design and buildNetwork design and buildOS and system software

    Parallel ObjectsCHARM++

    STAPL – standard adaptive templatelibrary

    Optimized FFT

    MPI – messagepassing interface

    PAPI - performancemonitoring

    Performance analysisVampir/GuideView

  • THIC Meeting, San Jose CA, 10-Mar-2004 35

    If industry progress slows, howto get simulation to next step?If industry progress slows, howIf industry progress slows, howto get simulation to next step?to get simulation to next step?

    PITAC 1999 Recommendations

    (President’s IT Advisory Committee)

    — $$ for R&D on innovative computing technologies

    — $$ for software research

    — $$ for Petaflops on some applications by 2010

    — $$ to fund the most powerful high-end systems

    — Can this be leveraged into a broad national program??

  • THIC Meeting, San Jose CA, 10-Mar-2004 36

    National Academies Report: The Future of Supercomputing

    National Academies Report: National Academies Report: The Future of SupercomputingThe Future of Supercomputing

    The Future of Supercomputing: An Interim Report (2003)from Computer Science and Telecommunications Boardhttp://www7.nationalacademies.org/cstb/pub_supercomp_int.html

    • Sponsored by DOE Office of Science and DOE ASC• Goals of the study

    — Supercomputing R&D in support of U.S. needs— Context and background— Applications and implications for design— Market, national security, role of U.S. government— Options for progress/recommendations (final report, 2004)

  • THIC Meeting, San Jose CA, 10-Mar-2004 37

    National Academies Report: The Future of Supercomputing

    National Academies Report: National Academies Report: The Future of SupercomputingThe Future of Supercomputing

    Some observations from the Interim Report• U.S. is in pretty good shape regarding manufacturing• Custom and commodity species have own niches• Need balance between customization and commodity• Need balance between evolution and innovation• Need continuity and sustained investment• Government role essential (market incentives insufficient)• Supercomputer software is not in good shape

    — Hard to program— Inadequate development tools— Legacy code porting problems

  • THIC Meeting, San Jose CA, 10-Mar-2004 38

    OutlineOutlineOutline

    • The NNSA Stockpile Stewardship Program

    • Where we are now: ASCI Red, Blue, White, Q

    • Challenges for today: Linux, Red Storm, Purple

    • The future challenges: BlueGene/L and beyond

    • A short digression: Storage and file systems

  • THIC Meeting, San Jose CA, 10-Mar-2004 39

    Some interesting things about 70’s, 80’s, early 90’s (G. Grider)Some interesting things about Some interesting things about 70’s, 80’s, early 90’s (G. 70’s, 80’s, early 90’s (G. GriderGrider))

    • Many (a dozen or so) supercomputers, all disk poor• Common serial file system and archive (integrated)• Archival file system was less than an order of

    magnitude slower than supercomputer local disk• Invented own networks, out of parallel technology

    —Invented our own protocols, much like IP today—3-5 MB/sec in 80’s when fast networks were 56 kb/s—HIPPI is an example, 100 MB/sec in 1989, when

    most networks were 10 Mb/s and fast networks were 50 Mb/s

    This slide and the following three slides courtesy of Gary Grider, LANLThis slide and the following three slides courtesy of Gary Grider, LANL

  • THIC Meeting, San Jose CA, 10-Mar-2004 40

    Enter ASCI in the mid-90’sEnter ASCI in the midEnter ASCI in the mid--90’s90’s

    • In the mid 90’s the ASCI program attempted to accelerate things through the use of massive parallelism.

    • We moved to a new model for balanced system, with new ratios for storage feeds and speeds.

    • Developed a forward-looking data transfer and storage technology roadmap to address barriers.

  • THIC Meeting, San Jose CA, 10-Mar-2004 41

    Some interesting thingsabout the late 90’s

    Some interesting thingsSome interesting thingsabout the late 90’sabout the late 90’s

    • Went from a dozen supercomputers to 1 or 2, and from disk-poor supercomputers to disk-rich

    • Supercomputer file system had to become parallel and use supercomputer interconnect to move data

    • Each supercomputer had to have its own parallel file system (not a common file system)

    • Went from integrated common file system with archive to separate parallel archive (HPSS)

    • Archive now over an order of magnitude slower than supercomputer parallel file system

  • THIC Meeting, San Jose CA, 10-Mar-2004 42

    Some interesting things about the current 21st century trend

    Some interesting things about Some interesting things about the current 21the current 21stst century trendcentury trend

    Due to lower costs for supercomputers—Going back to lots of lower cost supercomputers

    that are disk-poor—Probably need to move towards a scalable

    common parallel file system—Probably need to integrate parallel archive and

    common parallel file system—Probably need to have a parallel multi-

    supercomputer secure scalable backbone

    We have not been sitting idle hoping for “magic happens here” solutionsWe have not been sitting idle hoping for “magic happens here” solutions

  • THIC Meeting, San Jose CA, 10-Mar-2004 43

    HPSS archival storage slide from last time I was here…

    HPSS archival storage slide HPSS archival storage slide from last time I was here…from last time I was here…

    Accomplishments— A 20x performance increase in

    15 months (faster nets and disks)— PSE Milepost demonstrated

    170 MB/s aggregate throughput White-to-HPSS

    — Large single file transfer rates of up to 80MB/s White-to-HPSS

    — Large singe file transfer rates of up to 150MB/s White-to-SGI

    Challenges— Yearly doubling of throughput

    is needed for next machine

    At 170 MB/s, 2TB of data moves to At 170 MB/s, 2TB of data moves to storage in less than 4 hours. A year storage in less than 4 hours. A year and a half ago it took two and a half and a half ago it took two and a half days to move the same amount of datadays to move the same amount of data

    Aggregate Throughput to Storage

    1 MB/s 4 MB/s 6 MB/s 9 MB/s

    120 MB/s

    170 MB/s

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    FY96 FY97 FY98 FY99 FY00 FY01

    MB/

    sMoved to

    HPSS

    Moved to SP Nodes

    Moved to Jumbo GE & Parallel Striping

    Moved to Faster Disk on Faster Nodes & multi-node Concurrency

  • THIC Meeting, San Jose CA, 10-Mar-2004 44

    LLNL’s yearly “I/O Blueprints” have helped to increase ratesLLNL’s yearly “I/O Blueprints” LLNL’s yearly “I/O Blueprints” have helped to increase rateshave helped to increase rates

    A 115x performance improvement in four years!

    Aggregate Throughput to Storage

    1,037 MB/s

    170 MB/s120 MB/s

    9 MB/s6 MB/s4 MB/s1 MB/s

    854 MB/s

    0

    200

    400

    600

    800

    1000

    1200

    FY96 FY97 FY98 FY99 FY00 FY01 FY02 FY03

    MB/

    s

    Moved to HPSS

    Moved to SP Nodes

    Moved to Jumbo GE, Parallel Striping, Faster Disk & Nodes using multiple pftp sessions

    Moved to Faster Disk using multiple Htar sessions on multiple nodes

    12/03 Throughput

  • THIC Meeting, San Jose CA, 10-Mar-2004 45

    Current HPSS data movement with HSM disk/tape hierararchyCurrent HPSS data movement Current HPSS data movement

    with HSM disk/tape with HSM disk/tape hierararchyhierararchy

    HPSSMover

    HPSSMover

    HPSS HPSS DiskDisk

    MCR or ALC

    PlatformPlatformDiskDisk

    Client Mover

    PFTP Client

    Application1

    2

    4

    3 5

    6

    7

    Key elements1. User in the loop to force keep/delete decision2. Large HPSS disk cache and multiple copies on disk and tape3. Bandwidth limited by PFTP bandwidth off compute platform

  • THIC Meeting, San Jose CA, 10-Mar-2004 46

    Lustre Object Storage Target HPSS Data Movement VisionLustre Object Storage Target Lustre Object Storage Target HPSS Data Movement VisionHPSS Data Movement Vision

    Tape Front-End

    Lustre Lustre DiskDisk

    Global Accessible Lustre OST

    Application

    MCR or ALCPlatform

    1 2open(

    ), seek

    (), read

    ()

    3

    Location Independent

    Client ArchiveAgent

    Key elements1. User in the loop to force keep/delete decision2. Direct to tape transfers eliminate HPSS disk cache3. Bandwidth limited by SERIAL file system bandwidth access from HPSS

  • THIC Meeting, San Jose CA, 10-Mar-2004 47

    Proposed HPSS parallel local file movers for tape parallelismProposed HPSS parallel local Proposed HPSS parallel local

    file movers for tape parallelismfile movers for tape parallelism

    Lustre Lustre DiskDisk

    Application

    MCR or ALCCapacity Platform

    Location independent

    PFTP Client

    1 2…

    HPSSHPSSPLFMPLFM

    HPSSHPSSPLFMPLFM

    HPSSHPSSPLFMPLFM

    HPSSPLFM

    3

    PLFMs are location independentope

    n(), see

    k(), rea

    d()

    Key elements1. User in the loop to force keep/delete decision2. Direct to tape transfers eliminate HPSS disk cache3. Improved by PARALLEL file system bandwidth access from HPSS

  • THIC Meeting, San Jose CA, 10-Mar-2004 48

    Tri-lab historical timeline forscalable, parallel file systemsTriTri--lab historical timeline forlab historical timeline for

    scalable, parallel file systemsscalable, parallel file systemsRFQ, analysis, recommend funding open source OBSD development and NFSv4 efforts

    PathForward proposal with OBSD vendor, Panasas born

    20022000 20011999

    Proposed PathForward activity for SGSFS

    Propose initial architecture

    SGSFS workshop “You’re Crazy”

    Build initial requirements document

    PathForward team formed to pursue an RFI/RFQ approach, RFI issued, recommend RFQ process

    Begin partnering talks negotiations for OBSD and NFSv4 PathForwards

    HECRTF workshop: Re-invent Posix I/O ?

    “Are WeStill Crazy?”

    Tri-Lab joint requirements document complete

    Lustre PathForward effort is born

    Alliance contracts placed with universities on OBSD, overlapped I/O and NFSv4

    U MinnObject Archive begins

    20042003

  • THIC Meeting, San Jose CA, 10-Mar-2004 49

    From the June 2003 HECRTFworkshop report (available)

    From the June 2003 HECRTFFrom the June 2003 HECRTFworkshop report (available)workshop report (available)

    • For info: http://www.nitrd.gov/hecrtf-outreach/index.html• NNSA Tri-labs (Lee Ward of SNL, Tyce McClarty of LLNL,

    and Gary Grider of LANL) were lone I/O representatives at this workshop

    • Overwhelming consensus that POSIX I/O is inadequate

    5.5. Data Management and File SystemsWe believe legacy, POSIX I/O interfaces are incompatible with the full range of hardware architecture choices contemplated …

    The interface does not fully support the needs for parallel support along the I/O path …

    An alternative, appropriate operating system API should be developed for high-end computing systems …

  • THIC Meeting, San Jose CA, 10-Mar-2004 50

    DISCLAIMER

    This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.

    This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

    OutlineThe NNSA’s StockpileStewardship Program (SSP)Simulation plays central role to maintain stockpile confidenceRole of Advanced Simulationand Computing Program (ASC)Confluence of events leadsto creation of ASCI in 1996ASC Program is more than platforms and physics codesOutlineASCI Red, Blue, White, QASCI Red at SNLASCI Blue Pacific at LLNLASCI White IBM Nighthawk-2Compute Node SpecificationASCI Q at LANL is Alpha ES45SMP with Quadrics interconnectPlatforms are shaken out with groundbreaking science runsOutlineLLNL strategy is to straddlemultiple technology wavesRamifications of LLNL strategyMaintain similar programming model across LLNL platformsA supercomputing caveat: a new facility may be requiredThe MCR Linux Cluster made 10+ teraFLOPS computing affordableLANL announced Lightning Linux Cluster in August 2003Thunder is newest LLNL cluster:a 1000+ node quad Itanium2Key Open Source Softwarefor MCR and Thunder clustersLinux cluster and opensource observationsSNL’s new MPP architecturemachine is ASCI Red StormSystem Layout(27 x 16 x 24 mesh)Red Storm ArchitecturePurple is next 100 TF platform at LLNL, with 2 PB of disk storageOutlineBlueGene/L is being built as part of IBM Purple contractBlueGene/L will have 32,768 dual-node compute cardsBlueGene/L will contribute at all length and time scalesBlueGene/L has a growing list of industry/academia collaboratorsIf industry progress slows, howto get simulation to next step?National Academies Report: The Future of SupercomputingNational Academies Report: The Future of SupercomputingOutlineSome interesting things about 70’s, 80’s, early 90’s (G. Grider)Enter ASCI in the mid-90’sSome interesting thingsabout the late 90’sSome interesting things about the current 21st century trendHPSS archival storage slide from last time I was here…LLNL’s yearly “I/O Blueprints” have helped to increase ratesCurrent HPSS data movement with HSM disk/tape hierararchyLustre Object Storage Target HPSS Data Movement VisionProposed HPSS parallel local file movers for tape parallelismTri-lab historical timeline forscalable, parallel file systemsFrom the June 2003 HECRTFworkshop report (available)