Bill Camp, Jim Tomkins & Rob Leland

download Bill Camp, Jim Tomkins & Rob Leland

If you can't read please download the document

description

Bill Camp, Jim Tomkins & Rob Leland. How Much Commodity is Enough? the Red Storm Architecture. William J. Camp, James L. Tomkins & Rob Leland CCIM , Sandia National Laboratories Albuquerque, NM [email protected]. Sandia MPPs (since 1987). 1987: 1024-processor nCUBE10 [512 Mflops] - PowerPoint PPT Presentation

Transcript of Bill Camp, Jim Tomkins & Rob Leland

  • Bill Camp, Jim Tomkins& Rob Leland

  • How Much Commodity is Enough?the Red Storm Architecture

    William J. Camp, James L. Tomkins & Rob LelandCCIM, Sandia National LaboratoriesAlbuquerque, [email protected]

  • Sandia MPPs (since 1987) 1987: 1024-processor nCUBE10 [512 Mflops] 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops]1988--1990: 16384-processor CM-2001991: 64-processor Intel IPSC-8601993--1996: ~3700-processor Intel Paragon [180 Gflops]1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops]1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops]2003: 1280-processor IA32- Linux cluster [~7 Tflops]2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]

  • Our rubric (since 1987)Complex, mission-critical, engineering & science applications Large systems (1000s of PEs) with a few processors per nodeMessage passing paradigmBalanced architectureUse commodity wherever possibleEfficient systems softwareEmphasis on scalability & reliability in all aspects Critical advances in parallel algorithmsVertical integration of technologies

  • A partitioned, scalable computing architecture

  • Computing domains at SandiaRed Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)

    Domain# Procs1101102103104 Red StormXXXCplant Linux Supercluster XXX Beowulf clustersXXXDesktopX

  • Red Storm ArchitectureTrue MPP, designed to be a single system-- not a clusterDistributed memory MIMD parallel supercomputerFully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0--2.4 GHz)~10 or 20 TB of DDR memory @ 333MHzRed/Black switching: ~1/4, ~1/2, ~1/4 (for data security)12 Service, Visualization, and I/O cabinets on each end (640 S,V & I processors for each color)240 TB of disk storage (120 TB per color) initially

  • Red Storm ArchitectureFunctional hardware partitioning: service and I/O nodes, compute nodes, Visualization nodes, and RAS nodesPartitioned Operating System (OS): LINUX on Service, Visualization, and I/O nodes, LWK (Catamount) on compute nodes, LINUX on RAS nodesSeparate RAS and system management network (Ethernet)Router table-based routing in the interconnectLess than 2 MW total power and coolingLess than 3,000 ft2 of floor space

  • Usage ModelUnix (Linux) Login Node with Unix environmentBatch ProcessingorCompute ResourceI/OUser sees a coherent, single system

  • Thors Hammer Topology

    3D-mesh Compute node topology:27 x 16 x 24 (x, y, z) Red/Black split: 2,688 4,992 2,688Service, Visualization, and I/O partitions3 cabs on each end of each row384 full bandwidth links to Compute Node MeshNot all nodes have a processor-- all have routers 256 PEs in each Visualization Partition--2 per board256 PEs in each I/O Partition-- 2 per board128 PEs in each Service Partition-- 4 per board

  • 3-D Mesh topology (Z direction is a torus)10,368 Compute Node MeshX=27Y=16Z=24Torus Interconnect in Z640 Visualization Service & I/O Nodes640 Visualization, Service & I/O Nodes

  • Thors Hammer Network Chips

    3D-mesh is created by SEASTAR ASIC:Hyper-transport Interface and 6 network router ports on each chipIn computer partitions each processor has its own SEASTARIn service partition, some boards are configured like compute partition (4 PEs per board)Others have only 2 PEs per board; but still have 4 SEASTARSSo, network topology is uniformSEASTAR designed by CRAY to our specs, Fabricated by IBMThe only truly custom part in Red Storm-- complies with HT open standard

  • Node architectureCPU AMD OpteronDRAM 1 (or 2) Gbyte or moreSix Links To Other Nodes in X, Y,and ZASIC = Application Specific Integrated Circuit, or a custom chip

  • System Layout(27 x 16 x 24 mesh)Normally UnclassifiedNormally ClassifiedSwitchable NodesDisconnect Cabinets{{

  • Thors Hammer Cabinet LayoutCompute Node Partition3 Card Cages per Cabinet8 Boards per Card Cage4 Processors per Board4 NIC/Router Chips per BoardN + 1 Power SuppliesPassive BackplaneService. Viz, and I/O Node Partition2 (or 3) Card Cages per Cabinet8 Boards per Card Cage2 (or 4) Processors per Board4 NIC/Router Chips per Board2-PE I/O Boards have 4 PCI-X bussesN + 1 Power SuppliesPassive Backplane}96 PE

  • PerformancePeak of 41.4 (46.6) TF based on 2 floating point instruction issues per clock at 2.0 Gigahertz . We required 7-fold speedup versus ASCI Red but based on our benchmarks expect performance will be 8-10 time faster than ASCI Red.Expected MP-Linpack performance: ~30--35 TF Aggregate system memory bandwidth: ~55 TB/sInterconnect Performance:Latency
  • PerformanceI/O System PerformanceSustained file system bandwidth of 50 GB/s for each colorSustained external network bandwidth of 25 GB/s for each colorNode memory systemPage miss latency to local memory is ~80 nsPeak bandwidth of ~5.4 GB/s for each processor

  • Red Storm System SoftwareOperating SystemsLINUX on service and I/O nodesSandias LWK (Catamount) on compute nodesLINUX on RAS nodesRun-Time System Logarithmic loaderFast, efficient Node allocatorBatch system PBSLibraries MPI, I/O, MathFile Systems being considered includePVFS interim file systemLustre Design Intent Panassas-- possible alternative

  • Red Storm System SoftwareToolsAll IA32 Compilers, all AMD 64-bit Compilers Fortran, C, C++Debugger Totalview (also examining alternatives)Performance Tools (was going to be Vampir until Intel bought Pallas-- now?)System Management and AdministrationAccountingRAS GUI Interface

  • Comparison of ASCI Redand Red Storm

  • Comparison of ASCI Redand Red Storm

  • Red Storm Project23 months, design to First Product Shipment!System software is a joint project between Cray and SandiaSandia is supplying Catamount LWK and the service node run-time systemCray is responsible for Linux, NIC software interface, RAS software, file system software, and Totalview portInitial software development was done on a cluster of workstations with a commodity interconnect. Second stage involves an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based systemSystem design is going on nowCabinets-- existSEASTAR NIC/Router-- released to Fabrication at IBM earlier this monthFull system to be installed and turned over to Sandia in stages culminating in August--September 2004

  • New Building for Thors Hammer

  • Designing for scalable supercomputingChallenges in: -Design-Integration-Management-Use

  • SUREty for Very Large Parallel Computer Systems

    Scalability - Full System Hardware and System Software

    Usability - Required Functionality Only

    Reliability - Hardware and System Software

    Expense minimization- use commodity, high-volume parts SURE poses Computer System Requirements:

  • SURE Architectural tradeoffs: Processor and memory sub- system balanceCompute vs interconnect balanceTopology choicesSoftware choicesRASCommodity vs. Custom technology Geometry and mechanical design

  • Sandia Strategies:-build on commodity-leverage Open Source (e.g., Linux)-Add to commodity selectively (in RS there is basically one truly custom part!)-leverage experience with previous scalable supercomputers

  • System Scalability Driven RequirementsOverall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors).

    -

  • ScalabilitySystem Software;System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size.

    - Full re-boot time scales logarithmically with the system size.- Job loading is logarithmic with the number of processors.- Parallel I/O performance is not sensitive to # of PEs doing I/O- Communication Network software must be scalable.- No connection-based protocols among compute nodes.- Message buffer space independent of # of processors.- Compute node OS gets out of the way of the application.

  • Hardware scalabilityBalance in the node hardware:Memory BW must match CPU speedIdeally 24 Bytes/flop (never yet done)Communications speed must match CPU speedI/O must match CPU speedsScalable System SW( OS and Libraries)Scalable Applications

  • SW Scalability: Kernel tradeoffsUnix/Linux has more functionality, but impairs efficiency on the compute partition at scaleSay breaks are uncorrelated and last 50 S and occur once per secondOn 100 CPUs, wasted time is 5 ms every secondNegligible .5% impactOn 1000 CPUs, wasted time is 50 ms every secondModerate 5% impactOn 10,000 CPUs, wasted time is 500 msSignificant 50% impact

  • Usability>Application Code Support:Software that supports scalability of the Computer SystemMath LibrariesMPI Support for Full System SizeParallel I/O LibraryCompilersTools that Scale to the Full Size of the Computer SystemDebuggersPerformance MonitorsFull-featured LINUX OS support at the user interface

  • ReliabilityLight Weight Kernel (LWK) O. S. on compute partitionMuch less code fails much less oftenMonitoring of correctible errorsFix soft errors before they become hardHot swapping of componentsOverall system keeps running during maintenanceRedundant power supplies & memoriesCompletely independent RAS System monitors virtually every component in system

  • Economy

    Use high-volume parts where possibleMinimize power requirementsCuts operating costsReduces need for new capital investmentMinimize system volumeReduces need for large new capital facilitiesUse standard manufacturing processes where possible-- minimize customizationMaximize reliability and availability/dollarMaximize scalability/dollarDesign for integrability

  • EconomyRed Storm leverages economies of scaleAMD Opteron microprocessor & standard memoryAir cooledElectrical interconnect based on Infiniband physical devicesLinux operating systemSelected use of custom componentsSystem chip ASICCritical for communication intensive applicationsLight Weight KernelTruly custom, but we already have it (4th generation)

  • Cplant on a slideGoal: MPP look and feelStart ~1997, upgrade ~1999--2001Alpha & Myrinet, mesh topology~3000 procs (3Tf) in 7 systemsConfigurable to ~1700 procsRed/Black switching Linux w/ custom runtime & mgmt.Production operation for several yrs.ASCI Red

  • IA-32 Cplant on a slideGoal: Mid-range capacityStarted 2003, upgrade annuallyPentium-4 & Myrinet, Clos network1280 procs (~7 Tf) in 3 systemsCurrently configurable to 512 procsLinux w/ custom runtime & mgmt.Production operation for several yrs.ASCI Red

  • Observation:For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs.

    There must be balance between processor, interconnect, and I/O performance to achieve overall performance.

    To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.

  • Lets Compare Balance In Parallel Systems

  • Comparing Red Storm and BGL Blue Gene Light** Red Storm*Node Speed 5.6 GF 5.6 GF (1x)Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)Network latency 7 msecs 2 msecs (2/7 x)Network BW 0.28 GB/s 6.0 GB/s (22x)BW Bytes/Flops 0.05 1.1 (22x)Bi-Section B/F 0.0016 0.038 (24x)#nodes/problem 40,000 10,000 (1/4 x)*100 TF version of Red Storm* * 360 TF version of BGL

  • Fixed problem performanceMolecular dynamics problem(LJ liquid)

  • Parallel Sn Neutronics (provided by LANL)

  • Scalable computing works

    Chart1

    1001001111119910011100

    99.399.322222299682290

    98.798.244444498654486

    95.298.988888892678872

    94.998.6161697.747002893816161656161661

    94.696.9323297.59875259883232323210057

    93.495.8646494.58117671916464646497.916666666756

    93.29512812893.090273987696.229743137512812812896.311475409847

    9393.125625690.888714557994.305493523993.398341179125625694

    92.391.151251287.653025533491.61641929592.056456316351251292.1568627451

    92.285.81024102483.970019342488.439099892889.16008614590.1017688394102492.1568627451

    92.173.72048204881.149153876485.429623048687.563971340891.5055426656204887.037037037

    91.267.587.388.2320082.758620689784.914378907388.1914316703320081.0344827586

    3600360036003600360036003600360036003600

    4096409640964096409640964096409699.64096

    64206420642064206420642064206420946420

    QS-Particles

    QS-Fields-Only

    QS-1B Cells

    Rad x-port-1B Cells

    Rad x-port - 17M

    Rad x-port - 80M

    Rad x-port - 168M

    Rad x-port - 532M

    Finite Element

    Zapotec

    Reactive Fluid Flow

    Salinas

    CTH

    Processors

    Scaled parallel efficiency (%)

    ASCI Red efficiencies for major codes

    Raw Data

    Quicksilver data:

    PARTICLES - scaled-size problem:

    30x30x30 grid = 27000 cells per processor

    12 particles/cell = 324000 particles per procesor

    200 timesteps

    3-stage implicit field solve, all-mirror PMC BC

    2 species = electron and positrons

    6 sets of pre-loaded particles, moving in all 6 directions

    particles velocity = 1/2 grid cell/timestep

    uniform load-balance, all cells filled, non-interacting particles

    1-proc time is 92% particles, 7% fields (on Tflop)

    TflopTflopSiberiaSiberia

    ProcsGrid CellsParticlesCPUEffCPUEff

    127000324,000604.6100155.8100

    254000648,000608.899.3159.197.9

    41080001,296,000612.798.716196.8

    82160002,592,000635.495.217091.6

    164320005,184,00063794.9170.991.2

    3286400010,368,00063994.6173.489.9

    641,728,00020,736,000646.893.4179.986.6

    1283,456,00041,472,00064993.218584.2

    2566,912,00082,944,000650.393197.279

    51213,824,000165,888,000655.392.3

    102427,648,000331,776,000655.792.2

    204855,296,000663,552,000656.292.1

    320086,400,0001,036,800,000662.891.2

    ------------------------------------

    FIELDS-ONLY - scaled-size problem:

    30x30x30 grid = 27000 cells per processor

    10000 timesteps

    Poisson inlet (Gaussian pulse excitation)

    mirror side walls

    several HISTORY outputs

    TflopTflopAlaskaAlaskaSiberiaSiberia

    ProcsGrid CellsCPUEffCPUEffCPUEff

    127000412100129.610068.5100

    254000414.899.3147.38876.389.8

    4108000419.698.216976.786.878.9

    8216000416.498.9194.466.7119.657.3

    16432000417.898.6214.360.5130.752.4

    32864000425.296.924752.5143.847.6

    641,728,00043095.8283.145.8175.839

    1283,456,000433.895293.744.1204.133.6

    2566,912,000442.493.1

    51213,824,000452.191.1

    102427,648,000480.485.8

    204855,296,000558.873.7

    320086,400,00061067.5

    ----------------------------------

    Tflops huge run for 1000 timesteps:

    CPUTflop

    Procs1 billion545.8Eff(476.85 is ideal time based

    320087.3on 1-proc 27000-grid-cell time)

    ------------------------------------

    Static LB scaled-size problem

    30x30x30 grid per proc,

    6 sets of particles, 12 particles/cell ave

    LB = grid is 1/10 filled with particles in each of 3d, LB on

    noLB = same, but LB off

    LB Eff from 1-proc noLB time

    TflopTflopTflopTflopTflopSiberiaSiberia

    ProcsnoLBEffLBEffuniformLBEff

    1582.5100578.1100.1604.6158.8100

    2758.976.859398.2608.8170.793

    4934.662.3599.197.2612.7176.590

    8113551.3613.195635.4178.189.2

    16149139.1654.389637187.684.6

    32186031.3737.879639200.479.2

    64222626.267286.7646.8211.875

    128293519.8675.586.2649226.170.2

    256366915.9754.777.2650.3

    512438813.2710.582655.3

    1024472412.375577.2655.7

    ------------------------------------------------

    Dynamic LB scaled-size problem

    30x30x30 grid per proc,

    3 sets of particles, 12 particles/cell ave

    LB = grid is 1/10 filled with particles in each of 3d, LB on

    2 steps to transit one cell, load moves from lower-left to upper-right

    noLB = same, but LB off

    LB Eff from 1-proc noLB time

    TflopTflopTflopTflopTflopSiberiaSiberia

    ProcsnoLBEffLBEffuniformLBEff

    1576.6100573.1100.2604.6159.1100

    2744.677.4625.692.2608.8179.188.8

    4911.463.3725.679.5612.7211.675.2

    8108952.9757.176.2635.4220.172.3

    16139041.5873.866637259.861.2

    32169634957.360.2639300.652.9

    64200828.7100457.4646.8330.248.2

    128253022.8111551.7649394.340.3

    256304518.9128145650.3

    512357216.1141240.8655.3

    1024408914.1158436.4655.7

    ---------------------------------

    In Dec 99 I ran a billion-atom molecular dynamics benchmark on Tflops

    using a LAMMPS-similar code I have (it is more streamlined, but does

    the same kernel computations as LAMMPS). This problem had only

    short-range forces (no FFTs which the membrane problem has).

    It ran on 3200 Tflops procs (proc 0 mode), at 88.2% parallel

    efficiency.

    ProcEff

    320088.2

    CTH

    3dr cu/hmx

    janusmekipp6/27/03

    ProcsCTH

    (Mbyte)grind time(us/zn-cyc)1100real

    procsmem/procminimummaximum298

    119094.1109.3496

    890real

    819113.615.11689

    641871.791.983288

    (-p 3)128861.141.196486real

    (-p 3)512910.310.3512872real

    (-p 3)1024880.1680.18825668

    (-p 3)2048900.0880.09651261real

    (-p 3)6400890.0310.036102457real

    204856real

    320055

    360054

    409653

    642047real

    ProcsCTH

    1100real

    890real

    6486real

    12872real

    51261real

    102457real

    204856real

    642047real

    Summary

    ProcsQS-ParticlesQS-Fields-OnlyQS-1B CellsRad x-port-1B CellsRad x-port - 17MRad x-port - 80MRad x-port - 168MRad x-port - 532MFinite ElementZapotecReactive Fluid FlowSalinasCTHCTH-NZapotek-NFinite El.-N

    110010099100100118

    299.399.39968908816

    498.798.2986586646464

    895.298.99267721285123600

    1694.998.69856615123179

    3294.696.998100571024

    6493.495.89598562048

    12893.295939696476420

    2569393.191949394

    51292.391.188929292

    102492.285.88488899092

    204892.173.78185889287

    320091.267.587.388.283858881

    3600

    409699.6

    642094

    Summary

    QS-Particles

    QS-Fields-Only

    QS-1B Cells

    Rad x-port-1B Cells

    Rad x-port - 17M

    Rad x-port - 80M

    Rad x-port - 168M

    Rad x-port - 532M

    Finite Element

    Zapotec

    Reactive Fluid Flow

    Salinas

    CTH

    Processors

    Scaled parallel efficiency (%)

    ASCI Red efficiencies for major codes

    Zapotec

    Rob, Ed, Peter,

    Here are the timing numbers from zapotec. Zapotec is a coupled Eulerian-

    Lagrangian code which tightly couples CTH and PRONTO. The problem is a

    Tugsten rod impacting water with 1440 PRONTO elements per processor and a

    101250 CTH cells per processor.

    Courtenay

    procsefficiency

    1100

    868

    6465

    51267

    317956

    Zapotec

    Tungsten Rod

    Number of Processors

    Scaled Efficiency (%)

    Janus: Scaled Efficiency - Zapotec

    Crash

    Finite Element - Crash

    ProcEfficiency

    899

    1699

    6498

    360092

    Crash

    99

    99

    98

    92

    Finite Element-Crash

    Number of Processors

    Scaled Efficiency (%)

    Janus: Scaled Efficiency - Crash

    RD Plots

    Quicksilver

    Tflop (Janus) Scaled Efficiency

    ProcsParticlesFields-Only1B Cells1B Cells-LAMMPS

    1100100

    299.399.3

    498.798.2

    895.298.9

    1694.998.6

    3294.696.9

    6493.495.8

    12893.295

    2569393.1

    51292.391.1

    102492.285.8

    204892.173.7

    320091.267.587.388.2

    RD Plots

    Particles

    Fields-Only

    1B Cells

    1B Cells-LAMMPS

    Number of Processors

    Scaled Efficiency (%)

    Janus: Quicksilver Scaled Efficiency

    RF Flow

    ProcEff.

    409699.6

    642094

    Salinas

    ProcsSalinas-NormSalinas-Act

    1235

    2

    4

    8

    16

    32100235

    6498240

    12896244

    25694250

    51292255

    102492255

    204887270

    320081290

    Rad x-port

    Radiation Transport

    Rob - here is the parallel efficiency data in the radiation transport

    plot. For each of 4 problems sizes (N), there are 2 efficiency

    numbers - the top one is the "ideal" efficiency I mentioned due to the

    nature of the algorithm (blue circles in what I sent). The lower

    value is the actual measured run-time efficiency (red circles in what

    I sent). Again, the diff between the 2 is due to the

    communication/latency on a particular machine.

    Steve

    ---------------

    N = 17 millionN=17M-ThN=17M-Act

    896.7694.58

    P=8 P=16 P=32 P=64 P=128 P=256 P=512 P=10241696.293.89

    3294.6789.54

    96.76 96.20 94.67 91.61 89.23 85.77 82.72 76.236491.6185.28

    94.58 93.89 89.54 85.28 81.10 75.18 69.46 61.8612889.2381.1

    25685.7775.18

    ---------------51282.7269.46

    102476.2361.86

    N = 80 million2048

    P=64 P=128 P=256 P=512 P=1024 P=2048

    N=80M-ThN=80M-Act

    90.71 89.56 86.24 83.99 78.79 73.378

    87.29 84.46 79.01 74.28 67.31 60.7216

    32

    ---------------6490.7187.29

    12889.5684.46

    N = 168 million25686.2479.01

    51283.9974.28

    P=128 P=256 P=512 P=1024 P=2048102478.7967.31

    204873.3760.72

    89.22 85.73 83.58 78.16 73.58

    83.33 78.92 74.52 68.44 62.48N=168M-ThN=168M-Act

    8

    ---------------16

    32

    N = 532 million64

    12889.2283.33

    P=512 P=1024 P=204825685.7378.92

    51283.5874.52

    82.54 77.58 73.76102478.1668.44

    74.37 70.99 65.05204873.5862.48

    N=17MN=80MN=168MN=532MN=532M-ThN=532M-Act

    8988

    169816

    329532

    64939664

    128919493128

    256889292256

    5128488899051282.5474.37

    102481858892102477.5870.99

    2048838588204873.7665.05

    Rad x-port

    97.7470028938888

    97.5987525988161616

    94.5811767191323232

    93.090273987696.22974313756464

    90.888714557994.305493523993.3983411791128

    87.653025533491.61641929592.0564563163256

    83.970019342488.439099892889.16008614590.1017688394

    81.149153876485.429623048687.563971340891.5055426656

    204882.758620689784.914378907388.1914316703

    N=17M

    N=80M

    N=168M

    N=532M

    Number of Processors

    Scaled Efficiency (%)

    Janus: Radiation Transport - Scaled Efficiency

  • Balance is critical to scalabilityPeakLinpack

    Chart2

    1111111

    0.93750.92314793450.90909090910.80.71428571430.33333333330.2857142857

    0.88235294120.85726532360.83333333330.66666666670.55555555560.20.1666666667

    0.83333333330.8001600320.76923076920.57142857140.45454545450.14285714290.1176470588

    0.78947368420.75018754690.71428571430.50.38461538460.11111111110.0909090909

    0.750.70609002650.66666666670.44444444440.33333333330.09090909090.0740740741

    0.71428571430.6668889630.6250.40.29411764710.07692307690.0625

    0.68181818180.63181172010.58823529410.36363636360.26315789470.06666666670.0540540541

    0.6521739130.6002400960.55555555560.33333333330.23809523810.05882352940.0476190476

    0.6250.57167357440.52631578950.30769230770.21739130430.05263157890.0425531915

    0.60.54570259210.50.28571428570.20.04761904760.0384615385

    Red Storm (B=1.5)

    ASCI Red (B=1.2)

    Ref. Machine (B=1.0)

    Earth Sim. (B=.4)

    Cplant (B=.25)

    Blue Gene Light (B=.05)

    Std. Linux Cluster (B=.04)

    Communication/Computation Load

    Parallel Efficiency

    Basic Parallel Efficiency Model

    frac

    BetaBetaBetaBetaBetaBetaBetaBeta

    1.01.51.20.400.250.050.130.04

    Ref. Machine (B=1.0)fRed Storm (B=1.5)ASCI Red (B=1.2)Earth Sim. (B=.4)Cplant (B=.25)Blue Gene Light (B=.05)Cplant (old)Std. Linux Cluster (B=.04)

    1.000.01.001.001.001.001.001.001.00

    0.910.10.940.920.800.710.330.570.29

    0.830.20.880.860.670.560.200.400.17

    0.770.30.830.800.570.450.140.310.12

    0.710.40.790.750.500.380.110.250.09

    0.670.50.750.710.440.330.090.210.07

    0.630.60.710.670.400.290.080.180.06

    0.590.70.680.630.360.260.070.160.05

    0.560.80.650.600.330.240.060.140.05

    0.530.90.630.570.310.220.050.130.04

    0.501.00.600.550.290.200.050.120.04

    leland:On a machine with a balance factor of 1.0

    frac

    Red Storm (B=1.5)

    ASCI Red (B=1.2)

    Cplant (B=.25)

    Blue Gene Light (B=.05)

    Cplant (old)

    Std. Linux Cluster (B=.04)

    f = com_time/calc_time

    Parallel Efficiency

    Basic Parallel Efficiency Model

    grain

    Red Storm (B=1.5)

    ASCI Red (B=1.2)

    Ref. Machine (B=1.0)

    Earth Sim. (B=.4)

    Cplant (B=.25)

    Blue Gene Light (B=.05)

    Std. Linux Cluster (B=.04)

    Communication/Computation Load

    Parallel Efficiency

    Basic Parallel Efficiency Model

    ratio_Red_CP

    Model efficiency as a function of grain size (memory usage) based on hardware balance factors

    Grain size multiplier2.10E+06

    Bytes per Gigabyte1.07E+09

    Bytes per real for double precision8

    Reals per Gigabyte1.34E+08

    Reals of state per grain element64

    Bytes of state per grain element512

    Grain elements per Gigabyte2.10E+06

    Grain elements on edge of cube (that fits in GB)128.0

    c units adjust (tune to match meas. efficiency)8.0

    eta1.01.01.01.01.0

    c4.00E+026.40E+011.33E+011.07E+01

    Beta0.0400.251.21.5

    Grain frac.Grain sizeStandard Linux ClusterCplant (current)ASCI Red (current)Red StormRatio

    0.051.05E+050.110.420.780.821.84

    0.102.10E+050.130.480.820.851.70

    0.153.15E+050.150.520.840.861.62

    0.204.19E+050.160.540.850.881.57

    0.255.24E+050.170.560.860.881.54

    0.306.29E+050.180.570.870.891.51

    0.357.34E+050.180.580.870.891.49

    0.408.39E+050.190.600.880.901.47

    0.459.44E+050.200.610.880.901.45

    0.501.05E+060.200.610.880.901.44

    0.551.15E+060.210.620.890.911.43

    0.601.26E+060.210.630.890.911.42

    0.651.36E+060.220.630.890.911.41

    0.701.47E+060.220.640.900.911.40

    0.751.57E+060.230.650.900.921.39

    0.801.68E+060.230.650.900.921.38

    0.851.78E+060.230.650.900.921.38

    0.901.89E+060.240.660.900.921.37

    0.951.99E+060.240.660.900.921.36

    1.002.10E+060.240.670.910.921.36

    ratio_Red_CP

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    ASCI Red (current)

    Cplant (current)

    Red Storm

    Standard Linux Cluster

    Grain size (frac. of GB)

    Parallel Efficiency

    Standard Model Efficiency (3D, short range)

    ratio_RS_SLC

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    ASCI Red (current)

    Cplant (current)

    Ratio

    Grain size (frac. of GB)

    Parallel Efficiency/Ratio

    Standard Model Efficiency (3D, short range)

    ratio_RS_CLC

    BetaBetaBeta

    1.01.20.25

    Ref. MachinefEfficiency RatioASCI RedCplant

    1.000.01.001.001.00

    0.910.11.290.920.71

    0.830.21.540.860.56

    0.770.31.760.800.45

    0.710.41.950.750.38

    0.670.52.120.710.33

    0.630.62.270.670.29

    0.590.72.400.630.26

    0.560.82.520.600.24

    0.530.92.630.570.22

    0.501.02.730.550.20

    leland:On a machine with a balance factor of 1.0

    ratio_RS_CLC

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    Efficiency Ratio

    ASCI Red

    Cplant

    f = comm_time/calc_time (on ref. machine)

    Parallel Efficiency (& ratio)

    ASCI Red v. Cplant

    interview

    BetaBetaBeta

    1.01.50.04

    Ref. MachinefPerformance RatioRed StormStandard Linux Cluster

    1.000.01.001.001.00

    0.910.13.280.940.29

    0.830.25.290.880.17

    0.770.37.080.830.12

    0.710.48.680.790.09

    0.670.510.130.750.07

    0.630.611.430.710.06

    0.590.712.610.680.05

    0.560.813.700.650.05

    0.530.914.690.630.04

    0.501.015.600.600.04

    leland:On a machine with a balance factor of 1.0

    interview

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    Performance Ratio

    Red Storm

    Standard Linux Cluster

    Communication_time/Computation_time

    Parallel Efficiency (& ratio)

    Red Storm v. Standard Linux Cluster

    balance

    BetaBetaBeta

    1.01.50.05

    Ref. MachinefPerformance RatioRed StormCustom Linux Cluster

    1.000.01.001.001.00

    0.910.12.810.940.33

    0.830.24.410.880.20

    0.770.35.830.830.14

    0.710.47.110.790.11

    0.670.58.250.750.09

    0.630.69.290.710.08

    0.590.710.230.680.07

    0.560.811.090.650.06

    0.530.911.880.630.05

    0.501.012.600.600.05

    leland:On a machine with a balance factor of 1.0

    balance

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    Performance Ratio

    Red Storm

    Custom Linux Cluster

    Communication_time/Computation_time

    Parallel Efficiency (& ratio)

    Red Storm v. High End Linux Cluster

    port_red_CP

    ReferenceCodeLowerTypicalUpperComment

    Ken AlvinPresto

    Rich KoterasPresto

    Kevin BrownPresto0.50

    Manoj BhardwajSalinas

    Garth ReeseSalinas0.90

    Steve PlimptonLAMMPS0.95

    Tim BartelDSMC0.500.530.70

    Ed PiekosDSMC

    Pat ChavezCTH0.700.800.90Towards lower end of range for harder, more realistic problems

    Jim TomkinsCTH0.70

    Paul YarringtonCTH

    Len LorenceITS/Cab

    Jim TomkinsITS/Cab0.90

    port_RS_SLC

    Node speed (Mflops/sec)Link BW (Mbytes/sec, Bidirectional)Balance (Bytes/Flop)NotesNearest neighbor latency (u sec)Across machine latency (u sec)

    Cray T3E120012001.0

    ASCI Red 14008002.0Pre 1999 upgrade, not achievable, P3 mode

    ASCI Red 24005331.3Pre 1999 upgrade, theoretically achievable, P3 mode

    ASCI Red 34004001.0Pre 1999 upgrade, realistically achievable, P3 mode

    ASCI Red 46668001.2Post 1999 upgrade, not achievable, P3 mode

    ASCI Red 56665330.8Post 1999 upgrade, theoretically achievable, P3 mode

    ASCI Red 66664000.6Post 1999 upgrade, realistically achievable, P3 mode

    ASCI Red 72008004.0Pre 1999 upgrade, not achievable, P0 mode

    ASCI Red 82005332.7Pre 1999 upgrade, theoretically achievable, P0 mode

    ASCI Red 92004001.0Pre 1999 upgrade, realistically achievable, P0 mode

    ASCI Red 103338002.4Post 1999 upgrade, not achievable, P0 mode

    ASCI Red 113335331.6Post 1999 upgrade, theoretically achievable, P0 mode

    ASCI Red 123334001.2Post 1999 upgrade, realistically achievable, P0 mode

    ASCI Red 133333100.9Post 1999 upgrade, measured through MPI layer, P0 mode15

    ASCI Mountain Blue 15008001.6Within node communication

    ASCI Mountain Blue 26400012000.02Between node communication

    ASCI Blue Pacific 126503000.11Optimistic

    ASCI Blue Pacific 226501320.05Pessimistic

    ASCI Q 125006500.26Within node communication

    ASCI Q 2100004000.04Between node communication

    ASCI White2400020000.08From Tomkins table in CCPE paper draft

    Earth Simulator64000320000.5From Tomkins table in CCPE paper draft

    ASCI Red Storm 1400064001.6Requires multiple router links sending simultaneously; not achievable in practice25

    ASCI Red Storm 2400060001.5Theoretically achievable25

    ASCI Red Storm 2400036000.9Sustained bandwidth (this is the spec standard)

    Cplant 110001320.13Alaska

    Cplant 29322640.28Ross center section

    Cplant 312322640.21Ross head section

    Cplant 40.25An average across the Ross head and Center section

    Cplant 51000550.06Measured through MPI software layer105

    Vplant96005000.05Per Doug Doerfler

    Zepper80005000.069100 cluster run by Zepper

    ICC 1122005000.04Based on current (6/03) common vendor bids

    ICC 2112005000.04HP actual test system

    ICC 3122005000.04HPTi actual test system 1

    ICC 488005000.06HPTi actual test system 2

    ICC 5500IBM actual test system

    AMD/Quadrics 1400010000.25Hypothetical cluster '04-'05, 1 by 2 Ghz Opterons

    AMD/Quadrics 2560010000.18Hypothetical cluster '04-'05, 1 by 2.8 Ghz Opterons

    AMD/Quadrics 31120020000.18Hypothetical cluster '04-'05, 2 by 2.8 Ghz Opterons, dual rail

    AMD/Quadrics 42240040000.18Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, quad rail

    AMD/Quadrics 52240020000.09Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, dual rail

    AMD/Myrinet 1400010000.25Hypothetical cluster '04-'05, 1 by 2 Ghz Opterons

    AMD/Myrinet 2560010000.18Hypothetical cluster '04-'05, 1 by 2.8 Ghz Opterons

    AMD/Myrinet 31120020000.18Hypothetical cluster '04-'05, 2 by 2.8 Ghz Opterons, dual rail

    AMD/Myrinet 42240040000.18Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, quad rail

    AMD/Myrinet 52240020000.09Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, dual rail

    Intel/Myrinet 31600040000.25Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, quad rail

    Intel/Myrinet 21600020000.13Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, dual rail

    Intel/Myrinet 31600010000.06Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, single rail

    leland:Limited by internal architecture

    leland:Hypertransport provides 3200 Mbytes/sec in each direction, I.e. 6400 Mbytes/sec bidirectional.

    leland:Peak link bandwidth is 6000Mbytes/sec (bidirectional?), so this would limit it in practice I think.

    leland:Except where noted, these are HW based numbers.

    leland:500Mhz processors, 2 flops/clock, 1 proc per node.

    leland:Myrinet of this generation provides 160Mbytes/sec, but the NIC limtis to 132 Mbytes/sec.

    leland:466 Mhz processors, 2 flops/clock, 1 proc per node.

    leland:Myrinet of this generation supports 320 Mbytes/sec, but the NIC limits to 264 Mbytes/sec.

    leland:616Mhz processors, 2 flops/clock, 1 proc per node.

    leland:Myrinet of this generation supports 320 Mbytes/sec, but the NIC limits to 264 Mbytes/sec.

    leland:MPI bandwidth per Ron Brightwell (via Steve Plimpton)

    leland:Used in Plimpton runs

    leland:2.4 Ghz processors, 2 flops per clock, 2 procs per node.

    leland:The NIC supports 528 Mbytes/sec, but Myrinet of this generation limits this to 500Mbytes/sec bidirectional.

    leland:3.06Ghz processors, 2 flops/clock, 2 procs per node.

    leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.

    leland:2.0Ghz Opterons, 2 flops/clock, 1 proc per node

    leland:2.8 Ghz Opterons, 2 flops/clock, 1 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional

    leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system

    leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, four rails per system

    leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system

    leland:2.0Ghz Opterons, 2 flops/clock, 1 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional

    leland:2.8 Ghz Opterons, 2 flops/clock, 1 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional

    leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system

    leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, four rails per system

    leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system

    leland:4.0 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system

    leland:4.0 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system

    leland:4.0 Ghz Opterons, 2 flops/clock, 2 proc per node

    leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, four rails per system

    leland:3.06Ghz processors, 2 flops/clock, 2 procs per node.

    leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.

    leland:3.06Ghz processors, 2 flops/clock, 2 procs per node.

    leland:2.2Ghz processors, 2 flops/clock, 2 procs per node.

    leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.

    leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.

    leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.

    leland:The NIC supports 528 Mbytes/sec, but Myrinet of this generation limits this to 500Mbytes/sec bidirectional.

    leland:200Mhz procs, 1 flop/clock, with 2 procs/node

    leland:Using the mesh router chip (MRC) to obtain 400 Mbytes/sec in each direction for the node, I.e. 800 Mbytes/sec bidirectional. Similarly the NIC provides 400 Mbytes/sec each direction for a node, or 800 Mbytes/sec bidirectional.(JLT) Limit is really 533 MBytes/s for node memory bus. This result assumes that only one node is being used. If two nodes are considered then the bandwidth of the link between the MRC and NIC is shared between the nodes.

    leland:Assuming the memory bandwidth of 533 MBytes/sec is the limiting factor.(JLT) Same comment as for above applies here except that the impact is to change the bandwdith to 400 Mbytes/s.

    leland:333 Mhz processors, 1 flop/clock, 2 procs per node.

    leland:This assumes one proc per node is used and counted.

    leland:I gather this is the bandwidth when only one processor on the node is in use.(JLT) This is the measured write or read bandwidth into a nodes memory. This bandwidth is not the bandwidth achievable between MRC to MRC links in the network. It is also less than the measured bi-directional bandwidth for a node.

    leland:Used in Plimpton runs

    leland:200Mhz procs, 1 flop/clock, with 1 procs/node

    leland:Using the mesh router chip (MRC) to obtain 400 Mbytes/sec in each direction for the node, I.e. 800 Mbytes/sec bidirectional. Similarly the NIC provides 400 Mbytes/sec each direction for a node, or 800 Mbytes/sec bidirectional.(JLT) Limit is really 533 MBytes/s for node memory bus. This result assumes that only one node is being used. If two nodes are considered then the bandwidth of the link between the MRC and NIC is shared between the nodes.

    leland:Assuming the memory bandwidth of 533 MBytes/sec is the limiting factor.(JLT) Same comment as for above applies here except that the impact is to change the bandwdith to 400 Mbytes/s.

    leland:333 Mhz processors, 1 flop/clock, 1 procs per node.

    port_RS_CLC

    Balance factorCycle fractionfrac.Eff.Weight. ave. eff.CodeEff.fTime

    1.20ASCI Red

    0.3440.2000.200.500.10Presto0.501.002.00

    0.1580.1000.100.900.09Salinas0.900.111.11

    0.1280.1000.100.950.10LAMMPS0.950.051.05

    0.1040.1000.100.800.08CTH0.800.251.25

    0.0740.2000.200.550.11DSMC0.550.821.82

    0.0100.2000.200.650.13Ceptre0.650.541.54

    0.0010.1000.100.900.09ITS0.900.111.11

    1.001.000.70

    0.25Cplant

    0.20.200.170.03Presto0.174.805.80

    0.10.100.650.07Salinas0.650.531.53

    0.10.100.800.08LAMMPS0.800.251.25

    0.10.100.450.05CTH0.451.202.20

    0.20.200.200.04DSMC0.203.934.93

    0.2000.200.280.06Ceptre0.282.593.59

    0.1000.100.650.07ITS0.650.531.53

    1.001.000.39

    4.80Red/Cplant

    0.20.202.900.58Presto2.90

    0.10.101.380.14Salinas1.38

    0.10.101.190.12LAMMPS1.19

    0.10.101.760.18CTH1.76

    0.20.202.710.54DSMC2.71

    0.2000.202.330.47Ceptre2.33

    0.1000.101.380.14ITS1.38

    1.001.002.16

    leland:f = time_commun/time_compute = 1/eff - 1

    leland:Eff = parallel efficiency = 1/(1+f)

    leland:Measured MPI bandwith in Mbytes/sec.

    leland:Plimpton estimated .80

    leland:Assume 1/10 neutral flow at 75% efficiency, 9/10 plasma flow at 50% efficiency per Tim Bartel. Rounded up from 52.5 so as not to suggest a precision which is not there.

    leland:Per Garth Reese

    leland:Per Steve Plimpton

    leland:Per Pat Chavez

    leland:Per Len Lorence

    leland:Per Len Lorence

    leland:Per ACME data from Courtenay and discussion with Kevin Brown

    Balance factorCycle fractionfrac.Lower eff.Lower wgt. ave. eff.Typical eff.Typical wgt. ave. eff.Upper eff.Upper wgt. ave. eff.CodeLower Eff.Typical Eff.Upper Eff.Lower fTypical fUpper f

    1.5Red Storm

    0.3440.340.000.000.560.190.000.00Presto0.000.560.000.000.800.00

    0.1580.160.000.000.920.150.000.00Salinas0.000.920.000.000.090.00

    0.1280.130.000.000.960.120.000.00LAMMPS0.000.960.000.000.040.00

    0.1040.100.000.000.830.090.000.00CTH0.000.830.000.000.200.00

    0.0740.070.000.000.600.040.000.00DSMC0.000.600.000.000.660.00

    0.0100.010.000.000.700.010.000.00Ceptre0.000.700.000.000.430.00

    0.0010.000.000.000.920.000.000.00ITS0.000.920.000.000.090.00

    0.820.820.000.730.00

    0.04Standard L. Cluster

    0.3440.340.000.000.030.010.000.00Presto0.000.030.000.0030.030.00

    0.1580.160.000.000.230.040.000.00Salinas0.000.230.000.003.340.00

    0.1280.130.000.000.390.050.000.00LAMMPS0.000.390.000.001.580.00

    0.1040.100.000.000.120.010.000.00CTH0.000.120.000.007.510.00

    0.0740.070.000.000.040.000.000.00DSMC0.000.040.000.0024.570.00

    0.0100.010.000.000.060.000.000.00Ceptre0.000.060.000.0016.170.00

    0.0010.000.000.000.230.000.000.00ITS0.000.230.000.003.340.00

    0.820.820.000.140.00

    37.50RS/Cluster

    0.3440.340.000.0017.235.930.000.00Presto0.0017.230.00

    0.1580.160.000.003.980.630.000.00Salinas0.003.980.00

    0.1280.130.000.002.480.320.000.00LAMMPS0.002.480.00

    0.1040.100.000.007.090.740.000.00CTH0.007.090.00

    0.0740.070.000.0015.451.140.000.00DSMC0.0015.450.00

    0.0100.010.000.0012.000.120.000.00Ceptre0.0012.000.00

    0.0010.000.000.003.980.000.000.00ITS0.003.980.00

    0.820.820.0010.840.00

    leland:f = time_commun/time_compute = 1/eff - 1

    leland:Eff = parallel efficiency = 1/(1+f)

    leland:Measured MPI bandwith in Mbytes/sec.

    Balance factorCycle fractionfrac.Lower eff.Lower wgt. ave. eff.Typical eff.Typical wgt. ave. eff.Upper eff.Upper wgt. ave. eff.CodeLower Eff.Typical Eff.Upper Eff.Lower fTypical fUpper f

    1.5Red Storm

    0.3440.340.000.000.560.190.000.00Presto0.000.560.000.000.800.00

    0.1580.160.000.000.920.150.000.00Salinas0.000.920.000.000.090.00

    0.1280.130.000.000.960.120.000.00LAMMPS0.000.960.000.000.040.00

    0.1040.100.000.000.830.090.000.00CTH0.000.830.000.000.200.00

    0.0740.070.000.000.600.040.000.00DSMC0.000.600.000.000.660.00

    0.0100.010.000.000.700.010.000.00Ceptre0.000.700.000.000.430.00

    0.0010.000.000.000.920.000.000.00ITS0.000.920.000.000.090.00

    0.820.820.000.730.00

    0.05Custom L. Cluster

    0.3440.340.000.000.040.010.000.00Presto0.000.040.000.0024.020.00

    0.1580.160.000.000.270.040.000.00Salinas0.000.270.000.002.670.00

    0.1280.130.000.000.440.060.000.00LAMMPS0.000.440.000.001.260.00

    0.1040.100.000.000.140.010.000.00CTH0.000.140.000.006.010.00

    0.0740.070.000.000.050.000.000.00DSMC0.000.050.000.0019.660.00

    0.0100.010.000.000.070.000.000.00Ceptre0.000.070.000.0012.940.00

    0.0010.000.000.000.270.000.000.00ITS0.000.270.000.002.670.00

    0.820.820.000.160.00

    30.00RS/Cluster

    0.3440.340.000.0013.904.780.000.00Presto0.0013.900.00

    0.1580.160.000.003.370.530.000.00Salinas0.003.370.00

    0.1280.130.000.002.170.280.000.00LAMMPS0.002.170.00

    0.1040.100.000.005.840.610.000.00CTH0.005.840.00

    0.0740.070.000.0012.480.920.000.00DSMC0.0012.480.00

    0.0100.010.000.009.740.100.000.00Ceptre0.009.740.00

    0.0010.000.000.003.370.000.000.00ITS0.003.370.00

    0.820.820.008.820.00

    leland:f = time_commun/time_compute = 1/eff - 1

    leland:Eff = parallel efficiency = 1/(1+f)

    leland:Measured MPI bandwith in Mbytes/sec.

  • Relating scalability and cost

    Efficiency ratio =Cost ratio = 1.8MPP more cost effectiveCluster more cost effectiveAverage efficiency ratio over the five codes that consume >80% of Sandias cycles

    Chart1

    11

    1.04173162422

    1.03704275714

    1.12633268268

    1.122583230916

    1.169569253432

    1.296943655764

    1.4160002988128

    1.7174488073256

    2.2598735787512

    3.05108011563.0510801156

    20483.9

    40965

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    combined

    Sum

    Flat1.01.01.01.01.01.06.00

    Current0.340.160.130.100.070.0010.81

    Future0.150.100.100.100.100.150.70

    Future combined ratio

    0.150.100.100.100.100.150.70

    ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio

    10.150.100.100.100.100.151.00

    20.180.100.110.100.100.151.04

    40.160.100.110.100.110.151.04

    80.190.100.120.100.110.171.13

    160.200.100.120.100.120.141.12

    320.220.110.120.100.120.151.17

    640.260.130.120.110.130.161.30

    1280.290.160.120.120.130.161.42

    2560.360.250.120.130.150.191.72

    5120.470.450.150.130.140.242.26

    10240.530.910.130.130.150.293.053.05

    20483.90

    40965.00

    Current combined ratio

    0.340.160.130.100.070.0010.81

    ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio

    10.340.160.130.100.070.001.00

    20.400.160.140.100.080.001.08

    40.380.160.140.100.080.001.06

    80.430.160.150.100.080.001.14

    160.460.160.150.100.090.001.19

    320.510.170.150.100.090.001.27

    640.590.200.160.110.090.001.43

    1280.660.260.160.120.100.001.61

    2560.820.400.160.140.110.002.00

    5121.070.710.190.140.100.002.74

    10241.211.430.170.140.110.003.783.78

    20484.90

    40966.20

    Efficiency ratiosProcsACMESalinasLAMMPSDSMCCTHITSAverageAverage

    11.001.001.001.001.001.001.00

    21.171.001.071.001.020.971.04

    41.091.011.091.001.050.981.04

    81.241.011.161.001.101.171.11

    161.331.021.181.001.220.971.12

    321.481.061.171.001.241.001.16

    641.711.291.241.101.261.091.28

    1281.931.641.231.201.321.091.40

    2562.372.511.231.301.471.301.70

    5123.124.501.511.301.401.622.24

    10243.529.061.331.301.471.943.103.1

    20484.10

    40965.30

    1.812.281.201.111.231.19

    combined

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Future portfolio

    ACME

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Current portfolio

    Salinas

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Average portfolio

    LAMMPS

    ProblemProcsASCI RedCplant

    Scaled problem11.1320.56

    CPU time (sec.)21.5760.911

    41.9741.068

    82.2281.364

    162.431.594

    322.4821.822

    642.6032.200

    1282.762.629

    2562.8263.320

    5122.9934.614

    10243.1395.464

    ProblemProcsASCI RedCplantVplantRatio

    Scaled problem11.001.001.0001.00

    Efficiency20.720.610.5851.17

    40.570.520.4371.09

    80.510.410.3721.24

    160.470.350.3281.33

    320.460.310.3021.48

    640.430.250.2691.71

    1280.410.210.2651.93

    2560.400.170.2432.37

    5120.380.123.12

    10240.360.103.52

    leland:From very limited Bartel data, so made very conservative assumption by "hand"

    leland:Courtenay's driver routine around ACME

    leland:Ross

    leland:Ross

    LAMMPS

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    ACME brick contact problem

    DSMC

    ASCI RedSecsEff.

    1511.00

    8670.76

    27780.65

    64810.63

    125890.57

    216920.55

    3431020.50

    5121110.46

    7291200.43

    10001260.40

    13311620.31

    17281810.28

    21972200.23

    27442650.19

    CplantSecsEff.

    1241.00

    8320.75

    27380.63

    64490.49

    125680.35

    2161000.24

    3431530.16

    5122350.10

    7193490.07

    10005220.05

    VplantSecsEff.

    1211.00

    8340.62

    27370.57

    64400.53

    125410.51

    216420.50

    343420.50

    512480.44

    719-

    1024-

    ProcsRed Eff.Cplant Eff.

    11.001.00

    20.970.96

    40.900.89

    80.760.75

    160.720.70

    270.650.63

    320.650.61

    640.630.49

    1250.570.35

    1280.570.35

    2160.550.24

    2560.540.21

    3430.500.16

    5120.460.10

    7290.430.07

    10000.400.05

    10240.400.04

    13310.31

    ProcsRed Eff.Cplant Eff.Ratio

    11.001.001.00

    20.970.961.00

    40.900.891.01

    80.760.751.01

    160.720.701.02

    320.650.611.06

    640.630.491.29

    1280.570.351.64

    2560.540.212.51

    5120.460.104.50

    10240.400.0449.06

    leland:Courtenay's driver routine around ACME

    ASCI Red

    Cplant

    Vplant

    Processors

    Scaled parallel efficiency

    ACME brick contact problem

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    DSMC

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas modal problem

    CTH

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas modal problem

    ITS

    MembraneProcsASCI RedCplantRatio

    Scaled problem11.0001.0001.00

    Parallel efficiency21.0280.9641.07

    41.0050.9241.09

    80.9920.8531.16

    160.9950.8451.18

    320.9960.8531.17

    640.9940.8031.24

    1280.9920.8051.23

    2560.9890.8011.23

    5120.9850.6511.51

    10240.9620.7211.33

    ITS

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    LAMMPS membrane problem (FFT rich)

    Sheet1

    ProcsASCI Red (model)Cplant (model)f Redf CplantRatio

    11.001.000.000.001.00

    20.740.740.360.361.00

    40.670.670.480.481.00

    80.640.640.560.561.00

    160.590.590.690.691.00

    320.590.590.690.691.00

    640.600.600.680.681.00

    1280.570.570.750.751.00

    2560.510.510.970.971.00

    5120.480.481.071.071.00

    10240.520.520.930.931.00

    Red bal.1.2

    Cplant bal.1.2

    Ratio1.0

    ASCI RedTimeEff

    0

    20

    40

    80

    2000.680.432

    4000.720.408

    640

    8000.740.397

    1000

    16000.690.426

    CplantTimeEff

    0

    204.680.326

    404.720.323

    804.870.313

    200

    400

    6404.960.307

    800

    1000

    ASCI RedTimeEffRatio

    01.3

    80

    2000.680.432

    4000.720.408

    640

    8000.740.397

    1000

    CplantTimeEff

    0

    804.870.313

    200

    400

    6404.960.307

    800

    1000

    leland:Bio problem with FFT's

    leland:Ross

    Sheet1

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red (model)

    Cplant (model)

    Processors

    Scaled parallel efficiency

    DSMC problem

    Presto_old

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    DSMC plasma problem

    Salinas_old

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    DSMC plasma problem

    DSMC_old

    ProcsASCI RedASCI Red (interp.)ASCI RedRatio

    11.001.001.001.00

    20.950.951.02

    40.840.840.841.05

    80.820.821.10

    160.780.780.781.22

    320.780.781.24

    640.770.770.771.26

    1280.740.741.32

    2560.690.690.691.47

    5120.620.621.40

    10240.520.521.47

    ProcsCplantCplant (interp.)Cplant

    11.001.001.00

    20.930.93

    40.800.800.80

    80.740.74

    160.640.640.64

    320.630.63

    640.610.610.61

    1280.560.56

    2560.470.470.47

    5120.440.44

    10240.350.35

    DSMC_old

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    00

    00

    ASCI Red

    ASCI Red (interp.)

    Cplant

    Cplant (interp.)

    Processors

    Scaled parallel efficiency

    CTH - NEST shell problem

    CTH_old

    CPU timeScaled efficiency

    STARSATProcsASCI RedCplantASCI RedCplantRatio

    Scaled problem11701.4340.61.001.001.00

    21730.44336.180.981.010.97

    41687.4331.221.011.030.98

    81692.29394.681.010.861.17

    161693.52327.211.001.040.97

    321692.29337.861.011.011.00

    641695.06368.221.000.921.09

    1281695.43369.111.000.921.09

    2561704.42443.001.000.771.30

    5121735.67564.020.980.601.62

    10241802.00701.060.940.491.94

    12481632641282565121024

    10.98256120030.98684732320.80861088550.97972440940.96441028290.84070945950.8435593220.73255813950.5530.4090684932

    Vplant

    11.00

    20.98

    40.99

    80.81

    160.98

    320.96

    640.84

    1280.84

    2560.73

    5120.55

    10240.41

    leland:From Mahesh Rajan

    leland:Ross, mixture of head and center nodes

    leland:Ross, mixture of head and center nodes

    CTH_old

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled Parallel Efficiency

    ITS Starsat problem

    000

    000

    000

    000

    000

    000

    000

    000

    000

    00

    00

    ASCI Red

    Cplant

    Vplant

    Processors

    Scaled Parallel Efficiency

    ITS Starsat problem

    CPU time (sec)Parallel efficiency

    GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster

    4k elements/proc15.50E+021.60E+021.001.00

    Presto46.04E+021.79E+020.910.89

    158.77E+022.60E+020.630.62

    601.06E+033.76E+020.520.43

    2401.36E+030.41

    CPU time (sec)Fraction of whole

    GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster

    4k elements/proc1

    ACME time45.66E+021.73E+020.940.96

    158.37E+022.53E+020.950.97

    601.02E+033.69E+020.960.98

    2401.31E+030.97

    leland:Arne's test problem (a drop problem?)

    leland:9100 Linux Cluster

    leland:Arne's test problem (a drop problem?)

    leland:9100 Linux Cluster

    00

    00

    00

    00

    00

    ASCI Red

    Std. Linux Cluster

    Processors

    Scaled parallel efficiency

    Presto drop problem

    Scaled problemProcsASCI RedCplant

    CPU time (sec.)146.864.5

    247.0

    447.3

    847.8

    1648.7

    3250.7

    6454.475

    12855.8

    21656.978

    51259.180

    100063.685

    Scaled problemProcsASCI RedCplantRatio

    Parallel efficiency11.001.001.00

    20.990.991.00

    40.990.991.00

    80.980.981.00

    160.960.961.00

    320.920.921.00

    640.860.861.00

    1280.840.841.00

    2160.820.830.99

    5120.790.810.98

    10000.740.760.97

    Scaled problemProcsASCI Red (extrap.)Cplant (extrap.)

    Extrapolatio11.001.00

    20.990.99

    40.990.99

    80.980.98

    160.960.96

    320.920.92

    640.860.86

    128

    216

    512

    1000

    Scaled problemProcsASCI RedCplant

    Data1

    2

    4

    8

    16

    32

    640.860.86

    1280.840.84

    2160.820.83

    5120.790.81

    10000.740.76

    leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)

    leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)

    leland:Ross

    leland:Not an actual data point, but linear interp looks good and I don't know how to make the plot work if I have to skip.

    leland:Not an actual data point, but don't see how to make plot work otherwise.

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    ASCI Red (extrap.)

    Cplant (extrap.)

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas GB problem

    ProcsASCI Red (model)Cplant (model)f Redf CplantRatio

    11.001.000.000.001.00

    20.740.470.361.151.58

    40.670.390.481.551.72

    80.640.360.561.801.79

    160.590.310.692.211.90

    320.590.310.692.221.90

    640.600.320.682.171.89

    1280.570.290.752.411.95

    2560.510.240.973.112.08

    5120.480.231.073.432.14

    10240.520.250.932.962.06

    Red bal.0.80

    Cplant bal.0.25

    Ratio3.2

    leland:In rollup count this as 256 procs.

    leland:In rollup, count this as 1024 procs.

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red (model)

    Cplant (model)

    Processors

    Scaled parallel efficiency

    DSMC problem

    ProcsASCI RedCplant (model)f Redf CplantRatio

    11.001.000.000.001.00

    20.740.470.361.151.58

    40.670.390.481.551.72

    80.640.360.561.801.79

    160.590.310.692.211.90

    320.590.310.692.221.90

    640.600.320.682.171.89

    1280.570.290.752.411.95

    2560.510.240.973.112.08

    5120.480.231.073.432.14

    10240.520.250.932.962.06

    Red bal.0.80

    Cplant bal.0.25

    Ratio3.2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant (model)

    Processors

    Scaled parallel efficiency

    CTH 2-gas problem

  • Scalability determines cost effectiveness 380M node-hrs55M node-hrsMPP more cost effectiveCluster more cost effective256Sandias top priority computing workload:

  • Scalability also limits capability ~3x processors

    Chart1

    00

    64.239377957159.420814003

    128.4507175171118.5292037233

    192192

    255.5464028819196.476534586

    320320

    384384

    448448

    501.8907972138303.0814608188

    576576

    640640

    704704

    768768

    832832

    896896

    960960

    966.8332963374469.7221646257

    10881088

    Red Speedup

    Cplant Speedup

    Processors

    Speedup

    ITS Speedup curves

    combined

    Sum

    Flat1.01.01.01.01.01.06.00

    Current0.340.160.130.100.070.0010.81

    Future0.150.100.100.100.100.150.70

    Future combined ratio

    0.150.100.100.100.100.150.70

    ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio

    10.150.100.100.100.100.151.00

    20.180.100.110.100.100.151.04

    40.160.100.110.100.110.151.04

    80.190.100.120.100.110.171.13

    160.200.100.120.100.120.141.12

    320.220.110.120.100.120.151.17

    640.260.130.120.110.130.161.30

    1280.290.160.120.120.130.161.42

    2560.360.250.120.130.150.191.72

    5120.470.450.150.130.140.242.26

    10240.530.910.130.130.150.293.053.05

    20483.90

    40965.00

    Current combined ratio

    0.340.160.130.100.070.0010.81

    ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio

    10.340.160.130.100.070.001.00

    20.400.160.140.100.080.001.08

    40.380.160.140.100.080.001.06

    80.430.160.150.100.080.001.14

    160.460.160.150.100.090.001.19

    320.510.170.150.100.090.001.27

    640.590.200.160.110.090.001.43

    1280.660.260.160.120.100.001.61

    2560.820.400.160.140.110.002.00

    5121.070.710.190.140.100.002.74

    10241.211.430.170.140.110.003.783.78

    20484.90

    40966.20

    Efficiency ratiosProcsACMESalinasLAMMPSDSMCCTHITSAverageAverage

    11.001.001.001.001.001.001.00

    21.171.001.071.001.020.971.04

    41.091.011.091.001.050.981.04

    81.241.011.161.001.101.171.11

    161.331.021.181.001.220.971.12

    321.481.061.171.001.241.001.16

    641.711.291.241.101.261.091.28

    1281.931.641.231.201.321.091.40

    2562.372.511.231.301.471.301.70

    5123.124.501.511.301.401.622.24

    10243.529.061.331.301.471.943.103.1

    20484.10

    40965.30

    1.812.281.201.111.231.19

    combined

    1

    1.0417316242

    1.0370427571

    1.1263326826

    1.1225832309

    1.1695692534

    1.2969436557

    1.4160002988

    1.7174488073

    2.2598735787

    3.05108011563.0510801156

    3.9

    5

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Future portfolio

    ACME

    1

    1.0838924008

    1.0597698842

    1.1392859529

    1.1913721552

    1.2660360918

    1.4319161332

    1.6104267079

    1.998786187

    2.7403839139

    3.78114171643.7811417164

    4.9

    6.2

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Current portfolio

    Salinas

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Average portfolio

    LAMMPS

    ProblemProcsASCI RedCplant

    Scaled problem11.1320.56

    CPU time (sec.)21.5760.911

    41.9741.068

    82.2281.364

    162.431.594

    322.4821.822

    642.6032.200

    1282.762.629

    2562.8263.320

    5122.9934.614

    10243.1395.464

    ProblemProcsASCI RedCplantVplantRatio

    Scaled problem11.001.001.0001.00

    Efficiency20.720.610.5851.17

    40.570.520.4371.09

    80.510.410.3721.24

    160.470.350.3281.33

    320.460.310.3021.48

    640.430.250.2691.71

    1280.410.210.2651.93

    2560.400.170.2432.37

    5120.380.123.12

    10240.360.103.52

    leland:From very limited Bartel data, so made very conservative assumption by "hand"

    leland:Courtenay's driver routine around ACME

    leland:Ross

    leland:Ross

    LAMMPS

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    ACME brick contact problem

    DSMC

    ASCI RedSecsEff.

    1511.00

    8670.76

    27780.65

    64810.63

    125890.57

    216920.55

    3431020.50

    5121110.46

    7291200.43

    10001260.40

    13311620.31

    17281810.28

    21972200.23

    27442650.19

    CplantSecsEff.

    1241.00

    8320.75

    27380.63

    64490.49

    125680.35

    2161000.24

    3431530.16

    5122350.10

    7193490.07

    10005220.05

    VplantSecsEff.

    1211.00

    8340.62

    27370.57

    64400.53

    125410.51

    216420.50

    343420.50

    512480.44

    719-

    1024-

    ProcsRed Eff.Cplant Eff.

    11.001.00

    20.970.96

    40.900.89

    80.760.75

    160.720.70

    270.650.63

    320.650.61

    640.630.49

    1250.570.35

    1280.570.35

    2160.550.24

    2560.540.21

    3430.500.16

    5120.460.10

    7290.430.07

    10000.400.05

    10240.400.04

    13310.31

    ProcsRed Eff.Cplant Eff.Ratio

    11.001.001.00

    20.970.961.00

    40.900.891.01

    80.760.751.01

    160.720.701.02

    320.650.611.06

    640.630.491.29

    1280.570.351.64

    2560.540.212.51

    5120.460.104.50

    10240.400.0449.06

    leland:Courtenay's driver routine around ACME

    ASCI Red

    Cplant

    Vplant

    Processors

    Scaled parallel efficiency

    ACME brick contact problem

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    DSMC

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas modal problem

    CTH

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas modal problem

    ITS

    MembraneProcsASCI RedCplantRatio

    Scaled problem11.0001.0001.00

    Parallel efficiency21.0280.9641.07

    41.0050.9241.09

    80.9920.8531.16

    160.9950.8451.18

    320.9960.8531.17

    640.9940.8031.24

    1280.9920.8051.23

    2560.9890.8011.23

    5120.9850.6511.51

    10240.9620.7211.33

    ITS

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    LAMMPS membrane problem (FFT rich)

    Sheet1

    ProcsASCI Red (model)Cplant (model)f Redf CplantRatio

    11.001.000.000.001.00

    20.740.740.360.361.00

    40.670.670.480.481.00

    80.640.640.560.561.00

    160.590.590.690.691.00

    320.590.590.690.691.00

    640.600.600.680.681.00

    1280.570.570.750.751.00

    2560.510.510.970.971.00

    5120.480.481.071.071.00

    10240.520.520.930.931.00

    Red bal.1.2

    Cplant bal.1.2

    Ratio1.0

    ASCI RedTimeEff

    0

    20

    40

    80

    2000.680.432

    4000.720.408

    640

    8000.740.397

    1000

    16000.690.426

    CplantTimeEff

    0

    204.680.326

    404.720.323

    804.870.313

    200

    400

    6404.960.307

    800

    1000

    ASCI RedTimeEffRatio

    01.3

    80

    2000.680.432

    4000.720.408

    640

    8000.740.397

    1000

    CplantTimeEff

    0

    804.870.313

    200

    400

    6404.960.307

    800

    1000

    leland:Bio problem with FFT's

    leland:Ross

    Sheet1

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red (model)

    Cplant (model)

    Processors

    Scaled parallel efficiency

    DSMC problem

    Presto_old

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    DSMC plasma problem

    Salinas_old

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    DSMC plasma problem

    DSMC_old

    ProcsASCI RedASCI Red (interp.)ASCI RedRatio

    11.001.001.001.00

    20.950.951.02

    40.840.840.841.05

    80.820.821.10

    160.780.780.781.22

    320.780.781.24

    640.770.770.771.26

    1280.740.741.32

    2560.690.690.691.47

    5120.620.621.40

    10240.520.521.47

    ProcsCplantCplant (interp.)Cplant

    11.001.001.00

    20.930.93

    40.800.800.80

    80.740.74

    160.640.640.64

    320.630.63

    640.610.610.61

    1280.560.56

    2560.470.470.47

    5120.440.44

    10240.350.35

    DSMC_old

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    00

    00

    ASCI Red

    ASCI Red (interp.)

    Cplant

    Cplant (interp.)

    Processors

    Scaled parallel efficiency

    CTH - NEST shell problem

    CTH_old

    CPU timeScaled efficiency

    STARSATProcsASCI RedCplantASCI RedCplantRatio

    Scaled problem11701.4340.61.001.001.00

    21730.44336.180.981.010.97

    41687.4331.221.011.030.98

    81692.29394.681.010.861.17

    161693.52327.211.001.040.97

    321692.29337.861.011.011.00

    641695.06368.221.000.921.09

    1281695.43369.111.000.921.09

    2561704.42443.001.000.771.30

    5121735.67564.020.980.601.62

    10241802.00701.060.940.491.94

    12481632641282565121024

    10.98256120030.98684732320.80861088550.97972440940.96441028290.84070945950.8435593220.73255813950.5530.4090684932

    Vplant

    11.00

    20.98

    40.99

    80.81

    160.98

    320.96

    640.84

    1280.84

    2560.73

    5120.55

    10240.41

    ProcsRed EfficiencyRed SpeedupCplant EfficiencyCplant Speedup

    11.001.001.001.00

    20.981.971.011.99

    41.014.031.034.1

    81.018.040.866.9

    161.0016.11.0416.7

    321.0132.21.0132.4

    641.0064.20.9259.4

    1281.00128.50.92118.5

    2561.00255.50.77196.5

    5120.98501.90.60303.1

    10240.94966.80.49469.7

    ProcsRed EfficiencyRed SpeedupCplant EfficiencyCplant Speedup

    0

    641.00640.9259

    1281.001280.92119

    192

    2561.002560.77196

    320

    384

    448

    5120.985020.60303

    576

    640

    704

    768

    832

    896

    960

    10240.949670.49470

    1088

    1152

    1216

    1280

    1344

    1408

    1472

    1536

    1600

    1664

    1728

    1792

    1856

    1920

    1984

    2048

    leland:From Mahesh Rajan

    leland:Ross, mixture of head and center nodes

    leland:Ross, mixture of head and center nodes

    CTH_old

    11

    0.98321814111.0131477185

    1.0082967881.0283195459

    1.00538323810.8629776021

    1.00465303041.0409217322

    1.00538323811.008109868

    1.00374028060.9249904948

    1.00352123060.9227601528

    0.99822813630.7688487585

    0.98025546330.6038792951

    0.9441731410.4858357345

    ASCI Red

    Cplant

    Processors

    Scaled Parallel Efficiency

    ITS Starsat problem

    111

    0.98321814111.01314771850.9825612003

    1.0082967881.02831954590.9868473232

    1.00538323810.86297760210.8086108855

    1.00465303041.04092173220.9797244094

    1.00538323811.0081098680.9644102829

    1.00374028060.92499049480.8407094595

    1.00352123060.92276015280.843559322

    0.99822813630.76884875850.7325581395

    0.98025546330.6038792951

    0.9441731410.4858357345

    ASCI Red

    Cplant

    Vplant

    Processors

    Scaled Parallel Efficiency

    ITS Starsat problem

    Red Speedup

    Cplant Speedup

    Processors

    Speedup

    ITS Speedup curves

    Red Speedup

    Cplant Speedup

    Processors

    Speedup

    ITS Speedup curves

    CPU time (sec)Parallel efficiency

    GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster

    4k elements/proc15.50E+021.60E+021.001.00

    Presto46.04E+021.79E+020.910.89

    158.77E+022.60E+020.630.62

    601.06E+033.76E+020.520.43

    2401.36E+030.41

    CPU time (sec)Fraction of whole

    GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster

    4k elements/proc1

    ACME time45.66E+021.73E+020.940.96

    158.37E+022.53E+020.950.97

    601.02E+033.69E+020.960.98

    2401.31E+030.97

    leland:Arne's test problem (a drop problem?)

    leland:9100 Linux Cluster

    leland:Arne's test problem (a drop problem?)

    leland:9100 Linux Cluster

    00

    00

    00

    00

    00

    ASCI Red

    Std. Linux Cluster

    Processors

    Scaled parallel efficiency

    Presto drop problem

    Scaled problemProcsASCI RedCplant

    CPU time (sec.)146.864.5

    247.0

    447.3

    847.8

    1648.7

    3250.7

    6454.475

    12855.8

    21656.978

    51259.180

    100063.685

    Scaled problemProcsASCI RedCplantRatio

    Parallel efficiency11.001.001.00

    20.990.991.00

    40.990.991.00

    80.980.981.00

    160.960.961.00

    320.920.921.00

    640.860.861.00

    1280.840.841.00

    2160.820.830.99

    5120.790.810.98

    10000.740.760.97

    Scaled problemProcsASCI Red (extrap.)Cplant (extrap.)

    Extrapolatio11.001.00

    20.990.99

    40.990.99

    80.980.98

    160.960.96

    320.920.92

    640.860.86

    128

    216

    512

    1000

    Scaled problemProcsASCI RedCplant

    Data1

    2

    4

    8

    16

    32

    640.860.86

    1280.840.84

    2160.820.83

    5120.790.81

    10000.740.76

    leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)

    leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)

    leland:Ross

    leland:Not an actual data point, but linear interp looks good and I don't know how to make the plot work if I have to skip.

    leland:Not an actual data point, but don't see how to make plot work otherwise.

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    ASCI Red (extrap.)

    Cplant (extrap.)

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas GB problem

    ProcsASCI Red (model)Cplant (model)f Redf CplantRatio

    11.001.000.000.001.00

    20.740.470.361.151.58

    40.670.390.481.551.72

    80.640.360.561.801.79

    160.590.310.692.211.90

    320.590.310.692.221.90

    640.600.320.682.171.89

    1280.570.290.752.411.95

    2560.510.240.973.112.08

    5120.480.231.073.432.14

    10240.520.250.932.962.06

    Red bal.0.80

    Cplant bal.0.25

    Ratio3.2

    leland:In rollup count this as 256 procs.

    leland:In rollup, count this as 1024 procs.

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red (model)

    Cplant (model)

    Processors

    Scaled parallel efficiency

    DSMC problem

    ProcsASCI RedCplant (model)f Redf CplantRatio

    11.001.000.000.001.00

    20.740.470.361.151.58

    40.670.390.481.551.72

    80.640.360.561.801.79

    160.590.310.692.211.90

    320.590.310.692.221.90

    640.600.320.682.171.89

    1280.570.290.752.411.95

    2560.510.240.973.112.08

    5120.480.231.073.432.14

    10240.520.250.932.962.06

    Red bal.0.80

    Cplant bal.0.25

    Ratio3.2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant (model)

    Processors

    Scaled parallel efficiency

    CTH 2-gas problem

  • Commodity nearly everywhere-- Customization drives costEarth Simulator and Cray X-1 are fully custom Vector systems with good balanceThis drives their high cost (and their high performance).Clusters are nearly entirely high-volume with no truly custom partsWhich drives their low-cost (and their low scalability)Red Storm uses custom parts only where they are critical to performance and reliabilityHigh scalability at minimal cost/performance

  • Scaling data for some key engineering codesRandom variation at small proc. countsLarge differential in efficiency at large proc. counts

    Chart1

    1111

    0.98321814111.01314771850.71827411170.6147091109

    1.0082967881.02831954590.57345491390.5243445693

    1.00538323810.86297760210.50807899460.4105571848

    1.00465303041.04092173220.46584362140.3513174404

    1.00538323811.0081098680.45608380340.3073545554

    1.00374028060.92499049480.43488282750.2545454545

    1.00352123060.92276015280.41014492750.2130087486

    0.99822813630.76884875850.40056617130.1686746988

    0.98025546330.60387929510.3782158370.1213697443

    0.9441731410.48583573450.36062440270.102489019

    ITS, Red

    ITS, Cplant

    ACME, Red

    ACME, Cplant

    Processors

    Scaled Parallel Efficiency

    Performance on Engineering Codes

    combined

    Sum

    Flat1.01.01.01.01.01.06.00

    Current0.340.160.130.100.070.0010.81

    Future0.150.100.100.100.100.150.70

    Future combined ratio

    0.150.100.100.100.100.150.70

    ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio

    10.150.100.100.100.100.151.00

    20.180.100.110.100.100.151.04

    40.160.100.110.100.110.151.04

    80.190.100.120.100.110.171.13

    160.200.100.120.100.120.141.12

    320.220.110.120.100.120.151.17

    640.260.130.120.110.130.161.30

    1280.290.160.120.120.130.161.42

    2560.360.250.120.130.150.191.72

    5120.470.450.150.130.140.242.26

    10240.530.910.130.130.150.293.053.05

    20483.90

    40965.00

    Current combined ratio

    0.340.160.130.100.070.0010.81

    ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio

    10.340.160.130.100.070.001.00

    20.400.160.140.100.080.001.08

    40.380.160.140.100.080.001.06

    80.430.160.150.100.080.001.14

    160.460.160.150.100.090.001.19

    320.510.170.150.100.090.001.27

    640.590.200.160.110.090.001.43

    1280.660.260.160.120.100.001.61

    2560.820.400.160.140.110.002.00

    5121.070.710.190.140.100.002.74

    10241.211.430.170.140.110.003.783.78

    20484.90

    40966.20

    Efficiency ratiosProcsACMESalinasLAMMPSDSMCCTHITSAverageAverage

    11.001.001.001.001.001.001.00

    21.171.001.071.001.020.971.04

    41.091.011.091.001.050.981.04

    81.241.011.161.001.101.171.11

    161.331.021.181.001.220.971.12

    321.481.061.171.001.241.001.16

    641.711.291.241.101.261.091.28

    1281.931.641.231.201.321.091.40

    2562.372.511.231.301.471.301.70

    5123.124.501.511.301.401.622.24

    10243.529.061.331.301.471.943.103.1

    20484.10

    40965.30

    1.812.281.201.111.231.19

    combined

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Future portfolio

    compiled

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Current portfolio

    ACME

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Eff. Ratio

    Extrapolation

    Processors

    Efficiency ratio (Red/Cplant)

    Average portfolio

    Salinas

    CTH, RedSalinas, RedACME, RedITS, RedLAMMPS, Red

    Red11.001.001.001.001.00

    20.950.970.720.981.03

    40.840.900.571.011.01

    80.820.760.511.010.99

    160.780.720.471.001.00

    320.780.650.461.011.00

    640.770.630.431.000.99

    1280.740.570.411.000.99

    2560.690.540.401.000.99

    5120.620.460.380.980.99

    10240.520.400.360.940.96

    CTH, CplantSalinas, CplantACME, CplantITS, CplantLAMMPS, Cplant

    Cplant11.001.001.001.001.00

    20.930.960.611.010.96

    40.800.890.521.030.92

    80.740.750.410.860.85

    160.640.700.351.040.85

    320.630.610.311.010.85

    640.610.490.250.920.80

    1280.560.350.210.920.81

    2560.470.210.170.770.80

    5120.440.100.120.600.65

    10240.350.040.100.490.72

    SalinasACMEITSCTHLAMMPS

    Ratio11.001.001.001.001.00

    21.001.170.971.021.07

    41.011.090.981.051.09

    81.011.241.171.101.16

    161.021.330.971.221.18

    321.061.481.001.241.17

    641.291.711.091.261.24

    1281.641.931.091.321.23

    2562.512.371.301.471.23

    5124.503.121.621.401.51

    10249.063.521.941.471.33

    leland:From very limited Bartel data, so made very conservative assumption by "hand"

    Salinas

    Salinas

    ACME

    ITS

    CTH

    LAMMPS

    Processors

    Red efficiency/Cplant efficiency

    Efficiency ratios

    LAMMPS

    CTH, Red

    Salinas, Red

    ACME, Red

    CTH, Cplant

    Salinas, Cplant

    ACME, Cplant

    Processors

    Scaled Parallel Efficiency

    Performance on Engineering Codes

    DSMC

    CTH, Red

    CTH, Cplant

    ACME, Red

    ACME, Cplant

    Processors

    Scaled Parallel Efficiency

    Performance on Engineering Codes

    CTH

    ITS, Red

    ITS, Cplant

    ACME, Red

    ACME, Cplant

    Processors

    Scaled Parallel Efficiency

    Performance on Engineering Codes

    ITS

    ProblemProcsASCI RedCplant

    Scaled problem11.1320.56

    CPU time (sec.)21.5760.911

    41.9741.068

    82.2281.364

    162.431.594

    322.4821.822

    642.6032.200

    1282.762.629

    2562.8263.320

    5122.9934.614

    10243.1395.464

    ProblemProcsASCI RedCplantVplantRatio

    Scaled problem11.001.001.0001.00

    Efficiency20.720.610.5851.17

    40.570.520.4371.09

    80.510.410.3721.24

    160.470.350.3281.33

    320.460.310.3021.48

    640.430.250.2691.71

    1280.410.210.2651.93

    2560.400.170.2432.37

    5120.380.123.12

    10240.360.103.52

    leland:Courtenay's driver routine around ACME

    leland:Ross

    leland:Ross

    ITS

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    ACME brick contact problem

    Sheet1

    ASCI RedSecsEff.

    1511.00

    8670.76

    27780.65

    64810.63

    125890.57

    216920.55

    3431020.50

    5121110.46

    7291200.43

    10001260.40

    13311620.31

    17281810.28

    21972200.23

    27442650.19

    CplantSecsEff.

    1241.00

    8320.75

    27380.63

    64490.49

    125680.35

    2161000.24

    3431530.16

    5122350.10

    7193490.07

    10005220.05

    VplantSecsEff.

    1211.00

    8340.62

    27370.57

    64400.53

    125410.51

    216420.50

    343420.50

    512480.44

    719-

    1024-

    ProcsRed Eff.Cplant Eff.

    11.001.00

    20.970.96

    40.900.89

    80.760.75

    160.720.70

    270.650.63

    320.650.61

    640.630.49

    1250.570.35

    1280.570.35

    2160.550.24

    2560.540.21

    3430.500.16

    5120.460.10

    7290.430.07

    10000.400.05

    10240.400.04

    13310.31

    ProcsRed Eff.Cplant Eff.Ratio

    11.001.001.00

    20.970.961.00

    40.900.891.01

    80.760.751.01

    160.720.701.02

    320.650.611.06

    640.630.491.29

    1280.570.351.64

    2560.540.212.51

    5120.460.104.50

    10240.400.0449.06

    leland:Courtenay's driver routine around ACME

    ASCI Red

    Cplant

    Vplant

    Processors

    Scaled parallel efficiency

    ACME brick contact problem

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet1

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas modal problem

    Presto_old

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    Salinas modal problem

    Salinas_old

    MembraneProcsASCI RedCplantRatio

    Scaled problem11.0001.0001.00

    Parallel efficiency21.0280.9641.07

    41.0050.9241.09

    80.9920.8531.16

    160.9950.8451.18

    320.9960.8531.17

    640.9940.8031.24

    1280.9920.8051.23

    2560.9890.8011.23

    5120.9850.6511.51

    10240.9620.7211.33

    Salinas_old

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    LAMMPS membrane problem (FFT rich)

    DSMC_old

    ProcsASCI Red (model)Cplant (model)f Redf CplantRatio

    11.001.000.000.001.00

    20.740.740.360.361.00

    40.670.670.480.481.00

    80.640.640.560.561.00

    160.590.590.690.691.00

    320.590.590.690.691.00

    640.600.600.680.681.00

    1280.570.570.750.751.00

    2560.510.510.970.971.00

    5120.480.481.071.071.00

    10240.520.520.930.931.00

    Red bal.1.2

    Cplant bal.1.2

    Ratio1.0

    ASCI RedTimeEff

    0

    20

    40

    80

    2000.680.432

    4000.720.408

    640

    8000.740.397

    1000

    16000.690.426

    CplantTimeEff

    0

    204.680.326

    404.720.323

    804.870.313

    200

    400

    6404.960.307

    800

    1000

    ASCI RedTimeEffRatio

    01.3

    80

    2000.680.432

    4000.720.408

    640

    8000.740.397

    1000

    CplantTimeEff

    0

    804.870.313

    200

    400

    6404.960.307

    800

    1000

    leland:Bio problem with FFT's

    leland:Ross

    DSMC_old

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red (model)

    Cplant (model)

    Processors

    Scaled parallel efficiency

    DSMC problem

    CTH_old

    00

    00

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    DSMC plasma problem

    00

    00

    00

    00

    00

    00

    00

    ASCI Red

    Cplant

    Processors

    Scaled parallel efficiency

    DSMC plasma problem

    ProcsASCI RedASCI Red (interp.)ASCI RedRatio

    11.001.001.001.00

    20.950.951.02

    40.840.840.841.05

    80.820.821.10

    160.780.780.781.