Bill Camp, Jim Tomkins & Rob Leland

Bill Camp, Jim Tomkins& Rob Leland

How Much Commodity is Enough?the Red Storm Architecture

William J. Camp, James L. Tomkins & Rob LelandCCIM, Sandia National LaboratoriesAlbuquerque, [email protected]

Sandia MPPs (since 1987) 1987: 1024-processor nCUBE10 [512 Mflops] 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops]1988--1990: 16384-processor CM-2001991: 64-processor Intel IPSC-8601993--1996: ~3700-processor Intel Paragon [180 Gflops]1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops]1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops]2003: 1280-processor IA32- Linux cluster [~7 Tflops]2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]

Our rubric (since 1987)Complex, mission-critical, engineering & science applications Large systems (1000s of PEs) with a few processors per nodeMessage passing paradigmBalanced architectureUse commodity wherever possibleEfficient systems softwareEmphasis on scalability & reliability in all aspects Critical advances in parallel algorithmsVertical integration of technologies

A partitioned, scalable computing architecture

Computing domains at SandiaRed Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)

Domain# Procs1101102103104 Red StormXXXCplant Linux Supercluster XXX Beowulf clustersXXXDesktopX

Red Storm ArchitectureTrue MPP, designed to be a single system-- not a clusterDistributed memory MIMD parallel supercomputerFully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0--2.4 GHz)~10 or 20 TB of DDR memory @ 333MHzRed/Black switching: ~1/4, ~1/2, ~1/4 (for data security)12 Service, Visualization, and I/O cabinets on each end (640 S,V & I processors for each color)240 TB of disk storage (120 TB per color) initially

Red Storm ArchitectureFunctional hardware partitioning: service and I/O nodes, compute nodes, Visualization nodes, and RAS nodesPartitioned Operating System (OS): LINUX on Service, Visualization, and I/O nodes, LWK (Catamount) on compute nodes, LINUX on RAS nodesSeparate RAS and system management network (Ethernet)Router table-based routing in the interconnectLess than 2 MW total power and coolingLess than 3,000 ft2 of floor space

Usage ModelUnix (Linux) Login Node with Unix environmentBatch ProcessingorCompute ResourceI/OUser sees a coherent, single system

Thors Hammer Topology

3D-mesh Compute node topology:27 x 16 x 24 (x, y, z) Red/Black split: 2,688 4,992 2,688Service, Visualization, and I/O partitions3 cabs on each end of each row384 full bandwidth links to Compute Node MeshNot all nodes have a processor-- all have routers 256 PEs in each Visualization Partition--2 per board256 PEs in each I/O Partition-- 2 per board128 PEs in each Service Partition-- 4 per board

3-D Mesh topology (Z direction is a torus)10,368 Compute Node MeshX=27Y=16Z=24Torus Interconnect in Z640 Visualization Service & I/O Nodes640 Visualization, Service & I/O Nodes

Thors Hammer Network Chips

3D-mesh is created by SEASTAR ASIC:Hyper-transport Interface and 6 network router ports on each chipIn computer partitions each processor has its own SEASTARIn service partition, some boards are configured like compute partition (4 PEs per board)Others have only 2 PEs per board; but still have 4 SEASTARSSo, network topology is uniformSEASTAR designed by CRAY to our specs, Fabricated by IBMThe only truly custom part in Red Storm-- complies with HT open standard

Node architectureCPU AMD OpteronDRAM 1 (or 2) Gbyte or moreSix Links To Other Nodes in X, Y,and ZASIC = Application Specific Integrated Circuit, or a custom chip

System Layout(27 x 16 x 24 mesh)Normally UnclassifiedNormally ClassifiedSwitchable NodesDisconnect Cabinets{{

Thors Hammer Cabinet LayoutCompute Node Partition3 Card Cages per Cabinet8 Boards per Card Cage4 Processors per Board4 NIC/Router Chips per BoardN + 1 Power SuppliesPassive BackplaneService. Viz, and I/O Node Partition2 (or 3) Card Cages per Cabinet8 Boards per Card Cage2 (or 4) Processors per Board4 NIC/Router Chips per Board2-PE I/O Boards have 4 PCI-X bussesN + 1 Power SuppliesPassive Backplane}96 PE

PerformancePeak of 41.4 (46.6) TF based on 2 floating point instruction issues per clock at 2.0 Gigahertz . We required 7-fold speedup versus ASCI Red but based on our benchmarks expect performance will be 8-10 time faster than ASCI Red.Expected MP-Linpack performance: ~30--35 TF Aggregate system memory bandwidth: ~55 TB/sInterconnect Performance:Latency

PerformanceI/O System PerformanceSustained file system bandwidth of 50 GB/s for each colorSustained external network bandwidth of 25 GB/s for each colorNode memory systemPage miss latency to local memory is ~80 nsPeak bandwidth of ~5.4 GB/s for each processor

Red Storm System SoftwareOperating SystemsLINUX on service and I/O nodesSandias LWK (Catamount) on compute nodesLINUX on RAS nodesRun-Time System Logarithmic loaderFast, efficient Node allocatorBatch system PBSLibraries MPI, I/O, MathFile Systems being considered includePVFS interim file systemLustre Design Intent Panassas-- possible alternative

Red Storm System SoftwareToolsAll IA32 Compilers, all AMD 64-bit Compilers Fortran, C, C++Debugger Totalview (also examining alternatives)Performance Tools (was going to be Vampir until Intel bought Pallas-- now?)System Management and AdministrationAccountingRAS GUI Interface

Comparison of ASCI Redand Red Storm

Red Storm Project23 months, design to First Product Shipment!System software is a joint project between Cray and SandiaSandia is supplying Catamount LWK and the service node run-time systemCray is responsible for Linux, NIC software interface, RAS software, file system software, and Totalview portInitial software development was done on a cluster of workstations with a commodity interconnect. Second stage involves an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based systemSystem design is going on nowCabinets-- existSEASTAR NIC/Router-- released to Fabrication at IBM earlier this monthFull system to be installed and turned over to Sandia in stages culminating in August--September 2004

New Building for Thors Hammer

Designing for scalable supercomputingChallenges in: -Design-Integration-Management-Use

SUREty for Very Large Parallel Computer Systems

Scalability - Full System Hardware and System Software

Usability - Required Functionality Only

Reliability - Hardware and System Software

Expense minimization- use commodity, high-volume parts SURE poses Computer System Requirements:

SURE Architectural tradeoffs: Processor and memory sub- system balanceCompute vs interconnect balanceTopology choicesSoftware choicesRASCommodity vs. Custom technology Geometry and mechanical design

Sandia Strategies:-build on commodity-leverage Open Source (e.g., Linux)-Add to commodity selectively (in RS there is basically one truly custom part!)-leverage experience with previous scalable supercomputers

System Scalability Driven RequirementsOverall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors).

-

ScalabilitySystem Software;System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size.

- Full re-boot time scales logarithmically with the system size.- Job loading is logarithmic with the number of processors.- Parallel I/O performance is not sensitive to # of PEs doing I/O- Communication Network software must be scalable.- No connection-based protocols among compute nodes.- Message buffer space independent of # of processors.- Compute node OS gets out of the way of the application.

Hardware scalabilityBalance in the node hardware:Memory BW must match CPU speedIdeally 24 Bytes/flop (never yet done)Communications speed must match CPU speedI/O must match CPU speedsScalable System SW( OS and Libraries)Scalable Applications

SW Scalability: Kernel tradeoffsUnix/Linux has more functionality, but impairs efficiency on the compute partition at scaleSay breaks are uncorrelated and last 50 S and occur once per secondOn 100 CPUs, wasted time is 5 ms every secondNegligible .5% impactOn 1000 CPUs, wasted time is 50 ms every secondModerate 5% impactOn 10,000 CPUs, wasted time is 500 msSignificant 50% impact

Usability>Application Code Support:Software that supports scalability of the Computer SystemMath LibrariesMPI Support for Full System SizeParallel I/O LibraryCompilersTools that Scale to the Full Size of the Computer SystemDebuggersPerformance MonitorsFull-featured LINUX OS support at the user interface

ReliabilityLight Weight Kernel (LWK) O. S. on compute partitionMuch less code fails much less oftenMonitoring of correctible errorsFix soft errors before they become hardHot swapping of componentsOverall system keeps running during maintenanceRedundant power supplies & memoriesCompletely independent RAS System monitors virtually every component in system

Economy

Use high-volume parts where possibleMinimize power requirementsCuts operating costsReduces need for new capital investmentMinimize system volumeReduces need for large new capital facilitiesUse standard manufacturing processes where possible-- minimize customizationMaximize reliability and availability/dollarMaximize scalability/dollarDesign for integrability

EconomyRed Storm leverages economies of scaleAMD Opteron microprocessor & standard memoryAir cooledElectrical interconnect based on Infiniband physical devicesLinux operating systemSelected use of custom componentsSystem chip ASICCritical for communication intensive applicationsLight Weight KernelTruly custom, but we already have it (4th generation)

Cplant on a slideGoal: MPP look and feelStart ~1997, upgrade ~1999--2001Alpha & Myrinet, mesh topology~3000 procs (3Tf) in 7 systemsConfigurable to ~1700 procsRed/Black switching Linux w/ custom runtime & mgmt.Production operation for several yrs.ASCI Red

IA-32 Cplant on a slideGoal: Mid-range capacityStarted 2003, upgrade annuallyPentium-4 & Myrinet, Clos network1280 procs (~7 Tf) in 3 systemsCurrently configurable to 512 procsLinux w/ custom runtime & mgmt.Production operation for several yrs.ASCI Red

Observation:For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs.

There must be balance between processor, interconnect, and I/O performance to achieve overall performance.

To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.

Lets Compare Balance In Parallel Systems

Comparing Red Storm and BGL Blue Gene Light** Red Storm*Node Speed 5.6 GF 5.6 GF (1x)Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)Network latency 7 msecs 2 msecs (2/7 x)Network BW 0.28 GB/s 6.0 GB/s (22x)BW Bytes/Flops 0.05 1.1 (22x)Bi-Section B/F 0.0016 0.038 (24x)#nodes/problem 40,000 10,000 (1/4 x)*100 TF version of Red Storm* * 360 TF version of BGL

Fixed problem performanceMolecular dynamics problem(LJ liquid)

Parallel Sn Neutronics (provided by LANL)

Scalable computing works

Chart1

1001001111119910011100

99.399.322222299682290

98.798.244444498654486

95.298.988888892678872

94.998.6161697.747002893816161656161661

94.696.9323297.59875259883232323210057

93.495.8646494.58117671916464646497.916666666756

93.29512812893.090273987696.229743137512812812896.311475409847

9393.125625690.888714557994.305493523993.398341179125625694

92.391.151251287.653025533491.61641929592.056456316351251292.1568627451

92.285.81024102483.970019342488.439099892889.16008614590.1017688394102492.1568627451

92.173.72048204881.149153876485.429623048687.563971340891.5055426656204887.037037037

91.267.587.388.2320082.758620689784.914378907388.1914316703320081.0344827586

3600360036003600360036003600360036003600

4096409640964096409640964096409699.64096

64206420642064206420642064206420946420

QS-Particles

QS-Fields-Only

QS-1B Cells

Rad x-port-1B Cells

Rad x-port - 17M

Rad x-port - 80M

Rad x-port - 168M

Rad x-port - 532M

Finite Element

Zapotec

Reactive Fluid Flow

Salinas

CTH

Processors

Scaled parallel efficiency (%)

ASCI Red efficiencies for major codes

Raw Data

Quicksilver data:

PARTICLES - scaled-size problem:

30x30x30 grid = 27000 cells per processor

12 particles/cell = 324000 particles per procesor

200 timesteps

3-stage implicit field solve, all-mirror PMC BC

2 species = electron and positrons

6 sets of pre-loaded particles, moving in all 6 directions

particles velocity = 1/2 grid cell/timestep

uniform load-balance, all cells filled, non-interacting particles

1-proc time is 92% particles, 7% fields (on Tflop)

TflopTflopSiberiaSiberia

ProcsGrid CellsParticlesCPUEffCPUEff

127000324,000604.6100155.8100

254000648,000608.899.3159.197.9

41080001,296,000612.798.716196.8

82160002,592,000635.495.217091.6

164320005,184,00063794.9170.991.2

3286400010,368,00063994.6173.489.9

641,728,00020,736,000646.893.4179.986.6

1283,456,00041,472,00064993.218584.2

2566,912,00082,944,000650.393197.279

51213,824,000165,888,000655.392.3

102427,648,000331,776,000655.792.2

204855,296,000663,552,000656.292.1

320086,400,0001,036,800,000662.891.2

------------------------------------

FIELDS-ONLY - scaled-size problem:

30x30x30 grid = 27000 cells per processor

10000 timesteps

Poisson inlet (Gaussian pulse excitation)

mirror side walls

several HISTORY outputs

TflopTflopAlaskaAlaskaSiberiaSiberia

ProcsGrid CellsCPUEffCPUEffCPUEff

127000412100129.610068.5100

254000414.899.3147.38876.389.8

4108000419.698.216976.786.878.9

8216000416.498.9194.466.7119.657.3

16432000417.898.6214.360.5130.752.4

32864000425.296.924752.5143.847.6

641,728,00043095.8283.145.8175.839

1283,456,000433.895293.744.1204.133.6

2566,912,000442.493.1

51213,824,000452.191.1

102427,648,000480.485.8

204855,296,000558.873.7

320086,400,00061067.5

----------------------------------

Tflops huge run for 1000 timesteps:

CPUTflop

Procs1 billion545.8Eff(476.85 is ideal time based

320087.3on 1-proc 27000-grid-cell time)

------------------------------------

Static LB scaled-size problem

30x30x30 grid per proc,

6 sets of particles, 12 particles/cell ave

LB = grid is 1/10 filled with particles in each of 3d, LB on

noLB = same, but LB off

LB Eff from 1-proc noLB time

TflopTflopTflopTflopTflopSiberiaSiberia

ProcsnoLBEffLBEffuniformLBEff

1582.5100578.1100.1604.6158.8100

2758.976.859398.2608.8170.793

4934.662.3599.197.2612.7176.590

8113551.3613.195635.4178.189.2

16149139.1654.389637187.684.6

32186031.3737.879639200.479.2

64222626.267286.7646.8211.875

128293519.8675.586.2649226.170.2

256366915.9754.777.2650.3

512438813.2710.582655.3

1024472412.375577.2655.7

------------------------------------------------

Dynamic LB scaled-size problem

30x30x30 grid per proc,

3 sets of particles, 12 particles/cell ave

LB = grid is 1/10 filled with particles in each of 3d, LB on

2 steps to transit one cell, load moves from lower-left to upper-right

noLB = same, but LB off

LB Eff from 1-proc noLB time

TflopTflopTflopTflopTflopSiberiaSiberia

ProcsnoLBEffLBEffuniformLBEff

1576.6100573.1100.2604.6159.1100

2744.677.4625.692.2608.8179.188.8

4911.463.3725.679.5612.7211.675.2

8108952.9757.176.2635.4220.172.3

16139041.5873.866637259.861.2

32169634957.360.2639300.652.9

64200828.7100457.4646.8330.248.2

128253022.8111551.7649394.340.3

256304518.9128145650.3

512357216.1141240.8655.3

1024408914.1158436.4655.7

---------------------------------

In Dec 99 I ran a billion-atom molecular dynamics benchmark on Tflops

using a LAMMPS-similar code I have (it is more streamlined, but does

the same kernel computations as LAMMPS). This problem had only

short-range forces (no FFTs which the membrane problem has).

It ran on 3200 Tflops procs (proc 0 mode), at 88.2% parallel

efficiency.

ProcEff

320088.2

CTH

3dr cu/hmx

janusmekipp6/27/03

ProcsCTH

(Mbyte)grind time(us/zn-cyc)1100real

procsmem/procminimummaximum298

119094.1109.3496

890real

819113.615.11689

641871.791.983288

(-p 3)128861.141.196486real

(-p 3)512910.310.3512872real

(-p 3)1024880.1680.18825668

(-p 3)2048900.0880.09651261real

(-p 3)6400890.0310.036102457real

204856real

320055

360054

409653

642047real

ProcsCTH

1100real

890real

6486real

12872real

51261real

102457real

204856real

642047real

Summary

ProcsQS-ParticlesQS-Fields-OnlyQS-1B CellsRad x-port-1B CellsRad x-port - 17MRad x-port - 80MRad x-port - 168MRad x-port - 532MFinite ElementZapotecReactive Fluid FlowSalinasCTHCTH-NZapotek-NFinite El.-N

110010099100100118

299.399.39968908816

498.798.2986586646464

895.298.99267721285123600

1694.998.69856615123179

3294.696.998100571024

6493.495.89598562048

12893.295939696476420

2569393.191949394

51292.391.188929292

102492.285.88488899092

204892.173.78185889287

320091.267.587.388.283858881

3600

409699.6

642094

Summary

QS-Particles

QS-Fields-Only

QS-1B Cells

Rad x-port-1B Cells

Rad x-port - 17M

Rad x-port - 80M

Rad x-port - 168M

Rad x-port - 532M

Finite Element

Zapotec

Reactive Fluid Flow

Salinas

CTH

Processors

Scaled parallel efficiency (%)

ASCI Red efficiencies for major codes

Zapotec

Rob, Ed, Peter,

Here are the timing numbers from zapotec. Zapotec is a coupled Eulerian-

Lagrangian code which tightly couples CTH and PRONTO. The problem is a

Tugsten rod impacting water with 1440 PRONTO elements per processor and a

101250 CTH cells per processor.

Courtenay

procsefficiency

1100

868

6465

51267

317956

Zapotec

Tungsten Rod

Number of Processors

Scaled Efficiency (%)

Janus: Scaled Efficiency - Zapotec

Crash

Finite Element - Crash

ProcEfficiency

899

1699

6498

360092

Crash

99

99

98

92

Finite Element-Crash



Janus: Scaled Efficiency - Crash

RD Plots

Quicksilver

Tflop (Janus) Scaled Efficiency

ProcsParticlesFields-Only1B Cells1B Cells-LAMMPS

1100100

299.399.3

498.798.2

895.298.9

1694.998.6

3294.696.9

6493.495.8

12893.295

2569393.1

51292.391.1

102492.285.8

204892.173.7

320091.267.587.388.2

RD Plots

Particles

Fields-Only

1B Cells

1B Cells-LAMMPS



Janus: Quicksilver Scaled Efficiency

RF Flow

ProcEff.

409699.6

642094

Salinas

ProcsSalinas-NormSalinas-Act

1235

2

4

8

16

32100235

6498240

12896244

25694250

51292255

102492255

204887270

320081290

Rad x-port

Radiation Transport

Rob - here is the parallel efficiency data in the radiation transport

plot. For each of 4 problems sizes (N), there are 2 efficiency

numbers - the top one is the "ideal" efficiency I mentioned due to the

nature of the algorithm (blue circles in what I sent). The lower

value is the actual measured run-time efficiency (red circles in what

I sent). Again, the diff between the 2 is due to the

communication/latency on a particular machine.

Steve

---------------

N = 17 millionN=17M-ThN=17M-Act

896.7694.58

P=8 P=16 P=32 P=64 P=128 P=256 P=512 P=10241696.293.89

3294.6789.54

96.76 96.20 94.67 91.61 89.23 85.77 82.72 76.236491.6185.28

94.58 93.89 89.54 85.28 81.10 75.18 69.46 61.8612889.2381.1

25685.7775.18

---------------51282.7269.46

102476.2361.86

N = 80 million2048

P=64 P=128 P=256 P=512 P=1024 P=2048

N=80M-ThN=80M-Act

90.71 89.56 86.24 83.99 78.79 73.378

87.29 84.46 79.01 74.28 67.31 60.7216

32

---------------6490.7187.29

12889.5684.46

N = 168 million25686.2479.01

51283.9974.28

P=128 P=256 P=512 P=1024 P=2048102478.7967.31

204873.3760.72

89.22 85.73 83.58 78.16 73.58

83.33 78.92 74.52 68.44 62.48N=168M-ThN=168M-Act

8

---------------16

32

N = 532 million64

12889.2283.33

P=512 P=1024 P=204825685.7378.92

51283.5874.52

82.54 77.58 73.76102478.1668.44

74.37 70.99 65.05204873.5862.48

N=17MN=80MN=168MN=532MN=532M-ThN=532M-Act

8988

169816

329532

64939664

128919493128

256889292256

5128488899051282.5474.37

102481858892102477.5870.99

2048838588204873.7665.05

Rad x-port

97.7470028938888

97.5987525988161616

94.5811767191323232

93.090273987696.22974313756464

90.888714557994.305493523993.3983411791128

87.653025533491.61641929592.0564563163256

83.970019342488.439099892889.16008614590.1017688394

81.149153876485.429623048687.563971340891.5055426656

204882.758620689784.914378907388.1914316703

N=17M

N=80M

N=168M

N=532M



Janus: Radiation Transport - Scaled Efficiency

Balance is critical to scalabilityPeakLinpack

Chart2

1111111

0.93750.92314793450.90909090910.80.71428571430.33333333330.2857142857

0.88235294120.85726532360.83333333330.66666666670.55555555560.20.1666666667

0.83333333330.8001600320.76923076920.57142857140.45454545450.14285714290.1176470588

0.78947368420.75018754690.71428571430.50.38461538460.11111111110.0909090909

0.750.70609002650.66666666670.44444444440.33333333330.09090909090.0740740741

0.71428571430.6668889630.6250.40.29411764710.07692307690.0625

0.68181818180.63181172010.58823529410.36363636360.26315789470.06666666670.0540540541

0.6521739130.6002400960.55555555560.33333333330.23809523810.05882352940.0476190476

0.6250.57167357440.52631578950.30769230770.21739130430.05263157890.0425531915

0.60.54570259210.50.28571428570.20.04761904760.0384615385

Red Storm (B=1.5)

ASCI Red (B=1.2)

Ref. Machine (B=1.0)

Earth Sim. (B=.4)

Cplant (B=.25)

Blue Gene Light (B=.05)

Std. Linux Cluster (B=.04)

Communication/Computation Load

Parallel Efficiency

Basic Parallel Efficiency Model

frac

BetaBetaBetaBetaBetaBetaBetaBeta

1.01.51.20.400.250.050.130.04

Ref. Machine (B=1.0)fRed Storm (B=1.5)ASCI Red (B=1.2)Earth Sim. (B=.4)Cplant (B=.25)Blue Gene Light (B=.05)Cplant (old)Std. Linux Cluster (B=.04)

1.000.01.001.001.001.001.001.001.00

0.910.10.940.920.800.710.330.570.29

0.830.20.880.860.670.560.200.400.17

0.770.30.830.800.570.450.140.310.12

0.710.40.790.750.500.380.110.250.09

0.670.50.750.710.440.330.090.210.07

0.630.60.710.670.400.290.080.180.06

0.590.70.680.630.360.260.070.160.05

0.560.80.650.600.330.240.060.140.05

0.530.90.630.570.310.220.050.130.04

0.501.00.600.550.290.200.050.120.04

leland:On a machine with a balance factor of 1.0

frac

Red Storm (B=1.5)

ASCI Red (B=1.2)

Cplant (B=.25)


Cplant (old)


f = com_time/calc_time

Parallel Efficiency


grain

Red Storm (B=1.5)

ASCI Red (B=1.2)

Ref. Machine (B=1.0)

Earth Sim. (B=.4)

Cplant (B=.25)



Communication/Computation Load

Parallel Efficiency


ratio_Red_CP

Model efficiency as a function of grain size (memory usage) based on hardware balance factors

Grain size multiplier2.10E+06

Bytes per Gigabyte1.07E+09

Bytes per real for double precision8

Reals per Gigabyte1.34E+08

Reals of state per grain element64

Bytes of state per grain element512

Grain elements per Gigabyte2.10E+06

Grain elements on edge of cube (that fits in GB)128.0

c units adjust (tune to match meas. efficiency)8.0

eta1.01.01.01.01.0

c4.00E+026.40E+011.33E+011.07E+01

Beta0.0400.251.21.5

Grain frac.Grain sizeStandard Linux ClusterCplant (current)ASCI Red (current)Red StormRatio

0.051.05E+050.110.420.780.821.84

0.102.10E+050.130.480.820.851.70

0.153.15E+050.150.520.840.861.62

0.204.19E+050.160.540.850.881.57

0.255.24E+050.170.560.860.881.54

0.306.29E+050.180.570.870.891.51

0.357.34E+050.180.580.870.891.49

0.408.39E+050.190.600.880.901.47

0.459.44E+050.200.610.880.901.45

0.501.05E+060.200.610.880.901.44

0.551.15E+060.210.620.890.911.43

0.601.26E+060.210.630.890.911.42

0.651.36E+060.220.630.890.911.41

0.701.47E+060.220.640.900.911.40

0.751.57E+060.230.650.900.921.39

0.801.68E+060.230.650.900.921.38

0.851.78E+060.230.650.900.921.38

0.901.89E+060.240.660.900.921.37

0.951.99E+060.240.660.900.921.36

1.002.10E+060.240.670.910.921.36

ratio_Red_CP

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

ASCI Red (current)

Cplant (current)

Red Storm

Standard Linux Cluster

Grain size (frac. of GB)

Parallel Efficiency

Standard Model Efficiency (3D, short range)

ratio_RS_SLC

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

ASCI Red (current)

Cplant (current)

Ratio

Grain size (frac. of GB)

Parallel Efficiency/Ratio

Standard Model Efficiency (3D, short range)

ratio_RS_CLC

BetaBetaBeta

1.01.20.25

Ref. MachinefEfficiency RatioASCI RedCplant

1.000.01.001.001.00

0.910.11.290.920.71

0.830.21.540.860.56

0.770.31.760.800.45

0.710.41.950.750.38

0.670.52.120.710.33

0.630.62.270.670.29

0.590.72.400.630.26

0.560.82.520.600.24

0.530.92.630.570.22

0.501.02.730.550.20


ratio_RS_CLC

000

000

000

000

000

000

000

000

000

000

000

Efficiency Ratio

ASCI Red

Cplant

f = comm_time/calc_time (on ref. machine)

Parallel Efficiency (& ratio)

ASCI Red v. Cplant

interview

BetaBetaBeta

1.01.50.04

Ref. MachinefPerformance RatioRed StormStandard Linux Cluster

1.000.01.001.001.00

0.910.13.280.940.29

0.830.25.290.880.17

0.770.37.080.830.12

0.710.48.680.790.09

0.670.510.130.750.07

0.630.611.430.710.06

0.590.712.610.680.05

0.560.813.700.650.05

0.530.914.690.630.04

0.501.015.600.600.04


interview

000

000

000

000

000

000

000

000

000

000

000

Performance Ratio

Red Storm

Standard Linux Cluster

Communication_time/Computation_time


Red Storm v. Standard Linux Cluster

balance

BetaBetaBeta

1.01.50.05

Ref. MachinefPerformance RatioRed StormCustom Linux Cluster

1.000.01.001.001.00

0.910.12.810.940.33

0.830.24.410.880.20

0.770.35.830.830.14

0.710.47.110.790.11

0.670.58.250.750.09

0.630.69.290.710.08

0.590.710.230.680.07

0.560.811.090.650.06

0.530.911.880.630.05

0.501.012.600.600.05


balance

000

000

000

000

000

000

000

000

000

000

000

Performance Ratio

Red Storm

Custom Linux Cluster

Communication_time/Computation_time


Red Storm v. High End Linux Cluster

port_red_CP

ReferenceCodeLowerTypicalUpperComment

Ken AlvinPresto

Rich KoterasPresto

Kevin BrownPresto0.50

Manoj BhardwajSalinas

Garth ReeseSalinas0.90

Steve PlimptonLAMMPS0.95

Tim BartelDSMC0.500.530.70

Ed PiekosDSMC

Pat ChavezCTH0.700.800.90Towards lower end of range for harder, more realistic problems

Jim TomkinsCTH0.70

Paul YarringtonCTH

Len LorenceITS/Cab

Jim TomkinsITS/Cab0.90

port_RS_SLC

Node speed (Mflops/sec)Link BW (Mbytes/sec, Bidirectional)Balance (Bytes/Flop)NotesNearest neighbor latency (u sec)Across machine latency (u sec)

Cray T3E120012001.0

ASCI Red 14008002.0Pre 1999 upgrade, not achievable, P3 mode

ASCI Red 24005331.3Pre 1999 upgrade, theoretically achievable, P3 mode

ASCI Red 34004001.0Pre 1999 upgrade, realistically achievable, P3 mode

ASCI Red 46668001.2Post 1999 upgrade, not achievable, P3 mode

ASCI Red 56665330.8Post 1999 upgrade, theoretically achievable, P3 mode

ASCI Red 66664000.6Post 1999 upgrade, realistically achievable, P3 mode

ASCI Red 72008004.0Pre 1999 upgrade, not achievable, P0 mode

ASCI Red 82005332.7Pre 1999 upgrade, theoretically achievable, P0 mode

ASCI Red 92004001.0Pre 1999 upgrade, realistically achievable, P0 mode

ASCI Red 103338002.4Post 1999 upgrade, not achievable, P0 mode

ASCI Red 113335331.6Post 1999 upgrade, theoretically achievable, P0 mode

ASCI Red 123334001.2Post 1999 upgrade, realistically achievable, P0 mode

ASCI Red 133333100.9Post 1999 upgrade, measured through MPI layer, P0 mode15

ASCI Mountain Blue 15008001.6Within node communication

ASCI Mountain Blue 26400012000.02Between node communication

ASCI Blue Pacific 126503000.11Optimistic

ASCI Blue Pacific 226501320.05Pessimistic

ASCI Q 125006500.26Within node communication

ASCI Q 2100004000.04Between node communication

ASCI White2400020000.08From Tomkins table in CCPE paper draft

Earth Simulator64000320000.5From Tomkins table in CCPE paper draft

ASCI Red Storm 1400064001.6Requires multiple router links sending simultaneously; not achievable in practice25

ASCI Red Storm 2400060001.5Theoretically achievable25

ASCI Red Storm 2400036000.9Sustained bandwidth (this is the spec standard)

Cplant 110001320.13Alaska

Cplant 29322640.28Ross center section

Cplant 312322640.21Ross head section

Cplant 40.25An average across the Ross head and Center section

Cplant 51000550.06Measured through MPI software layer105

Vplant96005000.05Per Doug Doerfler

Zepper80005000.069100 cluster run by Zepper

ICC 1122005000.04Based on current (6/03) common vendor bids

ICC 2112005000.04HP actual test system

ICC 3122005000.04HPTi actual test system 1

ICC 488005000.06HPTi actual test system 2

ICC 5500IBM actual test system

AMD/Quadrics 1400010000.25Hypothetical cluster '04-'05, 1 by 2 Ghz Opterons

AMD/Quadrics 2560010000.18Hypothetical cluster '04-'05, 1 by 2.8 Ghz Opterons

AMD/Quadrics 31120020000.18Hypothetical cluster '04-'05, 2 by 2.8 Ghz Opterons, dual rail

AMD/Quadrics 42240040000.18Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, quad rail

AMD/Quadrics 52240020000.09Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, dual rail

AMD/Myrinet 1400010000.25Hypothetical cluster '04-'05, 1 by 2 Ghz Opterons

AMD/Myrinet 2560010000.18Hypothetical cluster '04-'05, 1 by 2.8 Ghz Opterons

AMD/Myrinet 31120020000.18Hypothetical cluster '04-'05, 2 by 2.8 Ghz Opterons, dual rail

AMD/Myrinet 42240040000.18Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, quad rail

AMD/Myrinet 52240020000.09Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, dual rail

Intel/Myrinet 31600040000.25Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, quad rail

Intel/Myrinet 21600020000.13Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, dual rail

Intel/Myrinet 31600010000.06Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, single rail

leland:Limited by internal architecture

leland:Hypertransport provides 3200 Mbytes/sec in each direction, I.e. 6400 Mbytes/sec bidirectional.

leland:Peak link bandwidth is 6000Mbytes/sec (bidirectional?), so this would limit it in practice I think.

leland:Except where noted, these are HW based numbers.

leland:500Mhz processors, 2 flops/clock, 1 proc per node.

leland:Myrinet of this generation provides 160Mbytes/sec, but the NIC limtis to 132 Mbytes/sec.

leland:466 Mhz processors, 2 flops/clock, 1 proc per node.

leland:Myrinet of this generation supports 320 Mbytes/sec, but the NIC limits to 264 Mbytes/sec.

leland:616Mhz processors, 2 flops/clock, 1 proc per node.

leland:Myrinet of this generation supports 320 Mbytes/sec, but the NIC limits to 264 Mbytes/sec.

leland:MPI bandwidth per Ron Brightwell (via Steve Plimpton)

leland:Used in Plimpton runs

leland:2.4 Ghz processors, 2 flops per clock, 2 procs per node.

leland:The NIC supports 528 Mbytes/sec, but Myrinet of this generation limits this to 500Mbytes/sec bidirectional.

leland:3.06Ghz processors, 2 flops/clock, 2 procs per node.

leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.

leland:2.0Ghz Opterons, 2 flops/clock, 1 proc per node

leland:2.8 Ghz Opterons, 2 flops/clock, 1 proc per node

leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional



leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system


leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, four rails per system



leland:2.0Ghz Opterons, 2 flops/clock, 1 proc per node























leland:The NIC supports 528 Mbytes/sec, but Myrinet of this generation limits this to 500Mbytes/sec bidirectional.

leland:200Mhz procs, 1 flop/clock, with 2 procs/node

leland:Using the mesh router chip (MRC) to obtain 400 Mbytes/sec in each direction for the node, I.e. 800 Mbytes/sec bidirectional. Similarly the NIC provides 400 Mbytes/sec each direction for a node, or 800 Mbytes/sec bidirectional.(JLT) Limit is really 533 MBytes/s for node memory bus. This result assumes that only one node is being used. If two nodes are considered then the bandwidth of the link between the MRC and NIC is shared between the nodes.

leland:Assuming the memory bandwidth of 533 MBytes/sec is the limiting factor.(JLT) Same comment as for above applies here except that the impact is to change the bandwdith to 400 Mbytes/s.

leland:333 Mhz processors, 1 flop/clock, 2 procs per node.

leland:This assumes one proc per node is used and counted.

leland:I gather this is the bandwidth when only one processor on the node is in use.(JLT) This is the measured write or read bandwidth into a nodes memory. This bandwidth is not the bandwidth achievable between MRC to MRC links in the network. It is also less than the measured bi-directional bandwidth for a node.

leland:Used in Plimpton runs

leland:200Mhz procs, 1 flop/clock, with 1 procs/node

leland:Using the mesh router chip (MRC) to obtain 400 Mbytes/sec in each direction for the node, I.e. 800 Mbytes/sec bidirectional. Similarly the NIC provides 400 Mbytes/sec each direction for a node, or 800 Mbytes/sec bidirectional.(JLT) Limit is really 533 MBytes/s for node memory bus. This result assumes that only one node is being used. If two nodes are considered then the bandwidth of the link between the MRC and NIC is shared between the nodes.

leland:Assuming the memory bandwidth of 533 MBytes/sec is the limiting factor.(JLT) Same comment as for above applies here except that the impact is to change the bandwdith to 400 Mbytes/s.

leland:333 Mhz processors, 1 flop/clock, 1 procs per node.

port_RS_CLC

Balance factorCycle fractionfrac.Eff.Weight. ave. eff.CodeEff.fTime

1.20ASCI Red

0.3440.2000.200.500.10Presto0.501.002.00

0.1580.1000.100.900.09Salinas0.900.111.11

0.1280.1000.100.950.10LAMMPS0.950.051.05

0.1040.1000.100.800.08CTH0.800.251.25

0.0740.2000.200.550.11DSMC0.550.821.82

0.0100.2000.200.650.13Ceptre0.650.541.54

0.0010.1000.100.900.09ITS0.900.111.11

1.001.000.70

0.25Cplant

0.20.200.170.03Presto0.174.805.80

0.10.100.650.07Salinas0.650.531.53

0.10.100.800.08LAMMPS0.800.251.25

0.10.100.450.05CTH0.451.202.20

0.20.200.200.04DSMC0.203.934.93

0.2000.200.280.06Ceptre0.282.593.59

0.1000.100.650.07ITS0.650.531.53

1.001.000.39

4.80Red/Cplant

0.20.202.900.58Presto2.90

0.10.101.380.14Salinas1.38

0.10.101.190.12LAMMPS1.19

0.10.101.760.18CTH1.76

0.20.202.710.54DSMC2.71

0.2000.202.330.47Ceptre2.33

0.1000.101.380.14ITS1.38

1.001.002.16

leland:f = time_commun/time_compute = 1/eff - 1

leland:Eff = parallel efficiency = 1/(1+f)

leland:Measured MPI bandwith in Mbytes/sec.

leland:Plimpton estimated .80

leland:Assume 1/10 neutral flow at 75% efficiency, 9/10 plasma flow at 50% efficiency per Tim Bartel. Rounded up from 52.5 so as not to suggest a precision which is not there.

leland:Per Garth Reese

leland:Per Steve Plimpton

leland:Per Pat Chavez

leland:Per Len Lorence

leland:Per Len Lorence

leland:Per ACME data from Courtenay and discussion with Kevin Brown

Balance factorCycle fractionfrac.Lower eff.Lower wgt. ave. eff.Typical eff.Typical wgt. ave. eff.Upper eff.Upper wgt. ave. eff.CodeLower Eff.Typical Eff.Upper Eff.Lower fTypical fUpper f

1.5Red Storm

0.3440.340.000.000.560.190.000.00Presto0.000.560.000.000.800.00

0.1580.160.000.000.920.150.000.00Salinas0.000.920.000.000.090.00

0.1280.130.000.000.960.120.000.00LAMMPS0.000.960.000.000.040.00

0.1040.100.000.000.830.090.000.00CTH0.000.830.000.000.200.00

0.0740.070.000.000.600.040.000.00DSMC0.000.600.000.000.660.00

0.0100.010.000.000.700.010.000.00Ceptre0.000.700.000.000.430.00

0.0010.000.000.000.920.000.000.00ITS0.000.920.000.000.090.00

0.820.820.000.730.00

0.04Standard L. Cluster

0.3440.340.000.000.030.010.000.00Presto0.000.030.000.0030.030.00

0.1580.160.000.000.230.040.000.00Salinas0.000.230.000.003.340.00

0.1280.130.000.000.390.050.000.00LAMMPS0.000.390.000.001.580.00

0.1040.100.000.000.120.010.000.00CTH0.000.120.000.007.510.00

0.0740.070.000.000.040.000.000.00DSMC0.000.040.000.0024.570.00

0.0100.010.000.000.060.000.000.00Ceptre0.000.060.000.0016.170.00

0.0010.000.000.000.230.000.000.00ITS0.000.230.000.003.340.00

0.820.820.000.140.00

37.50RS/Cluster

0.3440.340.000.0017.235.930.000.00Presto0.0017.230.00

0.1580.160.000.003.980.630.000.00Salinas0.003.980.00

0.1280.130.000.002.480.320.000.00LAMMPS0.002.480.00

0.1040.100.000.007.090.740.000.00CTH0.007.090.00

0.0740.070.000.0015.451.140.000.00DSMC0.0015.450.00

0.0100.010.000.0012.000.120.000.00Ceptre0.0012.000.00

0.0010.000.000.003.980.000.000.00ITS0.003.980.00

0.820.820.0010.840.00




Balance factorCycle fractionfrac.Lower eff.Lower wgt. ave. eff.Typical eff.Typical wgt. ave. eff.Upper eff.Upper wgt. ave. eff.CodeLower Eff.Typical Eff.Upper Eff.Lower fTypical fUpper f

1.5Red Storm

0.3440.340.000.000.560.190.000.00Presto0.000.560.000.000.800.00

0.1580.160.000.000.920.150.000.00Salinas0.000.920.000.000.090.00

0.1280.130.000.000.960.120.000.00LAMMPS0.000.960.000.000.040.00

0.1040.100.000.000.830.090.000.00CTH0.000.830.000.000.200.00

0.0740.070.000.000.600.040.000.00DSMC0.000.600.000.000.660.00

0.0100.010.000.000.700.010.000.00Ceptre0.000.700.000.000.430.00

0.0010.000.000.000.920.000.000.00ITS0.000.920.000.000.090.00

0.820.820.000.730.00

0.05Custom L. Cluster

0.3440.340.000.000.040.010.000.00Presto0.000.040.000.0024.020.00

0.1580.160.000.000.270.040.000.00Salinas0.000.270.000.002.670.00

0.1280.130.000.000.440.060.000.00LAMMPS0.000.440.000.001.260.00

0.1040.100.000.000.140.010.000.00CTH0.000.140.000.006.010.00

0.0740.070.000.000.050.000.000.00DSMC0.000.050.000.0019.660.00

0.0100.010.000.000.070.000.000.00Ceptre0.000.070.000.0012.940.00

0.0010.000.000.000.270.000.000.00ITS0.000.270.000.002.670.00

0.820.820.000.160.00

30.00RS/Cluster

0.3440.340.000.0013.904.780.000.00Presto0.0013.900.00

0.1580.160.000.003.370.530.000.00Salinas0.003.370.00

0.1280.130.000.002.170.280.000.00LAMMPS0.002.170.00

0.1040.100.000.005.840.610.000.00CTH0.005.840.00

0.0740.070.000.0012.480.920.000.00DSMC0.0012.480.00

0.0100.010.000.009.740.100.000.00Ceptre0.009.740.00

0.0010.000.000.003.370.000.000.00ITS0.003.370.00

0.820.820.008.820.00




Relating scalability and cost

Efficiency ratio =Cost ratio = 1.8MPP more cost effectiveCluster more cost effectiveAverage efficiency ratio over the five codes that consume >80% of Sandias cycles

Chart1

11

1.04173162422

1.03704275714

1.12633268268

1.122583230916

1.169569253432

1.296943655764

1.4160002988128

1.7174488073256

2.2598735787512

3.05108011563.0510801156

20483.9

40965

Eff. Ratio

Extrapolation

Processors

Efficiency ratio (Red/Cplant)

combined

Sum

Flat1.01.01.01.01.01.06.00

Current0.340.160.130.100.070.0010.81

Future0.150.100.100.100.100.150.70

Future combined ratio

0.150.100.100.100.100.150.70

ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio

10.150.100.100.100.100.151.00

20.180.100.110.100.100.151.04

40.160.100.110.100.110.151.04

80.190.100.120.100.110.171.13

160.200.100.120.100.120.141.12

320.220.110.120.100.120.151.17

640.260.130.120.110.130.161.30

1280.290.160.120.120.130.161.42

2560.360.250.120.130.150.191.72

5120.470.450.150.130.140.242.26

10240.530.910.130.130.150.293.053.05

20483.90

40965.00

Current combined ratio

0.340.160.130.100.070.0010.81


10.340.160.130.100.070.001.00

20.400.160.140.100.080.001.08

40.380.160.140.100.080.001.06

80.430.160.150.100.080.001.14

160.460.160.150.100.090.001.19

320.510.170.150.100.090.001.27

640.590.200.160.110.090.001.43

1280.660.260.160.120.100.001.61

2560.820.400.160.140.110.002.00

5121.070.710.190.140.100.002.74

10241.211.430.170.140.110.003.783.78

20484.90

40966.20

Efficiency ratiosProcsACMESalinasLAMMPSDSMCCTHITSAverageAverage

11.001.001.001.001.001.001.00

21.171.001.071.001.020.971.04

41.091.011.091.001.050.981.04

81.241.011.161.001.101.171.11

161.331.021.181.001.220.971.12

321.481.061.171.001.241.001.16

641.711.291.241.101.261.091.28

1281.931.641.231.201.321.091.40

2562.372.511.231.301.471.301.70

5123.124.501.511.301.401.622.24

10243.529.061.331.301.471.943.103.1

20484.10

40965.30

1.812.281.201.111.231.19

combined

00

00

00

00

00

00

00

00

00

00

00

00

00

Eff. Ratio

Extrapolation

Processors


Future portfolio

ACME

00

00

00

00

00

00

00

00

00

00

00

00

00

Eff. Ratio

Extrapolation

Processors


Current portfolio

Salinas

00

00

00

00

00

00

00

00

00

00

00

00

00

Eff. Ratio

Extrapolation

Processors


Average portfolio

LAMMPS

ProblemProcsASCI RedCplant

Scaled problem11.1320.56

CPU time (sec.)21.5760.911

41.9741.068

82.2281.364

162.431.594

322.4821.822

642.6032.200

1282.762.629

2562.8263.320

5122.9934.614

10243.1395.464

ProblemProcsASCI RedCplantVplantRatio

Scaled problem11.001.001.0001.00

Efficiency20.720.610.5851.17

40.570.520.4371.09

80.510.410.3721.24

160.470.350.3281.33

320.460.310.3021.48

640.430.250.2691.71

1280.410.210.2651.93

2560.400.170.2432.37

5120.380.123.12

10240.360.103.52

leland:From very limited Bartel data, so made very conservative assumption by "hand"

leland:Courtenay's driver routine around ACME

leland:Ross

leland:Ross

LAMMPS

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors

Scaled parallel efficiency

ACME brick contact problem

DSMC

ASCI RedSecsEff.

1511.00

8670.76

27780.65

64810.63

125890.57

216920.55

3431020.50

5121110.46

7291200.43

10001260.40

13311620.31

17281810.28

21972200.23

27442650.19

CplantSecsEff.

1241.00

8320.75

27380.63

64490.49

125680.35

2161000.24

3431530.16

5122350.10

7193490.07

10005220.05

VplantSecsEff.

1211.00

8340.62

27370.57

64400.53

125410.51

216420.50

343420.50

512480.44

719-

1024-

ProcsRed Eff.Cplant Eff.

11.001.00

20.970.96

40.900.89

80.760.75

160.720.70

270.650.63

320.650.61

640.630.49

1250.570.35

1280.570.35

2160.550.24

2560.540.21

3430.500.16

5120.460.10

7290.430.07

10000.400.05

10240.400.04

13310.31

ProcsRed Eff.Cplant Eff.Ratio

11.001.001.00

20.970.961.00

40.900.891.01

80.760.751.01

160.720.701.02

320.650.611.06

640.630.491.29

1280.570.351.64

2560.540.212.51

5120.460.104.50

10240.400.0449.06


ASCI Red

Cplant

Vplant

Processors



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

DSMC

00

00

00

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


Salinas modal problem

CTH

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



ITS

MembraneProcsASCI RedCplantRatio

Scaled problem11.0001.0001.00

Parallel efficiency21.0280.9641.07

41.0050.9241.09

80.9920.8531.16

160.9950.8451.18

320.9960.8531.17

640.9940.8031.24

1280.9920.8051.23

2560.9890.8011.23

5120.9850.6511.51

10240.9620.7211.33

ITS

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


LAMMPS membrane problem (FFT rich)

Sheet1

ProcsASCI Red (model)Cplant (model)f Redf CplantRatio

11.001.000.000.001.00

20.740.740.360.361.00

40.670.670.480.481.00

80.640.640.560.561.00

160.590.590.690.691.00

320.590.590.690.691.00

640.600.600.680.681.00

1280.570.570.750.751.00

2560.510.510.970.971.00

5120.480.481.071.071.00

10240.520.520.930.931.00

Red bal.1.2

Cplant bal.1.2

Ratio1.0

ASCI RedTimeEff

0

20

40

80

2000.680.432

4000.720.408

640

8000.740.397

1000

16000.690.426

CplantTimeEff

0

204.680.326

404.720.323

804.870.313

200

400

6404.960.307

800

1000

ASCI RedTimeEffRatio

01.3

80

2000.680.432

4000.720.408

640

8000.740.397

1000

CplantTimeEff

0

804.870.313

200

400

6404.960.307

800

1000

leland:Bio problem with FFT's

leland:Ross

Sheet1

00

00

00

00

00

00

00

00

00

00

00

ASCI Red (model)

Cplant (model)

Processors


DSMC problem

Presto_old

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


DSMC plasma problem

Salinas_old

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


DSMC plasma problem

DSMC_old

ProcsASCI RedASCI Red (interp.)ASCI RedRatio

11.001.001.001.00

20.950.951.02

40.840.840.841.05

80.820.821.10

160.780.780.781.22

320.780.781.24

640.770.770.771.26

1280.740.741.32

2560.690.690.691.47

5120.620.621.40

10240.520.521.47

ProcsCplantCplant (interp.)Cplant

11.001.001.00

20.930.93

40.800.800.80

80.740.74

160.640.640.64

320.630.63

640.610.610.61

1280.560.56

2560.470.470.47

5120.440.44

10240.350.35

DSMC_old

0000

0000

0000

0000

0000

0000

0000

0000

0000

00

00

ASCI Red

ASCI Red (interp.)

Cplant

Cplant (interp.)

Processors


CTH - NEST shell problem

CTH_old

CPU timeScaled efficiency

STARSATProcsASCI RedCplantASCI RedCplantRatio

Scaled problem11701.4340.61.001.001.00

21730.44336.180.981.010.97

41687.4331.221.011.030.98

81692.29394.681.010.861.17

161693.52327.211.001.040.97

321692.29337.861.011.011.00

641695.06368.221.000.921.09

1281695.43369.111.000.921.09

2561704.42443.001.000.771.30

5121735.67564.020.980.601.62

10241802.00701.060.940.491.94

12481632641282565121024

10.98256120030.98684732320.80861088550.97972440940.96441028290.84070945950.8435593220.73255813950.5530.4090684932

Vplant

11.00

20.98

40.99

80.81

160.98

320.96

640.84

1280.84

2560.73

5120.55

10240.41

leland:From Mahesh Rajan

leland:Ross, mixture of head and center nodes


CTH_old

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors

Scaled Parallel Efficiency

ITS Starsat problem

000

000

000

000

000

000

000

000

000

00

00

ASCI Red

Cplant

Vplant

Processors


ITS Starsat problem

CPU time (sec)Parallel efficiency

GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster

4k elements/proc15.50E+021.60E+021.001.00

Presto46.04E+021.79E+020.910.89

158.77E+022.60E+020.630.62

601.06E+033.76E+020.520.43

2401.36E+030.41

CPU time (sec)Fraction of whole


4k elements/proc1

ACME time45.66E+021.73E+020.940.96

158.37E+022.53E+020.950.97

601.02E+033.69E+020.960.98

2401.31E+030.97

leland:Arne's test problem (a drop problem?)

leland:9100 Linux Cluster



00

00

00

00

00

ASCI Red

Std. Linux Cluster

Processors


Presto drop problem

Scaled problemProcsASCI RedCplant

CPU time (sec.)146.864.5

247.0

447.3

847.8

1648.7

3250.7

6454.475

12855.8

21656.978

51259.180

100063.685

Scaled problemProcsASCI RedCplantRatio


20.990.991.00

40.990.991.00

80.980.981.00

160.960.961.00

320.920.921.00

640.860.861.00

1280.840.841.00

2160.820.830.99

5120.790.810.98

10000.740.760.97

Scaled problemProcsASCI Red (extrap.)Cplant (extrap.)

Extrapolatio11.001.00

20.990.99

40.990.99

80.980.98

160.960.96

320.920.92

640.860.86

128

216

512

1000


Data1

2

4

8

16

32

640.860.86

1280.840.84

2160.820.83

5120.790.81

10000.740.76

leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)


leland:Ross

leland:Not an actual data point, but linear interp looks good and I don't know how to make the plot work if I have to skip.

leland:Not an actual data point, but don't see how to make plot work otherwise.

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

ASCI Red (extrap.)

Cplant (extrap.)

ASCI Red

Cplant

Processors


Salinas GB problem


11.001.000.000.001.00

20.740.470.361.151.58

40.670.390.481.551.72

80.640.360.561.801.79

160.590.310.692.211.90

320.590.310.692.221.90

640.600.320.682.171.89

1280.570.290.752.411.95

2560.510.240.973.112.08

5120.480.231.073.432.14

10240.520.250.932.962.06

Red bal.0.80

Cplant bal.0.25

Ratio3.2

leland:In rollup count this as 256 procs.

leland:In rollup, count this as 1024 procs.

00

00

00

00

00

00

00

00

00

00

00

ASCI Red (model)

Cplant (model)

Processors


DSMC problem

ProcsASCI RedCplant (model)f Redf CplantRatio

11.001.000.000.001.00

20.740.470.361.151.58

40.670.390.481.551.72

80.640.360.561.801.79

160.590.310.692.211.90

320.590.310.692.221.90

640.600.320.682.171.89

1280.570.290.752.411.95

2560.510.240.973.112.08

5120.480.231.073.432.14

10240.520.250.932.962.06

Red bal.0.80

Cplant bal.0.25

Ratio3.2

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant (model)

Processors


CTH 2-gas problem

Scalability determines cost effectiveness 380M node-hrs55M node-hrsMPP more cost effectiveCluster more cost effective256Sandias top priority computing workload:

Scalability also limits capability ~3x processors

Chart1

00

64.239377957159.420814003

128.4507175171118.5292037233

192192

255.5464028819196.476534586

320320

384384

448448

501.8907972138303.0814608188

576576

640640

704704

768768

832832

896896

960960

966.8332963374469.7221646257

10881088

Red Speedup

Cplant Speedup

Processors

Speedup

ITS Speedup curves

combined

Sum

Flat1.01.01.01.01.01.06.00

Current0.340.160.130.100.070.0010.81

Future0.150.100.100.100.100.150.70


0.150.100.100.100.100.150.70


10.150.100.100.100.100.151.00

20.180.100.110.100.100.151.04

40.160.100.110.100.110.151.04

80.190.100.120.100.110.171.13

160.200.100.120.100.120.141.12

320.220.110.120.100.120.151.17

640.260.130.120.110.130.161.30

1280.290.160.120.120.130.161.42

2560.360.250.120.130.150.191.72

5120.470.450.150.130.140.242.26

10240.530.910.130.130.150.293.053.05

20483.90

40965.00


0.340.160.130.100.070.0010.81


10.340.160.130.100.070.001.00

20.400.160.140.100.080.001.08

40.380.160.140.100.080.001.06

80.430.160.150.100.080.001.14

160.460.160.150.100.090.001.19

320.510.170.150.100.090.001.27

640.590.200.160.110.090.001.43

1280.660.260.160.120.100.001.61

2560.820.400.160.140.110.002.00

5121.070.710.190.140.100.002.74

10241.211.430.170.140.110.003.783.78

20484.90

40966.20


11.001.001.001.001.001.001.00

21.171.001.071.001.020.971.04

41.091.011.091.001.050.981.04

81.241.011.161.001.101.171.11

161.331.021.181.001.220.971.12

321.481.061.171.001.241.001.16

641.711.291.241.101.261.091.28

1281.931.641.231.201.321.091.40

2562.372.511.231.301.471.301.70

5123.124.501.511.301.401.622.24

10243.529.061.331.301.471.943.103.1

20484.10

40965.30

1.812.281.201.111.231.19

combined

1

1.0417316242

1.0370427571

1.1263326826

1.1225832309

1.1695692534

1.2969436557

1.4160002988

1.7174488073

2.2598735787

3.05108011563.0510801156

3.9

5

Eff. Ratio

Extrapolation

Processors


Future portfolio

ACME

1

1.0838924008

1.0597698842

1.1392859529

1.1913721552

1.2660360918

1.4319161332

1.6104267079

1.998786187

2.7403839139

3.78114171643.7811417164

4.9

6.2

Eff. Ratio

Extrapolation

Processors


Current portfolio

Salinas

00

00

00

00

00

00

00

00

00

00

00

00

00

Eff. Ratio

Extrapolation

Processors


Average portfolio

LAMMPS



CPU time (sec.)21.5760.911

41.9741.068

82.2281.364

162.431.594

322.4821.822

642.6032.200

1282.762.629

2562.8263.320

5122.9934.614

10243.1395.464



Efficiency20.720.610.5851.17

40.570.520.4371.09

80.510.410.3721.24

160.470.350.3281.33

320.460.310.3021.48

640.430.250.2691.71

1280.410.210.2651.93

2560.400.170.2432.37

5120.380.123.12

10240.360.103.52



leland:Ross

leland:Ross

LAMMPS

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



DSMC

ASCI RedSecsEff.

1511.00

8670.76

27780.65

64810.63

125890.57

216920.55

3431020.50

5121110.46

7291200.43

10001260.40

13311620.31

17281810.28

21972200.23

27442650.19

CplantSecsEff.

1241.00

8320.75

27380.63

64490.49

125680.35

2161000.24

3431530.16

5122350.10

7193490.07

10005220.05

VplantSecsEff.

1211.00

8340.62

27370.57

64400.53

125410.51

216420.50

343420.50

512480.44

719-

1024-


11.001.00

20.970.96

40.900.89

80.760.75

160.720.70

270.650.63

320.650.61

640.630.49

1250.570.35

1280.570.35

2160.550.24

2560.540.21

3430.500.16

5120.460.10

7290.430.07

10000.400.05

10240.400.04

13310.31


11.001.001.00

20.970.961.00

40.900.891.01

80.760.751.01

160.720.701.02

320.650.611.06

640.630.491.29

1280.570.351.64

2560.540.212.51

5120.460.104.50

10240.400.0449.06


ASCI Red

Cplant

Vplant

Processors



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

DSMC

00

00

00

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



CTH

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



ITS




41.0050.9241.09

80.9920.8531.16

160.9950.8451.18

320.9960.8531.17

640.9940.8031.24

1280.9920.8051.23

2560.9890.8011.23

5120.9850.6511.51

10240.9620.7211.33

ITS

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



Sheet1


11.001.000.000.001.00

20.740.740.360.361.00

40.670.670.480.481.00

80.640.640.560.561.00

160.590.590.690.691.00

320.590.590.690.691.00

640.600.600.680.681.00

1280.570.570.750.751.00

2560.510.510.970.971.00

5120.480.481.071.071.00

10240.520.520.930.931.00

Red bal.1.2

Cplant bal.1.2

Ratio1.0

ASCI RedTimeEff

0

20

40

80

2000.680.432

4000.720.408

640

8000.740.397

1000

16000.690.426

CplantTimeEff

0

204.680.326

404.720.323

804.870.313

200

400

6404.960.307

800

1000


01.3

80

2000.680.432

4000.720.408

640

8000.740.397

1000

CplantTimeEff

0

804.870.313

200

400

6404.960.307

800

1000


leland:Ross

Sheet1

00

00

00

00

00

00

00

00

00

00

00

ASCI Red (model)

Cplant (model)

Processors


DSMC problem

Presto_old

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


DSMC plasma problem

Salinas_old

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


DSMC plasma problem

DSMC_old


11.001.001.001.00

20.950.951.02

40.840.840.841.05

80.820.821.10

160.780.780.781.22

320.780.781.24

640.770.770.771.26

1280.740.741.32

2560.690.690.691.47

5120.620.621.40

10240.520.521.47

ProcsCplantCplant (interp.)Cplant

11.001.001.00

20.930.93

40.800.800.80

80.740.74

160.640.640.64

320.630.63

640.610.610.61

1280.560.56

2560.470.470.47

5120.440.44

10240.350.35

DSMC_old

0000

0000

0000

0000

0000

0000

0000

0000

0000

00

00

ASCI Red

ASCI Red (interp.)

Cplant

Cplant (interp.)

Processors


CTH - NEST shell problem

CTH_old

CPU timeScaled efficiency

STARSATProcsASCI RedCplantASCI RedCplantRatio

Scaled problem11701.4340.61.001.001.00

21730.44336.180.981.010.97

41687.4331.221.011.030.98

81692.29394.681.010.861.17

161693.52327.211.001.040.97

321692.29337.861.011.011.00

641695.06368.221.000.921.09

1281695.43369.111.000.921.09

2561704.42443.001.000.771.30

5121735.67564.020.980.601.62

10241802.00701.060.940.491.94

12481632641282565121024

10.98256120030.98684732320.80861088550.97972440940.96441028290.84070945950.8435593220.73255813950.5530.4090684932

Vplant

11.00

20.98

40.99

80.81

160.98

320.96

640.84

1280.84

2560.73

5120.55

10240.41

ProcsRed EfficiencyRed SpeedupCplant EfficiencyCplant Speedup

11.001.001.001.00

20.981.971.011.99

41.014.031.034.1

81.018.040.866.9

161.0016.11.0416.7

321.0132.21.0132.4

641.0064.20.9259.4

1281.00128.50.92118.5

2561.00255.50.77196.5

5120.98501.90.60303.1

10240.94966.80.49469.7

ProcsRed EfficiencyRed SpeedupCplant EfficiencyCplant Speedup

0

641.00640.9259

1281.001280.92119

192

2561.002560.77196

320

384

448

5120.985020.60303

576

640

704

768

832

896

960

10240.949670.49470

1088

1152

1216

1280

1344

1408

1472

1536

1600

1664

1728

1792

1856

1920

1984

2048

leland:From Mahesh Rajan



CTH_old

11

0.98321814111.0131477185

1.0082967881.0283195459

1.00538323810.8629776021

1.00465303041.0409217322

1.00538323811.008109868

1.00374028060.9249904948

1.00352123060.9227601528

0.99822813630.7688487585

0.98025546330.6038792951

0.9441731410.4858357345

ASCI Red

Cplant

Processors


ITS Starsat problem

111

0.98321814111.01314771850.9825612003

1.0082967881.02831954590.9868473232

1.00538323810.86297760210.8086108855

1.00465303041.04092173220.9797244094

1.00538323811.0081098680.9644102829

1.00374028060.92499049480.8407094595

1.00352123060.92276015280.843559322

0.99822813630.76884875850.7325581395

0.98025546330.6038792951

0.9441731410.4858357345

ASCI Red

Cplant

Vplant

Processors


ITS Starsat problem

Red Speedup

Cplant Speedup

Processors

Speedup

ITS Speedup curves

Red Speedup

Cplant Speedup

Processors

Speedup

ITS Speedup curves

CPU time (sec)Parallel efficiency


4k elements/proc15.50E+021.60E+021.001.00

Presto46.04E+021.79E+020.910.89

158.77E+022.60E+020.630.62

601.06E+033.76E+020.520.43

2401.36E+030.41

CPU time (sec)Fraction of whole


4k elements/proc1

ACME time45.66E+021.73E+020.940.96

158.37E+022.53E+020.950.97

601.02E+033.69E+020.960.98

2401.31E+030.97





00

00

00

00

00

ASCI Red

Std. Linux Cluster

Processors


Presto drop problem


CPU time (sec.)146.864.5

247.0

447.3

847.8

1648.7

3250.7

6454.475

12855.8

21656.978

51259.180

100063.685

Scaled problemProcsASCI RedCplantRatio


20.990.991.00

40.990.991.00

80.980.981.00

160.960.961.00

320.920.921.00

640.860.861.00

1280.840.841.00

2160.820.830.99

5120.790.810.98

10000.740.760.97

Scaled problemProcsASCI Red (extrap.)Cplant (extrap.)

Extrapolatio11.001.00

20.990.99

40.990.99

80.980.98

160.960.96

320.920.92

640.860.86

128

216

512

1000


Data1

2

4

8

16

32

640.860.86

1280.840.84

2160.820.83

5120.790.81

10000.740.76



leland:Ross

leland:Not an actual data point, but linear interp looks good and I don't know how to make the plot work if I have to skip.

leland:Not an actual data point, but don't see how to make plot work otherwise.

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

ASCI Red (extrap.)

Cplant (extrap.)

ASCI Red

Cplant

Processors


Salinas GB problem


11.001.000.000.001.00

20.740.470.361.151.58

40.670.390.481.551.72

80.640.360.561.801.79

160.590.310.692.211.90

320.590.310.692.221.90

640.600.320.682.171.89

1280.570.290.752.411.95

2560.510.240.973.112.08

5120.480.231.073.432.14

10240.520.250.932.962.06

Red bal.0.80

Cplant bal.0.25

Ratio3.2

leland:In rollup count this as 256 procs.

leland:In rollup, count this as 1024 procs.

00

00

00

00

00

00

00

00

00

00

00

ASCI Red (model)

Cplant (model)

Processors


DSMC problem

ProcsASCI RedCplant (model)f Redf CplantRatio

11.001.000.000.001.00

20.740.470.361.151.58

40.670.390.481.551.72

80.640.360.561.801.79

160.590.310.692.211.90

320.590.310.692.221.90

640.600.320.682.171.89

1280.570.290.752.411.95

2560.510.240.973.112.08

5120.480.231.073.432.14

10240.520.250.932.962.06

Red bal.0.80

Cplant bal.0.25

Ratio3.2

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant (model)

Processors


CTH 2-gas problem

Commodity nearly everywhere-- Customization drives costEarth Simulator and Cray X-1 are fully custom Vector systems with good balanceThis drives their high cost (and their high performance).Clusters are nearly entirely high-volume with no truly custom partsWhich drives their low-cost (and their low scalability)Red Storm uses custom parts only where they are critical to performance and reliabilityHigh scalability at minimal cost/performance

Scaling data for some key engineering codesRandom variation at small proc. countsLarge differential in efficiency at large proc. counts

Chart1

1111

0.98321814111.01314771850.71827411170.6147091109

1.0082967881.02831954590.57345491390.5243445693

1.00538323810.86297760210.50807899460.4105571848

1.00465303041.04092173220.46584362140.3513174404

1.00538323811.0081098680.45608380340.3073545554

1.00374028060.92499049480.43488282750.2545454545

1.00352123060.92276015280.41014492750.2130087486

0.99822813630.76884875850.40056617130.1686746988

0.98025546330.60387929510.3782158370.1213697443

0.9441731410.48583573450.36062440270.102489019

ITS, Red

ITS, Cplant

ACME, Red

ACME, Cplant

Processors


Performance on Engineering Codes

combined

Sum

Flat1.01.01.01.01.01.06.00

Current0.340.160.130.100.070.0010.81

Future0.150.100.100.100.100.150.70


0.150.100.100.100.100.150.70


10.150.100.100.100.100.151.00

20.180.100.110.100.100.151.04

40.160.100.110.100.110.151.04

80.190.100.120.100.110.171.13

160.200.100.120.100.120.141.12

320.220.110.120.100.120.151.17

640.260.130.120.110.130.161.30

1280.290.160.120.120.130.161.42

2560.360.250.120.130.150.191.72

5120.470.450.150.130.140.242.26

10240.530.910.130.130.150.293.053.05

20483.90

40965.00


0.340.160.130.100.070.0010.81


10.340.160.130.100.070.001.00

20.400.160.140.100.080.001.08

40.380.160.140.100.080.001.06

80.430.160.150.100.080.001.14

160.460.160.150.100.090.001.19

320.510.170.150.100.090.001.27

640.590.200.160.110.090.001.43

1280.660.260.160.120.100.001.61

2560.820.400.160.140.110.002.00

5121.070.710.190.140.100.002.74

10241.211.430.170.140.110.003.783.78

20484.90

40966.20


11.001.001.001.001.001.001.00

21.171.001.071.001.020.971.04

41.091.011.091.001.050.981.04

81.241.011.161.001.101.171.11

161.331.021.181.001.220.971.12

321.481.061.171.001.241.001.16

641.711.291.241.101.261.091.28

1281.931.641.231.201.321.091.40

2562.372.511.231.301.471.301.70

5123.124.501.511.301.401.622.24

10243.529.061.331.301.471.943.103.1

20484.10

40965.30

1.812.281.201.111.231.19

combined

00

00

00

00

00

00

00

00

00

00

00

00

00

Eff. Ratio

Extrapolation

Processors


Future portfolio

compiled

00

00

00

00

00

00

00

00

00

00

00

00

00

Eff. Ratio

Extrapolation

Processors


Current portfolio

ACME

00

00

00

00

00

00

00

00

00

00

00

00

00

Eff. Ratio

Extrapolation

Processors


Average portfolio

Salinas

CTH, RedSalinas, RedACME, RedITS, RedLAMMPS, Red

Red11.001.001.001.001.00

20.950.970.720.981.03

40.840.900.571.011.01

80.820.760.511.010.99

160.780.720.471.001.00

320.780.650.461.011.00

640.770.630.431.000.99

1280.740.570.411.000.99

2560.690.540.401.000.99

5120.620.460.380.980.99

10240.520.400.360.940.96

CTH, CplantSalinas, CplantACME, CplantITS, CplantLAMMPS, Cplant

Cplant11.001.001.001.001.00

20.930.960.611.010.96

40.800.890.521.030.92

80.740.750.410.860.85

160.640.700.351.040.85

320.630.610.311.010.85

640.610.490.250.920.80

1280.560.350.210.920.81

2560.470.210.170.770.80

5120.440.100.120.600.65

10240.350.040.100.490.72

SalinasACMEITSCTHLAMMPS

Ratio11.001.001.001.001.00

21.001.170.971.021.07

41.011.090.981.051.09

81.011.241.171.101.16

161.021.330.971.221.18

321.061.481.001.241.17

641.291.711.091.261.24

1281.641.931.091.321.23

2562.512.371.301.471.23

5124.503.121.621.401.51

10249.063.521.941.471.33


Salinas

Salinas

ACME

ITS

CTH

LAMMPS

Processors

Red efficiency/Cplant efficiency

Efficiency ratios

LAMMPS

CTH, Red

Salinas, Red

ACME, Red

CTH, Cplant

Salinas, Cplant

ACME, Cplant

Processors



DSMC

CTH, Red

CTH, Cplant

ACME, Red

ACME, Cplant

Processors



CTH

ITS, Red

ITS, Cplant

ACME, Red

ACME, Cplant

Processors



ITS



CPU time (sec.)21.5760.911

41.9741.068

82.2281.364

162.431.594

322.4821.822

642.6032.200

1282.762.629

2562.8263.320

5122.9934.614

10243.1395.464



Efficiency20.720.610.5851.17

40.570.520.4371.09

80.510.410.3721.24

160.470.350.3281.33

320.460.310.3021.48

640.430.250.2691.71

1280.410.210.2651.93

2560.400.170.2432.37

5120.380.123.12

10240.360.103.52


leland:Ross

leland:Ross

ITS

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



Sheet1

ASCI RedSecsEff.

1511.00

8670.76

27780.65

64810.63

125890.57

216920.55

3431020.50

5121110.46

7291200.43

10001260.40

13311620.31

17281810.28

21972200.23

27442650.19

CplantSecsEff.

1241.00

8320.75

27380.63

64490.49

125680.35

2161000.24

3431530.16

5122350.10

7193490.07

10005220.05

VplantSecsEff.

1211.00

8340.62

27370.57

64400.53

125410.51

216420.50

343420.50

512480.44

719-

1024-


11.001.00

20.970.96

40.900.89

80.760.75

160.720.70

270.650.63

320.650.61

640.630.49

1250.570.35

1280.570.35

2160.550.24

2560.540.21

3430.500.16

5120.460.10

7290.430.07

10000.400.05

10240.400.04

13310.31


11.001.001.00

20.970.961.00

40.900.891.01

80.760.751.01

160.720.701.02

320.650.611.06

640.630.491.29

1280.570.351.64

2560.540.212.51

5120.460.104.50

10240.400.0449.06


ASCI Red

Cplant

Vplant

Processors



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Sheet1

00

00

00

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



Presto_old

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



Salinas_old




41.0050.9241.09

80.9920.8531.16

160.9950.8451.18

320.9960.8531.17

640.9940.8031.24

1280.9920.8051.23

2560.9890.8011.23

5120.9850.6511.51

10240.9620.7211.33

Salinas_old

00

00

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors



DSMC_old


11.001.000.000.001.00

20.740.740.360.361.00

40.670.670.480.481.00

80.640.640.560.561.00

160.590.590.690.691.00

320.590.590.690.691.00

640.600.600.680.681.00

1280.570.570.750.751.00

2560.510.510.970.971.00

5120.480.481.071.071.00

10240.520.520.930.931.00

Red bal.1.2

Cplant bal.1.2

Ratio1.0

ASCI RedTimeEff

0

20

40

80

2000.680.432

4000.720.408

640

8000.740.397

1000

16000.690.426

CplantTimeEff

0

204.680.326

404.720.323

804.870.313

200

400

6404.960.307

800

1000


01.3

80

2000.680.432

4000.720.408

640

8000.740.397

1000

CplantTimeEff

0

804.870.313

200

400

6404.960.307

800

1000


leland:Ross

DSMC_old

00

00

00

00

00

00

00

00

00

00

00

ASCI Red (model)

Cplant (model)

Processors


DSMC problem

CTH_old

00

00

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


DSMC plasma problem

00

00

00

00

00

00

00

ASCI Red

Cplant

Processors


DSMC plasma problem


11.001.001.001.00

20.950.951.02

40.840.840.841.05

80.820.821.10

160.780.780.781.

Bill Camp, Jim Tomkins & Rob Leland

Documents

Transcript of Bill Camp, Jim Tomkins & Rob Leland