Bill Camp, Jim Tomkins & Rob Leland
description
Transcript of Bill Camp, Jim Tomkins & Rob Leland
-
Bill Camp, Jim Tomkins& Rob Leland
-
How Much Commodity is Enough?the Red Storm Architecture
William J. Camp, James L. Tomkins & Rob LelandCCIM, Sandia National LaboratoriesAlbuquerque, [email protected]
-
Sandia MPPs (since 1987) 1987: 1024-processor nCUBE10 [512 Mflops] 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops]1988--1990: 16384-processor CM-2001991: 64-processor Intel IPSC-8601993--1996: ~3700-processor Intel Paragon [180 Gflops]1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops]1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops]2003: 1280-processor IA32- Linux cluster [~7 Tflops]2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]
-
Our rubric (since 1987)Complex, mission-critical, engineering & science applications Large systems (1000s of PEs) with a few processors per nodeMessage passing paradigmBalanced architectureUse commodity wherever possibleEfficient systems softwareEmphasis on scalability & reliability in all aspects Critical advances in parallel algorithmsVertical integration of technologies
-
A partitioned, scalable computing architecture
-
Computing domains at SandiaRed Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)
Domain# Procs1101102103104 Red StormXXXCplant Linux Supercluster XXX Beowulf clustersXXXDesktopX
-
Red Storm ArchitectureTrue MPP, designed to be a single system-- not a clusterDistributed memory MIMD parallel supercomputerFully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0--2.4 GHz)~10 or 20 TB of DDR memory @ 333MHzRed/Black switching: ~1/4, ~1/2, ~1/4 (for data security)12 Service, Visualization, and I/O cabinets on each end (640 S,V & I processors for each color)240 TB of disk storage (120 TB per color) initially
-
Red Storm ArchitectureFunctional hardware partitioning: service and I/O nodes, compute nodes, Visualization nodes, and RAS nodesPartitioned Operating System (OS): LINUX on Service, Visualization, and I/O nodes, LWK (Catamount) on compute nodes, LINUX on RAS nodesSeparate RAS and system management network (Ethernet)Router table-based routing in the interconnectLess than 2 MW total power and coolingLess than 3,000 ft2 of floor space
-
Usage ModelUnix (Linux) Login Node with Unix environmentBatch ProcessingorCompute ResourceI/OUser sees a coherent, single system
-
Thors Hammer Topology
3D-mesh Compute node topology:27 x 16 x 24 (x, y, z) Red/Black split: 2,688 4,992 2,688Service, Visualization, and I/O partitions3 cabs on each end of each row384 full bandwidth links to Compute Node MeshNot all nodes have a processor-- all have routers 256 PEs in each Visualization Partition--2 per board256 PEs in each I/O Partition-- 2 per board128 PEs in each Service Partition-- 4 per board
-
3-D Mesh topology (Z direction is a torus)10,368 Compute Node MeshX=27Y=16Z=24Torus Interconnect in Z640 Visualization Service & I/O Nodes640 Visualization, Service & I/O Nodes
-
Thors Hammer Network Chips
3D-mesh is created by SEASTAR ASIC:Hyper-transport Interface and 6 network router ports on each chipIn computer partitions each processor has its own SEASTARIn service partition, some boards are configured like compute partition (4 PEs per board)Others have only 2 PEs per board; but still have 4 SEASTARSSo, network topology is uniformSEASTAR designed by CRAY to our specs, Fabricated by IBMThe only truly custom part in Red Storm-- complies with HT open standard
-
Node architectureCPU AMD OpteronDRAM 1 (or 2) Gbyte or moreSix Links To Other Nodes in X, Y,and ZASIC = Application Specific Integrated Circuit, or a custom chip
-
System Layout(27 x 16 x 24 mesh)Normally UnclassifiedNormally ClassifiedSwitchable NodesDisconnect Cabinets{{
-
Thors Hammer Cabinet LayoutCompute Node Partition3 Card Cages per Cabinet8 Boards per Card Cage4 Processors per Board4 NIC/Router Chips per BoardN + 1 Power SuppliesPassive BackplaneService. Viz, and I/O Node Partition2 (or 3) Card Cages per Cabinet8 Boards per Card Cage2 (or 4) Processors per Board4 NIC/Router Chips per Board2-PE I/O Boards have 4 PCI-X bussesN + 1 Power SuppliesPassive Backplane}96 PE
- PerformancePeak of 41.4 (46.6) TF based on 2 floating point instruction issues per clock at 2.0 Gigahertz . We required 7-fold speedup versus ASCI Red but based on our benchmarks expect performance will be 8-10 time faster than ASCI Red.Expected MP-Linpack performance: ~30--35 TF Aggregate system memory bandwidth: ~55 TB/sInterconnect Performance:Latency
-
PerformanceI/O System PerformanceSustained file system bandwidth of 50 GB/s for each colorSustained external network bandwidth of 25 GB/s for each colorNode memory systemPage miss latency to local memory is ~80 nsPeak bandwidth of ~5.4 GB/s for each processor
-
Red Storm System SoftwareOperating SystemsLINUX on service and I/O nodesSandias LWK (Catamount) on compute nodesLINUX on RAS nodesRun-Time System Logarithmic loaderFast, efficient Node allocatorBatch system PBSLibraries MPI, I/O, MathFile Systems being considered includePVFS interim file systemLustre Design Intent Panassas-- possible alternative
-
Red Storm System SoftwareToolsAll IA32 Compilers, all AMD 64-bit Compilers Fortran, C, C++Debugger Totalview (also examining alternatives)Performance Tools (was going to be Vampir until Intel bought Pallas-- now?)System Management and AdministrationAccountingRAS GUI Interface
-
Comparison of ASCI Redand Red Storm
-
Comparison of ASCI Redand Red Storm
-
Red Storm Project23 months, design to First Product Shipment!System software is a joint project between Cray and SandiaSandia is supplying Catamount LWK and the service node run-time systemCray is responsible for Linux, NIC software interface, RAS software, file system software, and Totalview portInitial software development was done on a cluster of workstations with a commodity interconnect. Second stage involves an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based systemSystem design is going on nowCabinets-- existSEASTAR NIC/Router-- released to Fabrication at IBM earlier this monthFull system to be installed and turned over to Sandia in stages culminating in August--September 2004
-
New Building for Thors Hammer
-
Designing for scalable supercomputingChallenges in: -Design-Integration-Management-Use
-
SUREty for Very Large Parallel Computer Systems
Scalability - Full System Hardware and System Software
Usability - Required Functionality Only
Reliability - Hardware and System Software
Expense minimization- use commodity, high-volume parts SURE poses Computer System Requirements:
-
SURE Architectural tradeoffs: Processor and memory sub- system balanceCompute vs interconnect balanceTopology choicesSoftware choicesRASCommodity vs. Custom technology Geometry and mechanical design
-
Sandia Strategies:-build on commodity-leverage Open Source (e.g., Linux)-Add to commodity selectively (in RS there is basically one truly custom part!)-leverage experience with previous scalable supercomputers
-
System Scalability Driven RequirementsOverall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors).
-
-
ScalabilitySystem Software;System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size.
- Full re-boot time scales logarithmically with the system size.- Job loading is logarithmic with the number of processors.- Parallel I/O performance is not sensitive to # of PEs doing I/O- Communication Network software must be scalable.- No connection-based protocols among compute nodes.- Message buffer space independent of # of processors.- Compute node OS gets out of the way of the application.
-
Hardware scalabilityBalance in the node hardware:Memory BW must match CPU speedIdeally 24 Bytes/flop (never yet done)Communications speed must match CPU speedI/O must match CPU speedsScalable System SW( OS and Libraries)Scalable Applications
-
SW Scalability: Kernel tradeoffsUnix/Linux has more functionality, but impairs efficiency on the compute partition at scaleSay breaks are uncorrelated and last 50 S and occur once per secondOn 100 CPUs, wasted time is 5 ms every secondNegligible .5% impactOn 1000 CPUs, wasted time is 50 ms every secondModerate 5% impactOn 10,000 CPUs, wasted time is 500 msSignificant 50% impact
-
Usability>Application Code Support:Software that supports scalability of the Computer SystemMath LibrariesMPI Support for Full System SizeParallel I/O LibraryCompilersTools that Scale to the Full Size of the Computer SystemDebuggersPerformance MonitorsFull-featured LINUX OS support at the user interface
-
ReliabilityLight Weight Kernel (LWK) O. S. on compute partitionMuch less code fails much less oftenMonitoring of correctible errorsFix soft errors before they become hardHot swapping of componentsOverall system keeps running during maintenanceRedundant power supplies & memoriesCompletely independent RAS System monitors virtually every component in system
-
Economy
Use high-volume parts where possibleMinimize power requirementsCuts operating costsReduces need for new capital investmentMinimize system volumeReduces need for large new capital facilitiesUse standard manufacturing processes where possible-- minimize customizationMaximize reliability and availability/dollarMaximize scalability/dollarDesign for integrability
-
EconomyRed Storm leverages economies of scaleAMD Opteron microprocessor & standard memoryAir cooledElectrical interconnect based on Infiniband physical devicesLinux operating systemSelected use of custom componentsSystem chip ASICCritical for communication intensive applicationsLight Weight KernelTruly custom, but we already have it (4th generation)
-
Cplant on a slideGoal: MPP look and feelStart ~1997, upgrade ~1999--2001Alpha & Myrinet, mesh topology~3000 procs (3Tf) in 7 systemsConfigurable to ~1700 procsRed/Black switching Linux w/ custom runtime & mgmt.Production operation for several yrs.ASCI Red
-
IA-32 Cplant on a slideGoal: Mid-range capacityStarted 2003, upgrade annuallyPentium-4 & Myrinet, Clos network1280 procs (~7 Tf) in 3 systemsCurrently configurable to 512 procsLinux w/ custom runtime & mgmt.Production operation for several yrs.ASCI Red
-
Observation:For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs.
There must be balance between processor, interconnect, and I/O performance to achieve overall performance.
To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.
-
Lets Compare Balance In Parallel Systems
-
Comparing Red Storm and BGL Blue Gene Light** Red Storm*Node Speed 5.6 GF 5.6 GF (1x)Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)Network latency 7 msecs 2 msecs (2/7 x)Network BW 0.28 GB/s 6.0 GB/s (22x)BW Bytes/Flops 0.05 1.1 (22x)Bi-Section B/F 0.0016 0.038 (24x)#nodes/problem 40,000 10,000 (1/4 x)*100 TF version of Red Storm* * 360 TF version of BGL
-
Fixed problem performanceMolecular dynamics problem(LJ liquid)
-
Parallel Sn Neutronics (provided by LANL)
-
Scalable computing works
Chart1
1001001111119910011100
99.399.322222299682290
98.798.244444498654486
95.298.988888892678872
94.998.6161697.747002893816161656161661
94.696.9323297.59875259883232323210057
93.495.8646494.58117671916464646497.916666666756
93.29512812893.090273987696.229743137512812812896.311475409847
9393.125625690.888714557994.305493523993.398341179125625694
92.391.151251287.653025533491.61641929592.056456316351251292.1568627451
92.285.81024102483.970019342488.439099892889.16008614590.1017688394102492.1568627451
92.173.72048204881.149153876485.429623048687.563971340891.5055426656204887.037037037
91.267.587.388.2320082.758620689784.914378907388.1914316703320081.0344827586
3600360036003600360036003600360036003600
4096409640964096409640964096409699.64096
64206420642064206420642064206420946420
QS-Particles
QS-Fields-Only
QS-1B Cells
Rad x-port-1B Cells
Rad x-port - 17M
Rad x-port - 80M
Rad x-port - 168M
Rad x-port - 532M
Finite Element
Zapotec
Reactive Fluid Flow
Salinas
CTH
Processors
Scaled parallel efficiency (%)
ASCI Red efficiencies for major codes
Raw Data
Quicksilver data:
PARTICLES - scaled-size problem:
30x30x30 grid = 27000 cells per processor
12 particles/cell = 324000 particles per procesor
200 timesteps
3-stage implicit field solve, all-mirror PMC BC
2 species = electron and positrons
6 sets of pre-loaded particles, moving in all 6 directions
particles velocity = 1/2 grid cell/timestep
uniform load-balance, all cells filled, non-interacting particles
1-proc time is 92% particles, 7% fields (on Tflop)
TflopTflopSiberiaSiberia
ProcsGrid CellsParticlesCPUEffCPUEff
127000324,000604.6100155.8100
254000648,000608.899.3159.197.9
41080001,296,000612.798.716196.8
82160002,592,000635.495.217091.6
164320005,184,00063794.9170.991.2
3286400010,368,00063994.6173.489.9
641,728,00020,736,000646.893.4179.986.6
1283,456,00041,472,00064993.218584.2
2566,912,00082,944,000650.393197.279
51213,824,000165,888,000655.392.3
102427,648,000331,776,000655.792.2
204855,296,000663,552,000656.292.1
320086,400,0001,036,800,000662.891.2
------------------------------------
FIELDS-ONLY - scaled-size problem:
30x30x30 grid = 27000 cells per processor
10000 timesteps
Poisson inlet (Gaussian pulse excitation)
mirror side walls
several HISTORY outputs
TflopTflopAlaskaAlaskaSiberiaSiberia
ProcsGrid CellsCPUEffCPUEffCPUEff
127000412100129.610068.5100
254000414.899.3147.38876.389.8
4108000419.698.216976.786.878.9
8216000416.498.9194.466.7119.657.3
16432000417.898.6214.360.5130.752.4
32864000425.296.924752.5143.847.6
641,728,00043095.8283.145.8175.839
1283,456,000433.895293.744.1204.133.6
2566,912,000442.493.1
51213,824,000452.191.1
102427,648,000480.485.8
204855,296,000558.873.7
320086,400,00061067.5
----------------------------------
Tflops huge run for 1000 timesteps:
CPUTflop
Procs1 billion545.8Eff(476.85 is ideal time based
320087.3on 1-proc 27000-grid-cell time)
------------------------------------
Static LB scaled-size problem
30x30x30 grid per proc,
6 sets of particles, 12 particles/cell ave
LB = grid is 1/10 filled with particles in each of 3d, LB on
noLB = same, but LB off
LB Eff from 1-proc noLB time
TflopTflopTflopTflopTflopSiberiaSiberia
ProcsnoLBEffLBEffuniformLBEff
1582.5100578.1100.1604.6158.8100
2758.976.859398.2608.8170.793
4934.662.3599.197.2612.7176.590
8113551.3613.195635.4178.189.2
16149139.1654.389637187.684.6
32186031.3737.879639200.479.2
64222626.267286.7646.8211.875
128293519.8675.586.2649226.170.2
256366915.9754.777.2650.3
512438813.2710.582655.3
1024472412.375577.2655.7
------------------------------------------------
Dynamic LB scaled-size problem
30x30x30 grid per proc,
3 sets of particles, 12 particles/cell ave
LB = grid is 1/10 filled with particles in each of 3d, LB on
2 steps to transit one cell, load moves from lower-left to upper-right
noLB = same, but LB off
LB Eff from 1-proc noLB time
TflopTflopTflopTflopTflopSiberiaSiberia
ProcsnoLBEffLBEffuniformLBEff
1576.6100573.1100.2604.6159.1100
2744.677.4625.692.2608.8179.188.8
4911.463.3725.679.5612.7211.675.2
8108952.9757.176.2635.4220.172.3
16139041.5873.866637259.861.2
32169634957.360.2639300.652.9
64200828.7100457.4646.8330.248.2
128253022.8111551.7649394.340.3
256304518.9128145650.3
512357216.1141240.8655.3
1024408914.1158436.4655.7
---------------------------------
In Dec 99 I ran a billion-atom molecular dynamics benchmark on Tflops
using a LAMMPS-similar code I have (it is more streamlined, but does
the same kernel computations as LAMMPS). This problem had only
short-range forces (no FFTs which the membrane problem has).
It ran on 3200 Tflops procs (proc 0 mode), at 88.2% parallel
efficiency.
ProcEff
320088.2
CTH
3dr cu/hmx
janusmekipp6/27/03
ProcsCTH
(Mbyte)grind time(us/zn-cyc)1100real
procsmem/procminimummaximum298
119094.1109.3496
890real
819113.615.11689
641871.791.983288
(-p 3)128861.141.196486real
(-p 3)512910.310.3512872real
(-p 3)1024880.1680.18825668
(-p 3)2048900.0880.09651261real
(-p 3)6400890.0310.036102457real
204856real
320055
360054
409653
642047real
ProcsCTH
1100real
890real
6486real
12872real
51261real
102457real
204856real
642047real
Summary
ProcsQS-ParticlesQS-Fields-OnlyQS-1B CellsRad x-port-1B CellsRad x-port - 17MRad x-port - 80MRad x-port - 168MRad x-port - 532MFinite ElementZapotecReactive Fluid FlowSalinasCTHCTH-NZapotek-NFinite El.-N
110010099100100118
299.399.39968908816
498.798.2986586646464
895.298.99267721285123600
1694.998.69856615123179
3294.696.998100571024
6493.495.89598562048
12893.295939696476420
2569393.191949394
51292.391.188929292
102492.285.88488899092
204892.173.78185889287
320091.267.587.388.283858881
3600
409699.6
642094
Summary
QS-Particles
QS-Fields-Only
QS-1B Cells
Rad x-port-1B Cells
Rad x-port - 17M
Rad x-port - 80M
Rad x-port - 168M
Rad x-port - 532M
Finite Element
Zapotec
Reactive Fluid Flow
Salinas
CTH
Processors
Scaled parallel efficiency (%)
ASCI Red efficiencies for major codes
Zapotec
Rob, Ed, Peter,
Here are the timing numbers from zapotec. Zapotec is a coupled Eulerian-
Lagrangian code which tightly couples CTH and PRONTO. The problem is a
Tugsten rod impacting water with 1440 PRONTO elements per processor and a
101250 CTH cells per processor.
Courtenay
procsefficiency
1100
868
6465
51267
317956
Zapotec
Tungsten Rod
Number of Processors
Scaled Efficiency (%)
Janus: Scaled Efficiency - Zapotec
Crash
Finite Element - Crash
ProcEfficiency
899
1699
6498
360092
Crash
99
99
98
92
Finite Element-Crash
Number of Processors
Scaled Efficiency (%)
Janus: Scaled Efficiency - Crash
RD Plots
Quicksilver
Tflop (Janus) Scaled Efficiency
ProcsParticlesFields-Only1B Cells1B Cells-LAMMPS
1100100
299.399.3
498.798.2
895.298.9
1694.998.6
3294.696.9
6493.495.8
12893.295
2569393.1
51292.391.1
102492.285.8
204892.173.7
320091.267.587.388.2
RD Plots
Particles
Fields-Only
1B Cells
1B Cells-LAMMPS
Number of Processors
Scaled Efficiency (%)
Janus: Quicksilver Scaled Efficiency
RF Flow
ProcEff.
409699.6
642094
Salinas
ProcsSalinas-NormSalinas-Act
1235
2
4
8
16
32100235
6498240
12896244
25694250
51292255
102492255
204887270
320081290
Rad x-port
Radiation Transport
Rob - here is the parallel efficiency data in the radiation transport
plot. For each of 4 problems sizes (N), there are 2 efficiency
numbers - the top one is the "ideal" efficiency I mentioned due to the
nature of the algorithm (blue circles in what I sent). The lower
value is the actual measured run-time efficiency (red circles in what
I sent). Again, the diff between the 2 is due to the
communication/latency on a particular machine.
Steve
---------------
N = 17 millionN=17M-ThN=17M-Act
896.7694.58
P=8 P=16 P=32 P=64 P=128 P=256 P=512 P=10241696.293.89
3294.6789.54
96.76 96.20 94.67 91.61 89.23 85.77 82.72 76.236491.6185.28
94.58 93.89 89.54 85.28 81.10 75.18 69.46 61.8612889.2381.1
25685.7775.18
---------------51282.7269.46
102476.2361.86
N = 80 million2048
P=64 P=128 P=256 P=512 P=1024 P=2048
N=80M-ThN=80M-Act
90.71 89.56 86.24 83.99 78.79 73.378
87.29 84.46 79.01 74.28 67.31 60.7216
32
---------------6490.7187.29
12889.5684.46
N = 168 million25686.2479.01
51283.9974.28
P=128 P=256 P=512 P=1024 P=2048102478.7967.31
204873.3760.72
89.22 85.73 83.58 78.16 73.58
83.33 78.92 74.52 68.44 62.48N=168M-ThN=168M-Act
8
---------------16
32
N = 532 million64
12889.2283.33
P=512 P=1024 P=204825685.7378.92
51283.5874.52
82.54 77.58 73.76102478.1668.44
74.37 70.99 65.05204873.5862.48
N=17MN=80MN=168MN=532MN=532M-ThN=532M-Act
8988
169816
329532
64939664
128919493128
256889292256
5128488899051282.5474.37
102481858892102477.5870.99
2048838588204873.7665.05
Rad x-port
97.7470028938888
97.5987525988161616
94.5811767191323232
93.090273987696.22974313756464
90.888714557994.305493523993.3983411791128
87.653025533491.61641929592.0564563163256
83.970019342488.439099892889.16008614590.1017688394
81.149153876485.429623048687.563971340891.5055426656
204882.758620689784.914378907388.1914316703
N=17M
N=80M
N=168M
N=532M
Number of Processors
Scaled Efficiency (%)
Janus: Radiation Transport - Scaled Efficiency
-
Balance is critical to scalabilityPeakLinpack
Chart2
1111111
0.93750.92314793450.90909090910.80.71428571430.33333333330.2857142857
0.88235294120.85726532360.83333333330.66666666670.55555555560.20.1666666667
0.83333333330.8001600320.76923076920.57142857140.45454545450.14285714290.1176470588
0.78947368420.75018754690.71428571430.50.38461538460.11111111110.0909090909
0.750.70609002650.66666666670.44444444440.33333333330.09090909090.0740740741
0.71428571430.6668889630.6250.40.29411764710.07692307690.0625
0.68181818180.63181172010.58823529410.36363636360.26315789470.06666666670.0540540541
0.6521739130.6002400960.55555555560.33333333330.23809523810.05882352940.0476190476
0.6250.57167357440.52631578950.30769230770.21739130430.05263157890.0425531915
0.60.54570259210.50.28571428570.20.04761904760.0384615385
Red Storm (B=1.5)
ASCI Red (B=1.2)
Ref. Machine (B=1.0)
Earth Sim. (B=.4)
Cplant (B=.25)
Blue Gene Light (B=.05)
Std. Linux Cluster (B=.04)
Communication/Computation Load
Parallel Efficiency
Basic Parallel Efficiency Model
frac
BetaBetaBetaBetaBetaBetaBetaBeta
1.01.51.20.400.250.050.130.04
Ref. Machine (B=1.0)fRed Storm (B=1.5)ASCI Red (B=1.2)Earth Sim. (B=.4)Cplant (B=.25)Blue Gene Light (B=.05)Cplant (old)Std. Linux Cluster (B=.04)
1.000.01.001.001.001.001.001.001.00
0.910.10.940.920.800.710.330.570.29
0.830.20.880.860.670.560.200.400.17
0.770.30.830.800.570.450.140.310.12
0.710.40.790.750.500.380.110.250.09
0.670.50.750.710.440.330.090.210.07
0.630.60.710.670.400.290.080.180.06
0.590.70.680.630.360.260.070.160.05
0.560.80.650.600.330.240.060.140.05
0.530.90.630.570.310.220.050.130.04
0.501.00.600.550.290.200.050.120.04
leland:On a machine with a balance factor of 1.0
frac
Red Storm (B=1.5)
ASCI Red (B=1.2)
Cplant (B=.25)
Blue Gene Light (B=.05)
Cplant (old)
Std. Linux Cluster (B=.04)
f = com_time/calc_time
Parallel Efficiency
Basic Parallel Efficiency Model
grain
Red Storm (B=1.5)
ASCI Red (B=1.2)
Ref. Machine (B=1.0)
Earth Sim. (B=.4)
Cplant (B=.25)
Blue Gene Light (B=.05)
Std. Linux Cluster (B=.04)
Communication/Computation Load
Parallel Efficiency
Basic Parallel Efficiency Model
ratio_Red_CP
Model efficiency as a function of grain size (memory usage) based on hardware balance factors
Grain size multiplier2.10E+06
Bytes per Gigabyte1.07E+09
Bytes per real for double precision8
Reals per Gigabyte1.34E+08
Reals of state per grain element64
Bytes of state per grain element512
Grain elements per Gigabyte2.10E+06
Grain elements on edge of cube (that fits in GB)128.0
c units adjust (tune to match meas. efficiency)8.0
eta1.01.01.01.01.0
c4.00E+026.40E+011.33E+011.07E+01
Beta0.0400.251.21.5
Grain frac.Grain sizeStandard Linux ClusterCplant (current)ASCI Red (current)Red StormRatio
0.051.05E+050.110.420.780.821.84
0.102.10E+050.130.480.820.851.70
0.153.15E+050.150.520.840.861.62
0.204.19E+050.160.540.850.881.57
0.255.24E+050.170.560.860.881.54
0.306.29E+050.180.570.870.891.51
0.357.34E+050.180.580.870.891.49
0.408.39E+050.190.600.880.901.47
0.459.44E+050.200.610.880.901.45
0.501.05E+060.200.610.880.901.44
0.551.15E+060.210.620.890.911.43
0.601.26E+060.210.630.890.911.42
0.651.36E+060.220.630.890.911.41
0.701.47E+060.220.640.900.911.40
0.751.57E+060.230.650.900.921.39
0.801.68E+060.230.650.900.921.38
0.851.78E+060.230.650.900.921.38
0.901.89E+060.240.660.900.921.37
0.951.99E+060.240.660.900.921.36
1.002.10E+060.240.670.910.921.36
ratio_Red_CP
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
ASCI Red (current)
Cplant (current)
Red Storm
Standard Linux Cluster
Grain size (frac. of GB)
Parallel Efficiency
Standard Model Efficiency (3D, short range)
ratio_RS_SLC
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
000
ASCI Red (current)
Cplant (current)
Ratio
Grain size (frac. of GB)
Parallel Efficiency/Ratio
Standard Model Efficiency (3D, short range)
ratio_RS_CLC
BetaBetaBeta
1.01.20.25
Ref. MachinefEfficiency RatioASCI RedCplant
1.000.01.001.001.00
0.910.11.290.920.71
0.830.21.540.860.56
0.770.31.760.800.45
0.710.41.950.750.38
0.670.52.120.710.33
0.630.62.270.670.29
0.590.72.400.630.26
0.560.82.520.600.24
0.530.92.630.570.22
0.501.02.730.550.20
leland:On a machine with a balance factor of 1.0
ratio_RS_CLC
000
000
000
000
000
000
000
000
000
000
000
Efficiency Ratio
ASCI Red
Cplant
f = comm_time/calc_time (on ref. machine)
Parallel Efficiency (& ratio)
ASCI Red v. Cplant
interview
BetaBetaBeta
1.01.50.04
Ref. MachinefPerformance RatioRed StormStandard Linux Cluster
1.000.01.001.001.00
0.910.13.280.940.29
0.830.25.290.880.17
0.770.37.080.830.12
0.710.48.680.790.09
0.670.510.130.750.07
0.630.611.430.710.06
0.590.712.610.680.05
0.560.813.700.650.05
0.530.914.690.630.04
0.501.015.600.600.04
leland:On a machine with a balance factor of 1.0
interview
000
000
000
000
000
000
000
000
000
000
000
Performance Ratio
Red Storm
Standard Linux Cluster
Communication_time/Computation_time
Parallel Efficiency (& ratio)
Red Storm v. Standard Linux Cluster
balance
BetaBetaBeta
1.01.50.05
Ref. MachinefPerformance RatioRed StormCustom Linux Cluster
1.000.01.001.001.00
0.910.12.810.940.33
0.830.24.410.880.20
0.770.35.830.830.14
0.710.47.110.790.11
0.670.58.250.750.09
0.630.69.290.710.08
0.590.710.230.680.07
0.560.811.090.650.06
0.530.911.880.630.05
0.501.012.600.600.05
leland:On a machine with a balance factor of 1.0
balance
000
000
000
000
000
000
000
000
000
000
000
Performance Ratio
Red Storm
Custom Linux Cluster
Communication_time/Computation_time
Parallel Efficiency (& ratio)
Red Storm v. High End Linux Cluster
port_red_CP
ReferenceCodeLowerTypicalUpperComment
Ken AlvinPresto
Rich KoterasPresto
Kevin BrownPresto0.50
Manoj BhardwajSalinas
Garth ReeseSalinas0.90
Steve PlimptonLAMMPS0.95
Tim BartelDSMC0.500.530.70
Ed PiekosDSMC
Pat ChavezCTH0.700.800.90Towards lower end of range for harder, more realistic problems
Jim TomkinsCTH0.70
Paul YarringtonCTH
Len LorenceITS/Cab
Jim TomkinsITS/Cab0.90
port_RS_SLC
Node speed (Mflops/sec)Link BW (Mbytes/sec, Bidirectional)Balance (Bytes/Flop)NotesNearest neighbor latency (u sec)Across machine latency (u sec)
Cray T3E120012001.0
ASCI Red 14008002.0Pre 1999 upgrade, not achievable, P3 mode
ASCI Red 24005331.3Pre 1999 upgrade, theoretically achievable, P3 mode
ASCI Red 34004001.0Pre 1999 upgrade, realistically achievable, P3 mode
ASCI Red 46668001.2Post 1999 upgrade, not achievable, P3 mode
ASCI Red 56665330.8Post 1999 upgrade, theoretically achievable, P3 mode
ASCI Red 66664000.6Post 1999 upgrade, realistically achievable, P3 mode
ASCI Red 72008004.0Pre 1999 upgrade, not achievable, P0 mode
ASCI Red 82005332.7Pre 1999 upgrade, theoretically achievable, P0 mode
ASCI Red 92004001.0Pre 1999 upgrade, realistically achievable, P0 mode
ASCI Red 103338002.4Post 1999 upgrade, not achievable, P0 mode
ASCI Red 113335331.6Post 1999 upgrade, theoretically achievable, P0 mode
ASCI Red 123334001.2Post 1999 upgrade, realistically achievable, P0 mode
ASCI Red 133333100.9Post 1999 upgrade, measured through MPI layer, P0 mode15
ASCI Mountain Blue 15008001.6Within node communication
ASCI Mountain Blue 26400012000.02Between node communication
ASCI Blue Pacific 126503000.11Optimistic
ASCI Blue Pacific 226501320.05Pessimistic
ASCI Q 125006500.26Within node communication
ASCI Q 2100004000.04Between node communication
ASCI White2400020000.08From Tomkins table in CCPE paper draft
Earth Simulator64000320000.5From Tomkins table in CCPE paper draft
ASCI Red Storm 1400064001.6Requires multiple router links sending simultaneously; not achievable in practice25
ASCI Red Storm 2400060001.5Theoretically achievable25
ASCI Red Storm 2400036000.9Sustained bandwidth (this is the spec standard)
Cplant 110001320.13Alaska
Cplant 29322640.28Ross center section
Cplant 312322640.21Ross head section
Cplant 40.25An average across the Ross head and Center section
Cplant 51000550.06Measured through MPI software layer105
Vplant96005000.05Per Doug Doerfler
Zepper80005000.069100 cluster run by Zepper
ICC 1122005000.04Based on current (6/03) common vendor bids
ICC 2112005000.04HP actual test system
ICC 3122005000.04HPTi actual test system 1
ICC 488005000.06HPTi actual test system 2
ICC 5500IBM actual test system
AMD/Quadrics 1400010000.25Hypothetical cluster '04-'05, 1 by 2 Ghz Opterons
AMD/Quadrics 2560010000.18Hypothetical cluster '04-'05, 1 by 2.8 Ghz Opterons
AMD/Quadrics 31120020000.18Hypothetical cluster '04-'05, 2 by 2.8 Ghz Opterons, dual rail
AMD/Quadrics 42240040000.18Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, quad rail
AMD/Quadrics 52240020000.09Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, dual rail
AMD/Myrinet 1400010000.25Hypothetical cluster '04-'05, 1 by 2 Ghz Opterons
AMD/Myrinet 2560010000.18Hypothetical cluster '04-'05, 1 by 2.8 Ghz Opterons
AMD/Myrinet 31120020000.18Hypothetical cluster '04-'05, 2 by 2.8 Ghz Opterons, dual rail
AMD/Myrinet 42240040000.18Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, quad rail
AMD/Myrinet 52240020000.09Hypothetical cluster '04-'05, 4 by 2.8 Ghz Opterons, dual rail
Intel/Myrinet 31600040000.25Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, quad rail
Intel/Myrinet 21600020000.13Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, dual rail
Intel/Myrinet 31600010000.06Hypothetical cluster '04-'05, 2 by 4.0 Ghz Xeons, single rail
leland:Limited by internal architecture
leland:Hypertransport provides 3200 Mbytes/sec in each direction, I.e. 6400 Mbytes/sec bidirectional.
leland:Peak link bandwidth is 6000Mbytes/sec (bidirectional?), so this would limit it in practice I think.
leland:Except where noted, these are HW based numbers.
leland:500Mhz processors, 2 flops/clock, 1 proc per node.
leland:Myrinet of this generation provides 160Mbytes/sec, but the NIC limtis to 132 Mbytes/sec.
leland:466 Mhz processors, 2 flops/clock, 1 proc per node.
leland:Myrinet of this generation supports 320 Mbytes/sec, but the NIC limits to 264 Mbytes/sec.
leland:616Mhz processors, 2 flops/clock, 1 proc per node.
leland:Myrinet of this generation supports 320 Mbytes/sec, but the NIC limits to 264 Mbytes/sec.
leland:MPI bandwidth per Ron Brightwell (via Steve Plimpton)
leland:Used in Plimpton runs
leland:2.4 Ghz processors, 2 flops per clock, 2 procs per node.
leland:The NIC supports 528 Mbytes/sec, but Myrinet of this generation limits this to 500Mbytes/sec bidirectional.
leland:3.06Ghz processors, 2 flops/clock, 2 procs per node.
leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.
leland:2.0Ghz Opterons, 2 flops/clock, 1 proc per node
leland:2.8 Ghz Opterons, 2 flops/clock, 1 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional
leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system
leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, four rails per system
leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system
leland:2.0Ghz Opterons, 2 flops/clock, 1 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional
leland:2.8 Ghz Opterons, 2 flops/clock, 1 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional
leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system
leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, four rails per system
leland:2.8 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system
leland:4.0 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system
leland:4.0 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, two rails per system
leland:4.0 Ghz Opterons, 2 flops/clock, 2 proc per node
leland:Limited by PCIX to 500 Mbytes/sec in each direction = 1 Gbyte/sec bidirectional per rail, four rails per system
leland:3.06Ghz processors, 2 flops/clock, 2 procs per node.
leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.
leland:3.06Ghz processors, 2 flops/clock, 2 procs per node.
leland:2.2Ghz processors, 2 flops/clock, 2 procs per node.
leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.
leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.
leland:Limited by Myrinet, which supports 250 Mbytes/sec in each direction, or 500 Mbytes/sec bidirectional.
leland:The NIC supports 528 Mbytes/sec, but Myrinet of this generation limits this to 500Mbytes/sec bidirectional.
leland:200Mhz procs, 1 flop/clock, with 2 procs/node
leland:Using the mesh router chip (MRC) to obtain 400 Mbytes/sec in each direction for the node, I.e. 800 Mbytes/sec bidirectional. Similarly the NIC provides 400 Mbytes/sec each direction for a node, or 800 Mbytes/sec bidirectional.(JLT) Limit is really 533 MBytes/s for node memory bus. This result assumes that only one node is being used. If two nodes are considered then the bandwidth of the link between the MRC and NIC is shared between the nodes.
leland:Assuming the memory bandwidth of 533 MBytes/sec is the limiting factor.(JLT) Same comment as for above applies here except that the impact is to change the bandwdith to 400 Mbytes/s.
leland:333 Mhz processors, 1 flop/clock, 2 procs per node.
leland:This assumes one proc per node is used and counted.
leland:I gather this is the bandwidth when only one processor on the node is in use.(JLT) This is the measured write or read bandwidth into a nodes memory. This bandwidth is not the bandwidth achievable between MRC to MRC links in the network. It is also less than the measured bi-directional bandwidth for a node.
leland:Used in Plimpton runs
leland:200Mhz procs, 1 flop/clock, with 1 procs/node
leland:Using the mesh router chip (MRC) to obtain 400 Mbytes/sec in each direction for the node, I.e. 800 Mbytes/sec bidirectional. Similarly the NIC provides 400 Mbytes/sec each direction for a node, or 800 Mbytes/sec bidirectional.(JLT) Limit is really 533 MBytes/s for node memory bus. This result assumes that only one node is being used. If two nodes are considered then the bandwidth of the link between the MRC and NIC is shared between the nodes.
leland:Assuming the memory bandwidth of 533 MBytes/sec is the limiting factor.(JLT) Same comment as for above applies here except that the impact is to change the bandwdith to 400 Mbytes/s.
leland:333 Mhz processors, 1 flop/clock, 1 procs per node.
port_RS_CLC
Balance factorCycle fractionfrac.Eff.Weight. ave. eff.CodeEff.fTime
1.20ASCI Red
0.3440.2000.200.500.10Presto0.501.002.00
0.1580.1000.100.900.09Salinas0.900.111.11
0.1280.1000.100.950.10LAMMPS0.950.051.05
0.1040.1000.100.800.08CTH0.800.251.25
0.0740.2000.200.550.11DSMC0.550.821.82
0.0100.2000.200.650.13Ceptre0.650.541.54
0.0010.1000.100.900.09ITS0.900.111.11
1.001.000.70
0.25Cplant
0.20.200.170.03Presto0.174.805.80
0.10.100.650.07Salinas0.650.531.53
0.10.100.800.08LAMMPS0.800.251.25
0.10.100.450.05CTH0.451.202.20
0.20.200.200.04DSMC0.203.934.93
0.2000.200.280.06Ceptre0.282.593.59
0.1000.100.650.07ITS0.650.531.53
1.001.000.39
4.80Red/Cplant
0.20.202.900.58Presto2.90
0.10.101.380.14Salinas1.38
0.10.101.190.12LAMMPS1.19
0.10.101.760.18CTH1.76
0.20.202.710.54DSMC2.71
0.2000.202.330.47Ceptre2.33
0.1000.101.380.14ITS1.38
1.001.002.16
leland:f = time_commun/time_compute = 1/eff - 1
leland:Eff = parallel efficiency = 1/(1+f)
leland:Measured MPI bandwith in Mbytes/sec.
leland:Plimpton estimated .80
leland:Assume 1/10 neutral flow at 75% efficiency, 9/10 plasma flow at 50% efficiency per Tim Bartel. Rounded up from 52.5 so as not to suggest a precision which is not there.
leland:Per Garth Reese
leland:Per Steve Plimpton
leland:Per Pat Chavez
leland:Per Len Lorence
leland:Per Len Lorence
leland:Per ACME data from Courtenay and discussion with Kevin Brown
Balance factorCycle fractionfrac.Lower eff.Lower wgt. ave. eff.Typical eff.Typical wgt. ave. eff.Upper eff.Upper wgt. ave. eff.CodeLower Eff.Typical Eff.Upper Eff.Lower fTypical fUpper f
1.5Red Storm
0.3440.340.000.000.560.190.000.00Presto0.000.560.000.000.800.00
0.1580.160.000.000.920.150.000.00Salinas0.000.920.000.000.090.00
0.1280.130.000.000.960.120.000.00LAMMPS0.000.960.000.000.040.00
0.1040.100.000.000.830.090.000.00CTH0.000.830.000.000.200.00
0.0740.070.000.000.600.040.000.00DSMC0.000.600.000.000.660.00
0.0100.010.000.000.700.010.000.00Ceptre0.000.700.000.000.430.00
0.0010.000.000.000.920.000.000.00ITS0.000.920.000.000.090.00
0.820.820.000.730.00
0.04Standard L. Cluster
0.3440.340.000.000.030.010.000.00Presto0.000.030.000.0030.030.00
0.1580.160.000.000.230.040.000.00Salinas0.000.230.000.003.340.00
0.1280.130.000.000.390.050.000.00LAMMPS0.000.390.000.001.580.00
0.1040.100.000.000.120.010.000.00CTH0.000.120.000.007.510.00
0.0740.070.000.000.040.000.000.00DSMC0.000.040.000.0024.570.00
0.0100.010.000.000.060.000.000.00Ceptre0.000.060.000.0016.170.00
0.0010.000.000.000.230.000.000.00ITS0.000.230.000.003.340.00
0.820.820.000.140.00
37.50RS/Cluster
0.3440.340.000.0017.235.930.000.00Presto0.0017.230.00
0.1580.160.000.003.980.630.000.00Salinas0.003.980.00
0.1280.130.000.002.480.320.000.00LAMMPS0.002.480.00
0.1040.100.000.007.090.740.000.00CTH0.007.090.00
0.0740.070.000.0015.451.140.000.00DSMC0.0015.450.00
0.0100.010.000.0012.000.120.000.00Ceptre0.0012.000.00
0.0010.000.000.003.980.000.000.00ITS0.003.980.00
0.820.820.0010.840.00
leland:f = time_commun/time_compute = 1/eff - 1
leland:Eff = parallel efficiency = 1/(1+f)
leland:Measured MPI bandwith in Mbytes/sec.
Balance factorCycle fractionfrac.Lower eff.Lower wgt. ave. eff.Typical eff.Typical wgt. ave. eff.Upper eff.Upper wgt. ave. eff.CodeLower Eff.Typical Eff.Upper Eff.Lower fTypical fUpper f
1.5Red Storm
0.3440.340.000.000.560.190.000.00Presto0.000.560.000.000.800.00
0.1580.160.000.000.920.150.000.00Salinas0.000.920.000.000.090.00
0.1280.130.000.000.960.120.000.00LAMMPS0.000.960.000.000.040.00
0.1040.100.000.000.830.090.000.00CTH0.000.830.000.000.200.00
0.0740.070.000.000.600.040.000.00DSMC0.000.600.000.000.660.00
0.0100.010.000.000.700.010.000.00Ceptre0.000.700.000.000.430.00
0.0010.000.000.000.920.000.000.00ITS0.000.920.000.000.090.00
0.820.820.000.730.00
0.05Custom L. Cluster
0.3440.340.000.000.040.010.000.00Presto0.000.040.000.0024.020.00
0.1580.160.000.000.270.040.000.00Salinas0.000.270.000.002.670.00
0.1280.130.000.000.440.060.000.00LAMMPS0.000.440.000.001.260.00
0.1040.100.000.000.140.010.000.00CTH0.000.140.000.006.010.00
0.0740.070.000.000.050.000.000.00DSMC0.000.050.000.0019.660.00
0.0100.010.000.000.070.000.000.00Ceptre0.000.070.000.0012.940.00
0.0010.000.000.000.270.000.000.00ITS0.000.270.000.002.670.00
0.820.820.000.160.00
30.00RS/Cluster
0.3440.340.000.0013.904.780.000.00Presto0.0013.900.00
0.1580.160.000.003.370.530.000.00Salinas0.003.370.00
0.1280.130.000.002.170.280.000.00LAMMPS0.002.170.00
0.1040.100.000.005.840.610.000.00CTH0.005.840.00
0.0740.070.000.0012.480.920.000.00DSMC0.0012.480.00
0.0100.010.000.009.740.100.000.00Ceptre0.009.740.00
0.0010.000.000.003.370.000.000.00ITS0.003.370.00
0.820.820.008.820.00
leland:f = time_commun/time_compute = 1/eff - 1
leland:Eff = parallel efficiency = 1/(1+f)
leland:Measured MPI bandwith in Mbytes/sec.
-
Relating scalability and cost
Efficiency ratio =Cost ratio = 1.8MPP more cost effectiveCluster more cost effectiveAverage efficiency ratio over the five codes that consume >80% of Sandias cycles
Chart1
11
1.04173162422
1.03704275714
1.12633268268
1.122583230916
1.169569253432
1.296943655764
1.4160002988128
1.7174488073256
2.2598735787512
3.05108011563.0510801156
20483.9
40965
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
combined
Sum
Flat1.01.01.01.01.01.06.00
Current0.340.160.130.100.070.0010.81
Future0.150.100.100.100.100.150.70
Future combined ratio
0.150.100.100.100.100.150.70
ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio
10.150.100.100.100.100.151.00
20.180.100.110.100.100.151.04
40.160.100.110.100.110.151.04
80.190.100.120.100.110.171.13
160.200.100.120.100.120.141.12
320.220.110.120.100.120.151.17
640.260.130.120.110.130.161.30
1280.290.160.120.120.130.161.42
2560.360.250.120.130.150.191.72
5120.470.450.150.130.140.242.26
10240.530.910.130.130.150.293.053.05
20483.90
40965.00
Current combined ratio
0.340.160.130.100.070.0010.81
ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio
10.340.160.130.100.070.001.00
20.400.160.140.100.080.001.08
40.380.160.140.100.080.001.06
80.430.160.150.100.080.001.14
160.460.160.150.100.090.001.19
320.510.170.150.100.090.001.27
640.590.200.160.110.090.001.43
1280.660.260.160.120.100.001.61
2560.820.400.160.140.110.002.00
5121.070.710.190.140.100.002.74
10241.211.430.170.140.110.003.783.78
20484.90
40966.20
Efficiency ratiosProcsACMESalinasLAMMPSDSMCCTHITSAverageAverage
11.001.001.001.001.001.001.00
21.171.001.071.001.020.971.04
41.091.011.091.001.050.981.04
81.241.011.161.001.101.171.11
161.331.021.181.001.220.971.12
321.481.061.171.001.241.001.16
641.711.291.241.101.261.091.28
1281.931.641.231.201.321.091.40
2562.372.511.231.301.471.301.70
5123.124.501.511.301.401.622.24
10243.529.061.331.301.471.943.103.1
20484.10
40965.30
1.812.281.201.111.231.19
combined
00
00
00
00
00
00
00
00
00
00
00
00
00
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Future portfolio
ACME
00
00
00
00
00
00
00
00
00
00
00
00
00
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Current portfolio
Salinas
00
00
00
00
00
00
00
00
00
00
00
00
00
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Average portfolio
LAMMPS
ProblemProcsASCI RedCplant
Scaled problem11.1320.56
CPU time (sec.)21.5760.911
41.9741.068
82.2281.364
162.431.594
322.4821.822
642.6032.200
1282.762.629
2562.8263.320
5122.9934.614
10243.1395.464
ProblemProcsASCI RedCplantVplantRatio
Scaled problem11.001.001.0001.00
Efficiency20.720.610.5851.17
40.570.520.4371.09
80.510.410.3721.24
160.470.350.3281.33
320.460.310.3021.48
640.430.250.2691.71
1280.410.210.2651.93
2560.400.170.2432.37
5120.380.123.12
10240.360.103.52
leland:From very limited Bartel data, so made very conservative assumption by "hand"
leland:Courtenay's driver routine around ACME
leland:Ross
leland:Ross
LAMMPS
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
ACME brick contact problem
DSMC
ASCI RedSecsEff.
1511.00
8670.76
27780.65
64810.63
125890.57
216920.55
3431020.50
5121110.46
7291200.43
10001260.40
13311620.31
17281810.28
21972200.23
27442650.19
CplantSecsEff.
1241.00
8320.75
27380.63
64490.49
125680.35
2161000.24
3431530.16
5122350.10
7193490.07
10005220.05
VplantSecsEff.
1211.00
8340.62
27370.57
64400.53
125410.51
216420.50
343420.50
512480.44
719-
1024-
ProcsRed Eff.Cplant Eff.
11.001.00
20.970.96
40.900.89
80.760.75
160.720.70
270.650.63
320.650.61
640.630.49
1250.570.35
1280.570.35
2160.550.24
2560.540.21
3430.500.16
5120.460.10
7290.430.07
10000.400.05
10240.400.04
13310.31
ProcsRed Eff.Cplant Eff.Ratio
11.001.001.00
20.970.961.00
40.900.891.01
80.760.751.01
160.720.701.02
320.650.611.06
640.630.491.29
1280.570.351.64
2560.540.212.51
5120.460.104.50
10240.400.0449.06
leland:Courtenay's driver routine around ACME
ASCI Red
Cplant
Vplant
Processors
Scaled parallel efficiency
ACME brick contact problem
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
DSMC
00
00
00
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas modal problem
CTH
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas modal problem
ITS
MembraneProcsASCI RedCplantRatio
Scaled problem11.0001.0001.00
Parallel efficiency21.0280.9641.07
41.0050.9241.09
80.9920.8531.16
160.9950.8451.18
320.9960.8531.17
640.9940.8031.24
1280.9920.8051.23
2560.9890.8011.23
5120.9850.6511.51
10240.9620.7211.33
ITS
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
LAMMPS membrane problem (FFT rich)
Sheet1
ProcsASCI Red (model)Cplant (model)f Redf CplantRatio
11.001.000.000.001.00
20.740.740.360.361.00
40.670.670.480.481.00
80.640.640.560.561.00
160.590.590.690.691.00
320.590.590.690.691.00
640.600.600.680.681.00
1280.570.570.750.751.00
2560.510.510.970.971.00
5120.480.481.071.071.00
10240.520.520.930.931.00
Red bal.1.2
Cplant bal.1.2
Ratio1.0
ASCI RedTimeEff
0
20
40
80
2000.680.432
4000.720.408
640
8000.740.397
1000
16000.690.426
CplantTimeEff
0
204.680.326
404.720.323
804.870.313
200
400
6404.960.307
800
1000
ASCI RedTimeEffRatio
01.3
80
2000.680.432
4000.720.408
640
8000.740.397
1000
CplantTimeEff
0
804.870.313
200
400
6404.960.307
800
1000
leland:Bio problem with FFT's
leland:Ross
Sheet1
00
00
00
00
00
00
00
00
00
00
00
ASCI Red (model)
Cplant (model)
Processors
Scaled parallel efficiency
DSMC problem
Presto_old
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
DSMC plasma problem
Salinas_old
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
DSMC plasma problem
DSMC_old
ProcsASCI RedASCI Red (interp.)ASCI RedRatio
11.001.001.001.00
20.950.951.02
40.840.840.841.05
80.820.821.10
160.780.780.781.22
320.780.781.24
640.770.770.771.26
1280.740.741.32
2560.690.690.691.47
5120.620.621.40
10240.520.521.47
ProcsCplantCplant (interp.)Cplant
11.001.001.00
20.930.93
40.800.800.80
80.740.74
160.640.640.64
320.630.63
640.610.610.61
1280.560.56
2560.470.470.47
5120.440.44
10240.350.35
DSMC_old
0000
0000
0000
0000
0000
0000
0000
0000
0000
00
00
ASCI Red
ASCI Red (interp.)
Cplant
Cplant (interp.)
Processors
Scaled parallel efficiency
CTH - NEST shell problem
CTH_old
CPU timeScaled efficiency
STARSATProcsASCI RedCplantASCI RedCplantRatio
Scaled problem11701.4340.61.001.001.00
21730.44336.180.981.010.97
41687.4331.221.011.030.98
81692.29394.681.010.861.17
161693.52327.211.001.040.97
321692.29337.861.011.011.00
641695.06368.221.000.921.09
1281695.43369.111.000.921.09
2561704.42443.001.000.771.30
5121735.67564.020.980.601.62
10241802.00701.060.940.491.94
12481632641282565121024
10.98256120030.98684732320.80861088550.97972440940.96441028290.84070945950.8435593220.73255813950.5530.4090684932
Vplant
11.00
20.98
40.99
80.81
160.98
320.96
640.84
1280.84
2560.73
5120.55
10240.41
leland:From Mahesh Rajan
leland:Ross, mixture of head and center nodes
leland:Ross, mixture of head and center nodes
CTH_old
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled Parallel Efficiency
ITS Starsat problem
000
000
000
000
000
000
000
000
000
00
00
ASCI Red
Cplant
Vplant
Processors
Scaled Parallel Efficiency
ITS Starsat problem
CPU time (sec)Parallel efficiency
GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster
4k elements/proc15.50E+021.60E+021.001.00
Presto46.04E+021.79E+020.910.89
158.77E+022.60E+020.630.62
601.06E+033.76E+020.520.43
2401.36E+030.41
CPU time (sec)Fraction of whole
GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster
4k elements/proc1
ACME time45.66E+021.73E+020.940.96
158.37E+022.53E+020.950.97
601.02E+033.69E+020.960.98
2401.31E+030.97
leland:Arne's test problem (a drop problem?)
leland:9100 Linux Cluster
leland:Arne's test problem (a drop problem?)
leland:9100 Linux Cluster
00
00
00
00
00
ASCI Red
Std. Linux Cluster
Processors
Scaled parallel efficiency
Presto drop problem
Scaled problemProcsASCI RedCplant
CPU time (sec.)146.864.5
247.0
447.3
847.8
1648.7
3250.7
6454.475
12855.8
21656.978
51259.180
100063.685
Scaled problemProcsASCI RedCplantRatio
Parallel efficiency11.001.001.00
20.990.991.00
40.990.991.00
80.980.981.00
160.960.961.00
320.920.921.00
640.860.861.00
1280.840.841.00
2160.820.830.99
5120.790.810.98
10000.740.760.97
Scaled problemProcsASCI Red (extrap.)Cplant (extrap.)
Extrapolatio11.001.00
20.990.99
40.990.99
80.980.98
160.960.96
320.920.92
640.860.86
128
216
512
1000
Scaled problemProcsASCI RedCplant
Data1
2
4
8
16
32
640.860.86
1280.840.84
2160.820.83
5120.790.81
10000.740.76
leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)
leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)
leland:Ross
leland:Not an actual data point, but linear interp looks good and I don't know how to make the plot work if I have to skip.
leland:Not an actual data point, but don't see how to make plot work otherwise.
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
ASCI Red (extrap.)
Cplant (extrap.)
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas GB problem
ProcsASCI Red (model)Cplant (model)f Redf CplantRatio
11.001.000.000.001.00
20.740.470.361.151.58
40.670.390.481.551.72
80.640.360.561.801.79
160.590.310.692.211.90
320.590.310.692.221.90
640.600.320.682.171.89
1280.570.290.752.411.95
2560.510.240.973.112.08
5120.480.231.073.432.14
10240.520.250.932.962.06
Red bal.0.80
Cplant bal.0.25
Ratio3.2
leland:In rollup count this as 256 procs.
leland:In rollup, count this as 1024 procs.
00
00
00
00
00
00
00
00
00
00
00
ASCI Red (model)
Cplant (model)
Processors
Scaled parallel efficiency
DSMC problem
ProcsASCI RedCplant (model)f Redf CplantRatio
11.001.000.000.001.00
20.740.470.361.151.58
40.670.390.481.551.72
80.640.360.561.801.79
160.590.310.692.211.90
320.590.310.692.221.90
640.600.320.682.171.89
1280.570.290.752.411.95
2560.510.240.973.112.08
5120.480.231.073.432.14
10240.520.250.932.962.06
Red bal.0.80
Cplant bal.0.25
Ratio3.2
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant (model)
Processors
Scaled parallel efficiency
CTH 2-gas problem
-
Scalability determines cost effectiveness 380M node-hrs55M node-hrsMPP more cost effectiveCluster more cost effective256Sandias top priority computing workload:
-
Scalability also limits capability ~3x processors
Chart1
00
64.239377957159.420814003
128.4507175171118.5292037233
192192
255.5464028819196.476534586
320320
384384
448448
501.8907972138303.0814608188
576576
640640
704704
768768
832832
896896
960960
966.8332963374469.7221646257
10881088
Red Speedup
Cplant Speedup
Processors
Speedup
ITS Speedup curves
combined
Sum
Flat1.01.01.01.01.01.06.00
Current0.340.160.130.100.070.0010.81
Future0.150.100.100.100.100.150.70
Future combined ratio
0.150.100.100.100.100.150.70
ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio
10.150.100.100.100.100.151.00
20.180.100.110.100.100.151.04
40.160.100.110.100.110.151.04
80.190.100.120.100.110.171.13
160.200.100.120.100.120.141.12
320.220.110.120.100.120.151.17
640.260.130.120.110.130.161.30
1280.290.160.120.120.130.161.42
2560.360.250.120.130.150.191.72
5120.470.450.150.130.140.242.26
10240.530.910.130.130.150.293.053.05
20483.90
40965.00
Current combined ratio
0.340.160.130.100.070.0010.81
ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio
10.340.160.130.100.070.001.00
20.400.160.140.100.080.001.08
40.380.160.140.100.080.001.06
80.430.160.150.100.080.001.14
160.460.160.150.100.090.001.19
320.510.170.150.100.090.001.27
640.590.200.160.110.090.001.43
1280.660.260.160.120.100.001.61
2560.820.400.160.140.110.002.00
5121.070.710.190.140.100.002.74
10241.211.430.170.140.110.003.783.78
20484.90
40966.20
Efficiency ratiosProcsACMESalinasLAMMPSDSMCCTHITSAverageAverage
11.001.001.001.001.001.001.00
21.171.001.071.001.020.971.04
41.091.011.091.001.050.981.04
81.241.011.161.001.101.171.11
161.331.021.181.001.220.971.12
321.481.061.171.001.241.001.16
641.711.291.241.101.261.091.28
1281.931.641.231.201.321.091.40
2562.372.511.231.301.471.301.70
5123.124.501.511.301.401.622.24
10243.529.061.331.301.471.943.103.1
20484.10
40965.30
1.812.281.201.111.231.19
combined
1
1.0417316242
1.0370427571
1.1263326826
1.1225832309
1.1695692534
1.2969436557
1.4160002988
1.7174488073
2.2598735787
3.05108011563.0510801156
3.9
5
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Future portfolio
ACME
1
1.0838924008
1.0597698842
1.1392859529
1.1913721552
1.2660360918
1.4319161332
1.6104267079
1.998786187
2.7403839139
3.78114171643.7811417164
4.9
6.2
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Current portfolio
Salinas
00
00
00
00
00
00
00
00
00
00
00
00
00
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Average portfolio
LAMMPS
ProblemProcsASCI RedCplant
Scaled problem11.1320.56
CPU time (sec.)21.5760.911
41.9741.068
82.2281.364
162.431.594
322.4821.822
642.6032.200
1282.762.629
2562.8263.320
5122.9934.614
10243.1395.464
ProblemProcsASCI RedCplantVplantRatio
Scaled problem11.001.001.0001.00
Efficiency20.720.610.5851.17
40.570.520.4371.09
80.510.410.3721.24
160.470.350.3281.33
320.460.310.3021.48
640.430.250.2691.71
1280.410.210.2651.93
2560.400.170.2432.37
5120.380.123.12
10240.360.103.52
leland:From very limited Bartel data, so made very conservative assumption by "hand"
leland:Courtenay's driver routine around ACME
leland:Ross
leland:Ross
LAMMPS
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
ACME brick contact problem
DSMC
ASCI RedSecsEff.
1511.00
8670.76
27780.65
64810.63
125890.57
216920.55
3431020.50
5121110.46
7291200.43
10001260.40
13311620.31
17281810.28
21972200.23
27442650.19
CplantSecsEff.
1241.00
8320.75
27380.63
64490.49
125680.35
2161000.24
3431530.16
5122350.10
7193490.07
10005220.05
VplantSecsEff.
1211.00
8340.62
27370.57
64400.53
125410.51
216420.50
343420.50
512480.44
719-
1024-
ProcsRed Eff.Cplant Eff.
11.001.00
20.970.96
40.900.89
80.760.75
160.720.70
270.650.63
320.650.61
640.630.49
1250.570.35
1280.570.35
2160.550.24
2560.540.21
3430.500.16
5120.460.10
7290.430.07
10000.400.05
10240.400.04
13310.31
ProcsRed Eff.Cplant Eff.Ratio
11.001.001.00
20.970.961.00
40.900.891.01
80.760.751.01
160.720.701.02
320.650.611.06
640.630.491.29
1280.570.351.64
2560.540.212.51
5120.460.104.50
10240.400.0449.06
leland:Courtenay's driver routine around ACME
ASCI Red
Cplant
Vplant
Processors
Scaled parallel efficiency
ACME brick contact problem
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
DSMC
00
00
00
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas modal problem
CTH
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas modal problem
ITS
MembraneProcsASCI RedCplantRatio
Scaled problem11.0001.0001.00
Parallel efficiency21.0280.9641.07
41.0050.9241.09
80.9920.8531.16
160.9950.8451.18
320.9960.8531.17
640.9940.8031.24
1280.9920.8051.23
2560.9890.8011.23
5120.9850.6511.51
10240.9620.7211.33
ITS
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
LAMMPS membrane problem (FFT rich)
Sheet1
ProcsASCI Red (model)Cplant (model)f Redf CplantRatio
11.001.000.000.001.00
20.740.740.360.361.00
40.670.670.480.481.00
80.640.640.560.561.00
160.590.590.690.691.00
320.590.590.690.691.00
640.600.600.680.681.00
1280.570.570.750.751.00
2560.510.510.970.971.00
5120.480.481.071.071.00
10240.520.520.930.931.00
Red bal.1.2
Cplant bal.1.2
Ratio1.0
ASCI RedTimeEff
0
20
40
80
2000.680.432
4000.720.408
640
8000.740.397
1000
16000.690.426
CplantTimeEff
0
204.680.326
404.720.323
804.870.313
200
400
6404.960.307
800
1000
ASCI RedTimeEffRatio
01.3
80
2000.680.432
4000.720.408
640
8000.740.397
1000
CplantTimeEff
0
804.870.313
200
400
6404.960.307
800
1000
leland:Bio problem with FFT's
leland:Ross
Sheet1
00
00
00
00
00
00
00
00
00
00
00
ASCI Red (model)
Cplant (model)
Processors
Scaled parallel efficiency
DSMC problem
Presto_old
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
DSMC plasma problem
Salinas_old
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
DSMC plasma problem
DSMC_old
ProcsASCI RedASCI Red (interp.)ASCI RedRatio
11.001.001.001.00
20.950.951.02
40.840.840.841.05
80.820.821.10
160.780.780.781.22
320.780.781.24
640.770.770.771.26
1280.740.741.32
2560.690.690.691.47
5120.620.621.40
10240.520.521.47
ProcsCplantCplant (interp.)Cplant
11.001.001.00
20.930.93
40.800.800.80
80.740.74
160.640.640.64
320.630.63
640.610.610.61
1280.560.56
2560.470.470.47
5120.440.44
10240.350.35
DSMC_old
0000
0000
0000
0000
0000
0000
0000
0000
0000
00
00
ASCI Red
ASCI Red (interp.)
Cplant
Cplant (interp.)
Processors
Scaled parallel efficiency
CTH - NEST shell problem
CTH_old
CPU timeScaled efficiency
STARSATProcsASCI RedCplantASCI RedCplantRatio
Scaled problem11701.4340.61.001.001.00
21730.44336.180.981.010.97
41687.4331.221.011.030.98
81692.29394.681.010.861.17
161693.52327.211.001.040.97
321692.29337.861.011.011.00
641695.06368.221.000.921.09
1281695.43369.111.000.921.09
2561704.42443.001.000.771.30
5121735.67564.020.980.601.62
10241802.00701.060.940.491.94
12481632641282565121024
10.98256120030.98684732320.80861088550.97972440940.96441028290.84070945950.8435593220.73255813950.5530.4090684932
Vplant
11.00
20.98
40.99
80.81
160.98
320.96
640.84
1280.84
2560.73
5120.55
10240.41
ProcsRed EfficiencyRed SpeedupCplant EfficiencyCplant Speedup
11.001.001.001.00
20.981.971.011.99
41.014.031.034.1
81.018.040.866.9
161.0016.11.0416.7
321.0132.21.0132.4
641.0064.20.9259.4
1281.00128.50.92118.5
2561.00255.50.77196.5
5120.98501.90.60303.1
10240.94966.80.49469.7
ProcsRed EfficiencyRed SpeedupCplant EfficiencyCplant Speedup
0
641.00640.9259
1281.001280.92119
192
2561.002560.77196
320
384
448
5120.985020.60303
576
640
704
768
832
896
960
10240.949670.49470
1088
1152
1216
1280
1344
1408
1472
1536
1600
1664
1728
1792
1856
1920
1984
2048
leland:From Mahesh Rajan
leland:Ross, mixture of head and center nodes
leland:Ross, mixture of head and center nodes
CTH_old
11
0.98321814111.0131477185
1.0082967881.0283195459
1.00538323810.8629776021
1.00465303041.0409217322
1.00538323811.008109868
1.00374028060.9249904948
1.00352123060.9227601528
0.99822813630.7688487585
0.98025546330.6038792951
0.9441731410.4858357345
ASCI Red
Cplant
Processors
Scaled Parallel Efficiency
ITS Starsat problem
111
0.98321814111.01314771850.9825612003
1.0082967881.02831954590.9868473232
1.00538323810.86297760210.8086108855
1.00465303041.04092173220.9797244094
1.00538323811.0081098680.9644102829
1.00374028060.92499049480.8407094595
1.00352123060.92276015280.843559322
0.99822813630.76884875850.7325581395
0.98025546330.6038792951
0.9441731410.4858357345
ASCI Red
Cplant
Vplant
Processors
Scaled Parallel Efficiency
ITS Starsat problem
Red Speedup
Cplant Speedup
Processors
Speedup
ITS Speedup curves
Red Speedup
Cplant Speedup
Processors
Speedup
ITS Speedup curves
CPU time (sec)Parallel efficiency
GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster
4k elements/proc15.50E+021.60E+021.001.00
Presto46.04E+021.79E+020.910.89
158.77E+022.60E+020.630.62
601.06E+033.76E+020.520.43
2401.36E+030.41
CPU time (sec)Fraction of whole
GellerudProcsASCI RedLinuxASCI RedStd. Linux Cluster
4k elements/proc1
ACME time45.66E+021.73E+020.940.96
158.37E+022.53E+020.950.97
601.02E+033.69E+020.960.98
2401.31E+030.97
leland:Arne's test problem (a drop problem?)
leland:9100 Linux Cluster
leland:Arne's test problem (a drop problem?)
leland:9100 Linux Cluster
00
00
00
00
00
ASCI Red
Std. Linux Cluster
Processors
Scaled parallel efficiency
Presto drop problem
Scaled problemProcsASCI RedCplant
CPU time (sec.)146.864.5
247.0
447.3
847.8
1648.7
3250.7
6454.475
12855.8
21656.978
51259.180
100063.685
Scaled problemProcsASCI RedCplantRatio
Parallel efficiency11.001.001.00
20.990.991.00
40.990.991.00
80.980.981.00
160.960.961.00
320.920.921.00
640.860.861.00
1280.840.841.00
2160.820.830.99
5120.790.810.98
10000.740.760.97
Scaled problemProcsASCI Red (extrap.)Cplant (extrap.)
Extrapolatio11.001.00
20.990.99
40.990.99
80.980.98
160.960.96
320.920.92
640.860.86
128
216
512
1000
Scaled problemProcsASCI RedCplant
Data1
2
4
8
16
32
640.860.86
1280.840.84
2160.820.83
5120.790.81
10000.740.76
leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)
leland:Made up to achieve an efficiency that is reasonable on 4 procs (where the actual data starts)
leland:Ross
leland:Not an actual data point, but linear interp looks good and I don't know how to make the plot work if I have to skip.
leland:Not an actual data point, but don't see how to make plot work otherwise.
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
ASCI Red (extrap.)
Cplant (extrap.)
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas GB problem
ProcsASCI Red (model)Cplant (model)f Redf CplantRatio
11.001.000.000.001.00
20.740.470.361.151.58
40.670.390.481.551.72
80.640.360.561.801.79
160.590.310.692.211.90
320.590.310.692.221.90
640.600.320.682.171.89
1280.570.290.752.411.95
2560.510.240.973.112.08
5120.480.231.073.432.14
10240.520.250.932.962.06
Red bal.0.80
Cplant bal.0.25
Ratio3.2
leland:In rollup count this as 256 procs.
leland:In rollup, count this as 1024 procs.
00
00
00
00
00
00
00
00
00
00
00
ASCI Red (model)
Cplant (model)
Processors
Scaled parallel efficiency
DSMC problem
ProcsASCI RedCplant (model)f Redf CplantRatio
11.001.000.000.001.00
20.740.470.361.151.58
40.670.390.481.551.72
80.640.360.561.801.79
160.590.310.692.211.90
320.590.310.692.221.90
640.600.320.682.171.89
1280.570.290.752.411.95
2560.510.240.973.112.08
5120.480.231.073.432.14
10240.520.250.932.962.06
Red bal.0.80
Cplant bal.0.25
Ratio3.2
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant (model)
Processors
Scaled parallel efficiency
CTH 2-gas problem
-
Commodity nearly everywhere-- Customization drives costEarth Simulator and Cray X-1 are fully custom Vector systems with good balanceThis drives their high cost (and their high performance).Clusters are nearly entirely high-volume with no truly custom partsWhich drives their low-cost (and their low scalability)Red Storm uses custom parts only where they are critical to performance and reliabilityHigh scalability at minimal cost/performance
-
Scaling data for some key engineering codesRandom variation at small proc. countsLarge differential in efficiency at large proc. counts
Chart1
1111
0.98321814111.01314771850.71827411170.6147091109
1.0082967881.02831954590.57345491390.5243445693
1.00538323810.86297760210.50807899460.4105571848
1.00465303041.04092173220.46584362140.3513174404
1.00538323811.0081098680.45608380340.3073545554
1.00374028060.92499049480.43488282750.2545454545
1.00352123060.92276015280.41014492750.2130087486
0.99822813630.76884875850.40056617130.1686746988
0.98025546330.60387929510.3782158370.1213697443
0.9441731410.48583573450.36062440270.102489019
ITS, Red
ITS, Cplant
ACME, Red
ACME, Cplant
Processors
Scaled Parallel Efficiency
Performance on Engineering Codes
combined
Sum
Flat1.01.01.01.01.01.06.00
Current0.340.160.130.100.070.0010.81
Future0.150.100.100.100.100.150.70
Future combined ratio
0.150.100.100.100.100.150.70
ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio
10.150.100.100.100.100.151.00
20.180.100.110.100.100.151.04
40.160.100.110.100.110.151.04
80.190.100.120.100.110.171.13
160.200.100.120.100.120.141.12
320.220.110.120.100.120.151.17
640.260.130.120.110.130.161.30
1280.290.160.120.120.130.161.42
2560.360.250.120.130.150.191.72
5120.470.450.150.130.140.242.26
10240.530.910.130.130.150.293.053.05
20483.90
40965.00
Current combined ratio
0.340.160.130.100.070.0010.81
ProcsACMESalinasLAMMPSDSMCCTHITSEff. RatioEff. Ratio
10.340.160.130.100.070.001.00
20.400.160.140.100.080.001.08
40.380.160.140.100.080.001.06
80.430.160.150.100.080.001.14
160.460.160.150.100.090.001.19
320.510.170.150.100.090.001.27
640.590.200.160.110.090.001.43
1280.660.260.160.120.100.001.61
2560.820.400.160.140.110.002.00
5121.070.710.190.140.100.002.74
10241.211.430.170.140.110.003.783.78
20484.90
40966.20
Efficiency ratiosProcsACMESalinasLAMMPSDSMCCTHITSAverageAverage
11.001.001.001.001.001.001.00
21.171.001.071.001.020.971.04
41.091.011.091.001.050.981.04
81.241.011.161.001.101.171.11
161.331.021.181.001.220.971.12
321.481.061.171.001.241.001.16
641.711.291.241.101.261.091.28
1281.931.641.231.201.321.091.40
2562.372.511.231.301.471.301.70
5123.124.501.511.301.401.622.24
10243.529.061.331.301.471.943.103.1
20484.10
40965.30
1.812.281.201.111.231.19
combined
00
00
00
00
00
00
00
00
00
00
00
00
00
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Future portfolio
compiled
00
00
00
00
00
00
00
00
00
00
00
00
00
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Current portfolio
ACME
00
00
00
00
00
00
00
00
00
00
00
00
00
Eff. Ratio
Extrapolation
Processors
Efficiency ratio (Red/Cplant)
Average portfolio
Salinas
CTH, RedSalinas, RedACME, RedITS, RedLAMMPS, Red
Red11.001.001.001.001.00
20.950.970.720.981.03
40.840.900.571.011.01
80.820.760.511.010.99
160.780.720.471.001.00
320.780.650.461.011.00
640.770.630.431.000.99
1280.740.570.411.000.99
2560.690.540.401.000.99
5120.620.460.380.980.99
10240.520.400.360.940.96
CTH, CplantSalinas, CplantACME, CplantITS, CplantLAMMPS, Cplant
Cplant11.001.001.001.001.00
20.930.960.611.010.96
40.800.890.521.030.92
80.740.750.410.860.85
160.640.700.351.040.85
320.630.610.311.010.85
640.610.490.250.920.80
1280.560.350.210.920.81
2560.470.210.170.770.80
5120.440.100.120.600.65
10240.350.040.100.490.72
SalinasACMEITSCTHLAMMPS
Ratio11.001.001.001.001.00
21.001.170.971.021.07
41.011.090.981.051.09
81.011.241.171.101.16
161.021.330.971.221.18
321.061.481.001.241.17
641.291.711.091.261.24
1281.641.931.091.321.23
2562.512.371.301.471.23
5124.503.121.621.401.51
10249.063.521.941.471.33
leland:From very limited Bartel data, so made very conservative assumption by "hand"
Salinas
Salinas
ACME
ITS
CTH
LAMMPS
Processors
Red efficiency/Cplant efficiency
Efficiency ratios
LAMMPS
CTH, Red
Salinas, Red
ACME, Red
CTH, Cplant
Salinas, Cplant
ACME, Cplant
Processors
Scaled Parallel Efficiency
Performance on Engineering Codes
DSMC
CTH, Red
CTH, Cplant
ACME, Red
ACME, Cplant
Processors
Scaled Parallel Efficiency
Performance on Engineering Codes
CTH
ITS, Red
ITS, Cplant
ACME, Red
ACME, Cplant
Processors
Scaled Parallel Efficiency
Performance on Engineering Codes
ITS
ProblemProcsASCI RedCplant
Scaled problem11.1320.56
CPU time (sec.)21.5760.911
41.9741.068
82.2281.364
162.431.594
322.4821.822
642.6032.200
1282.762.629
2562.8263.320
5122.9934.614
10243.1395.464
ProblemProcsASCI RedCplantVplantRatio
Scaled problem11.001.001.0001.00
Efficiency20.720.610.5851.17
40.570.520.4371.09
80.510.410.3721.24
160.470.350.3281.33
320.460.310.3021.48
640.430.250.2691.71
1280.410.210.2651.93
2560.400.170.2432.37
5120.380.123.12
10240.360.103.52
leland:Courtenay's driver routine around ACME
leland:Ross
leland:Ross
ITS
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
ACME brick contact problem
Sheet1
ASCI RedSecsEff.
1511.00
8670.76
27780.65
64810.63
125890.57
216920.55
3431020.50
5121110.46
7291200.43
10001260.40
13311620.31
17281810.28
21972200.23
27442650.19
CplantSecsEff.
1241.00
8320.75
27380.63
64490.49
125680.35
2161000.24
3431530.16
5122350.10
7193490.07
10005220.05
VplantSecsEff.
1211.00
8340.62
27370.57
64400.53
125410.51
216420.50
343420.50
512480.44
719-
1024-
ProcsRed Eff.Cplant Eff.
11.001.00
20.970.96
40.900.89
80.760.75
160.720.70
270.650.63
320.650.61
640.630.49
1250.570.35
1280.570.35
2160.550.24
2560.540.21
3430.500.16
5120.460.10
7290.430.07
10000.400.05
10240.400.04
13310.31
ProcsRed Eff.Cplant Eff.Ratio
11.001.001.00
20.970.961.00
40.900.891.01
80.760.751.01
160.720.701.02
320.650.611.06
640.630.491.29
1280.570.351.64
2560.540.212.51
5120.460.104.50
10240.400.0449.06
leland:Courtenay's driver routine around ACME
ASCI Red
Cplant
Vplant
Processors
Scaled parallel efficiency
ACME brick contact problem
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Sheet1
00
00
00
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas modal problem
Presto_old
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
Salinas modal problem
Salinas_old
MembraneProcsASCI RedCplantRatio
Scaled problem11.0001.0001.00
Parallel efficiency21.0280.9641.07
41.0050.9241.09
80.9920.8531.16
160.9950.8451.18
320.9960.8531.17
640.9940.8031.24
1280.9920.8051.23
2560.9890.8011.23
5120.9850.6511.51
10240.9620.7211.33
Salinas_old
00
00
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
LAMMPS membrane problem (FFT rich)
DSMC_old
ProcsASCI Red (model)Cplant (model)f Redf CplantRatio
11.001.000.000.001.00
20.740.740.360.361.00
40.670.670.480.481.00
80.640.640.560.561.00
160.590.590.690.691.00
320.590.590.690.691.00
640.600.600.680.681.00
1280.570.570.750.751.00
2560.510.510.970.971.00
5120.480.481.071.071.00
10240.520.520.930.931.00
Red bal.1.2
Cplant bal.1.2
Ratio1.0
ASCI RedTimeEff
0
20
40
80
2000.680.432
4000.720.408
640
8000.740.397
1000
16000.690.426
CplantTimeEff
0
204.680.326
404.720.323
804.870.313
200
400
6404.960.307
800
1000
ASCI RedTimeEffRatio
01.3
80
2000.680.432
4000.720.408
640
8000.740.397
1000
CplantTimeEff
0
804.870.313
200
400
6404.960.307
800
1000
leland:Bio problem with FFT's
leland:Ross
DSMC_old
00
00
00
00
00
00
00
00
00
00
00
ASCI Red (model)
Cplant (model)
Processors
Scaled parallel efficiency
DSMC problem
CTH_old
00
00
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
DSMC plasma problem
00
00
00
00
00
00
00
ASCI Red
Cplant
Processors
Scaled parallel efficiency
DSMC plasma problem
ProcsASCI RedASCI Red (interp.)ASCI RedRatio
11.001.001.001.00
20.950.951.02
40.840.840.841.05
80.820.821.10
160.780.780.781.