RAMSES: Robust Analytical Models for Science at Extreme …
Transcript of RAMSES: Robust Analytical Models for Science at Extreme …
RAMSES: Robust Analytical Models for Science at Extreme Scale
Presenter: Raj Kettimuthu (Argonne)
PI: Ian Foster (Argonne)Co-PIs: Gagan Agrawal (Ohio State), Nagi Rao (ORNL), Brad Settlemyer (LANL), Brian Tierney (LBL), and Don
Towsley (UMass)
Project Overview
Experiments
Database
Modeling
Estimation
Advisor
Estimators
Evaluators
Tester
Tools Develop easy-to-use tools to provide end-users with actionable advice
Develop and apply data-driven estimation methods: differential regression, surrogate models, etc.
Develop, evaluate, and refine component and end-to-end models
Conduct extensive, automated experiments to test models and build database
Exemplar Science Workflows
§ Five science workflows § Span a broad range of DOE science domains and modeling
problems§ File Transfer§ Light Source Workflows
– Tomographic Reconstruction, Diffuse Scattering § Distributed MapReduce§ In-situ Analysis§ Exascale Simulations
TCP Throughput Profiles
tconcaveregion
convexregion
RTT - ms
Thro
ughp
ut -
Gbps
§ Most common TCP throughput profile convex function of rtt
§ Observed dual-mode profiles: emulated 0-366ms rttconnections
– CUBIC, STCP: Smaller RTT - Concave region, Larger RTT- Convex region
§ Concave regions very desirable – Throughput does not decay as fast, rate of decrease slows down as rtt
Models of single TCP connections
STCP CUBIC
• Models account for increase/decrease rules, rtt, link capacity, max receive window size
• Validation against measurements• Useful for selecting best version and troubleshooting• Future directions: other versions, e.g., UDT, multiple
connections, account for I/O interactions
UDP-Based Transport: UDT
§ For dedicated 10G links, UDT provides higher throughput than CUBIC (linux default) § TCP and UDT Throughput transition-point depends on connection parameters –rtt, loss rate, host – NIC parameters, IP and UDP parameters§ Disk-to-Disk transfers (xdd) have lower transfer rate
xdd-read
xdd-write
CUBICSingle stream
UDT
Data Driven Models for File Transfer
§ Combines historical data with a correction term for current external load
§ Takes three pieces of input § Signature for a given transfer
– Concurrency level – Total known concurrency at source (“known load at source”) – Total known concurrency at destination (“known load at destination”)– File Size
§ Historical data – Transfer concurrency, known loads, and observed throughput for the
source-destination pair § Signatures and observed throughputs from the most recent
transfers for the source-destination pair § It produces an estimated throughput as an output.
Data Driven Models for File Transfer
§ Transfer Scheduling Algorithms– SEAL: Schedule transfers minimize average transfer slowdown– STEAL: Minimize slowdown for best-effort transfers and maximize
bandwidth utilization for batch transfers
Destination ≤1GB >1GB, ≤10GB >10G Overall
Gordon 4.26 9.3 6.67 8.31
Mason 3.55 9.4 8.22 8.76
Yellowstone 2.78 8.0 8.1 6.84
Blacklight 5.96 4.41 5.27 4.93
Darter 7.70 4.03 2.63 4.73
SEAL Evaluation – Turnaround Time 60% Load
Modeling In-situ Analysis
§ How often should we perform the analyses?
§ How often should the analyses output be written?
§ Analyses parameters– Time (Initialization, Auxiliary, Output)– Memory (Fixed, Auxiliary, Output)– Minimum interval between consecutive
steps– Importance – Threshold time for analyses
10
§ System parameters– I/O bandwidth– Rate of computation– Available memory
Problem Size
Net
wor
k ba
ndw
idth
/Pr
oces
s cou
nt
Results: Scheduling Analyses within Threshold
TotalThreshold (sec)
R1 (Radius of gyration)
R2 (Membrane density profile 2D histogram)
R3 (Protein density profile 2D histogram)
% within threshold
200 10 4 7 94.59
100 10 2 3 85.99
60 10 1 2 86.01
20 10 1 0 86.11
10 10 0 0 0.3
Table: Analysis frequencies, analysis times, and corresponding thresholds for 1 billion atoms rhodopsin simulation (1000 steps) in LAMMPS on 32768 cores (2048 nodes) of Mira.
Simulation: Rhodopsin protein benchmark, which consists of a protein embedded in a membrane and solvated with water and ions using LAMMPS.
11
Observation: More than 80% of the allowed threshold is used for analyses, when threshold > 20 s.
R1 R2 R3
Time
Mem
ory
Two In-Situ Modes
12
Time Sharing Mode: Minimizes memory consumption
Space Sharing Mode: Enhances resource utilization when simulation reaches its scalability bottleneck
§ Model computational part (MapReduce-like processing)§ Model memory
– Data locality between simulation and analytics (initial work only)
Performance Modeling with Disk Model for K-means with MATE(File Size = 1GB, K = 50, Num of Iterations = 1)
Modeling Computational Component in MATE/Smart
Modeling Computation Time for Parallel Tomographic Reconstruction
§ Computation– Number of intersected rays, and
horizontal and vertical linest x col^2 x (|sin(θ)|+|cos(θ)|)
14
0100002000030000400005000060000700008000090000
100000
0 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
170
Horizontal Vertical Total
0
0.005
0.01
0.015
0.02
0.025
65000
70000
75000
80000
85000
90000
95000
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
151
161
171
Estimated Real Error Ratio
…
P0
P2
P1
…P n
T0
T1
… T n
T2
Estimated Execution time vs. Real Reconstruction Time
RAMSES Meeting
15
0
500000000
1E+09
1.5E+09
2E+09
2.5E+09
3E+09
3.5E+09
4E+09
4.5E+09
0
100
200
300
400
500
600
700
80012
8
256
384
512
640
768
896
1024
1152
1280
1408
1536
1664
1792
1920
2048
Real Time
Estimated Exec Time (wrt 2K)
Estimated Computation
Questions?
Additive increase and additive decrease (AIAD) for optimal stream
Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"17
For every step c (fixed number of epoch), do the following:
Tubes (ANL) to DMZ (UChicago)
Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"18