Recent Developments for Parallel CMAQ Jeff Young AMDB/ASMD/ARL/NOAA David Wong SAIC – NESCC/EPA.
-
Upload
beverley-wilkins -
Category
Documents
-
view
216 -
download
0
Transcript of Recent Developments for Parallel CMAQ Jeff Young AMDB/ASMD/ARL/NOAA David Wong SAIC – NESCC/EPA.
Recent Developments for Parallel CMAQ
• Jeff Young AMDB/ASMD/ARL/NOAA
• David Wong SAIC – NESCC/EPA
Running in “quasi-operational” mode at NCEP
26+ minutes for a 48 hour forecast (on 33 processors)
AQF-CMAQ
Some background: MPI data communication – ghost (halo) regions
Code modifications to improve data locality
Some vdiff optimization
Jerry Gipson’s MEBI/EBI chem solver
Tried CGRID( SPC, LAYER, COLUMN, ROW ) – tests
Indicated not good for MPI communication of data
DO J = 1, N
DO I = 1, M
DATA( I,J ) = A( I+2,J ) * A( I,J-1 )
END DO
END DO
Ghost (Halo) Regions
0 2
3 4 5
1
3 4
Exterior Boundary
Ghost Region
Stencil Exchange Data Communication Function
Horizontal Advection and Diffusion Data Requirements
Architectural Changes for Parallel I/O
Tests confirmed latest releases (May, Sep 2003) not very scalable
Background:
Original Version
Latest Release
AQF Version
Parallel I/O
Read and computation
Write and computation
2002 Release
2003 Release
AQF
Computation only
Write only
Read, write and computation
asynchronous
Data transfer by message passing
Modifications to the I/O API for Parallel I/O
For 3D data, e.g. …
INTERP3 ( FileName, VarName, ProgName, Date, Time,
Data_Buffer )
StartCol, EndCol,
StartRow, EndRow,StartLay, EndLay,
Ncols*Nrows*Nlays,
XTRACT3 ( FileName, VarName,
StartLay, EndLay,
StartRow, EndRow,
StartCol, EndCol,
Data_Buffer )
Data_Buffer )
Date, Time,
Date, Time,
INTERPX ( FileName, VarName, ProgName,
WRPATCH ( FileID, VarID, TimeStamp, Record_No )
Called from PWRITE3 (pario)
time interpolation
spatial subset
DRIVER read IC’s into CGRID begin output timestep loop
advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect adjadv hdiff decouple vdiff DRYDEP cloud WETDEP gas chem (aero VIS) couple end sync timestep loop decouple write conc and avg conc CONC, ACONC
end output timestep loop
Standard (May 2003 Release)
DRIVER set WORKERs, WRITER if WORKER, read IC’s into CGRID begin output timestep loop
if WORKER advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect hdiff decouple vdiff cloud gas chem (aero) couple end sync timestep loop decouple MPI send conc, aconc, drydep, wetdep, (vis) if WRITER completion-wait: for conc, write conc CONC ; for aconc, write aconc, etc. end output timestep loop
AQF CMAQ
Switch
Power3 Cluster (NESCC’s LPAR)
Memory
0 21 3 87 954 6 1211 141310 15
1211 1413 151211 1413 15
The platform (cypress00, cypress01, cypress02) consists of 3 SP-Nighthawk nodes
All cpu’s share user applications with file servers, interactive use, etc.
2X slower than NCEP’s
Switch
Memory
0 321
Memory
0 321
Memory
0 321
Memory
0 321
Power4 p690 Servers (NCEP’s LPAR)
Each platform (snow, frost) is composed of 22 p690 (Regatta) servers
Each server has 32 cpu’s LPAR-ed into 8 nodes per server (4 cpu’s per node)
Some nodes are dedicated to file servers, interactive use, etc.
There are effectively 20 servers for general use (160 nodes, 640 cpu’s)
2 “colony” SW connectors/node
~2X performance of cypress nodes
Internal
Network
“ice” Beowulf Cluster
Pentium 3 – 1.4 GHz
Memory
0 1
Memory
0 1
Memory
0 1
Memory
0 1
Memory
0 1
Memory
0 1
Isolated from outside network traffic
Network
“global” MPICH Cluster
Pentium 4 XEON – 2.4 GHz
Memory
0 1
Memory
0 1
Memory
0 1
Memory
0 1
Memory
0 1
Memory
0 1
global global1 global2 global3 global4 global5
RESULTS
5 hour, 12Z – 17Z and 24 hour, 12Z – 12Z runs
20 Sept 2002 test data set used for developing the AQF-CMAQ
Input Met from ETA, processed thru PRDGEN and PREMAQ
Domain seen on following slides
CB4 mechanism, no aerosols, Pleim’s Yamartino advection for AQF-CMAQ
166 columns X 142 rows X 22 layers at 12 km resolution
PPM advection for May 2003 Release
AQF 2003 Release 24 5 24 5 run wall run wall run wall run wall 4 X X
8 X X X X X X
16 X X
cypress
32 X X X
4 X X
8 X X X X
16 X X
32 X X X X
snow
64 X X
ice 8 X X X
global 8 X X X
The Matrix
run
wall
= comparison of run times
= comparison of relative wall times
4 = 2 X 2
8 = 4 X 2
32 = 8 X 4
16 = 4 X 4
64 = 8 X 8
64 = 16 X 4
for main science processes
24–Hour AQF vs. 2003 Release
AQF-CMAQ 2003 Release
Data shown at peak hour
24–Hour AQF vs. 2003 Release
Less than 0.2 ppb diff between Almost 9 ppb max diff between Yamartino and PPMAQF on cypress and snow
0100020003000400050006000700080009000
10000
AQF REL
cypress1cypress2cypress3cypress4snowiceglobal
Absolute Run Times
24 Hours, 8 Worker Processors
Sec
AQF vs. 2003 Release
0
2000
4000
6000
8000
10000
12000
AQF REL
cypress1cypress2cypress3cypress4cypress5snow1snow2snow3
AQF vs. 2003 Release on cypress and AQF on snow
Absolute Run Times
24 Hours, 32 Worker Processors
Sec
0
10
20
30
40
50
60
70
80
90
100 cypress1cypress2cypress3cypress4snow1snow2snow3ice1ice2global1global2
% of Slowest
AQF CMAQ – Various Platforms Relative Run Times
5 Hours, 8 Worker Processors
0
10
20
30
40
50
60
70
80
90
100
cypress snow
4
4'
4''
4'''
8
8'
8''
8'''
16
16'
16''
16'''
32
32'
64
64'
64b
64b'
% of Slowest
Relative Run Times 5 Hours
AQF-CMAQ cypress vs. snow
0
10
20
30
40
50
60
70
80
90
100
4 8 16 32 64
cypress1cypress2cypress3cypress4
snow1snow2snow3snow4
ice1ice2
global1global2
% of Slowest
Number of Worker Processors
AQF-CMAQ on Various Platforms
Relative Run Times 5 Hours
0
5
10
15
20
25
30
35
xadv yadv zadv hdiff vdiff cloud chem
44'4''4'''88'8''8'''1616'16''16'''3232'
%
AQF-CMAQ on cypress
Relative Wall Times 5 Hours
0
5
10
15
20
25
30
35
xadv yadv zadv hdiff vdiff cloud chem
44'4''88'8''1616'3232'6464'64b64b'
%
AQF-CMAQ on snow
Relative Wall Times 5 Hours
0
5
10
15
20
25
30
35
xadv yadv zadv hdiff vdiff cloud chem
rel1rel2rel3rel4aqf1aqf2aqf3aqf4
Relative Wall Times
24 hr, 8 Worker Processors
AQF vs. 2003 Release on cypress
%
PPM
Yamo
0
5
10
15
20
25
30
35
xadv yadv zadv hdiff vdiff cloud chem
rel1rel2rel3rel4aqf1aqf2aqf3aqf4snowsnow32
AQF vs. 2003 Release on cypress
Relative Wall Times
24 hr, 8 Worker Processors
Add snow for 8 and 32 processors
%
0
5
10
15
20
25
30
x-r-l x-hppm y-c-l y-hppm
cypress1cypress2cypress3cypress4snow1
AQF Horizontal Advection
Relative Wall Times
24 hr, 8 Worker Processors
x-r-l x-row-loop y-c-l
-hppm
y-column-loop
kernel solver in each loop
%
02468
101214161820
x-r-l x-hppm y-c-l y-hppm
cypress1cypress2cypress3snow1snow2snow3
AQF Horizontal Advection
Relative Wall Times
24 hr, 32 Worker Processors
%
02468
101214161820
x-r-l x-hppm y-c-l y-hppm
cypress1cypress2cypress3snow1snow2snow3r-cypress1r-cypress2r-cypress3
AQF & Release Horizontal Advection
Relative Wall Times
24 hr, 32 Worker Processors
r- release
%
Future Work
Add aerosols back in
TKE vdiff
Improve horizontal advection/diffusion scalability
Some I/O improvements
Layer variable horizontal advection time steps
71
83
71
42 41
35
21 2042 41
35
3636
Aspect Ratios
142
166
11 10
35
36
Aspect Ratios
21 20
17
18