Manno, 4.5..2011, © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite...
-
Upload
elfrieda-potter -
Category
Documents
-
view
215 -
download
0
Transcript of Manno, 4.5..2011, © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite...
Manno, 4.5..2011, © by Supercomputing Systems1 1
COSMO - Dynamical Core Rewrite Approach, Rewrite and Status
Tobias Gysi
POMPA Workshop, Manno, 3.5.2011
Supercomputing Systems AG Fon +41 43 456 16 00
Technopark 1 Fax +41 43 456 16 10
8005 Zürich www.scs.ch
Manno, 4.5..2011, © by Supercomputing Systems3 3
COSMO Dynamical Core Rewrite
Challenge
•Assuming that the COSMO code will continue run on commodity processors
in the next couple of years, what is the performance improvement we can
achieve by rewriting the dynamical core?
Boundary Conditions
•Do not touch the underlying physical model
(i.e. equations that are being solved)
– Formulas must remain as they are
– Arbitrary ordering of computations, etc. may change
– Results must remain ‘identical’ to ‘a very high level of accuracy’
•Part of an initiative looking at all parts of the COSMO code.
Support
•Support from & direct interaction with MeteoSwiss, DWD, CSCS, C2SM
Manno, 4.5..2011, © by Supercomputing Systems4 4
Approach
Feasibility Study Lib. Design
Rewrite
Test Tune
Feasibility
Library
Test & Tune
~2 Years
CPU
GPUt
Yo
u A
re H
ere
Manno, 4.5..2011, © by Supercomputing Systems6 6
Feasibility Study - Overview
• Get to know the code
• Understand performance characteristics
• Find computational motives
– Stencil
– Tri-Diagonal Solver
• Implement a prototype code
– Relevant part of the dynamical core
(Fast Wave Solver, ~30% of total runtime)
– Try to optimize for x86
– No MPI parallelization
Manno, 4.5..2011, © by Supercomputing Systems8 8
Feasibility Study - Prototype
• Implemented in C++
• Optimize for memory-bandwidth
utilization
– Avoid pre-computation, do
computation on the fly
– Merge loops accessing the
common variables
– Use iterators rather than full index
calculation on 3D grid
– Store data contiguous in
‘k-direction’ (vertical columns)
Manno, 4.5..2011, © by Supercomputing Systems9 9
Fast Wave Solver - SpeedupThe performance
difference is NOT due to programming
language but due to code optimizations!
Manno, 4.5..2011, © by Supercomputing Systems10 10
Feasibility Study - Conclusion
• A performance increase of 2x has been achieved on a representative part
of the code
• Main optimizations identified (for scalar processors)
– Avoid pre-calculation whenever possible
– Merge loops
– Change the storage order to k-first
• Performance is all about memory bandwidth
Manno, 4.5..2011, © by Supercomputing Systems12 12
Design Targets
Write a code that
•Delivers the right results
– Dedicated unit-tests & verification framework
•Apply the performance optimization strategies used in the prototype
•Can be developed within a year to run on x86 and GPU platforms
– Mandatory: support three-level parallelism in a very flexible way
• Vector processing units (e.g. SSE)
• Multi-core node (sub-domain)
• Multiple nodes (domain) - not part of the SCS project
– Optional: write one single code that can be compiled to both platforms
Manno, 4.5..2011, © by Supercomputing Systems13 13
Design Targets
Write a code that
•Facilitates future improvements in terms of
– New models / algorithms
– Portability to new computer architectures
•Can and will be integrated by the COSMO consortium into the main
branch
Manno, 4.5..2011, © by Supercomputing Systems14 14
Stencil Library - Ideas
• It is challenging to develop a stencil library
– There is no big chunk of work that can be hidden behind a API call
(e.g. matrix multiplication)
– The actual update function of the stencil is heavily application specific
and performance critical
• We use a DSEL like approach (Domain Specific Embedded
Language)
– “Stencil language” embedded in C++
– Separate description of loop logic and update function
– During compile time generate optimized C++ code
(possible due to C++ meta programming capabilities)
Manno, 4.5..2011, © by Supercomputing Systems15 15
Stencil Library - Parallelization
• Parallelization on the node level is done by
– Splitting the calculation domain into blocks (IJ-Plane)
– Parallelize the work over the blocks
– Double buffering avoids concurrency issues
Manno, 4.5..2011, © by Supercomputing Systems16 16
Stencil Library – Loop Merging
• The library allows the definition of multiple stages per stencil
– Stages are update functions applied consecutively to one block
– As a block is typically much smaller than the complete domain
we can leverage the caches of the CPU
Manno, 4.5..2011, © by Supercomputing Systems17 17
Stencil Library – Calculation On The Fly
• Calculation on the fly is
supported using a
combination of stages and
column buffers
– Column buffers are
fields with the size of
one block local to every
CPU core
– A first stage writes to a
buffer while a second
stage consumes the
pre-calculated values
Manno, 4.5..2011, © by Supercomputing Systems18 18
Stencil Code – My Toy Example
1. Naive
for k
a(k) := b(k) + c(k)
end
...
for k
d(k) := a(k-1)*e(-1) +
a(k)*e(0) +
a(k+1)*e(+1)
end
...
for k
f(k) := a(k)*g(k) + d(k)
end
Manno, 4.5..2011, © by Supercomputing Systems19 19
Stencil Code – My Toy Example
2. No pre-calculation
for k
d(k) := (b(k-1)+c(k-1))*e(-1) +
(b(k)+c(k))*e(0) +
(b(k+1)+c(k+1))*e(+1)
f(k) := (b(k)+c(k))*g(k) + d(k)
end
1. Naive
for k
a(k) := b(k) + c(k)
end
...
for k
d(k) := a(k-1)*e(-1) +
a(k)*e(0) +
a(k+1)*e(+1)
end
...
for k
f(k) := a(k)*g(k) + d(k)
end
Manno, 4.5..2011, © by Supercomputing Systems20 20
Stencil Code – My Toy Example
3. Pre-calculation with
temporary variables
for k
z := b(k+1) + c(k+1)
d(k) := x*e(-1) +
y*e(0) +
z*e(+1)
f(k) := y*g(k) + d(k)
x:=y
y:=z
end
1. Naive
for k
a(k) := b(k) + c(k)
end
...
for k
d(k) := a(k-1)*e(-1) +
a(k)*e(0) +
a(k+1)*e(+1)
end
...
for k
f(k) := a(k)*g(k) + d(k)
end
Manno, 4.5..2011, © by Supercomputing Systems21 21
Stencil Code – My Toy Example
4. Pre-calculation with
column buffer
for k
a(k) := b(k) + c(k)
end
for k
d(k) := a(k-1)*e(-1) +
a(k)*e(0) +
a(k+1)*e(+1)
f(k) := a(k)*g(k) + d(k)
end
1. Naive
for k
a(k) := b(k) + c(k)
end
...
for k
d(k) := a(k-1)*e(-1) +
a(k)*e(0) +
a(k+1)*e(+1)
end
...
for k
f(k) := a(k)*g(k) + d(k)
end
Manno, 4.5..2011, © by Supercomputing Systems22 22
Stencil Code – My Toy Example
5. Pre-calculation with
stages & column Buffer
Stencil
Stage 1
a := b + c
Stage 2
d := a*e (k:-1,0,1)
Stage 3
f := a*g + d
Apply Stencil
1. Naive
for k
a(k) := b(k) + c(k)
end
...
for k
d(k) := a(k-1)*e(-1) +
a(k)*e(0) +
a(k+1)*e(+1)
end
...
for k
f(k) := a(k)*g(k) + d(k)
end
Manno, 4.5..2011, © by Supercomputing Systems24 24
Status
• So far the following stencils have been implemented:
– Fast wave solver (w bottom boundary initialization missing)
– Advection
• 5th order advection
• Bott 2 advection (cri implementation missing)
– Complete tendencies
– Horizontal Diffusion
– Coriolis
• The next steps are:
– Implicit vertical diffusion
– Put it all together
– Performance optimization