Solving Large-scale Eigenvalue Problems in SciDAC Applications
11 FASTMath Team Lori Diachin, Institute Director FASTMath: Summary of Portable Performance...
-
Upload
marion-matthews -
Category
Documents
-
view
215 -
download
1
Transcript of 11 FASTMath Team Lori Diachin, Institute Director FASTMath: Summary of Portable Performance...
11
FASTMath Team
Lori Diachin, Institute Director
FASTMath: Summary of Portable Performance Strategies
FASTMath SciDAC Institute
LLNL-PRES-501654
22
End User of FASTMath Software: Same piece of code running on different architectures with ‘good’ performance
Developer of FASTMath Software: A relatively small amount of effort is needed make to make a change to get good performance within advertised (algorithmic or performance) tolerances across both current and future architectures
Particular targeting portability for two classes of architectures • Hybrid multicore (CPU/GPU) systems• Manycore systems
FASTMath defines portable performance from two perspectives
33
We surveyed 12 key FASTMath libraries to determine• Challenges/strategies to support performance portability within
the library• Challenges/strategies associated with portable performance
when using multiple packages together• Key areas of future investigation
Libraries:
FASTMath libraries are already working toward performance portability
PETSc Hypre mueLu SUNDIALS Eigensolvers SuperLU
Chombo BoxLib PUMI/PARMA/PCU MOAB Zoltan/Graph Algs VisIt (Guest library)
44
Data movement• NUMA – multilevel memory management techniques (tiling and smart
task/data placement)• Data motion distance• Cache coherence
Thread management• Placement• Pools interacting• Oversubscription• Thread collectives/synchronization techniques
Summary of on node performance challenges within a single library
55
Execution models• MPI + X + Y• Lightweight MPI – lightweight
communication libraries• EAVL runtime
Algorithmic changes• Changes in algorithms to reduce
communication/increase arithmetic intensity
• Fusion – pipelining – both with and without communication
• Compute on the fly using fast small memory to reduce storage costs? Compression
Early thoughts on addressing on node performance challenges within a single library
Multiple kernel support• Hand codes for code generated• Code generation
Data and execution abstractions (e.g., templated C++ approaches (Kokkos,RAJA) or other libraries (TiDA))
Cross-Compiler-based approaches (e.g. ROSE, CHILL)
Just in time compile Interfaces/Tools
• API for communicating data layout and location (pinning)
• Interfaces for thread pool/pinning information
• Zoltan load balance/partitioning/coloring
Embedded performance models Standards ‘influence’ – OpenMP, C++, etc
66
Thread management• Oversubscription of threads• Programming model consistency
Data management• Data affinity/use of data as laid out by others
– this depends on the type of data• The need to track original layout so that you
can use that information as needed in ‘re-layout’
• Understanding the costs of moving data vs using layouts as given
• Transition data among software components • How do we do more work on data while it’s in
Cache (L1 or L2) – this can cross library boundaries (e.g. mesh/solver interactions)
Addressing algorithmic improvements that result from closer coupling of mesh/solver• Using mesh information to develop improved
matrix free solution algorithms but this introduces software issues
Summary of challenges when using multiple libraries/software packages
Performance diagnostics• User decisions can significantly impact data
remap costs – we can make this information transparent to that the user knows the implications of the choices they are making
• Communicating performance to the user – what should they expect
Resource management• Making libraries small enough to not swamp
memory – heavily templated libraries are a particular concern
• If there’s a lot of per process data that can’t be shared – if you have 50 processes – how much is private and must be replicated and how much of the data is shared to minimize memory costs
• Libraries sharing/coordinating use of fast memory and threads
77
Data management across libraries Do we need a ‘data mediator/coupler’– will
not ‘manage’ all the data (we have concern about a data manager). Not sure if this is needed overall or if it’s needed pairwise - accepting data in an ‘intermediate form’ that can be quickly optimized for each components use
• Fusion between libraries- different scales of granularity (e.g., smaller chunks of data to allow data reuse while in Cache)
• Optimize data layouts across multiple components a priori? Not even doing the simple things well right now
• Producer and consumer of the data preserve some locality across the interface
Thread management across libraries• Thread communicator API - Allows multiple
programming models to communicate info about threads and thread pools.
• Need to associate thread information with data layout
Early thoughts on strategies for addressing the challenges associated with multiple libraries/packages
Testing solutions• Looking at sub-components of the problem
to simplify the process to some extent – e.g. look only at solvers rather than all PDE solver
• Development of mini-apps that combine software tools to explore these ideas and solution strategies
Resource management• Phased communication and resource
splitting/allocation for different parts of the solution process
• need good shared library practices
88
Library Summaries
FASTMath SciDAC Institute
99
Current Strategies:• Thread communicator concept to allow passing of thread and data information
among packages• Algorithmic changes to fuse/pipeline operations• MPI+OpenMP, OpenVienna for GPU support
Future Strategies:• ? Didn’t capture this
PETSc (POC: Barry Smith)
1010
Current Strategies:• Reducing communication• Changing math algorithms• Focus on portability (minimize external dependencies)• Using OpenMP, will investigate OpenACC
Future Strategies:• Compiler based-approaches• Reorder operations to exploit better convergence properties of GS
within a node• Code generation approaches for reordering operations to explore
communication/computation tradeoffs
Hypre (POC: Rob Falgout)
1111
Current Strategies:• Task placement to get locality and reduce communication costs• Partitioning to reduce number neighbors
Future Strategies:• whether/how Kokkos plays a role• partitioning to divide work between CPU and GPU• partitioning for fixed sized memories
Note:• Zoltan could provide a suite a tools that many others in FASTMath can
use – perhaps Kokkos can benefit from using Zoltan. Might be able to provide a bridge between the concept of tiling tools and higher level libraries – mapping application topology onto machine topology using task placement strategies
Zoltan2 (POC: Karen Devine)
1212
Current Strategies:• Reducing communication• Changing math libraries
Future Strategies:• Higher level kernels built on Kokkos• fused kernels (prolongator smoothing)• task based parallelism for more expensive aggregation schemes• data layout for different devices• Could use tools like HWloc (gives the layout of the node and allows
you to pin the threads) for large-scale systems, portable across the platforms
mueLu (POC: Jonathan Hu)
1313
Current Strategies:• MPI+openMP• coarse grained (released) and fine grained parallelism, experiments with
smaller kernel codes with fine locking where it does perform better for geometric multigrid – a co-scheduled moving wave front – needs to be done by hand or with Chill compiler to be successful
• Using different wavefront algorithms for different architectures• Several DAG based parallel strategies (HPX, Charm++, etc) limited success so
far• Tried tiling but not had success yet with our version (now trying TiDA).
Future Strategies:• Need a compiler to help with fusion (tiling may mitigate this by providing data
locality)• DSL for AMR (coded this up in C++ and fortran and it does reduce the amount
of code by quite bit) – separates the dimensionality and the way the data is laid out from the algorithm – can be put to the tiler and the communication code generation
Chombo (POC: Brian Van Straalen)
1414
Current Strategies:• ? Didn’t capture this
Future Strategies:• DAG scheduling tool rather than manual scheduling (static is fine, dynamic
would mess up the data structures – could be a coloring problem) – could use Zoltan.
• Interested in new programming model or library to use heterogeneous cores. Don’t need help on the CPU side – need help with CPU/GPU transfer and making the API more uniform across architectures, currently have CudaMemCopy would like GPUMemCopy.
• Would like better performance models – need to be broader than on node, also need across node
SuperLU (POC: Sherry Li)
1515
Current Strategies:• Lightweight API for MPI and threads – we have an inexpensive abstraction for
this. • API for the data structures to allow switching among different data structures for
different architectures – link against different ones that are hand coded Future Strategies:
• Optimal use cases all involve using these tools with solvers, etc – will need to be consistent across packages
• Could use portable HWloc to pin threads/memory.
PUMI/PARMA/PCU (POC: Mark Shephard)
1616
Current Strategies:• using tiling/TiDA• load balancing – accurate assessment of the workload, cost of moving the
data, the communication patterns Future Strategies:
• Zoltan might be able to help with some of this• Establish detailed profiling to understand current bottlenecks/costs. • Wish for a fast malloc for threads – would be private to a thread, haven’t tried
tcmalloc yet (does it work for fortran?)
BoxLib (POC: Ann Almgren)
1717
Current Strategies:• Using hybrid MPI/OpenMP• New algorithms to increase concurrency
Future Strategies:• partners interested in GPUs, calculations are very memory intensive which is
problematic – may need mixed CPU/GPU, don’t want to do this themselves and explicitly copy things back and forth – worried about maintenance/lightweight.
• Want to explore matrix free to reduce memory use – compute on the fly algorithms
PARPACK/Eigensolvers (POC: Chao Yang)
1818
Current Strategies:• ? Didn’t capture this
Future Strategies:• Thread communicator in Chombo, using threaded PETSc in Chombo. • Restructure the algorithm to better utilize the buffering strategy in SR, never
store the entire problem – compute fine grid data on the fly
Segmental Refinement Multigrid (POC: Mark Adams)
1919
Current Strategies:• Using Low level APIs – don’t know how the data structures are laid out or how it
will be used• OpenARC• Writing a mini-app
Future Strategies:• Need to expose a higher-level API for data structures
MOAB (POC: Vijay Mahadevan)
2020
Current Strategies:• Using a bit of threading for vector work we supply to users, • Using CVODE in a threadsafe manner. • SUNDIALS doesn’t have much data at all – uses callbacks, could rely a lot on
user/other code for threading – works great for cases where you don’t need a lot of fusing for performance – can you make the call backs different to get fused things back rather than blas1 things
Future Strategies:• Reducing communication through vector kernel fusion• GPU interacting with vector supplied by user – need to play well with what user
supplies and libraries underneath use. • Tools: code generation tools for kernel code – architecture aware/specific for
vector. What are the fused kernels that are useful for linear algebra? Are there too many? Are there a reasonable number that can be still used for callbacks?
• Fusion which includes communication
SUNDIALS (POC: Carol Woodward)
2121
Current Strategies:• prototyping GPU multilevel EAVL (C++ lightweight tool from ORNL)– like
Kokkos but operates at two levels of abstractions (structured and unstructured mesh)
• Implement a few key vtk objects in EAVL, then will run on CPU, GPU, etc.. transparently, so all the expression template on GPUs
Future Strategies:• ? Didn’t capture this
VisIt (Guest Package) (POC: Mark Miller)
2222
Capture the notes here in text document – assignments for folks to flesh out
Next meeting:• Piggy back on IDEAS meeting in January 27-29 in the bay area• Piggy back on CSE15; ask Chris Johnson for space at Utah
Thread communicator – Barry, Jed, others?• Develop and present a new C API – Barry/Jed/Others?• Create a working group to provide feedback, early use cases, etc
Next Steps