Bert de Jong High Performance Software Development Molecular Science Computing Facility

Workshop on Parallelization of Coupled-Cluster Methods

Panel 1: Parallel efficiency

An incomplete list of thoughts

Workshop on Parallelization of Coupled-Cluster Methods

Panel 1: Parallel efficiency

An incomplete list of thoughts

Bert de JongHigh Performance Software Development

Molecular Science Computing Facility

2

Overall hardware issues

Computer power per node has increased Increase of single CPU has flattened out (but you never know!) Multiple cores together tax out other hardware resources in a

node

Bandwidth and latency for other major hardware resources are far behind Affecting the flops we actually use Memory

Very difficult to feed the CPU Multiple cores further reduce bandwidth

Network Data access considerably slower than memory Speed of light is our enemy

Disk input/output Slowest of them all, disks spin only so fast

3

Dealing with memory

Amounts of data needed in coupled cluster can be huge Amplitudes

Too large to store on a single node (except for T1) Shared memory would be good, but will shared memory of 100s of

terabytes be feasible and accessible? Integrals

Recompute vs store (on disk or in memory) Can we avoid access to memory when recomputing

Coupled cluster has one advantage: it can easily be formulated as matrix-multiplication Can be very efficient: DGEMM on EMSL’s 1.5 GHz Itanium-2

system reached over 95% of peak efficiency As long as we can get all the needed data in memory!

4

Dealing with networks

With 10s of terabytes of data and distributed memory systems, getting data from remote nodes is inevitable Can be no problem, as long as you can hide the communication

behind computation Fetch data while computing = one-sided communication NWChem uses Global Arrays to accomplish this

Issues are Low bandwidth and high latency relative to increasing node speed Non-uniform network

- Cabling a full fat tree can be cost prohibitive - Effect of network topology - Fault resiliency of network

Multiple cores need to compete for limited number of busses Data contention increase with increasing node count

Data locality, data locality, data locality

5

Dealing with spinning disks

Using local disk Will only contain data needed by its own node Can be fast enough if you put large number of spindles behind it

And, again, if you can hide behind computation (pre-fetch) With 100,000s of disks, chances of failure become significant

Fault tolerance of computation becomes an issue

Using globally shared disk Crucial when going to very large systems Allows for large files shared by large numbers of nodes Lustre file system of petabytes possible Speed limited by number of access points (hosts)

Large number of reads and writes need to be handled by small number of hosts, creating lock and access contention

6

What about beyond 1 petaflop?

Possibly 100,000s of multicore nodes How does one create a fat enough network between that many nodes?

Possibly 32, 64, 128 or more cores per node All cores simply cannot do the same thing anymore

Not enough memory bandwidth Not enough network bandwidth

Heterogenous computing within a node (CPU+GPU) Designate nodes for certain tasks

- Communication- Memory access, put and get- Recompute integrals hopefully using cache only- DGEMM operations

Task scheduling will become an issue

7

WR Wiley Environmental Molecular Sciences Laboratory

A national scientific user facility integrating experimental and computational resources for

discovery and technological innovation

Bert de Jong High Performance Software Development Molecular Science Computing Facility

Documents

Transcript of Bert de Jong High Performance Software Development Molecular Science Computing Facility