Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

40
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale http://charm.cs.uiuc.edu

Transcript of Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Page 1: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Parallel Molecular Dynamics

Application Oriented

Computer Science Research

Laxmikant Kale

http://charm.cs.uiuc.edu

Page 2: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Outline

• What is needed for HPC to succeed?

• Parallelization of Molecular Dynamics– Aggressive Parallel decomposition– Load Balancing and performance– Multiparadigm programming

• Collaborative Interdisciplinary Research– Comments and lessons

Page 3: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Contributors

• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel

• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy,

Bill Humphrey, Mark Nelson

• NAMD2:– M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips,

N.Krawetz, A. Shinozaki, K. Varadarajan,

Page 4: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Parallel Computing Research

• Trends: – application centered CS research– Isolated CS research

• Drawback of both

• Needed:– Computer Science centered, yet application

oriented research

Page 5: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Middle layers

Applications

Parallel Machines

“Middle Layers”:Languages, Tools, Libraries

Page 6: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .
Page 7: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Molecular Dynamics• Collection of [charged] atoms, with bonds

• Newtonian mechanics

• At each time-step– Calculate forces on each atom

• bonds:

• non-bonded: electrostatic and van der Waal’s

– Calculate velocities and Advance positions

• 1 femtosecond time-step, millions needed!

• Thousands of atoms (1,000 - 100,000)

Page 8: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Further MD

• Use of cut-off radius to reduce work– 8 - 14 Å– Faraway charges ignored!

• 80-95 % work is non-bonded force computations

• Some simulations need faraway contributions

Page 9: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Scalability

• The Program should scale up to use a large number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable

• Better definition of scalability:– If I double the number of processors, I should

be able to retain parallel efficiency by increasing the problem size

Page 10: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Isoefficiency

• Quantify scalability

• How much increase in problem size is needed to retain the same efficiency on a larger machine?

• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =

• computation + communication + idle

Page 11: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Traditional Approaches

• Replicated Data:– All atom coordinates stored on each processor– Non-bonded Forces distributed evenly– Analysis: Assume N atoms, P processors

• Computation: O(N/P)

• Communication: O(N log P)

• Communication/Computation ratio: P log P

• Fraction of communication increases with number of processors, independent of problem size!

Page 12: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Atom decomposition

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor– Communication: O(N) per processor– Communication/Computation: O(P)

Page 13: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Force Decomposition

• Distribute force matrix to processors– Matrix is sparse, non uniform– Each processor has one block– Communication: N/sqrt(P)– Ratio: sqrt(P)

• Better scalability (can use 100+ processors)– Hwang, Saltz, et al: – 6% on 32 Pes 36% on 128 processor

Page 14: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Spatial Decomposition

• Allocate close-by atoms to the same processor

• Three variations possible:– Partitioning into P boxes, 1 per processor

• Good scalability, but hard to implement

– Partitioning into fixed size boxes, each a little larger than the cutoff disctance

– Partitioning into smaller boxes

• Communication: O(N/P)

Page 15: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Spatial Decomposition in NAMD

• NAMD 1 used spatial decomposition

• Good theoretical isoefficiency, but for a fixed size system, load balancing problems

• For midsize systems, got good speedups up to 16 processors….

• Use the symmetry of Newton’s 3rd law to facilitate load balancing

Page 16: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Spatial Decomposition

Page 17: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Spatial Decomposition

Page 18: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

FD + SD

• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor– Number of diamonds (3D): – 14·Number of Patches

Page 19: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Bond Forces

• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..– Luckily, each involves atoms in neighboring

patches only

• Straightforward implementation:– Send message to all neighbors,– receive forces from them– 26*2 messages per patch!

Page 20: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Bonded Forces:

• Assume one patch per processor

B

CA

Page 21: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Implementation

• Multiple Objects per processor– Different types: patches, pairwise forces,

bonded forces,– Each may have its data ready at different times– Need ability to map and remap them– Need prioritized scheduling

• Charm++ supports all of these

Page 22: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Charm++

• Data Driven Objects

• Object Groups: – global object with a “representative” on each PE

• Asynchronous method invocation

• Prioritized scheduling

• Mature, robust, portable

• http://charm.cs.uiuc.edu

Page 23: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 24: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Load Balancing

• Is a major challenge for this application– especially for a large number of processors

• Unpredictable workloads– Each diamond (force object) and patch

encapsulate variable amount of work– Static estimates are inaccurate

• Measurement based Load Balancing– Very slow variations across timesteps

Page 25: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Bipartite graph balancing

• Background load:– Patches and angle forces

• Migratable load:– Non-bonded forces

• Bipartite communication graph – between migratable and non-migratable objects

• Challenge:– Balance Load while minimizing communication

Page 26: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Load balancing

• Collect timing data for several cycles

• Run heuristic load balancer– Several alternative ones

• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration

• Needs a separate talk!

Page 27: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

Processors

Time

migratable work

non-migratable work

Before and After

Page 28: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Before and After

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Processors

Time migratable work

non-migratable work

Page 29: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

Processors

Time

migratable work

non-migratable work

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Processors

Time migratable work

non-migratable work

Page 30: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Performance: size of system

# ofatoms

Procs 1 2 4 8 16 32 64 128 160

bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms

Speedup 1.0 1.97 3.61 7.20 13.2 23.7

ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms

Speedup (3.88) 7.64 14.7 28.4 57.3 109 130

Page 31: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Performance: various machines

Procs 1 2 4 8 16 32 64 128 160 192

T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098

- ---------

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152

2000-------

Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3

ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196

Red ---------

Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143

NOWs Time 24.1 12.4 6.39 3.69

HP735/125

Speedup 1.0 1.94 3.77 6.54

Page 32: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Speedup

0

20

40

60

80

100

120

140

160

180

200

220

240

0 20 40 60 80 100 120 140 160 180 200 220 240

Processors

Speedup

Speedup

Perfect Speedup

Page 33: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Multi-paradigm programming

• Long-range electrostatic interactions– Some simulations require this– Contributions of faraway atoms can be

calculated infrequently– PVM based library, DPMTA

• developed at Duke by John Board et al

• Patch life cycle• Better expressed as a thread

Page 34: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Converse

• Supports multi-paradigm programming

• Provides portability

• Makes it easy to implement RTS for new paradigms

• Several languages/libraries:– Charm++, threaded MPI, PVM, Java, md-perl,

pc++, Nexus, Path, Cid, CC++, DP, Agents,..

Page 35: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Namd2 with Converse

Page 36: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

NAMD2

• In production use – Internally for about a year– Several simulations completed/published

• Fastest MD program? We think so

• Modifiable/extensible– Steered MD– Free energy calculations

Page 37: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Lessons for CSE

• Technical lessons– Multiple-domain (patch) decomposition provides

necessary flexibility – Data driven objects and threads is a great combo– Measurement based load balancing is better– Multi-paradigm parallel programming works!

• Integrate independently developed libraries

• Use appropriate paradigm for each component

Page 38: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Real Application?

• Drawbacks– Need to spend effort on mundane details not

germane to CS research– Production program: complicates structure

Page 39: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Real Application for CS research?

• Benefits– Subtle and complex research problems uncovered

only with real application– Satisfaction of “real” concrete contribution– With careful planning, you can truly enrich the

“middle layers”– Bring back a rich variety of relevant CS problems– Apply to other domains: Rockets? Casting?

Page 40: Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale .

Collaboration lessons

• Use conservative methods..– C++: fashionable vs. conservative– Aggressive methods where they matter

• Account for differing priorities and objectives