Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group...

35
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois at Urbana-Champaign

Transcript of Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group...

Page 1: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Scalability and interoperable libraries in NAMD

Laxmikant (Sanjay) KaleTheoretical Biophysics group

and

Department of Computer Science

University of Illinois at Urbana-Champaign

Page 2: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Contributors

• PI s : – Laxmikant Kale, Klaus Schulten, Robert Skeel

• NAMD 1: – Robert Brunner, Andrew Dalke, Attila Gursoy, Bill

Humphrey, Mark Nelson

• NAMD2: – M. Bhandarkar, R. Brunner, A. Gursoy, J. Philips,

N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..

Page 3: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Middle layers

Applications

Parallel Machines

“Middle Layers”:Languages, Tools, Libraries

Page 4: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Molecular Dynamics

• Collection of [charged] atoms, with bonds• Newtonian mechanics• At each time-step

– Calculate forces on each atom

• bonds:

• non-bonded: electrostatic and van der Waal’s

– Calculate velocities and Advance positions

• 1 femtosecond time-step, millions needed!• Thousands of atoms (1,000 - 100,000)

Page 5: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Cut-off radius

• Use of cut-off radius to reduce work– 8 - 14 Å

– Faraway charges ignored!

• 80-95 % work is non-bonded force computations• Some simulations need faraway contributions

– Periodic systems: Ewald, Particle-Mesh Ewald

– Aperiodic systems: FMA

• Even so, cut-off based computations are important:– near-atom calculations are part of the above

– multiple time-stepping is used: k cut-off steps, 1 PME/FMA

Page 6: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Scalability

• The Program should scale up to use a large number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable• Better definition of scalability:

– If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Page 7: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Isoefficiency

• Quantify scalability – (Work of Vipin Kumar, U. Minnesota)

• How much increase in problem size is needed to retain the same efficiency on a larger machine?

• Efficiency : Seq. Time/ (P · Parallel Time)– parallel time =

• computation + communication + idle

Page 8: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Traditional Approaches

• Replicated Data:– All atom coordinates stored on each processor

– Non-bonded Forces distributed evenly

– Analysis: Assume N atoms, P processors

• Computation: O(N/P)

• Communication: O(N log P)

• Communication/Computation ratio: P log P

• Fraction of communication increases with number of processors, independent of problem size!

– So, not scalable by this definition

Page 9: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Atom decomposition

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor

– Communication: O(N) per processor

– Communication/Computation: O(N)/(N/P): O(P)

– Again, not scalable by our definition

Page 10: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Force Decomposition

• Distribute force matrix to processors– Matrix is sparse, non uniform

– Each processor has one block

– Communication:

– Ratio:

• Better scalability in practice – (can use 100+ processors)

– Plimpton:

– Hwang, Saltz, et al:

• 6% on 32 Pes 36% on 128 processor

– Yet not scalable in the sense defined here!

P

N

P

Page 11: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Spatial Decomposition

• Allocate close-by atoms to the same processor• Three variations possible:

– Partitioning into P boxes, 1 per processor

• Good scalability, but hard to implement

– Partitioning into fixed size boxes, each a little larger than the cutoff distance

– Partitioning into smaller boxes

• Communication: O(N/P): – so, scalable in principle

Page 12: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Spatial Decomposition in NAMD

• NAMD 1 used spatial decomposition• Good theoretical isoefficiency, but for a fixed size

system, load balancing problems• For midsize systems, got good speedups up to 16

processors….• Use the symmetry of Newton’s 3rd law to

facilitate load balancing

Page 13: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Spatial Decomposition

But the load balancing problems are still severe:

Page 14: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
Page 15: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

FD + SD

• Now, we have many more objects to load balance:– Each diamond can be assigned to any processor

– Number of diamonds (3D):

• 14·Number of Patches

Page 16: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Bond Forces

• Multiple types of forces:– Bonds(2), Angles(3), Dihedrals (4), ..

– Luckily, each involves atoms in neighboring patches only

• Straightforward implementation:– Send message to all neighbors,

– receive forces from them

– 26*2 messages per patch!

Page 17: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Bonded Forces:• Assume one patch per processor:

– an angle force involving atoms in patches:

• (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)

• is calculated in patch: (max{xi}, max{yi}, max{zi})

B

CA

Page 18: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Implementation

• Multiple Objects per processor– Different types: patches, pairwise forces, bonded forces,

– Each may have its data ready at different times

– Need ability to map and remap them

– Need prioritized scheduling

• Charm++ supports all of these

Page 19: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Charm++

• Parallel C++ with Data Driven Objects• Object Groups:

– global object with a “representative” on each PE

• Asynchronous method invocation• Prioritized scheduling• Mature, robust, portable• http://charm.cs.uiuc.edu

Page 20: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Data driven execution

Scheduler Scheduler

Message Q Message Q

Page 21: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Load Balancing

• Is a major challenge for this application– especially for a large number of processors

• Unpredictable workloads– Each diamond (force object) and patch encapsulate variable

amount of work

– Static estimates are inaccurate

• Measurement based Load Balancing Framework– Robert Brunner’s recent Ph.D. thesis

– Very slow variations across timesteps

Page 22: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Bipartite graph balancing

• Background load:– Patches (integration, ..) and bond-related forces:

• Migratable load:– Non-bonded forces

– bond-related forces involving atoms of the same patch

• Bipartite communication graph – between migratable and non-migratable objects

• Challenge:– Balance Load while minimizing communication

Page 23: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Load balancing

• Collect timing data for several cycles• Run heuristic load balancer

– Several alternative ones

• Re-map and migrate objects accordingly– Registration mechanisms facilitate migration

• Needs a separate talk!

Page 24: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

Processors

Tim

emigratable work

non-migratable work

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Processors

Tim

e migratable work

non-migratable work

Page 25: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Performance: size of system

# ofatoms

Procs 1 2 4 8 16 32 64 128 160

bR Time 1.14 0.58 .315 .158 .086 .0483,762atoms

Speedup 1.0 1.97 3.61 7.20 13.2 23.7

ER-ERE Time 6.115 3.099 1.598 .810 .397 0.212 0.123 0.09836,573atoms

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

ApoA-I Time 10.76 5.46 2.85 1.47 0.729 0.382 0.32192,224atoms

Speedup (3.88) 7.64 14.7 28.4 57.3 109 130

Performance data on Cray T3E

Page 26: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Performance: various machines

Procs 1 2 4 8 16 32 64 128 160 192

T3E Time 6.12 3.10 1.60 0.810 0.397 0.212 0.123 0.098

- ---------

Speedup (1.97) 3.89 7.54 14.9 30.3 56.8 97.9 123

Origin Time 8.28 4.20 2.17 1.07 0.542 0.271 0.152

2000-------

Speedup 1.0 1.96 3.80 7.74 15.3 30.5 54.3

ASCI- Time 28.0 13.9 7.24 3.76 1.91 1.01 0.500 0.279 0.227 0.196

Red ---------

Speedup 1.0 2.01 3.87 7.45 14.7 27.9 56.0 100 123 143

NOWs Time 24.1 12.4 6.39 3.69

HP735/125

Speedup 1.0 1.94 3.77 6.54

Page 27: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Speedup

0

20

40

60

80

100

120

140

160

180

200

220

240

0 20 40 60 80 100 120 140 160 180 200 220 240

Processors

Sp

eed

up

Speedup

Perfect Speedup

Page 28: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Recent Speedup Results: ASCI RedSpeedup on ASCI Red: Apo-A1

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200

Processors

Sp

eed

up

Page 29: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Recent Results on Linux ClusterSpeedup on Linux Cluster

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120

Processors

Sp

eed

up

Page 30: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Recent Results on Origin 2000Performance on Origin 2000

0

10

20

30

40

50

60

70

80

90

0 20 40 60 80 100 120

Processors

Sp

ee

du

p

Page 31: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Multi-paradigm programming

• Long-range electrostatic interactions– Some simulations require this

– Contributions of faraway atoms can be calculated infrequently

– PVM based library, DPMTA

• developed at Duke by John Board et al

• Patch life cycle• Better expressed as a thread

Page 32: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Converse

• Supports multi-paradigm programming• Provides portability• Makes it easy to implement RTS for new paradigms• Several languages/libraries:

– Charm++, threaded MPI, PVM, Java, md-perl, pc++, Nexus, Path, Cid, CC++, DP, Agents,..

Page 33: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Namd2 with Converse

Page 34: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

NAMD2

• In production use – Internally for about a year

– Several simulations completed/published

• Fastest MD program? We think so• Modifiable/extensible

– Steered MD

– Free energy calculations

Page 35: Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

Real Application for CS research?

• Benefits– Subtle and complex research problems uncovered only

with real application

– Satisfaction of “real” concrete contribution

– With careful planning, you can truly enrich the “middle layers”

– Bring back a rich variety of relevant CS problems

– Apply to other domains: Rockets? Casting?