Disorder-Induced Length Scales and Faster Scalable Algorithms Vincent Sacksteder IV Asia Pacific...

Disorder-Induced Length Scales and Faster Scalable Algorithms

Vincent Sacksteder IV

Asia Pacific Center for Theoretical Physics Previous Career at Microsoft Ph.D. at La Sapienza in Rome, 2005 One year at the Ateneo de Manila University

in Manila, Philippines One year at the S.N. Bose National Centre

in Kolkata, India

Outline of This Talk

Scalability Challenge and its Answer from Locality Physics

Matrix Functions Linear Scaling and Nearsightedness Numerical Evidence for Nearsightedness and

Linear Scaling Scalable Algorithms vs. Scalable Code; proposal

for a linear scaling library Future Plans

The Challenge

This year a supercomputer is planned with about a million processors. Last year's record was about a quarter million.

Many desktops already have 2-4 processors. How to scale science and engineering codes

on clusters and supercomputers? The key is in dividing up the problem into little

chunks that can be handled “almost” independently by each processor, with communication between processors carefully managed and minimized.

Parallelization Via Locality

Divide space into blocks. Physics within the blocks (local interactions) does not

require communication between processors. Nearly local (next neighbor, etc.) physics creates a need

for boundary matching between processors. Usually still scalable, with some effort.

Nonlocal physics = failure of the spatial parallelization strategy.

Scaling law or mean field behavior may still work. Locality Physics Is Key!

Locality and Math

How to represent the idea of locality mathematically? We are talking about the relationship between two

points x and y. Therefore we want a function of two coordinates; a

matrix or operator A(x,y). Statements of locality or scaling translate to

mathematical requirements for the dependence of A(x,y) on |x-y|.

Matrices are the natural way of talking about locality.

Success Story: Linear Equations

Used as a building block to solve Partial Differential Equations.

The differential operators (gradient, Laplacian, etc.) often have nearly local matrix representations.

If the matrix A in Ax = b is nearly local, then well developed scalable algorithms are available: Krylov space algorithms, preconditioners, domain decomposition, etc.

If the interaction scales with distance (a la renormalization), then scalability may still be possible: multigrid, domain decomposition, etc.

Problem Story: Matrix Functions

Matrix function, noun: A mapping from a matrix argument H to another matrix f(H).

A large fraction of the world's computer time is spent on computing matrix functions.

Will give four examples.

Examples of matrix functions The matrix step function = the electronic density matrix. Lattice QCD calculations (a grand challenge problem)

calculate properties of neutrons, protons, etc. The quarks come in computationally as a matrix logarithm, matrix inverse, matrix sign function, etc. These matrix functions are the computational bottleneck nowadays.

Green's functions describe responses to external forces, currents, etc. Mathematically they are just the inverse of a matrix, 1 / (E - H), where E has an imaginary part.

The matrix exponential: quantum mechanics, control theory, engineering.

Matrix Functions Resist Parallelization. Example: Diagonalization (Schur – Parlett) algorithm.

Diagonalize the argument, evaluate the matrix function in the diagonalized basis, transform back to original basis.

Many other algorithms exist. All general purpose algorithms run into similar difficulties.

Computer time scales as O(N^3). The scaling problem encountered with density matrix

calculations is the same as with every other matrix function.

Is this the Last Word? The general purpose method for calculating

something may be non-local. For instance, the mathematics of electronic structure.

However there may still be an emergent phenomenon of locality arising from the details of the problem at hand.

For instance, it is well known that chemical properties are well described (at first order) by chemical bonds – nearest neighbor interactions. So in most chemical problems locality emerges from a manifestly nonlocal problem.

Depends on both the system and the observable.

What is known about Emergent Locality = “Near-Sightedness”?

Experimental phenomenology of chemical bonding. A few proofs and many arguments for nearsightedness if you are

outside the energy band, or have an insulator, or have non-zero temperatures. Limited knowledge of resulting scales, etc. Not general enough to account for chemical bonding phenomenology.

Disorder creates a scattering length; the averaged Greens function dies off exponentially. (randomized phase)

But the average of its square is another matter. (i.e. Amplitude vs. random phase) Very hard to handle this calculation mathematically.

Anderson localization at large disorder & in band tails. In general the math is extremely challenging; seemingly

intractable. Simulations are the way forward.

Evidence for nearsightedness: the success of linear scaling algorithms

Little attention to precise numerical tests, more attention to whether the correct chemical structure is reproduced.

Satisfactory performance for many systems of interest, including insulators and finite temperatures.

There are also reports of good accuracy for ordered metals.

So far almost all research has focused on applications to molecular dynamics and condensed matter. Almost no calculations of matrix functions other than the density matrix. (Sometimes the Green's function as an intermediate step.)

No general purpose library is available for linear scaling calculation of matrix functions.

Evidence for Nearsightedness: My Simulations of Disordered Systems

Simple model of disordered hamiltonian without interactions. (Anderson model.)

Well below the Anderson transition (sigma = 6.15), and inside the band, so almost all the eigenfunctions are delocalized.

Metallic phase. I regularized all non-analytic functions. A cube with linear length L = 16. Results released in my thesis in late 2004.

Matrix Function Nearsightedness: Small Disorder = 1.15

Matrix Function Nearsightedness:Intermediate Disorder = 4.65

Fitting the Matrix Function Decay

• Exponential decay works well for all functions except the exponential.

• The same value of the parameter alpha is shared by all functions except the exponential, meaning that they are all affected by the same coherence length.

• Other parameters vary; the matrix functions differ from each other at small disorders and close to the diagonal.

• The fits break down at very small disorders.

)1(

))((211

/22

baL

errf Lrn

Remarks about the Results The Logarithm, Density Matrix, Cauchy Distribution, and

Gaussian Distribution are nearsighted even in the metallic regime.

The Exponential is even more nearsighted. The real part of the inverse is only very weakly nearsighted.

Should be nearsighted if the argument has an imaginary part – here I studied real arguments.

I also studied the accuracy of linear scaling algorithms for these matrix functions; their accuracy is controlled by the exponential nearsightedness.

Linear Scaling can be applied to many problems!

Million Processor Challenge

Can we really use a million processors? In actual parallelized codes one fights hard

for every incremental increase in scaling. Difficult debugging, small and large rewrites. Large amounts of time from very skilled programmers.

SIESTA, CONQUEST, ONETEP, and OPENMP have reported results with between thirty and a few hundreds of processors.

What are the Challenges for Scaling?

Find a scalable algorithm. Apply physical insight: preconditioning; multigrid;

domain decomposition, operator splitting. Structured vs. unstructured grids. Decide how to partition the data across the

processors. Distribute data according to the partitioning

scheme. Index the data; often several indexing schemes

will be used at the same time. Cache coherency, efficient packing, etc.

What are the Challenges for Scaling? Communications between nodes:

Figure out which data needs to be transferred, where from, and where to.

Efficient and scalable data transfer. Merge the external data with the local copy; ghost

points. Debug, profile, monitor, control the parallel code. Challenges same for linear equations and matrix

functions.

Bottom Line: It is not enough to have a good algorithm!!!

Libraries are key to scaling Separate the matrix function evaluation from the rest of

the program, solve its technical challenges separately, and optimize it separately.

Loosely coupled architecture.

The CONQUEST authors attribute their scaling success to the fact that they have isolated and optimized a small piece of code responsible for part of the matrix multiplications. They paid a lot of attention to “organization of atoms and grid points into small compact sets,” and to indexing schemes, loop precedence, sub-block structure, cache hits, and cross-processor communications. (Actually is a kernel not a library.)

The same library based approach has proved necessary for linear equation solvers on supercomputers.

General Purpose Algorithms Many algorithms for matrix functions rely only on matrix

arithmetic. Example: Penalty functional optimization schemes like Li-Nunez-

Vanderbilt Example: Purification schemes including McWeeny purification. Example: Chebyshev approximation.

Requirements for a linear scaling matrix library: It’s enough to have a good linear scaling implementation of matrix addition, subtraction, multiplication, (maybe division), a good optimizer, and also a good linear solver supporting the optimizer.

Reusing libraries for linear equations

The number of non-zero elements in a nearsighted matrix is proportional to the basis size, like a vector.

Many linear scaling algorithms can be thought of as solving for a vector.

Example: Solving for the inverse matrix. You have to solve H G = 1. This is a set of linear equations, with the inverse G being the unknown. Thus all the Krylov space algorithms which have been developed to solve Ax = b can be applied to the calculating the inverse.

In general, you apply the standard algorithms for minimization, but you are solving for a matrix and not a vector.

Example: Conjugate gradients applied to Li-Nunez-Vanderbilt.

All the standard algorithms for solving linear or nonlinear vector equations can be applied to calculating nearsighted matrices.

What is unique to the matrix library, not in a vector library? Code that stores nearsighted matrices in a vector

format. Index schemes for matrix elements. Code for adding, subtracting, and multiplying

nearsighted matrices (stored as vectors.) Must take advantage of the block structure, caches, indexing, physical distance, etc.

Manage nearsighted data transfer on both structured and unstructured grids.

Implementation of various linear scaling algorithms.

My Proposal

Take a library which is already great at solving linear and nonlinear

equations on the biggest computers, and extend it to matrix functions.

One Possibility: PETSc Developed by Argonne National Laboratory, which

is run by the U.S. Government. Available for free reuse, changes, and

redistribution. Applications built on PETSc have successfully

scaled to 6000+ processors, systems of equations (linear and nonlinear) with 500 million unknowns, and have run at up to two Teraflops.

Highly portable, and designed to be extensible by third parties.

Wraps MPI, and takes care of a lot of the details of scaling for the user.

The Payoff for Matrix Functions

Much easier to write scalable scientific codes. More scalability then we would get otherwise, since the

effort is focused on one library instead of on many different codes.

Portability and adaptability. Higher reliability. Million processor DFT codes. Applications to other matrix functions, like QCD. Set the stage for developing the next generation of matrix

algorithms, like linear scaling ones but now incorporating the ideas of multigrid and domain decomposition. Multi-scale, O(N ln N).

Future Plans

Repeat the nearsightedness study: With scaling analysis to measure nearsightedness in

infinite systems. A wider variety of matrix functions. Improved control of errors. Coding already half done.

Look at applications of nearsightedness to the electron correlations, mixed mode calculations, lattice QCD, etc.

Examine successful algorithms for solving linear equations in the light of physics and nearsightedness.

Thank You!

Vincent Sacksteder

[email protected], yahoo, and google IDs available

PETSc

“Developing parallel, nontrivial PDE solvers that deliver high performance is still difficult and

requires months (or even years) of concentrated effort. PETSc is a tool that can

ease these difficulties and reduce the development time, but it is not a black-box PDE

solver, nor a silver bullet.”

How does PETSc help solve linear and nonlinear equations?

Find a scalable algorithm – Many already implemented.

Preconditioning; multigrid; domain decomposition – Many already implemented.

Structured grids – Already implemented. Unstructured grids – Strong aids. Decide how to partition the data across the

processors – Already implemented. Distribute data according to the partitioning

scheme – Already implemented.

How does PETSc help solve linear and nonlinear equations? Index the data; often several indexing schemes

will be used at the same time – Strong aids. Communications between nodes:

Figure out which data needs to be transferred, where from, and where to – Strong aids.

Efficient and scalable data transfer – Already implemented.

Merge the external data with the local copy; ghost points – Already implemented.

Debug, profile, monitor, control the parallel code – Partly implemented.

Review of the Principal Linear Scaling Ab Initio (DFT) Codes Written only in the last few years.

Research initially focused on development of tight-binding codes.

Four major linear scaling ab initio codes are well documented: CONQUEST and ONETEP from England, SIESTA from Spain, OPENMP from Japan.

These are pretty full featured codes, comparable to existing plane wave ab initio codes, so they contain a lot of complications and code which are not directly related to linear scaling. I will concentrate here on scalability aspects.

Review of the Principal Linear Scaling Ab Initio (DFT) Codes

All four codes call MPI directly, and it seems likely that the linear scaling algorithm and the network calls are mixed in with the rest of the code. Tightly coupled implementations.

The prefactors of linear algorithms scale cubically with the number of basis elements close to the diagonal, so each code spends a lot of effort on minimizing the basis size. This is one of the biggest distinguishing factors between the four codes.

The matrices have a special structure of dense sub-blocks corresponding to individual atoms, and efficient codes have to take advantage of that.

OPENMP is based on the Schur-Parlett (diagonalization) algorithm. The other codes all use direct algorithms.

Scaling

So far only relatively modest calculations have been reported.

CONQUEST reports good scaling for up to 512 processors for well designed clusters (not Ethernet).

The other three codes seem to have problems scaling even up to 100 processors, and do not report results with much more than that.

Scaling ONETEP attributes its scaling difficulties to a last bit of

O(N^3) code which they hope to rewrite soon. Are they correct? (Probably will find other bottlenecks.)

CONQUEST partitions assuming a structured grid, while ONETEP assumes an unstructured grid. The other two codes use a very simple and non-optimal 1-D scheme for partitioning the atoms.

Disorder-Induced Length Scales and Faster Scalable Algorithms Vincent Sacksteder IV Asia Pacific...

Documents

Transcript of Disorder-Induced Length Scales and Faster Scalable Algorithms Vincent Sacksteder IV Asia Pacific...