Sparse Matrices and Combinatorial Algorithms in Star-P Status and Plans April 8, 2005.
-
date post
21-Dec-2015 -
Category
Documents
-
view
225 -
download
3
Transcript of Sparse Matrices and Combinatorial Algorithms in Star-P Status and Plans April 8, 2005.
Sparse Matrices and
Combinatorial Algorithms in Star-P
Status and PlansApril 8, 2005
Matlab’s sparse matrix design principles
All operations should give the same results for sparse and full matrices (almost all)
Sparse matrices are never created automatically, but once created they propagate
Performance is important -- but usability, simplicity, completeness, and robustness are more important
Storage for a sparse matrix should be O(nonzeros)
Time for a sparse operation should be O(flops)(as nearly as possible)
Matlab’s sparse matrix design principles
All operations should give the same results for sparse and full matrices (almost all)
Sparse matrices are never created automatically, but once created they propagate
Performance is important -- but usability, simplicity, completeness, and robustness are more important
Storage for a sparse matrix should be O(nonzeros)
Time for a sparse operation should be O(flops)(as nearly as possible)
Star-P dsparse matrices: same principles, some different tradeoffs
Sparse matrix operations
dsparse layout, same semantics as ddense
For now, only row distribution
Matrix operators: +, -, max, etc.
Matrix indexing and concatenation A (1:3, [4 5 2]) = [ B(:, 7) C ] ;
A \ b by direct methods
Star-P sparse data structure
• Full: • 2-dimensional array of real or
complex numbers
• (nrows*ncols) memory
31 0 53
0 59 0
41 26 0
31 53 59 41 26
1 3 2 1 2
• Sparse:
• compressed row storage
• about (2*nzs + nrows) memory
P0
P1
P2
Pn
5941 532631
23 131
Each processor stores:• # of local nonzeros• range of local rows• nonzeros in CSR form
Star-P distributed sparse data structure
Sorting in Star-P
[V, perm] = sort (V)
Useful primitive for some sparse array algorithms, e.g. the sparse constructor
Star-P uses a parallel sample sort
The sparse( ) constructor
A = sparse (I, J, V, nr, nc)
Input: ddense vectors I, J, V, and dimensions nr, nc
A(I(k), J(k)) = V(k) for each k
Sort triples (i, j, v) by (i, j)
Adjust for desired row distribution
Locally convert to compressed row indices Sum values with duplicate indices
Sparse matrix times dense vector
y = A * x
The first call to matvec caches a communication schedule for matrix A. Later calls to multiply any vector by A use the cached schedule.
Communication and computation overlap.
Sparse linear systems
x = A \ b
Matrix division uses MPI-based direct solvers: SuperLU_dist: nonsymmetric static pivoting MUMPS: nonsymmetric multifrontal PSPASES: Cholesky
ppsetoption(’SparseDirectSolver’,’SUPERLU’)
Iterative solvers implemented in Star-P
Ongoing work on preconditioners
Sparse eigenvalue solvers
[V, D] = eigs(A); etc.
Implemented via MPI-based PARPACK [Lehoucq, Maschhoff, Sorensen, Yang]
Work in progress
Combinatorial Scientific Computing
Sparse matrix methods; machine learning; search and information retrieval; adaptive multilevel modeling; geometric meshing; computational biology; . . .
How will combinatorial methods be used by people who don’t understand them in complete detail?
Analogy: Matrix division in Matlab
x = A \ b;
Works for either full or sparse A
Is A square?no => use QR to solve least squares problem
Is A triangular or permuted triangular?yes => sparse triangular solve
Is A symmetric with positive diagonal elements?yes => attempt Cholesky after symmetric minimum degree
Otherwise => use LU on A(:, colamd(A))
Combinatorics in Star-P
A sparse matrix language is a good start on primitives for combinatorial computing.
Random-access indexing: A(i,j)
Neighbor sequencing: find (A(i,:))
Sparse table construction: sparse (I, J, V)
Other primitives we are building: sorting, searching, pointer-jumping, connectivity and paths, . . .
Matching and depth-first search in Matlab
dmperm: Dulmage-Mendelsohn decomposition
Square, full rank A: [p, q, r] = dmperm(A); A(p,q) is block upper triangular with nonzero diagonal also, strongly connected components of a directed graph also, connected components of an undirected graph
Arbitrary A: [p, q, r, s] = dmperm(A); maximum-size matching in a bipartite graph minimum-size vertex cover in a bipartite graph decomposition into strong Hall blocks
Connected components (work in progress)
Sequential Matlab uses depth-first search (dmperm), which doesn’t parallelize well
Shiloach-Vishkin algorithm: repeat
Link every (super)vertex to a random neighbor Shrink each tree to a supervertex by pointer jumping
until no further change
Hybrid SV / search method under construction
Other potential graph kernels: Breadth-first search (after Husbands, LBNL) Bipartite matching (after Riedy, UCB) Strongly connected components (after Pinar, LBNL)
MIT Lincoln Laboratory
SSCA #2 -- Overview
SSCA #2: Directed weighted multigraph with no self-loops
Scalable data generator Kernel 1 — Graph Construction Kernel 2 — Classify Large Sets (Search) Kernel 3 — Subgraph Extraction Kernel 4 — Graph Clustering / Partitioning
Randomly sized “cliques” Most intra-clique edges present Random inter-clique edges,
gradually 'thinning' with distance Integer and character string edge labels Randomized vertex numbers
MIT Lincoln Laboratory
HPCS Benchmark SpectrumSSCA #2
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
Data Generator
1. Kernel
2. Kernel
3. Kernel
4. Kernel
HPCchallengeBenchmarks
Micro &Kernel
BenchmarksMission Partner
ApplicationBenchmarks
6.Signal
ProcessingKnowledgeFormation
Ex
isti
ng
Ap
pli
cati
on
s
Em
erg
ing
Ap
pli
cat
ion
s
Fu
ture
Ap
pli
cat
ion
s
Sim
ula
tio
nIn
telli
gen
ce
Re
co
nn
ais
san
ce5.
SimulationMulti-Physics
1.OptimalPattern
Matching
1.OptimalPattern
Matching
4.SimulationNAS PB AU
3.SimulationNWCHEM
Scalable SyntheticCompact Applications
HPCSSpanning
Set ofKernels
Kernels
DiscreteMath…GraphAnalysis…LinearSolvers…SignalProcessing…Simulation…I/O
ExecutionPerformance
Bounds
ExecutionPerformance
Indicators
LocalDGEMMSTREAM
RandomAccess1D FFT
GlobalLinpackPTRANS
RandomAccess1D FFT
CurrentUM2000GAMESS
OVERFLOWLBMHDRFCTHHYCOM
Near-FutureNWChemALEGRA
CCSM
Execution andDevelopment
Performance Indicators
System Bounds
Data Generator
Graph Construction
Sort Large Sets
Graph Extraction
Graph Clustering
Data Generator
Graph Construction
Sort Large Sets
Graph Extraction
Graph Clustering
2.Graph
Analysis
2.Graph
Analysis
MIT Lincoln Laboratory
Scalable Data Generator
Produce edge tuples containing the start vertex, end vertex, and weight for each directed edge Hierarchical nature, with random-sized cliques connected by
edges between some of the cliques Interclique edges are assigned using a random distribution
that represents a hierarchical thinning as vertices are further apart
Weights are either random integer values or randomly selected character strings
Random permutations to avoid induced locality Vertex numbers — local processor memory locality Parallel — global memory locality
Data sets must be developed in their entirety before graph construction
May be parallelized — data tuple vertex naming schemes must be globally consistent
Data generator will not be timed Statistics will be saved to be used in later verification stages
MIT Lincoln Laboratory
Kernel 1 — Graph Construction
Construct the multigraph in a format usable by all subsequent kernels
No subsequent modifications will be permitted to benefit specific kernels
Graph construction will be timed
MIT Lincoln Laboratory
Kernel 2 — Sort on Selected Edge Weights
Sort on edge weights to determine those vertex pairs with The largest integer weights With a desired string weight
The two vertex pair lists will be saved for use in the subsequent kernel Start set SI for integer start sets Start set SC for character start sets
Sorting weights will be timed
MIT Lincoln Laboratory
Kernel 3 — Extract Subgraphs
Extract a series of subgraphs formed by following paths of length k from a start set of initial vertices
The set of initial vertices will be determined from the previous kernel
Produce a series of subgraphs consisting of vertices and edges on paths of length k from the vertex pairs start sets SI and SC
A possible computational kernel for graph extraction is Breadth First Search
Extracting graphs will be timed
MIT Lincoln Laboratory
Kernel 4 — Partition Graph Using a Clustering Algorithm
Apply a graph clustering algorithm to partition the vertices of the graph into subgraphs no larger than a maximum size so as to minimize the number of edges that need be “cut”, e.g. Kernighan and Lin Sangiovanni-Vincentelli, Chen, and Chua
The output of this kernel will be a description of the clusters and the edges that form the interconnecting “network”
This step should identify the structure in the graph specified in the data generation phase
It is important that the implementation of this kernel does not utilize a priori knowledge of the details in the data generator or the statistics collected in the graph generation process
Identifying the clusters and the interconnecting “network” will be timed
Concise SSCA#2
A serial Matlab implementation of SSCA#2
Coding style is simple and “data parallel”
Our starting point for Star-P implementation
Edge labels are only integers, not strings yet
Data generation
Using Lincoln Labs serial Matlab generator at present
Data-parallel Star-P generator under construction
Generate “edge triples” as distributed dense vectors
Kernel 1: Graph construction
cSSCA#2 K1: ~30 lines of executable serial Matlab
Multigraph is a cell array of simple directed graphs
Graphs are dsparse matrices, created by sparse( )
Kernel 2: Search in large sets
cSSCA#2 K2: ~12 lines of executable serial Matlab
Uses max( ) and find( ) to locate edges of max weight
ng = length(G.edgeWeights); % number of graphs maxwt = 0;for g = 1:ng maxwt = max(maxwt, max(max(G.edgeWeights{g})));endintedges = zeros(0,2);for g = 1:ng [intstart intend] = find(G.edgeWeights{g} == maxwt);intedges = [intedges ; intstart intend];end;
Kernel 3: Subgraph extraction
cSSCA#2 K3: ~25 lines of executable serial Matlab
Returns subgraphs consisting of vertices and edges within specified distance of specified starting vertices
Sparse matrix-vector product for breadth-first search
Kernel 4: Partitioning / clustering
cSSCA#2 K4: serial prototype exists, not yet parallel
Different algorithm from Lincoln Labs executable spec, which uses seed-growing breadth-first search
cSSCA#2 uses spectral methods for clustering, based on eigenvectors (Fiedler vectors) of the graph
One version seeks small balanced edge cuts (graph partitioning); another seeks clusters with low isoperimetric number (clustering)