[email protected] gyan [email protected] ... · Parallel Implementation - First Join &...
Transcript of [email protected] gyan [email protected] ... · Parallel Implementation - First Join &...
![Page 1: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/1.jpg)
Computing the Pseudoinverse of a Graph’s Laplacian using GPUs
Large Scale Parallel Processing 2015 1/54
Nishant Saurabh Dr. Ana Lucia Varbanescu Dr. Gyan Ranjan Vrije Universiteit, Amsterdam. University of Amsterdam, Amsterdam. Symantec, CA, USA. [email protected] [email protected] [email protected]
Workshop on Large-Scale Parallel Processing IEEE International Parallel and Distributed Processing Symposium
May 25th - 29th, 2015
![Page 2: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/2.jpg)
Graphs are Everywhere !
Large Scale Parallel Processing 2015 2/54
![Page 3: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/3.jpg)
Graphs are Everywhere !
Large Scale Parallel Processing 2015 3/54
Large Scale Local World Degree Distribution Sparse
Small World Characteristics
![Page 4: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/4.jpg)
Complex Networks : Define Real World Graphs
Large Scale Parallel Processing 2015 4/54
Topological Characteristics
Behavioral Predictability
Why ?
![Page 5: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/5.jpg)
Graph Formalization & Background
Large Scale Parallel Processing 2015 5/54
5
3
2
41
A Simple, Connected, Undirected, Unweighted Graph G(V, E) .
n = |V| is the order of the graph, i.e., the number of vertices.
m = |E| is the size of the graph, i.e., the number of edges.
![Page 6: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/6.jpg)
Graph Formalization & Background
Large Scale Parallel Processing 2015 6/54
5
3
2
41
0 1 1 0 01 0 0 1 01 0 0 1 10 1 1 0 00 0 1 0 0
A =
2 0 0 0 00 2 0 0 00 0 3 0 00 0 0 2 00 0 0 0 1
D =
A is Adjacency Matrix , D is Degree Matrix.
![Page 7: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/7.jpg)
Graph Formalization & Background
Large Scale Parallel Processing 2015 7/54
5
3
2
41
0 1 1 0 01 0 0 1 01 0 0 1 10 1 1 0 00 0 1 0 0
2 0 0 0 00 2 0 0 00 0 3 0 00 0 0 2 00 0 0 0 1
Degree matrix (D) - Adjacency Matrix (A) = Laplacian Matrix (L)
2 -1 -1 0 0-1 2 0 -1 0-1 0 3 -1 -1 0 -1 -1 2 0 0 0 -1 0 1
![Page 8: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/8.jpg)
Inverse of Laplacian matrix and Eigenspace
Large Scale Parallel Processing 2015 8/54
The inverse of Laplacian matrix L is L-1
such that :
L L-1 = I
where I is Identity matrix and L is a square matrix.
![Page 9: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/9.jpg)
Inverse of Laplacian matrix and Eigenspace
Large Scale Parallel Processing 2015 9/54
The inverse of Laplacian matrix L is L-1
such that :
L L-1 = I
where I is Identity matrix and L is a square matrix.
Not Every matrix is invertible.
![Page 10: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/10.jpg)
Inverse of Laplacian matrix and Eigenspace
Large Scale Parallel Processing 2015 10/54
The inverse of Laplacian matrix L is L-1
such that :
L L-1 = I
where I is Identity matrix and L is a square matrix.
Not Every matrix is invertible.
The Eigenspace can be formulated as :
Lv = λ v
where L is Laplacian Matrix, v is Eigenvector and λ is Eigenvalue.
![Page 11: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/11.jpg)
Inverse of Laplacian matrix and Eigenspace
Large Scale Parallel Processing 2015 11/54
The inverse of Laplacian matrix L is L-1
such that :
L L-1 = I
where I is Identity matrix and L is a square matrix.
Not Every matrix is invertible.
The Eigenspace can be formulated as :
Lv = λ v
where L is Laplacian Matrix, v is Eigenvector and λ is Eigenvalue.
A matrix is not invertible, if any corresponding value of λ is 0.
![Page 12: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/12.jpg)
Eigenvalues & Eigenvector
Large Scale Parallel Processing 2015 12/54
5
3
2
41
2 -1 -1 0 0-1 2 0 -1 0-1 0 3 -1 -1 0 -1 -1 2 0 0 0 -1 0 1
L =
![Page 13: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/13.jpg)
Eigenvalues & Eigenvector
Large Scale Parallel Processing 2015 13/54
5
3
2
41
2 -1 -1 0 0-1 2 0 -1 0-1 0 3 -1 -1 0 -1 -1 2 0 0 0 -1 0 1
L =
Non - Invertible !
λ = , 0.82 , 2 , 2.68 , 4.4812 0
![Page 14: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/14.jpg)
Moore Penrose pseudo-inverse
Large Scale Parallel Processing 2015 14/54
To calculate the inverse for a rank deficient matrix (L = laplacian matrix, of order n):
![Page 15: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/15.jpg)
Moore Penrose pseudo-inverse
Large Scale Parallel Processing 2015 16/54
L+ = ( L +1/n)-1 - 1/n
To calculate the inverse for a rank deficient matrix (L = laplacian matrix, of order n):
![Page 16: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/16.jpg)
Moore Penrose pseudo-inverse
Large Scale Parallel Processing 2015 15/54
L+ = pinv(L) in Matlab
L+ = numpy.linalg.pinv(L) in Python
To calculate the inverse for a rank deficient matrix (L = laplacian matrix, of order n):
L+ = ( L +1/n)-1 - 1/n
![Page 17: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/17.jpg)
Applications to Computation of L+
Large Scale Parallel Processing 2015 17/54
Epidemiology
Infrastructure Planning
Online Social Networks
Collaborative Recommendation Systems
Probability and Mathematical Chemistry
![Page 18: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/18.jpg)
The Goal
Large Scale Parallel Processing 2015 18/54
Algorithm
![Page 19: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/19.jpg)
The Goal
Large Scale Parallel Processing 2015 19/54
Algorithm +
![Page 20: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/20.jpg)
The Goal
Large Scale Parallel Processing 2015 20/54
Algorithm + Speedup ?
![Page 21: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/21.jpg)
Divide and Conquer Approach to compute L+
Large Scale Parallel Processing 2015 21/54
![Page 22: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/22.jpg)
Divide and Conquer Approach to compute L+
Large Scale Parallel Processing 2015 22/54
Three Steps
Partition
![Page 23: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/23.jpg)
Divide : Partition
Large Scale Parallel Processing 2015 23/54
A simple, connected,unweighted, undirected Graph G.
![Page 24: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/24.jpg)
Divide : Partition
Large Scale Parallel Processing 2015 24/54
A simple, connected,unweighted, Connected Subgraph G1 and G2undirected Graph G. Compute L+
G1 and L+G2
![Page 25: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/25.jpg)
Divide and Conquer Approach to compute L+
Large Scale Parallel Processing 2015 25/54
Three Steps
Partition
First Join
![Page 26: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/26.jpg)
Conquer : First Join
Large Scale Parallel Processing 2015 26/54
Dotted lines represent minimized cutoff edges during Partition.
![Page 27: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/27.jpg)
Conquer : First Join
Large Scale Parallel Processing 2015 27/54
Join a random cutoff edge
We get a new connected Graph G3Using L+
G1 and L+G2 , compute L+
G3
Dotted lines represent minimized cutoff edges during Partition.
![Page 28: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/28.jpg)
Divide and Conquer Approach to compute L+
Large Scale Parallel Processing 2015 28/54
Three Steps
Partition
First Join
Edge Firing
![Page 29: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/29.jpg)
Conquer : Edge Firing
Large Scale Parallel Processing 2015 29/54
![Page 30: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/30.jpg)
Conquer : Edge Firing
Large Scale Parallel Processing 2015 30/54
Fire first cutoff Edge
G4, Compute L+G4
![Page 31: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/31.jpg)
Conquer : Edge Firing
Large Scale Parallel Processing 2015 31/54
Fire first cutoff Edge
Fire Second cutoff Edge
G4, Compute L+G4
G5, Compute L+G5
![Page 32: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/32.jpg)
Methodology
Large Scale Parallel Processing 2015 32/54
Abstract a simple Partition Approach.
Use GPU to compute L+ of the subgraphs.
Apply element-wise computation on First Join and Edge Firing using GPU.
Minimize Data Transfer.
![Page 33: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/33.jpg)
Implementation Approach - Representation
Large Scale Parallel Processing 2015 33/54
1
7
2
6
4
3
5
A simple, connected,unweighted, undirected Graph G.
Sparse Representation
1, 2, 12, 1, 11, 5, 15, 1, 11, 7, 17, 1, 12, 3, 13, 2, 13, 6, 16, 3, 13, 7, 17, 3, 14, 7, 17, 4, 14, 5, 15, 4, 1
![Page 34: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/34.jpg)
Implementation Approach - Partition
Large Scale Parallel Processing 2015 34/54
1
7
2
6
4
3
5
A simple, connected,unweighted, undirected Graph G.
Node Degree 1 3 3 3 7 3 2 2 4 2 5 2 6 1
![Page 35: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/35.jpg)
Implementation Approach - Partition
Large Scale Parallel Processing 2015 35/54
1
7
2
6
4
3
5
Largest Component has order 6
Remove edges of node 1 Remove edges of node 3
Largest Component has order 3 , which is <= n / 2 ( n = 7). But we wanted two connected components, and there are some isolated ones.
1
7
2
6
4
3
5
![Page 36: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/36.jpg)
Implementation Approach - Recombine
Large Scale Parallel Processing 2015 36/54
1
7
2
6
4
3
5
1
7
2
6
4
3
5
We get a bipartition , with two simple, connected components
![Page 37: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/37.jpg)
Parallel Implementation - First Join & Edge Firing
Large Scale Parallel Processing 2015 37/54
1. CPU Based Parallel Implementation :
Parallelisation using pthreads inverse (L + 1/n) - 1/n dgetri.f Blas routine First Join and Edge Firing 4 threads generated
![Page 38: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/38.jpg)
Parallel Implementation - First Join & Edge Firing
Large Scale Parallel Processing 2015 38/54
1. CPU Based Parallel Implementation :
Parallelisation using pthreads inverse (L + 1/n) - 1/n dgetri.f Blas routine First Join and Edge Firing 4 threads generated
2. Matlab based GPU Implementation :
Parallel Computing Toolbox inv( ) , GPU enabled function First Join and Edge Firing bsxfun( )
![Page 39: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/39.jpg)
Parallel Implementation - First Join & Edge Firing
Large Scale Parallel Processing 2015 39/54
1. CPU Based Parallel Implementation :
Parallelisation using pthreads inverse (L + 1/n) - 1/n dgetri.f Blas routine First Join and Edge Firing 4 threads generated
2. Matlab based GPU Implementation :
Parallel Computing Toolbox inv( ) , GPU enabled function First Join and Edge Firing bsxfun( )
3. CUDA based GPU Implementation :
Thrust Library inverse cuBlas library routine First Join - Three Device Kernels Edge Firing - Single Kernel 256 threads generated per block
![Page 40: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/40.jpg)
Parallel Implementation - First Join & Edge Firing
Large Scale Parallel Processing 2015 40/54
1. CPU Based Parallel Implementation :
Parallelisation using pthreads inverse (L + 1/n) - 1/n dgetri.f Blas routine First Join and Edge Firing 4 threads generated
2. Matlab based GPU Implementation :
Parallel Computing Toolbox inv( ) , GPU enabled function First Join and Edge Firing bsxfun( )
3. CUDA based GPU Implementation :
Thrust Library inverse cuBlas library routine First Join - Three Device Kernels Edge Firing - Single Kernel 256 threads generated per block
4. cuBlas Implementation -Baselining the performance :
inverse (L + 1/n) - 1/n cublas<t>getriBatched routine cublas<t>getrfBatched routine
![Page 41: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/41.jpg)
Implementation - First Join Using GPU
Large Scale Parallel Processing 2015 41/54
HOST
GPU
Transfers , the sub-graphs as full Matrix of order n1 & n2.Here n1 = 2 and n2 = 3
![Page 42: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/42.jpg)
Implementation - First Join Using GPU
Large Scale Parallel Processing 2015 42/54
HOST
GPU
Transfers , the sub-graphs as full Matrix of order n1 & n2.Here n1 = 2 and n2 = 3
Transfers first Cutoff Edge as (source,destination)
![Page 43: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/43.jpg)
Implementation - First Join Using GPU
Large Scale Parallel Processing 2015 43/54
HOST
GPU
Transfers , the sub-graphs as full Matrix of order n1 & n2.Here n1 = 2 and n2 = 3
Transfers first Cutoff Edge as (source,destination)
l+11 l+
12
l+21 l+
22
l+11 l+
12 l+
13
l+21 l+
22 l+
23
l+31 l+
32 l+
33
Computes L+ of sub-graphs
![Page 44: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/44.jpg)
Implementation - First Join Using GPU
Large Scale Parallel Processing 2015 44/54
HOST
GPU
Transfers , the sub-graphs as full Matrix of order n1 & n2.Here n1 = 2 and n2 = 3
Transfers first Cutoff Edge as (source,destination)
l+11 l+
12
l+21 l+
22
l+11 l+
12 l+
13
l+21 l+
22 l+
23
l+31 l+
32 l+
33
Computes L+ of sub-graphs
l+11 l
+12 l
+13 l
+14 l
+15
l+
21 l+
22 l+
23 l+
24 l+
25
l+
31 l+
32 l+
33 l+
34 l+
35
l+41 l
+42 l
+43 l
+44 l
+45
l+51 l
+52 l
+53 l
+54 l
+55
Using L+ of sub-graphs, L+
G3 is obtained.
![Page 45: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/45.jpg)
Implementation - Edge Firing Using GPU
Large Scale Parallel Processing 2015 45/54
HOST
GPU
Remaining cutoff edges transferred as (source, destination), N number of times.N is the number of edges fired after first Join.
Computes L+ of order n1+ n2 , n times
l+11 l
+12 l
+13 l
+14 l
+15
l+
21 l+
22 l+
23 l+
24 l+
25
l+
31 l+
32 l+
33 l+
34 l+
35
l+41 l
+42 l
+43 l
+44 l
+45
l+51 l
+52 l
+53 l
+54 l
+55
Final Result Obtained
![Page 46: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/46.jpg)
Implementation - Edge Firing Using GPU
Large Scale Parallel Processing 2015 46/54
HOST
GPU
Remaining cutoff edges transferred as (source, destination), N number of times.N is the number of edges fired after first Join.
Computes L+ of order n1+ n2 , n times
l+11 l
+12 l
+13 l
+14 l
+15
l+
21 l+
22 l+
23 l+
24 l+
25
l+
31 l+
32 l+
33 l+
34 l+
35
l+41 l
+42 l
+43 l
+44 l
+45
l+51 l
+52 l
+53 l
+54 l
+55
Final Result Obtained
gather( L+G)
![Page 47: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/47.jpg)
Experiments & Results
Large Scale Parallel Processing 2015 47/54
Hardware Platform :
NVIDIA Tesla K20m GPU from DAS4 , a six-cluster wide-area distributed system designed and used by 5 research institutions in The Netherlands.
5GB of GPU global memory, Memory bandwidth of 208 GB/sec, and a peak performance of 3520 GFlops (single precision).
CUDA 5.5 , Matlab R2014a
CPU experiments (sequential and parallel) - performed on DAS4 computing node, using dual-quad-core 2.4 GHz CPU configuration and 24GB memory.
![Page 48: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/48.jpg)
Experiments & Results
Large Scale Parallel Processing 2015 48/54
Different behavior for different graphs.
cuBlas and CUDA better for smaller graphs.
Matlab suitable for graphs of large order.
Divide and Conquer approach versus (L + 1/n)-1 - 1/n
Finding :
![Page 49: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/49.jpg)
Experiments & Results
Large Scale Parallel Processing 2015 49/54
Finding :
Speedup achieved up-to 300 times.
Matlab - Speedup - better for large order graphs.
![Page 50: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/50.jpg)
Experiments & Results
Large Scale Parallel Processing 2015 50/54
No Correlation Found !
Investigate more parameters !
![Page 51: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/51.jpg)
Experiments & Results
Large Scale Parallel Processing 2015 51/54
No Correlation Found !
Future Work : Investigate more parameters.
![Page 52: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/52.jpg)
Contributions
Large Scale Parallel Processing 2015 52/54
Designed a parallel version of the Divide-and-Conquer computation of the Moore-Penrose pseudo-inverse of the Laplacian .
Designed a GPU-enabled version of this parallel solution, and implemented it in Matlab, with significant speedup.
Implemented three other parallel versions, one using CUDA, one using cuBLAS, and a pThreads-based version.
Empirical evidence that the performance of three GPU-enabled versions is heavily dependent on the input graph properties.
![Page 53: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/53.jpg)
Conclusion & Future Work
Large Scale Parallel Processing 2015 53/54
Conclusion :
cuBlas and CUDA - small order graphs.
Matlab implementation outperforms cuBlas and CUDA for large order graphs.
Divide and Conquer Approach - Large Graphs - Significant Performance.
Matlab GPU Computing - Productivity and Performance.
Performance Variation - Input Graph
![Page 54: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/54.jpg)
Conclusion & Future Work
Large Scale Parallel Processing 2015 54/54
Future Work : Multiple GPU’s , Recursive Partitioning.
Spanning Trees to Compute L+.
Investigate parameters of the graph affecting performance,
![Page 55: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/55.jpg)
Thank you !
Large Scale Parallel Processing 2015
Any Questions ?
Workshop on Large-Scale Parallel Processing
IEEE International Parallel and Distributed Processing Symposium
Nishant Saurabh [email protected]
![Page 56: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/56.jpg)
Back-up Slides
Large Scale Parallel Processing 2015 1
Nishant Saurabh Dr. Ana Lucia Varbanescu Dr. Gyan Ranjan Vrije Universiteit University of Amsterdam Symantec Amsterdam. Amsterdam. CA USA.
Workshop on Large-Scale Parallel Processing
IEEE International Parallel and Distributed Processing Symposium
May 25th - 29th, 2015
![Page 57: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/57.jpg)
Computational Formula
Large Scale Parallel Processing 2015 2
Operations Ω L+
First Join x, y ∈ G1 : ΩG3
xy = ΩG1xy
x, y ∈G2 : ΩG3
xy = ΩG2xy
x ∈ G1, y ∈ G2 : ΩG3
xy = ΩG1xi + ωij +
ΩG1jy
l+(1)xy - n2n3 ( l
+(1)xi + l+(1)
iy ) − n22 ( l
+(1)ii + l+(2)
jj + ωij ) n2
3
l+(2)xy - n1n3( l
+(2)xj + l+(2)
jy ) − n21 ( l
+(1)ii + l+(2)
jj + ωij )
n23
n3( n1 l+(1)
xi + n2 l+(2)
jy ) − n1 n2( l+(1)
ii + l+(2)jj + ωij )
n23
![Page 58: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/58.jpg)
Computational Formula
Large Scale Parallel Processing 2015 3
Operations Ω L+
Edge Firing ΩG1xy − [( ΩG1
xj − ΩG1xi ) − ( ΩG1
jy − ΩG1iy )] 2
4 ( ωij + ΩG1
ij )
l+(1)
xy − ( l+(1)xi + l+(1)
xj ) ( l+(1)
iy + l+(2)jy )
ωij + ΩG1
ij
![Page 59: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/59.jpg)
Case Study - Topological Centrality
Large Scale Parallel Processing 2015 4
5
3
2
41
l+11 l+
12 l+
13 l+
14 l+
15
l+
21 l+
22 l+
23 l+
24 l+
25
l+
31 l+
32 l+
33 l+
34 l+
35
l+41 l
+42 l
+43 l
+44 l
+45
l+51 l
+52 l
+53 l
+54 l
+55
Topological Centrality (C*i) = 1/ l+
ii, where l+ii is general term of L+ for
node i
![Page 60: nishants.prmitr7@gmail.com gyan ranjan@symantec.com ... · Parallel Implementation - First Join & Edge Firing Large Scale Parallel Processing 2015 39/54](https://reader034.fdocuments.net/reader034/viewer/2022042322/5f0bf4c67e708231d4330cad/html5/thumbnails/60.jpg)
Experiments & Results
Large Scale Parallel Processing 2015 5