Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 ·...

51
Design Considerations for a GraphBLAS Compliant Graph Library on Clusters of GPUs July 11, 2016 Carl Yang, University of California, Davis Aydın Buluç, Lawrence Berkeley National Laboratory John D. Owens, University of California, Davis

Transcript of Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 ·...

Page 1: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations  for  a  GraphBLASCompliant  Graph  Library  on  Clusters  of  GPUs

July 11,  2016

Carl  Yang,  University  of California,  DavisAydın  Buluç,  Lawrence  Berkeley  National  LaboratoryJohn  D.  Owens,  University  of California,  Davis

Page 2: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

GraphBLASInstead  of  writing  a  graph  algorithm  from  scratch,  users  can  express  algorithm  using  a  small  number  of  GraphBLAS kernels

Hardware-­agnostic:Users  do  not  have  to  relearn  many  different  APIs  in  order  to  run  graph  algorithms  on  different  hardware

Efficiency:Kernels  are  implemented  with  performance  guarantees

Page 3: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Why  use  GPUs  for  graphs?Cost-­efficientK20X  costs  ~$2000

Improving  faster  that  traditional  CPUsP100  (2016):    ~8  TFLOPS,  ~1TB/s,  ~3800  ALUs,  32GB  mem.CPU-­to-­GPU  bandwidth:  ~80GB/s-­200GB/s  (NVLink)Xeon  Phi  7250  (2016):  ~6  TFLOPS,  ~400GB/s,  ~288  threads,  16GB  mem.CPU-­to-­coprocessor  bandwidth:  ~100GB/s  (Omni-­Path)

Page 4: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Challenges  with  using  GPUs  for  graph  analysis

Hard  to  programSmall  memoryCPU  RAM  available  ~100x  larger

Solved  by  GraphBLASAPISolved  by  using  multi-­GPU

Page 5: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

GraphBLAS kernel  (mXv)Goal:  Implement  mXv on  multiple  GPUs  following  GraphBLASstandardSame  as  SpMVSame  as  vXm,  which  computes  yT =  xTA

Input:  Given  adjacency  matrix  A,  vector  xWant:  Compute  y  =  ATxCurrent  GraphBLAS API:

GrB_info GrB_mXv(  GrB_Vector *y,  const GrB_Semiring s,  const GrB_Matrix A,                    const GrB_vector x  )  

Def:  Think  of  a  semiring as  an  object  with  a  “x”  operator,  “+”  operator,  and  identity  for  each  operator.

Page 6: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Applications  of  SpMVSSSPPageRankMISAnd  many  more!

Page 7: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =      ?

Each iteration,  1)  compute  y  =  ATx2)  update BFS  result3)  use  BFS  result to  filter y4)  stop  if BFS  result is unchanged

BFS  Using SpMV (Iteration  1)

10000

0-­1-­1-­1-­1

0

1

2

3

4

Page 8: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute  y  =  ATx2)  update BFS  result3)  use  BFS  result to  filter y4)  stop  if BFS  result is unchanged

BFS  Using SpMV (Iteration  1)

10000

01000

0-­1-­1-­1-­1

0

1

2

3

4

Page 9: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  -­22)  update  BFS  result3)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  1)

10000

01000

01-­1-­1-­1

0

1

2

3

4

Page 10: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx2)  update  BFS  result old:  -­23)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  2)

01000

00110

01-­1-­1-­1

0

1

2

3

4

Page 11: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  42)  update  BFS  result old:  -­23)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  2)

01000

00110

0122-­1

0

1

2

3

4

Page 12: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx2)  update  BFS  result old:  43)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  3)

00110

00012

0122-­1

0

1

2

3

4

Page 13: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  82)  update  BFS  result old:  43)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  3)

00110

00012

01223

0

1

2

3

4

Page 14: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  82)  update  BFS  result old:  83)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  4)

00001

00000

01223

0

1

2

3

4

Page 15: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations2.How  to  solve  multiway-­merge3.Partitioning  scheme

Page 16: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

AT x                y

x                =

• 1)  Elementwise multiply• 2)  Reduce• Work:  O(nnz)• Independent of  vector  sparsity

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

1.  Row-­based SpMV

00214

20864

Page 17: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Why  not  Row-­based?cuSPARSE,  Modern  GPU,  CUB  all  provide  row-­based  SpMV GPU  implementationsWhy  implement  our  own?

Page 18: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

x              =  A  Motivating  Example

0 0 1 0 0 1 0 0 2 24 0 0 0 0 1 3 0 0 20 0 1 2 1 0 0 0 7 91 4 3 0 0 2 4 3 3 90 5 2 0 0 2 0 0 0 01 2 0 0 0 0 2 2 0 03 3 2 0 0 2 0 0 0 00 0 2 0 0 0 0 2 3 00 0 0 0 0 0 3 0 0 03 0 2 2 9 0 0 0 0 0

0000002000

0608040060

• Take  scalar  from  vector  x,  and  multiply  it  by  the  corresponding  column  from  A

AT x            y

Page 19: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

AT                                                  x      =    y3 +    y4 +    y5 =    y

x                =              +              +              =          

• 1)  Gather  and  elementwise multiply• 2)  Multiway-­merge  (many  elementwise adds)• Work:  𝑂 ∑ 𝑑)*[)]-. + kn• Suitable for sparse vectors,  due to  dependence on vector  

sparsity

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

Column-­based SpMV

00214

20864

20264

00200

00400

Page 20: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

For  GraphBLAS folksboth  a  row-­based  SpMV as  well  as  a  column-­based  SpMV and  automatically  decide  which  one  to  use?This  is  the  idea  behind  direction-­optimizing  BFS

Page 21: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations1.Column-­based  SpMV2.How  to  solve  multiway-­merge3.Partitioning  scheme

Page 22: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

2.  Intuition  of  multiway-­mergeleading  to  duplicates

Input:  sorted  segments,  with  possible  duplicatesE.g.  [4  5  6  4  5  6  3  4  5]

Want:  uniqueE.g.  [4  5  6  3]

0 1 2

3 4 5 6

Page 23: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Block  Diagram  (Single  GPU)

(+):  Addition  operator  of  semiring(x):  Multiplication  operator  of  semiring

Gather  columns  from  AT

(x)  Multiway-­merge

Input:  Sparse  Vector  xSparse  Matrix  AT

(+)  ewiseMult

Output:  Sparse  Vector  y

Page 24: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

(+):  Addition  operator  of  semiring(x):  Multiplication  operator  of  semiring

(x)  Multiway-­merge

Multiway-­merge

Gather  columns  from  AT

(x)  Multiway-­merge

Input:  Sparse  Vector  xSparse  Matrix  AT

(+)  ewiseMult

Radix  Sort

(x)  Segmented  ReduceOutput:  Sparse  Vector  y

Page 25: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Radix  sort  works  best

Page 26: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Importance  of  multiway-­merge

Multiway-­merge  takes  ~85.6%  of  application  runtimeGather  and  ewiseMult take  11ms

Page 27: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

For  GraphBLAS folksautomatically  decide  which  one  to  use?This  is  the  idea  behind  direction-­optimizing  BFS

Should  sparse  vector  be  allowed  to  contain  duplicate  entries?This  is  the  idea  behind  Gunrock’s idempotent  discovery  that  lets  them  get  away  without  solving  multiway-­merge  completelySome  duplicates  are  allowed

Page 28: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations1.Column-­based  SpMV2.How  to  solve  multiway-­merge3.Partitioning  scheme

Page 29: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1x2

x x3 =        +      +      +...+      =  ...xp

Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

3.  Partitioning scheme

Page 30: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

3.  Partitioning scheme

c

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1                                                                                                        y1x2

x x3 =        +      +      +...+      =  ...xp

GPU1Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

Page 31: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1                                                                                                        x2                                                                                                          y2

x x3 =        +      +      +...+      =  ...xp

GPU2Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

3.  Partitioning scheme

c

Page 32: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

3.  Partitioning scheme

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1 y1x2

x x3 =        +      +      +...+      =  ...xp

Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

Page 33: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

3.  Partitioning scheme

c

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1                                                                                                        x2                                                                                                          y2

x x3 =        +      +      +...+      =  ...xp

GPU2Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

Page 34: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Block  Matrix  (Multi-­GPU)On  GPU  1,  2,  …,  p:

mXv (Single  GPU)

Gather  columns  from  Ai

Multiway-­merge

Input:  Sparse  Vector  xiSparse  Matrix  Ai

ewiseMult

Output:  Sparse  Vector  yi

mXv (Single  GPU)

MPI_Alltoallv

MPI_Alltoall

Multiway-­merge

Histogram

c

c

Why  2  multiway-­merges?

Page 35: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Experimental  SetuplCPUl64x  16-­core  AMD  Opteron 6274  CPUl32  GB  of main memory

lGPUl64x  NVIDIA  K20X  GPUs  l14  SMs,  192  ALUs  per  SM,  732  MHz,  6  GB  on-­board  memory

lInterconnectlGemini  interconnect between nodesl10.4  GB/s  (HyperTransport 3.0)

Page 36: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Strong  Scaling

0.5

1

2

4

8

16

1 2 4 8 16 32 64

Billions  of  Edges  Traversed  (GTEPS)

Number  of  Processors

Ideal  ScalingActual  Scaling

Page 37: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Weak  Scaling  (Vertex)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 4 8 16 32

Billions  of  Edges  Traversed  

per  Processor  (GTEPS)

Number  of  Processors

Ideal  ScalingActual  Scaling

Page 38: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

For  4  GPUs,  everything  looks  fine

Majority  of  runtime  is  compute• Two  biggest  iterations  brought  down  from  121ms  to  40ms

Page 39: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Compute  time  >  Communication

MPI  calls  comprise  a  sliver  of  runtime

Page 40: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

At  16  GPUs,  picture  becomes  bleak

MPI  calls  take  up  ~50%  runtime  now

Page 41: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Not  overlapping  compute  with  communication

GPUs  sit  idle  to  wait  for  other  GPUs  to  finish  compute

Page 42: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Communication  vs.  Compute

0

1

2

3

4

5

6

7

1 2 4 8 16

Relative  Cost  per  Processor

Number  of  Processors

CommunicationCompute

How  to  bridge  this  gap?

Page 43: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Work-­in-­progress1.Overlap  compute  with  communication

Use  multiple  cudaStreams to  make  communication  asynchronous

2.Try  other  partitioning  schemesVariants  of  1D  column-­based  partitioning2D  partitioning

3.Partition  graph  betterReverse  Cuthill–McKee

Page 44: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

ConclusionRow-­based  vs.  Column-­based  implementation  for  mXvBoth  have  their  uses

Should  sparse  vector  tolerate  duplicate  entries  if  user  does  not  care  about  intermediate  sparse  vectors?

Page 45: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Acknowledgements

supportThis  work  is  funded  by:DARPA  XDATA  program  US  Army  award  W911QX-­12-­C-­0059

DARPA  STTR  award  D15PC00010Department  of  Energy,  Office  of  Science,  ASCR  Contract  No.  DE-­AC02-­05CH11231

Page 46: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Questions?Aydın  Buluç,  Lawrence  Berkeley  National  [email protected]

John  D.  Owens,  University  of  Davis,  [email protected]

Page 47: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Backup  Slides

Page 48: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Applications of Interest  

• Bioinformatics• Social  network  analysis• Computer  vision• Fraud  detection• Urban  planning

Page 49: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Challenges  with  Graph  AnalysisToo  big  to  visualize

Software  complexityStructural  properties  varySmall-­world:  Difficult  to  partitionImpacts  scalabilityMesh:Limits  scalability  of  synchronous  graph  algorithms

Page 50: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

AT x                y

x                =

• GPU  threads compute  each row simultaneously• Work:  O(nnz)• Suitable for multiplication by dense  vector,  because work is

independent of  vector  sparsity• Direction-­optimizing or bottom-­up  BFS  is a  special case  of  this

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

1.  Row-­based mXv (SpMV)

00214

20864

Page 51: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Dominance  of  Big  Iterations  in  BFS  of  Scale-­free  Graphs

Two  biggest  iterations  take  91%  of  runtime• Other  six  iterations  take  6ms  combined