Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao...

1
MultiGPU Graph Analytics Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang and John D. Owens, University of California, Davis {ychpan, yzhwang, yudwu, ctcyang, jowens}@ucdavis.edu Introduction about Gunrock Programming Model MultiGPU Framework Results Future Work Samples Acknowledgement The GPU hardware and cluster access was provided by NVIDIA. This work was funded by the DARPA XDATA program under AFRL Contract FA875013C0002 and by NSF awards CCF1017399 and OCI1032859. References [1] Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. MultiGPU Graph Analytics. CoRR, abs/1504.04804, Apr. 2015. [2] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. Gunrock: A HighPerformance Graph Processing Library on the GPU. CoRR, abs/1501.05387v2, March 2015. [3] D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117128, Feb.2012. [4] J. Zhong and B. He. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems, 25(6):15431552, June 2014. [5] Z. Fu, H. K. Dasari, B. Bebee, M. Berzins, and B. Thompson. Parallel breadth first search on GPU clusters. In IEEE International Conference on Big Data, pages 110118, Oct. 2014. Gunrock is a multiGPU graph processing library, which targets at: High performance analytics of large graphs Low programming complexity in implementing parallel graph algorithms on GPUs Homepage: http://gunrock.github.io The copyright of Gunrock is owned by The Regents of the University of California, 2015. All source code are released under Apache 2.0. ref. ref. hardware ref. performance our hardware our performance rmat_n20_128 Merrill et al. [3] 4x Tesla C2050 8.3 GTEPS 4x Tesla K40 11.2 GTEPS rmat_n20_16 Zhong et al. [4] 4x Tesla C2050 15.4 ms 4x Tesla K40 9.29 ms peak GTEPS Fu et al. [5] 16x Tesla K20 15 GTEPS 6x Tesla K40 22.3 GTEPS peak GTEPS Fu et al. [5] 64x Tesla K20 29.1 GTEPS 6x Tesla K40 22.3 GTEPS performance analysis and optimization extending Gunrock onto multiple nodes asynchronized graph algorithms Local input frontier Partitioner Input graph Partition table Sub-graph builder Sub-graphs Local input frontier Output sub-frontier Merge Received data package Remote input frontier Output sub-frontier Full-queue kernels Merged frontier Output frontier Separate Local input frontier Remote output frontier Data package Output sub-frontier Merge Remote input frontier Output sub-frontier Merged frontier Output frontier Separate Local input frontier Remote output frontier Data package Received data package Finish Converged? Converged? GPU0 GPU1 Package data Push to peer Unpackage Sub-queue kernels Sub-queue kernels Sub-queue kernels Unpackage Legend: Package data Push to peer Parameters required from user User provided operations Sub-queue kernels 1 4 3 2 6 5 0 1 1 1 2 2 Graph traversal V-Id Label Source vertex / Frontier 0 Frontier 1 Frontier 2 1 2 3 0 4 6 5 0 4 1 3 5 2 6 4 1 5 0 2 6 3 4 1 5 0 2 6 3 0 4 1 3 5 2 6 GPU0 GPU1 Graph partition x y Original vertices x y x y x y Local vertices Remote vertices (with local replica) x Local V-id y Remote V-id 2D partitioning Fixed partitioning more algorithms Comparison with previous work on GPU BFS 5 10 15 20 2 4 6 Number of GPUs GTEPS rmat_n20_1023 rmat_n23_48 Traversed edges per second (TEPS) for BFS Number of GPUs 1 2 3 4 5 2 4 6 Number of GPUs Speedup vs. GPU x 1 BFS SSSP BC CC PR Number of GPUs Strong scaling on rmat_n22_48 1.0 1.5 2.0 2.5 2 4 6 Number of GPUs T1*n/Tn BFS SSSP BC CC PR Number of GPUs Weak scaling on RMAT graphs (scale 48, each GPU hosting ~180M edges) Speedup of BFS for different graph types (error bars showing minima and maxima from graphs of the same types) Speedup of PR for different graph types (error bars showing minima and maxima from graphs of the same types) Speedup of different graph algorithms (error bars showing minima and maxima from different graphs) Gunrock’s multiGPU framework aims at: Programmability: easy to develop graph primitives to support multiple GPUs > hides most implementation details in the framework, and only requires little inputs (what data to exchange, how to combine data, when to stop) Algorithm generality: support a wide range of graph algorithms > isolates from the actual algorithm implementations Hardware compatibility: usable on most single node GPU systems > works on any number of GPUs, with or w/o peer GPU memory access Performance: low runtime, and leverages the underlying hardware well > uses multiple CPU control threads and GPU streams to overlap computations on different portions of frontier, as well as communication Scalability: scalable in terms of both performance and memory usage > Performs just enough GPU memory (re)allocation to keep usage small Graph algorithm as a datacentric process Frontier: compact queue of nodes or edges Generation Operation Advance: visit neighbor lists Filter: select and reorganize Compute: perelement computation kernels in parallel can be combined with advance or filter Full-queue kernels Single GPU data flow Multi GPU data flow

Transcript of Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao...

Page 1: Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao …sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao Wang, Yuduo

Multi-­‐GPU  Graph  Analytics Yuechao Pan,  Yangzihao Wang,  Yuduo Wu,  Carl  Yang  and  John  D.  Owens,  University  of  California,  Davis{ychpan,  yzhwang,  yudwu,  ctcyang,  jowens}@ucdavis.edu

Introduction  -­‐ about  Gunrock

Programming  Model

Multi-­‐GPU  Framework Results

Future  Work

Samples

AcknowledgementThe  GPU  hardware  and  cluster  access  was  provided  by  NVIDIA.This  work  was  funded  by  the  DARPA  XDATA  program  under  AFRL  Contract  FA8750-­‐13-­‐C-­‐0002  and  by  NSF  awards  CCF-­‐1017399  and  OCI-­‐1032859.  

References[1]  Yuechao Pan,  Yangzihao Wang,  Yuduo Wu,  Carl  Yang,  and  John  D.  Owens.  Multi-­‐GPU  Graph  Analytics.  CoRR,  abs/1504.04804,  Apr.  2015.[2]  Yangzihao Wang,  Andrew  Davidson,  Yuechao Pan,  Yuduo Wu,  Andy  Riffel,  and  John  D.  Owens.  Gunrock:  A  High-­‐Performance  Graph  Processing  Library  on  the  GPU.  CoRR,  abs/1501.05387v2,  March  2015.[3]  D.  Merrill,  M.  Garland,  and  A.  Grimshaw.  Scalable  GPU  graph  traversal.  In  Proceedings  of   the  17th  ACM  SIGPLAN  Symposium  on  Principles  and  Practice  of  Parallel  Programming,   PPoPP '12,  pages  117-­‐128,  Feb.2012.  [4]  J.  Zhong and  B.  He.  Medusa:  Simplified   graph  processing  on  GPUs.  IEEE  Transactions  on  Parallel  and  Distributed  Systems,  25(6):1543-­‐1552,June  2014.[5]  Z.  Fu,  H.  K.  Dasari,  B.  Bebee,  M.  Berzins,  and  B.  Thompson.  Parallel  breadth   first  search  on  GPU  clusters.  In  IEEE  International  Conference  on  Big  Data,  pages  110-­‐118,  Oct.  2014.

Gunrock is  a  multi-­‐GPU  graph  processing  library,  which  targets  at:• High  performance  analytics  of  large  graphs• Low  programming  complexity in  implementing  parallel  

graph  algorithms  on  GPUsHomepage:  http://gunrock.github.ioThe  copyright  of  Gunrock is  owned  by  The  Regents  of  the  University  of  California,  2015.  All  source  code  are  released  under  Apache  2.0.

ref. ref. hardware ref.  performance our  hardware our  performancermat_n20_128 Merrill  et  al.  [3] 4x  Tesla  C2050 8.3  GTEPS 4x  Tesla  K40 11.2  GTEPSrmat_n20_16 Zhong et al.  [4] 4x  Tesla  C2050 15.4  ms 4x  Tesla  K40 9.29  mspeak  GTEPS Fu  et  al.  [5] 16x  Tesla  K20 15  GTEPS 6x  Tesla  K40 22.3  GTEPSpeak  GTEPS Fu  et  al. [5] 64x  Tesla  K20 29.1  GTEPS 6x  Tesla  K40 22.3  GTEPS

• performance  analysis  and  optimization• extending  Gunrock onto  multiple  nodes• asynchronized graph  algorithms

Local input frontier

PartitionerInput graph Partition table

Sub-graph builder Sub-graphs

Local input frontier

Output sub-frontier

Merge

Received data package

Remote input frontier

Output sub-frontier

Full-queue kernels

Merged frontier

Output frontier

Separate

Local input frontier

Remote output frontier

Data package

Output sub-frontier

Merge

Remote input frontier

Output sub-frontier

Merged frontier

Output frontier

Separate

Local input

frontier

Remote output frontier

Data package

Received data package

FinishConverged? Converged?

GPU0 GPU1

Package data

Push to peer

Unpackage

Sub-queue kernels Sub-queue kernelsSub-queue kernels

Unpackage

Legend:

Package data

Push to peer

Parameters required from user

User provided operations

Sub-queue kernels

1

4

3

2

6

5

0

1

1

1

2

2

Graph  traversal

V-IdLabel

Source vertex /Frontier 0

Frontier 1

Frontier 2

1

2

3

04

6

5

0

4

1

35

2

6

4

1

5

02

6

3

4

1

5

0

2

6

3

0

4

1

35

2

6

GPU0 GPU1

Graph  partition

xy

Original vertices

x y

xy

x yLocal vertices

Remote vertices(with local replica)

x Local V-idy Remote V-id

• 2D  partitioning• Fixed  partitioning• more  algorithms

Comparison  with  previous  work  on  GPU  BFS  

5

10

15

20

2 4 6Number of GPUs

GTE

PS

rmat_n20_1023rmat_n23_48

Traversed  edges  per  second  (TEPS)  for  BFS

Number  of  GPUs

1

2

3

4

5

2 4 6Number of GPUs

Spee

dup

vs. G

PU x

1 BFSSSSPBCCCPR

Number  of  GPUs

Strong  scaling  on  rmat_n22_48

1.0

1.5

2.0

2.5

2 4 6Number of GPUs

T1*n

/Tn

BFSSSSPBCCCPR

Number  of  GPUs

Weak  scaling  on  R-­‐MAT  graphs(scale  48,  each  GPU  hosting  ~180M  edges)

Speedup  of  BFS  for  different  graph  types(error  bars  showing  minima  and  maxima  from  graphs  of  the  same  types)  

Speedup  of  PR  for  different  graph  types(error  bars  showing  minima  and  maxima  from  graphs  of  the  same  types)  

Speedup  of  different  graph  algorithms(error  bars  showing  minima  and  maxima  from  different  graphs)  Gunrock’s multi-­‐GPU  framework  aims  at:

• Programmability: easy  to  develop  graph  primitives  to  support  multiple  GPUs-­‐>  hides  most  implementation  details  in  the  framework,  and  only  requires  little  inputs  (what  data  to  exchange,  how  to  combine  data,  when  to  stop)

• Algorithm  generality:  support  a  wide  range  of  graph  algorithms-­‐>  isolates  from  the  actual  algorithm  implementations

• Hardware  compatibility:  usable  on  most  single  node  GPU  systems-­‐>  works  on  any  number  of  GPUs,  with  or  w/o  peer  GPU  memory  access

• Performance:  low  runtime,  and  leverages  the  underlying  hardware  well-­‐> uses  multiple  CPU  control  threads  and  GPU  streams  to  overlap  computations  on  different  portions  of  frontier,  as  well  as  communication  

• Scalability:  scalable  in  terms  of  both  performance  and  memory  usage-­‐>  Performs  just  enough  GPU  memory  (re)allocation  to  keep  usage  small

Graph  algorithm  as  a  data-­‐centric  processFrontier:  compact  queue  of  nodes  or  edges

Generation

Operation

Advance:  visit  neighbor  lists Filter:  select  and  reorganize

Compute:  per-­‐element  computation    kernels  in  parallelcan  be  combined  with  advance  or  filter

Full-queue kernelsSingle GPU data flow

Multi GPU data flow