Post on 13-May-2015
description
ACCELERATING AND EVALUATING OPENCL GRAPH APPLICATIONS
SHUAI CHE , BRAD BECKMANN, STEVE REINHARDT AND KEVIN SKADRON
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 2
AGENDA
Background and Graph Applica8ons
Panno8a OpenCL™ Graph Applica8ons
Performance Evalua8on and Discussion
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 3
GRAPH APPLICATIONS
! Intelligence ‒ Business analy8cs, security and scien8fic discovery
! Social networks ‒ Facebook, TwiVer, LinkedIn, Weibo, etc.
! Life science and healthcare ‒ Disease and drug research, life system research
! Infrastructure ‒ Transporta8on, power grid, energy and water supply
! Scien8fic and engineering simula8ons
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 4
GRAPH APPLICATIONS
! Low arithme8c intensity and data reuse ! Not floa8ng-‐point intensive ! Branch divergence
‒ Part of threads in a wavefront are ac8ve
! Memory divergence ‒ Data distributed in different regions of memory ‒ A challenge to op8mize data layouts and memory accesses
! Load imbalance ‒ Uneven work distribu8on across different threads ‒ Short-‐running threads wait for long-‐running threads
! Parallelism ‒ Changing degree of parallelism across itera8ons ‒ Underu8liza8on of compute units for certain phases
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 5
PANNOTIA
! A graph applica8on suite for GPGPU ! Eight diverse graph algorithms, e.g., shortest path, graph par88oning, web analysis and
social network ! Implemented in C + OpenCL™ ! Some are OpenCL implementa8ons based on algorithms of prior work ! Ini8al implementa8on is for a single GPU node ! Further algorithmic and hardware-‐specific op8miza8ons are in progress ! Details of Panno8a can be found in our paper published in 2013 IEEE Interna8onal
Symposium on Workload Characteriza8on
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 6
PANNOTIA
Applica7ons Domains Single-‐Source Shortest Path Shortest Path
Connected Component Labeling Graph Clustering
Graph Coloring Graph Par88oning
Floyd-‐Warshall Shortest Path
Maximal Independent Set Graph Par88oning
Betweeness Centrality Social Network
Friend Recommenda8on Social Network
Page Rank Web Analysis
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 7
GRAPH INPUT AND DATA STRUCTURE
! Real-‐world graphs ‒ The University of Florida Sparse Matrix Collec8on ‒ The 9th DIMACS Implementa8on Challenges ‒ The10th DIMACS Implementa8on Challenges
! Synthe8c graphs ‒ Random-‐graph generator from Georgia Tech
! Graph input formats ‒ Coordinate Format ‒ METIS ‒ Matrix Market
! Data structure representa8on ‒ CSR, COO, ELL … ‒ 2D adjacency matrix
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 8
SINGLE SOURCE SHORTEST PATH
! Finds the path with the shortest path between the source node and all the other nodes in the graph
0
2
1
3
4
5
6 23
7 8
1 15
18 13
2
0 0
1 3
2 1
3 8
4 16
5 19
6 16
Vid Dist
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 9
CONNECTED COMPONENT LABELING
! Detect connected regions in graphs and images ! Connected components are the nodes in a graph that point to the same root
q
s
p
r
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 10
GRAPH COLORING
! Assign colors (integers) to ver8ces with no two adjacent ver8ces with the same color
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 11
FLOYD-‐WARSHALL
! Solves the all-‐pairs shortest path (APSP) problem – finding the shortest path from every possible source to every possible des8na8on
! A dynamic programming approach shortestPath(i, j, k) returns the shortest path from i to j with ver8ces from {1,2,...,k}
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 12
MAXIMAL INDEPENDENT SET
! Independent set: no two ver8ces are neighbors ! Maximal Independent set: impossible to add another vertex to s8ll keep independent
0 1
4 2 3 7
5 6
S = {0, 4, 6} is an Maximal Independent Set
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 13
BETWEENNESS CENTRALITY
! Centrality determines the rela8ve importance of a vertex within the graph (e.g. degree, betweenness, closeness …)
! Betweenness Centrality quan8fies the number of 8mes a node acts as a bridge along the shortest path between two other nodes.
∑≠≠
=tvs st
st vvBCσσ )()(
no. of shortest paths between nodes s and t )(vstσ
stσno. of shortest paths between nodes s and t passing through v
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 14
FRIEND RECOMMENDATION
! Recommend friend connec8ons – a common feature in social websites ! A simple Map-‐Reduce like algorithm
“Andy” = [ “Brad”, “Derek”, “Shuai”, …] Andy ! <“Brad”, “Derek”, “Andy”>
<“Brad”, “Shuai”, “Andy”> <“Derek”, “Brad”, “Andy”>
<“Derek”, “Shuai”, “Andy”> <“Shuai”, “Derek”, “Andy”>
<“Shuai”, “Brad”, “Andy”> Andy recommends Brad to Shuai
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 15
PAGERANK
!
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 16
PERFORMANCE BENEFITS
! Speedups are up to 11x (an AMD “Tahi8” discrete GPU v.s. 4 CPU cores on A8) ! PCI-‐E overhead is included ! Performance benefits depend on graph input datasets
0
5
10
15
Par
alle
l Spe
edup
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 17
EXECUTION TIME BREAKDOWN (D-‐GPU)
! The por8on of GPU execu8on: 8% -‐ 99% ! Some further GPU offload can be done (e.g. FRD and MIS)
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 18
PARALLELISM (ACTIVE VERTICES OVER TIME)
Single-‐Source Shortest Path (Road Network -‐ NY)
Graph Coloring (G3 Circuit)
0
120000
Time
0
400000
Time
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 19
LOAD IMBALANCE (DEGREE DISTRIBUTION) Single-‐Source Shortest Path (Road Network)
Graph Coloring (G3 Circuit)
0%
100%
Time
1 2 3 4 5 6 7 8
0%
100%
Time
1 2 3 4 5
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 20
HIERARCHICAL CLUSTERING
! Different program-‐input pairs may have vastly different characteris8cs!
CLR-‐G3-‐circuit CLR-‐ecology
DJK-‐US-‐NW DJK-‐US-‐CA
BC-‐2k BC-‐1k
CCL-‐lena CCL-‐deposit
FW-‐512-‐64k FW-‐256-‐16k
MIS-‐US-‐NW
MIS-‐shell CLR-‐shell
MIS-‐ecology
PRK-‐flicker FRD-‐coAuthor
PRK-‐2k
0.0 4.6
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 21
L2 HIT RATE OVER TIME (SSSP)
! The cache hit rate first improves, then degrades, improves again and finally degrades with some fluctua8ons in the middle
0
10
20
30
40
50
60 Hit R
ate
Time
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 22
ARCHITECTURAL IMPLICATIONS (SCALAR UNIT)
1
2
Scalar SIMD
1 2 1
2
A B
Time
SIMD
Graph Traversal
Scalar
SIMD
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 23
! Possibly include narrower SIMD units or heterogeneous SIMD units
! Resource management and scheduling
‒ Switch the task between the CPU and the GPU based on parallelism ‒ Use only “enough” SIMD engines and save power
ARCHITECTURAL IMPLICATIONS
Scalar Narrow SIMD Wide SIMD
CPU GPU
0
120000
Time
GPU
A B
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 24
CONCLUSION AND FUTURE WORK
! Graph applica8ons are an emerging workload domain ! Panno8a is a first-‐step aVempt to evaluate diverse graph building blocks on GPUs
Next-‐Step Goals: ! Add more applica8ons (e.g. matching, spanning tree, flow) ! Op8mize Panno8a applica8ons ! Extend to mul8ple GPU nodes and across CPU and GPU
| Accelera8ng and Evalua8ng OpenCL Graph Applica8ons| November 20, 2013 | CONFIDENTIAL 25
DISCLAIMER & ATTRIBUTION
The informa8on presented in this document is for informa8onal purposes only and may contain technical inaccuracies, omissions and typographical errors.
The informa8on contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, so{ware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obliga8on to update or otherwise correct or revise this informa8on. However, AMD reserves the right to revise this informa8on and to make changes from 8me to 8me to the content hereof without obliga8on of AMD to no8fy any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combina8ons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdic8ons. OpenCL is a registered trademark of Apple Inc. Other names are for informa8onal purposes only and may be trademarks of their respec8ve owners.