Post on 15-Apr-2017
1
LLVM-based Communication Optimizations
for PGAS ProgramsAkihiro Hayashi (Rice University)
Jisheng Zhao (Rice University)Michael Ferguson (Cray Inc.)
Vivek Sarkar (Rice University)
2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15
2
A Big Picture
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab.
© RIKEN AICS
©Berkeley Lab.
X10,Habanero-UPC+
+,…
Communicatio
n
Optimization
3
PGAS LanguagesHigh-productivity features:
Global-View Task parallelism Data Distribution Synchronization
X10Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/
CAFHabanero-UPC++
4
Communication is implicit in some PGAS Programming ModelsGlobal Address Space
Compiler and Runtime is responsible for performing communications across nodes
1: var x = 1; // on Node 02: on Locales[1] {// on Node 13: … = x; // DATA ACCESS4: }
Remote Data Access in Chapel
5
Communication is Implicit in some PGAS Programming Models (Cont’d)
1: var x = 1; // on Node 02: on Locales[1] {// on Node 13: … = x; // DATA ACCESS
Remote Data Access
if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get(…);}
Runtime affinity handling
1: var x = 1; 2: on Locales[1] {3: … = 1;
Compiler Optimization
OR
6
Communication Optimizationis Important
A synthetic Chapel program on Intel Xeon CPU X5660 Clusters with QDR Inifiniband
4096
4096
081
920
1228
80
1638
40
2048
00
2457
60
2867
20
3276
80
3686
40
4096
00
4505
60
4915
20
5324
80
6144
00
6553
60
6963
20
7372
80
7782
40
8192
00
8601
60
9011
20
9461
76
1064
960
100
1000
10000
100000
1000000
10000000 Optimized (Bulk Transfer)Unoptimized
Transferred Byte
Late
ncy
(ms)
1,500x
59x
Lower is better
7
PGAS Optimizations are language-specific
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
X10,Habanero-UPC+
+,…
© Argonne National Lab.
©Berkeley Lab.
© RIKEN AICS
Chapel Compiler
UPC Compiler
X10 CompilerHabanero-C
Compiler
Language
Specific!
8
Our goal
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab.
© RIKEN AICS
©Berkeley Lab.
X10,Habanero-UPC+
+,…
9
Why LLVM?Widely used language-agnostic compiler
LLVM Intermediate Representation (LLVM IR)
C/C++FrontendClang
C/C++, Fortran, Ada, Objective-C
Frontenddragonegg
ChapelFrontend
x86 backend
Power PCbackend
ARMbackend
PTXbackend
Analysis & Optimizations
UPC++Frontend
x86 Binary PPC Binary ARM Binary GPU Binary
10
Summary & Contributions
Our Observations : Many PGAS languages share semantically similar constructs PGAS Optimizations are language-specific
Contributions: Built a compilation framework that can uniformly optimize
PGAS programs (Initial Focus : Communication)Enabling existing LLVM passes for communication optimizationsPGAS-aware communication optimizations
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
11
Overview of our frameworkChapel
Programs
UPC++Program
sX10
Programs
Chapel-LLVM
frontendUPC++-
LLVM frontend
X10-LLVM
frontend
LLVM IRLLVM-basedCommunicati
on Optimization
passes
Lowering Pass
CAFProgram
s
CAF-LLVM
frontend 1. Vanilla LLVM IR2. use address space feature to express communications
Need to be implemented when supporting a new
language/runtimeGenerally language-agnostic
12
How optimizations work
store i64 1, i64 addrspace(100)* %x, …
// x is possibly remotex = 1;
Chapelshared_var<int> x;x = 1;
UPC++
1.Existing LLVMOptimizations
2.PGAS-awareOptimizations
treat remote
access as if it were
local access Runtime-Specific Lowering
Address space-aware
Optimizations
Communication API Calls
13
LLVM-based Communication Optimizations for Chapel
1. Enabling Existing LLVM passes Loop invariant code motion (LICM) Scalar replacement, …
2. Aggregation Combine sequences of loads/stores on
adjacent memory location into a single memcpy
These are already implemented in the standard Chapel compiler
14
An optimization example:LICM for Communication Optimizations
LICM = Loop Invariant Code Motion
for i in 1..100 { %x = load i64 addrspace(100)* %xptr A(i) = %x;}
LICM by LLVM
15
An optimization example:Aggregation
// p is possibly remotesum = p.x + p.y;
llvm.memcpy(…);
load i64 addrspace(100)* %pptr+0load i64 addrspace(100)* %pptr+4
x y
GET GET
GET
16
LLVM-based Communication Optimizations for Chapel
3. Locality Optimization Infer the locality of data and convert
possibly-remote access to definitely-local access at compile-time if possible
4. Coalescing Remote array access vectorization
These are implemented, but not in the standard Chapel compiler
17
An Optimization example:Locality Optimization
1: proc habanero(ref x, ref y, ref z) {2: var p: int = 0;3: var A:[1..N] int;4: local { p = z; }5: z = A(0) + z;6:}
2.p and z are definitely local
3.Definitely-local access!(avoid runtime affinity checking)
1.A is definitely-local
18
An Optimization example:Coalescing
1:for i in 1..N {2: … = A(i);3:}
1:localA = A;2:for i in 1..N {3: … = localA(i);4:}
Perform bulk transfer
Converted to definitely-local
access
BeforeAfter
19
Performance Evaluations:Benchmarks
Application SizeSmith-Waterman 185,600 x 192,000Cholesky Decomp 10,000 x 10,000
NPB EP CLASS = D
Sobel 48,000 x 48,000SSCA2 Kernel 4 SCALE = 16
Stream EP 2^30
20
Performance Evaluations:Platforms Cray XC30™ Supercomputer @ NERSC
Node Intel Xeon E5-2695 @ 2.40GHz x 24 cores 64GB of RAM
Interconnect Cray Aries interconnect with Dragonfly topology
Westmere Cluster @ Rice Node
Intel Xeon CPU X5660 @ 2.80GHz x 12 cores 48 GB of RAM
Interconnect Quad-data rated infiniband
21
Performance Evaluations:Details of Compiler & Runtime Compiler
Chapel Compiler version 1.9.0 LLVM 3.3
Runtime : GASNet-1.22.0
Cray XC : ariesWestmere Cluster : ibv-conduit
Qthreads-1.10Cray XC: 2 shepherds, 24 workers / shepherdWestmere Cluster : 2 shepherds, 6 workers / shepherd
22
BRIEF SUMMARY OF PERFORMANCE EVALUATIONS
Performance Evaluation
23
SW Cholesky Sobel StreamEP EP SSCA20.00.51.01.52.02.53.03.54.04.55.0 Coalescing
Locality OptAggregation
Perf
orm
ance
Im-
prov
emen
t ov
er L
LVM
-uno
ptResults on the Cray XC (LLVM-unopt vs. LLVM-allopt)
4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
2.1x
19.5x
1.1x
2.4x1.4x 1.3x
Higher is better
24
SW Cholesky Sobel StreamEP EP SSCA20.00.51.01.52.02.53.03.54.04.55.0 Coalescing
Locality OptAggregation
Perf
orm
ance
Im-
prov
emen
t ov
er L
LVM
-uno
ptResults on Westmere Cluster(LLVM-unopt vs. LLVM-allopt)
2.3x
16.9x
1.1x
2.5x1.3x
2.3x
4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
25
DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION
Performance Evaluation
26
Cholesky Decompositiondependencies0
001122233
001122233
01122233
1122233
122233
22233
2233
233
33 3
Node0
Node1
Node2
Node3
27
Metrics1. Performance & Scalability
Baseline (LLVM-unopt) LLVM-based Optimizations (LLVM-allopt)
2. The dynamic number of communication API calls
3. Analysis of optimized code4. Performance comparison
Conventional C-backend vs. LLVM-backend
28
Performance Improvement by LLVM (Cholesky on the Cray XC)
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales0
1
2
3
4
5
1
0.1 0.1 0.2 0.2 0.3
2.6 2.7
3.74.1 4.3 4.5LLVM-unopt LLVM-allopt
Spee
dup
over
LLV
M-
unop
t 1l
ocal
e
LLVM-based communication optimizations show scalability
29
Communication API calls elimination by LLVM (Cholesky on the Cray XC)
LOCAL_ GET REMOTE_GET LOCAL_PUT REMOTE_PUT0
0.2
0.4
0.6
0.8
1
1.2100.0% 100.0% 100.0% 100.0%
12.1%0.2%
89.2%100.0%
LLVM-unopt LLVM-allopt
Dyna
mic
num
ber o
f co
mm
unic
atio
n AP
I cal
ls
(nor
mal
ized
to
LLVM
-un
opt) 8.3x
improve-ment
500ximprove-
ment
1.1ximprove-
ment
30
Analysis of optimized code
for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 4GETS for iB in zero..tileSize-1 { 9GETS + 1PUT}}}
1.ALLOCATE LOCAL BUFFER2.PERFORM BULK TRANSFERfor jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 1GET for iB in zero..tileSize-1 { 1GET + 1PUT}}}
LLVM-unopt LLVM-allopt
31
Performance comparison with C-backend
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales0
0.51
1.52
2.53
3.54
4.55
2.8
0.1 0.1 0.1 0.30.7
0.41
0.1 0.1 0.2 0.2 0.3 0.4
2.6 2.7
3.74.1 4.3 4.5 4.5
C-backend LLVM-unopt LLVM-allopt
Spee
dup
over
LLV
M-
unop
t 1l
ocal
e
C-backend is faster!
32
Current limitation
In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces
Locale(16bit)
addr(48bit)
For LLVM Code Generation : 64bit packed pointer
For C Code Generation : 128bit struct pointer
ptr.locale;ptr.addr;
ptr >> 48 ptr | 48BITS_MASK;
1. Needs more instructions2. Lose opportunities for Alias analysis
33
ConclusionsLLVM-based Communication optimizations for
PGAS Programs Promising way to optimize PGAS programs in a
language-agnostic manner Preliminary Evaluation with 6 Chapel applications
Cray XC30 Supercomputer– 4.6x average performance improvement
Westmere Cluster– 4.4x average performance improvement
34
Future workExtend LLVM IR to support parallel programs
with PGAS and explicit task parallelism Higher-level IR
Parallel Programs
(Chapel, X10, CAF, HC, …)
1.RI-PIR Gen2.Analysis
3.Transformation
1.RS-PIR Gen2.Analysis
3.TransformationLLVM
Runtime-IndependentOptimizations
e.g. Task Parallel Construct
LLVM Runtime-Specific
Optimizationse.g. GASNet API
Binary
35
AcknowledgementsSpecial thanks to
Brad Chamberlain (Cray) Rafael Larrosa Jimenez (UMA) Rafael Asenjo Plaza (UMA) Habanero Group at Rice
36
Backup slides
37
Compilation FlowChapel
Programs
AST Generation and Optimizations
C-code Generation
LLVM OptimizationsBackend Compiler’s
Optimizations(e.g. gcc –O3)
LLVM IRC Programs
LLVM IR Generation
Binary Binary