Approved for public release
What is LLVM? And a Status Update.
Hal Finkel
Leadership Computing Facility
Argonne National Laboratory
2
Clang, LLVM, etc.
LLVM/Clang is both a research platformand a production-quality compiler.
✔ LLVM is a liberally-licensed(*) infrastructure for creating compilers, other toolchain components, and JIT compilation engines.
✔ Clang is a modern C++ frontend for LLVM✔ LLVM and Clang will play significant roles in exascale
computing systems!
(*) Now under the Apache 2 license with the LLVM Exception
3
A role in exascale? Current/Future HPC vendors are already involved (plus many others)...
LLVM
Apple + Google(Many millions invested annually)+ many others (Qualcomm, Sony,
Microsoft, Facebook, Ericcson, etc.)
Intel
Cray
ARM
IBM
NVIDIA(and PGI)
AMD
Academia, Labs, etc.
4
What is LLVM:
LLVM is not a “low-level virtual machine”!
LLVM is a multi-architecture infrastructure for constructing compilers and other toolchain components.
LLVM IR Architecture-independentsimplification
Architecture-awareoptimization
(e.g. vectorization)
Backends(Type legalization,
instruction selection,register allocation, etc.)
Assembly printing,binary generation, or
JIT execution
5
What is Clang:
Clang is a C++ frontend for LLVM...
C++ Source(C++14, C11, etc.)
Parsing andsemantic analysis
LLVM IR
Code generation
Static analysis
● For basic compilation, Clang works just like gcc – using clang instead of gcc, or clang++ instead of g++, in your makefile will likely “just work.”
● Clang has a scalable LTO, check out: https://clang.llvm.org/docs/ThinLTO.html
6
The core LLVM compiler-infrastructure components are one of the subprojects in the LLVM project.These components are also referred to as “LLVM.”
7
What About Flang?
● Started as a collaboration between DOE and NVIDIA/PGI. Now also involves ARM and other vendors.
● Public discussions on making Flang (f18+runtimes) part of LLVM going well.
● Two development paths:
Flang based on PGI’s existing
frontend (in C).
Production ready including
OpenMP support.
f18 – A new frontend written in
modern C++.
Parsing, semantic
analysis, etc. under active
development.
Fortran runtime
library and vectorized
math-function library.
LLVM Project
8
What About MLIR?
● MLIR is now part of the LLVM project.● MLIR is a "multi-level IR", originaly
developed as part of the TensorFlow project.● It is a kind of framework for producing
particular IRs (along with a nice way to specify peephole optimizations for them, translations between them, etc.)
● MLIR is being used for Flang
9
EuroLLVM
Upcoming LLVM Status...
● Relicensing well underway.● Moved to github (might also move issue
tracking, etc. soon).● Version 9.0.1 has been released.
10
What’s Working Well
● Involvement:
● Features: LLVM has become well known for an important set of features:
● A well-defined IR allows use by a lot of different languages (C, C++, Fortran, Julia, Rust, Python
(e.g., via Numba), Swift, ML frameworks (e.g., TensorFlow/XLA, PyTorch/Glow), and many others.
● A backend infrastructure allowing the efficient creation of backends for new hardware.
● A state-of-the-art C++ frontend, CUDA support, scalabale LTO, sanitizers and other debugging
capabilities, and more.
● High code-quality standards.
11
MPI-specifc warning messages
These are not really MPI specific, but uses the “type safety” attributes inspired by this use case:
int MPI_Send(void *buf, int count, MPI_Datatype datatype)
__attribute__(( pointer_with_type_tag(mpi,1,3) ));
…
#define MPI_DATATYPE_NULL ((MPI_Datatype) 0xa0000000)
#define MPI_FLOAT ((MPI_Datatype) 0xa0000001)
…static const MPI_Datatype mpich_mpi_datatype_null __attribute__(( type_tag_for_datatype(mpi,void,must_be_null) )) = 0xa0000000;
static const MPI_Datatype mpich_mpi_float __attribute__(( type_tag_for_datatype(mpi,float) )) = 0xa0000001;
See Clang's test/Sema/warn-type-safety-mpi-hdf5.c, test/Sema/warn-type-safety.c and
test/Sema/warn-type-safety.cpp for more examples,
and: http://clang.llvm.org/docs/AttributeReference.html#type-safety-checking
12
Sanitizers
The sanitizers (some now also supported by GCC) – Instrumentation-based debugging● Checks get compiled in (and optimized along with the rest of the code) – Execution speed an order of
magnitude or more faster than Valgrind● You need to choose which checks to run at compile time:
● Address sanitizer: -fsanitize=address – Checks for out-of-bounds memory access, use after free, etc.: http://clang.llvm.org/docs/AddressSanitizer.html
● Leak sanitizer: Checks for memory leaks; really part of the address sanitizer, but can be enabled in a mode just to detect leaks with -fsanitize=leak: http://clang.llvm.org/docs/LeakSanitizer.html
● Memory sanitizer: -fsanitize=memory – Checks for use of uninitialized memory: http://clang.llvm.org/docs/MemorySanitizer.html
● Thread sanitizer: -fsanitize=thread – Checks for race conditions: http://clang.llvm.org/docs/ThreadSanitizer.html
● Undefined-behavior sanitizer: -fsanitize=undefined – Checks for the execution of undefined behavior: http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
● Efficiency sanitizer [Recent development]: -fsanitize=efficiency-cache-frag, -fsanitize=efficiency-working-set (-fsanitize=efficiency-all to get both)
And there's more, check out http://clang.llvm.org/docs/ and Clang's include/clang/Basic/Sanitizers.def for more information.
13
Clang Can Compile CUDA!
$ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch>
For example:--cuda-gpu-arch=sm_35
When compiling, you may also need to pass --cuda-path=/path/to/cuda if you didn’t install the CUDA SDK into /usr/local/cuda (or a few other “standard” locations).
For more information, see: http://llvm.org/docs/CompileCudaWithLLVM.html
● CUDA is the language used to compile code for NVIDIA GPUs.● Support now also developed by AMD as part of their HIP project.
Clang's CUDA aims to provide better support for modern C++ than NVIDIA's nvcc.
14
Many Derived Vendor Products
Upstream LLVM Code Base
Add optimizations and other features.
Package and Ship Product
Add optimizations and other features.
Package and Ship Product
Add optimizations and other features.
Package and Ship Product
Vendor A Vendor B
Vendor C
15
On Interacting with Open-Source LLVM
When compilers are open source:● DOE Staff, collaborators, contractors, etc. can directly contribute features and bug fixes.
● And we can review code to keep quality high (and keep the open source community healthy).● We can enhance user productivity by delivering new features and bug fixes to our users quickly…
● This critically depends on testing!
Vendor(s)
Staff, collaborators, contractors, etc.
open-source LLVM
CI/testing system with large amount of DOE
applications and libraries with correctness tests!
DOE/ECP’s CI effort is important here!
Facilities+
Happy users!
Bug fixes (and simple features) deployed to users in days (not weeks or months)!
Approved for public release
Flang
Alexis Perry–Holby (Los Alamos National Laboratory)Patrick McCormick (Los Alamos National Laboratory)Douglas Miles(NVIDIA)Stephen Scalpone (NVIDIA)Hal Finkel (Argonne National Laboratory)David Bernholdt (Oak Ridge National Laboratory)Brian Friesen (Lawrence Berkeley National Laboratory)
Feb. 6th, 2020
2
Flang – Current Status• Working on upstreaming Flang (F18) codebase to LLVM’s main monorepo
– Crucial to the long-term success of the project– Identified items are worked on collaboratively by the community
• Technical work scope details: – Working on resolution of type-bound generics and operators– Completed support of forward references to derived types– Implemented logical expression lowering– Began character expression lowering– Created an expression lowering test framework– Continued work on DO loop semantic checks, especially where zeroes are not allowed– Continued work on FIR definition
3
Flang – Path to Merging into the LLVM Monorepo• CMake changes to support in-tree building
– PR coming soon
• Style changes that take us closer to LLVM– Clang format changes to bring Flang more in line with LLVM (PR #945)– Rationalization of public/private headers (merged PR #943)– Renaming of files from .cc to .cpp (merge PR #958)
• Making more general use of LLVM APIs and data structures– Discussion of detailed list of changes is ongoing on flang-dev mailing list
• Port testing to use LLVM tools (lit and FileCheck)– PR #941 ports the test suite to lit– Separate discussions of various custom scripts are ongoing
• Build compiler support– Need plan for improving public buildbot coverage and moving in a direction to allow ECP-centric coverage to be
addressed
The LLVM Compiler Infrastructure
4
Flang – Growing Community
The LLVM Compiler Infrastructure
5
Flang – Project Timeline• Schedule is currently driven by phases of the compiler (parsing, semantic analysis, lowering, etc.)
• Schedule priority is on getting infrastructure complete for a functionally correct, sequential compiler – it does not necessarily reflect the steps required for full LLVM adoption. – Additional efforts working on foundational pieces (shared with Clang) for OpenMP support (i.e. OpenMP
IR Builder)
• Fortran-centric optimizations, including runtime library, will follow in a series of releases timed with an early release within ECP and then a drop into LLVM’s 6-month release cycle
Intel
AMD
Code GenerationLLVM Infrastructure
FIR Analysis & Optimization
Lower to FIR(Fortran 2018)(OpenMP 5.x)
SemanticAnalysis
(Fortran 2018)(OpenMP 5.x)
Parse (Fortran 2018)(OpenMP 5.x)
Front End
ECP Release: Fortran 2018LLVM 11 upstream target
Back End
Lower to LLVM IR
Mid-Level
7/2020
Note: Current community design discussions are considering two MLIR dialects here. One for Fortran-centric constructs and a second for OpenMP.
Note: Fortran 2018 & OpenMP 5.x
Note: MLIR dialect for OpenMP proposal (Arm)
1/2020 2/2020 3/2020 4/2020 5/2020 6/2020 8/202011/2019
Prototype OpenMP IR Builder(Shared functionality w/ Clang)
9/2020 11/2020 12/2020
ECP Release: Fortran 2018 and OpenMPLLVM 12 upstream target
1/2021 2/2021 3/2021
LLVM 11
10/2020
LLVM 10 LLVM 12
ECP Release: Fortran 2018 and OpenMPLLVM 13 upstream target
4/202112/2019
Prototype Release: Fortran 2018 and OpenMP
(OpenMP IR w/ AMD and Intel Targets)
Prototype Release: Fortran 2018 and OpenMP
(Extended Features, Performance)
Timeline
Kokkos Clang InteractionsDavid Poliakoff, Sandia National Laboratories, [email protected]
Clang-Tidy for Kokkos
• Kokkos exposes semantics that allow you to say things the developers know to be wrong• clang-tidy
Sanitizer Integration
• Kokkos Callstacks are terrible, but Kokkos supports runtime tooling• Make sanitizers aware of Kokkos tooling (Kokkos field name)
JIT in Kokkos
• In the past we’ve looked at using ClangJIT to speed up RAJA kernels, fixing loop bounds to speed up finite element codes
• There are also finite element codes in Kokkos
• ClangJIT in Kokkos
• Pictured: a place to consider implementing JIT
Programming GPUs without GPUs
• GPU development can be tricky. While the tools that exist are minor miracles, they’re still rough in places.
• Simulate the semantics of GPU memory using Kokkos and ASAN
Kokkos Autotuning
• Kokkos Tools (likely) adding a Tuning callback interface• Obvious: tune Kokkos parameters (CUDA block size, different execution
strategies)
• Fun: tune compiler optimization strategies
• Problem: you’d need a group of compiler experts to help make the fun part a reality• Contributions welcome!
User-Directed Loop Transformations
What are User-Directed Loop Transformations?#pragma clang transform tile sizes(4,4)for (int i = 0; i < m; i += 1)
for (int j = 0; j < n; j += 1)Body(i,j);
for (int i1 = 0; i1 < m; i1 += 4)for (int j1 = 0; j1 < n; j1 += 4)
for (int i2 = i1; i2 < min(i1+4,m); i2 += 1)for (int j2 = i1; j2 < min(j1+4,n); j2 += 1)
Body(i2,j2);
OptimizationLower the bar to try out what’s executing fasterAutotuningLet machine learning do itPerformance-PortabilityDifferent transformations for different platformsMaintainabilityOrganize your code to be more understandable and void code duplication
User-Directed Loop Transformations
Loop Transformation Zoo
Tiling
#pragma clang transform tile sizes(4,4)for (int i = 0; i < m; i += 1)
for (int j = 0; j < n; j += 1)Body(i,j);
for (int i1 = 0; i1 < m; i1 += 4)for (int j1 = 0; j1 < n; j1 += 4)
for (int i2 = i1; i2 < min(i1+4,m); i2 += 1)for (int j2 = i1; j2 < min(j1+4,n); j2 += 1)Body(i2,j2);
Unrolling
#pragma clang transform unroll partial(4)for (int i = 0; i < n; i += 1)Body(i);
for (int i = 0; i < n; i1 += 4) {Body(i);Body(i+1);Body(i+2);Body(i+3);
}
Fusion#pragma clang transform fusefor (int i = 0; i < n; i+=1)
BodyA(i);for (int i = 0; i < n; i+=1)
BodyB(i);
for (int i = 0; i < n; i+=1) {BodyA(i);BodyB(i);
}
Space-Filling Curves
#pragma clang transform spacefill curve(hilbert)for (int i = 0; i < m; i += 1)
for (int j = 0; j < n; j += 1)Body(i,j);
for (int idx = 0; idx < m*n; idx += 1) {tie(i,j) = hilbert2d_from_index(idx,m,n);Body(i,j);
}
Interchange
#pragma clang transform interchangefor (int i = 0; i < m; i+=1)
for (int j = 0; j < n; j+=1)Body(i,j);
for (int j = 0; j < n; j+=1)for (int i = 0; i < m; i+=1)
Body(i,j);
Reversal#pragma clang transform reversefor (int i = 0; i < n; i+=1)
Body(i);
for (int i = n-1; i >= 0; i-=1)Body(i);
User-Directed Loop Transformations
Maintainable Performance ImprovementsBLAS dgemm
0 10 20 30 40 50 60 70 80 90 100
-O3 -march=nativeNetlib CBLAS*
#pragma SLPVectorizerATLAS*
#pragma LoopVectorizerOpenBLAS*
Polly MatMulATLAS
OpenBLASIntel MKL 2018.3theoretical peak
33.5s (1.6%)
2.2s (24%)
0.9s (60%)
1.27s (42%)
0.64s (83%)0.59s (89%)
74.9s (0.7%)
2.2s (24%)
8.2s (6%)
0.53s
1.25s (42%)
Double precsion FP operations per time unit in percentage of peak
Polybench syr2k
0 50 100 150 200 250 300 350 400
-O3 -march=nativefissionPolly
fission tileOpenMP
fission tile interchangefission tile interchange parallel
OpenMP targetPolly parallel
OpenMP target unroll-and-jam
436s
14s
2.7s
402s
36s
8.7s3.1s
1.6s
39s
1.9s
Executon time
SPEC CPU 2016 456.hmmer
0 100 200 300
-O3 -march=nativeLoopDistribute
LoopDistribute non-if-convertablePolly (fission)
306s286s
217s167s
Executon time
Polybench head-3d
0 5 10 15 20
-O3 -march=nativeOpenMP
#pragma tile sizes(16,1,1024)tile sizes(24,4,1024)+threading
19.9s16.1s
14.2s11.1s
Executon time
Polybench covariance
0 20 40 60 80 100 120 140 160 180 200
-O3 -march=nativeOpenMP
PollyPolly parallel
fission tile(128,512,4)fission tile(32,128,1) parallel
186s60s
1.6s1.1s
9.6s3.2s
Executon time
User-Directed Loop Transformations
OpenMP Standardization
Why Standardizing?
Composable with OpenMP directivesWork with multiple compilersEncourage compiler vendors to implement loop transformations
Specification Status
New tile directive in TR8 (predecessor for OpenMP 5.1)Working on also adding unroll directive to OpenMP 5.1 (target Nov 2020)More planned for OpenMP 6.0 (target Nov 2023)
Followup-transformationsLoop identifiersTransformation optionsMore transformations…
User-Directed Loop Transformations
Implementation Status
Front-End Parsing
Under review for upstreaming: https://reviews.llvm.org/D69088#pragma clang transform syntaxOnly transformations that LLVM has passes for:Distribution, Unrolling, Unroll-And-Jam, Vectorization, Interleaving
Applying Transformations
Prototype based on Polly: github.com/SOLLVE/llvm-project/tree/pragma-pollyMore and composable transformationsTiling, Interchange, Reversal, Unroll(-and-Jam), Array Packing, Thread Parallelization
Usual Polly restrictions applyLong-term looking into more encompassing loop optimizer
Optimizing (Parallel) Programs
with interprocedural optimizations and (parallelism/)runtime awareness
SC’19 OpenMP Booth talk on OpenMP in LLVM: https://youtu.be/6yOa-hRi63M
LLVM Developers Conference 2018: Talk: https://youtu.be/zfiHaPaoQPc
LLVM Developers Conference 2019: Talk: https://youtu.be/CzWkc_JcfS0 Tutorial: https://youtu.be/HVvvCSSLiTw
LLVM Developers Conference 2019: https://youtu.be/elmio6AoyK0
ISC’19 Talk: https://doi.org/10.1007/978-3-030-20656-7_13
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Clacc: OpenACC Supportfor Clang and LLVM
Joel E. Denny, Seyong Lee, Jeffrey S. Vetter
Future Technologies Group, ORNL
https://ft.ornl.gov/ [email protected]
Feb 6, 2020 ECP Annual Meeting: LLVM Session
ECP 2.3.2.10 STDT PROTEAS-TUNE
22
Clacc Background
OpenACC• Launched 2010 as portable directive-
based programming model in C, C++, Fortran for heterogeneous accelerators
• Best known for NVIDIA GPU; implementations have targeted AMD GCN, multicore CPU, Intel Xeon Phi, FPGA
• Compared to OpenMP– Descriptive vs. Prescriptive– Many features ported to OpenMP– Specification less complex
• OpenACC 3.0 released in Nov, 2019
Clacc• US Exascale Computing Project (ECP)
• Goal: Open-source, production-quality, standard-conforming OpenACCcompiler support for Clang and LLVM
• Why?– Needed for HPC app development and
OpenACC adoption and evolution– GCC is only open-source, production-
quality compiler supporting OpenACC
• Design: Translate OpenACC to OpenMP to build on OpenMP support in Clang
33
Clacc Design
• AST transformation– OpenACC AST for source-level tools: pretty printers, analyzers, lint
tools, debugger and editor extensions, etc.– OpenMP AST for source-to-source: reuse OpenMP implementation
and tools, automatically port apps, etc.– Clang AST is immutable by design– Using Clang’s TreeTransform facility
• Two Compilation Modes– Traditional compilation: OpenACC source to executable– Source-to-source: OpenACC source to OpenMP source
• Future: MLIR OpenACC dialect?
44
2019 Progress on OpenACC Support
• Support for device offloading (e.g., NVIDIA GPUs) in addition to multicore
– Clacc poster presents preliminary SPEC ACCEL benchmark results for NVIDIA GPU
• Source-to-source mode now uses Clang’s Rewrite facility
– Avoids most preprocessor expansions in generated OpenMP source
– More usable for targeting different OpenMP compilers and for app migration
• Support for OpenACC Profiling Interface– Layer over OMPT plus Clacc extensions– Prototype of OMPT offloading support,
which isn’t included in LLVM OpenMP– Most events supported, not all features
• Support for implicit gang clauses, an unspecified behavior implemented by all major OpenACC compilers
• Support for data construct, update directive, subarrays, and various clauses
• Regular maintenance– Test suite, bug fixes, docs, user feedback– Updates for OpenACC spec revisions– Continuous integration of upstream Clang
and LLVM
• Investigation of Gitlab CI and SLURM for accelerated building and testing of Clacc on ORNL’s ExCL cluster
55
2019 Upstream Clang and LLVM Contributions
Clang, LLVM, and OpenMP Improvements
• Fixes and OpenMP 5.0 extension for Clang Parse and Sema for OpenMP
• Fixes for Debian/Ubuntu nvidia-cuda-toolkit support (affects OpenMP/OpenACC offloading)
• Fixes for Clang pragma location tracking issue to enable Clacc’s use of Rewrite
• Fixes for signedness tests in LLVM’s arbitrary-precision integer type (APSInt)
• Fix for Clang -ast-print (affects Claccsource-to-source mode)
Testing Infrastructure Improvements
• LIT– LLVM Integrated Tester: LLVM’s testing
framework– Fixes for a series of issues related to LIT
internal shell commands– Added LIT_OPTS env var to pass
command-line ops through ninja/make
• FileCheck– Tool used pervasively in Clacc and LLVM
test suites for verifying test case output– Made various improvements related to
previous contributions, primarily debugging facilities for FileCheck
66
2019 OpenACC Specification Contributions
• Clarifications for the semantics of combinations of seq, independent, auto, gang, worker, and vector clauses
• Clarification about implicit independent clauses on orphaned loops
• Clarifications about the resolution of conflicts between device-specific and default clauses
• Clarifications about when loop reductions update reduction variables
• Clarifications for compute constructs, their restrictions, and their implicitly determined data attributes
• Corrections for the OpenACC Profiling Interface
• Series of examples to clarify subtle issues for device_type clauses and reductions
• Launch of an OpenACC rationale document with initial content related to implicit clauses
77
Path Forward
Development Strategy• Focus on C and then C++
• Focus on behavioral correctness– Prescriptive OpenACC interpretation– Many-to-one mapping to OpenMP
• Then performance– Descriptive OpenACC interpretation– Analyses for best mapping to OpenMP– Investigate advanced LLVM analyses
Clacc Access• For now, email us: [email protected]
• Might be hosted publicly with ECP LLVM integration repo… if appropriate
• Otherwise, likely hosted publicly on ORNL Gitlab server
• Eventually upstream to Clang and LLVM
88
Clacc Takeaways
• Overview– Objective: Production-quality OpenACC compiler support for Clang and LLVM– Design: Translate OpenACC to OpenMP to build on existing OpenMP support in Clang
• Join Us– Future Technologies Group, Oak Ridge National Laboratory– Hiring interns, postdocs, research and technical staff– External collaborators welcome
• Clacc poster at ST Poster Session
https://ft.ornl.gov/[email protected]
Clacc: Translating OpenACC to OpenMP in Clang, Joel E. Denny, SeyongLee, and Jeffrey S. Vetter, 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), Dallas, TX, USA, (2018).
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Exploring MLIR for OpenACC
Valentin Clement, Jeffrey S. VetterECP 2.3.2.10 PROTEAS-TUNE ECP Annual Meeting, February 6 - 2020, Houston, Texas
http://ft.ornl.gov
2
Exploring MLIR for OpenACC - Goals
• Definition of an MLIR OpenACCdialect compatible with the core dialects
• Use the MLIR dialect for an Open-source OpenACC compiler support for Flang (f18) and LLVM
4.2
24.5
4.466.9
Accelerator Programming Model:
Runtime Weighted (INCITE up to June 2019)
OpenMP offload
OpenACC
CUDA Fortran
CUDA
Source: XALT/Reuben Budiardja, NCCS
3
MLIR – Multi-Level Intermediate Representation
func @testFunction(%arg0: i32) {
%x = mydialect.op(%arg0) : (i32) -> i32
br ^bb1
^bb1:
%y = addi %x, %x : i32
return %y : i32
}
- SSA-based designs- Module/Function/Block/Instruction structure- Round trippable textual form- Syntactically to LLVM IR - Progressive lowering -> general lowering passes = more code reuse- Great location tracking
Operations, Not instructions• No predefined set of instructions. Collection of dialects• Operations are like “opaque functions” to MLIR
Dialect prefix
Operation Id
Arguments
Argument type
Return typeResult name
4
MLIR Core dialects
OpenMPOpenACC
5
OpenACC dialect – !$acc parallel loopfunc @saxpy(%x: memref<1024xf32>, %y: memref<1024xf32>,
%n: index, %a: f32) -> memref<1024xf32> {
%c0 = constant 0 : index
%c1 = constant 1 : index
// y[i] = a*x[i] + y[i];
acc.parallel {
acc.loop {
loop.for %arg0 = %c0 to %n step %c1 {
%xi = load %x[%arg0] : memref<1024xf32>
%yi = load %y[%arg0] : memref<1024xf32>
%ax = mulf %a, %xi : f32
%yy = addf %ax, %yi : f32
store %yy, %y[%arg0] : memref<1024xf32>
}
} attributes { independent }
} attributes { num_gangs = 8, num_workers = 128 }
return %y : memref<1024xf32>
} Attributes attached to the operation
- Region impacted by the operation acc.loop- Should support different loop
- loop.for- affine.for- fir.do
6
OpenACC dialect - lowering
7
OpenACC lowered to GPU dialect
func @saxpy(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>, %arg2: index, %arg3: f32) -> memref<1024xf32> {
%c0 = constant 0 : index
%c1 = constant 1 : index
%c8 = constant 8 : index
%c128 = constant 128 : index
%c1_0 = constant 1 : index
gpu.launch blocks(%arg4, %arg5, %arg6) in (%arg10 = %c8, %arg11 = %c1_0, %arg12 = %c1_0)
threads(%arg7, %arg8, %arg9) in (%arg13 = %c128, %arg14 = %c1_0, %arg15 = %c1_0)
args(%arg16 = %arg0, %arg17 = %arg1, %arg18 = %arg3, %arg19 = %c0, %arg20 = %arg2, %arg21 = %c1)
: memref<1024xf32>, memref<1024xf32>, f32, index, index, index {
%0 = muli %arg21, %arg4 : index
%1 = addi %arg19, %0 : index
%2 = muli %arg21, %arg10 : index
loop.for %arg22 = %1 to %arg20 step %2 {
%3 = load %arg16[%arg22] : memref<1024xf32>
%4 = load %arg17[%arg22] : memref<1024xf32>
%5 = mulf %arg18, %3 : f32
%6 = addf %5, %4 : f32
store %6, %arg17[%arg22] : memref<1024xf32>
}
gpu.terminator
}
return %arg1 : memref<1024xf32>
}
$ mlir-opt –convert-openacc-to-gpu saxpy.mlir
> This code is then lowered down to NVVM/LLVM IR and passed to LLVM
8
Work to be done
f18
•Parsing
•Semantic
•AST lowering
MLIR
•Dialect design
•Optimization
•Progressive lowering
Runtime•Plug into compatible runtime
Concurrency in LLVM
Pat McCormick, George Stelle, Alexis Perry-Holby, EJ Park,Nirmal Prajapati, Daniel Shevitz *
TB Schardl, William Moses, Charles Leiserson +
* Los Alamos National Laboratory+ MIT
February 2020
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Overview
Slide 2
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Recent Developments
Reduction improvementImproved FLECSI and Kokkos supportConcurrent region analysisRefactored backendsRace condition preventionRealm backend runtime wrapperLLVM 9 rebaseConcurrent SSA theory
Slide 3
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Single Static Assignment (SSA)
define f(x){entry:
cond = and x, 1br cond , a, b
a:y = mul 4, xcall g()br cond , b, c
b:z = add x, xcall h()br cond , c, a
c:r = add y, zret r
}
Slide 4
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Control Flow Graph (CFG)
define f(x){entry:
cond = and x, 1br cond , a, b
a:y = mul 4, xcall g()br cond , b, c
b:z = add x, xcall h()br cond , c, a
c:r = add y, zret r
}
entry
a b
c
Slide 5
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Dominator Trees
entry
a b
c
entry
a b c
Slide 6
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Limitation of Dominator Trees
define f(x){entry:
cond = and x, 1br cond , a, b
a:y = mul 4, xcall g()br cond , b, c
b:z = add x, xcall h()br cond , c, a
c:r = add y, zret r
}
CFG
entry
a b
c
DomTree
entry
a b c
Slide 7
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Dominator DAG
define f(x){entry:
cond = and x, 1br cond , a, b
a:y = mul 4, xcall g()br cond , b, c
b:z = add x, xcall h()br cond , c, a
c:r = add y, zret r
}
CFG
entry
a b
c
DomDAG
entry
a b
c
Slide 8
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Correctness
Valid Paths
LLVM ⊆ Conditional CFG ⊆ CFG
Dominator Relation
CFG ⊆ Conditional CFG ⊆ LLVM
Slide 9
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Concurrency
fork
a b
join
Slide 10
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Tapir
fork
a b
join
fork
a b
join
Slide 11
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Concurrency
fork
a b
join
fork
a b
join
Slide 12
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Questions?
Slide 13
Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA
Business Sensitive Information
Working toward an ECP fork of LLVM
Jeffrey Vetter
ECP Annual Meeting6 Feb 2020
2
Deep Dive : Improving the LLVM Compiler Ecosystem
LLVM
• Very popular open source compiler infrastructure
• Easily extensible
• Widely used and contributed to in industry
• Permissive license
• Used for heterogeneous computing
+SOLLVE
• Enhancing the implementation of OpenMP in LLVM
• Unified memory
• OMP Optimizations
• Prototype OMP features for LLVM
• OMP test suite
+PROTEAS
• Core optimization improvements to LLVM
• OpenACCcapability for LLVM
• Autotuning for OpenACC and OpenMP in LLVM
• Integration with Tau performance tools
+FLANG
• Developing an open-source, production Fortran frontend
• Upstream to LLVM public release
• Support for OpenMP and OpenACC
• Approved by LLVM
+ATDM
• Enhancing LLVM to optimize template expansion for FlexCSI, Kokkos, RAJA, etc.
• Flang testing and evaluation
Vendors
• Increasing dependence on LLVM
• Collaborations with many vendors using LLVM
• AMD
• ARM
• Cray
• HPE
• IBM
• Intel
• NVIDIA
Active involvement with broad LLVM community: LLVM Dev, EuroLLVM
3
ECP LLVM Integration and Deployment
• Develop an integrated ECP LLVM distribution
– Integrating different ECP projects using LLVM
– CI on target architectures
– Shared vehicle for improvements in LLVM
– Increased collaboration within ECP
– If vendor or LLVM compiler fails, we have a functioning risk mitigation solution
• Operations
– ECP LLVM distro will be closely maintained fork of LLVM mono repo
– Individual ECP projects will exist as git branches
– Branches will be integrated into ECP LLVM as they mature
• Periodic upstreaming and patching of LLVM monorepo
4
Next Steps
• Create new repo - easy
• Operations
– Select projects
– Merge existing projects into ecp.llvm as branches
– Leverage CI infrastructure for our platforms of interest
• Request contingency funding for existing projects to merge, maintain, down/upstream changes
– See Mike’s presentation from yesterday
• More info: [email protected]
Top Related