Download - What is LLVM? And a Status Update.€¦ · Clang, LLVM, etc. LLVM/Clang is both a research platform and a production-quality compiler. LLVM is a liberally-licensed(*) infrastructure

Approved for public release

What is LLVM? And a Status Update.

Hal Finkel

Leadership Computing Facility

Argonne National Laboratory

2

Clang, LLVM, etc.

LLVM/Clang is both a research platformand a production-quality compiler.

✔ LLVM is a liberally-licensed(*) infrastructure for creating compilers, other toolchain components, and JIT compilation engines.

✔ Clang is a modern C++ frontend for LLVM✔ LLVM and Clang will play significant roles in exascale

computing systems!

(*) Now under the Apache 2 license with the LLVM Exception

3

A role in exascale? Current/Future HPC vendors are already involved (plus many others)...

LLVM

Apple + Google(Many millions invested annually)+ many others (Qualcomm, Sony,

Microsoft, Facebook, Ericcson, etc.)

Intel

Cray

ARM

IBM

NVIDIA(and PGI)

AMD

Academia, Labs, etc.

4

What is LLVM:

LLVM is not a “low-level virtual machine”!

LLVM is a multi-architecture infrastructure for constructing compilers and other toolchain components.

LLVM IR Architecture-independentsimplification

Architecture-awareoptimization

(e.g. vectorization)

Backends(Type legalization,

instruction selection,register allocation, etc.)

Assembly printing,binary generation, or

JIT execution

5

What is Clang:

Clang is a C++ frontend for LLVM...

C++ Source(C++14, C11, etc.)

Parsing andsemantic analysis

LLVM IR

Code generation

Static analysis

● For basic compilation, Clang works just like gcc – using clang instead of gcc, or clang++ instead of g++, in your makefile will likely “just work.”

● Clang has a scalable LTO, check out: https://clang.llvm.org/docs/ThinLTO.html

https://clang.llvm.org/docs/ThinLTO.html

6

The core LLVM compiler-infrastructure components are one of the subprojects in the LLVM project.These components are also referred to as “LLVM.”

7

What About Flang?

● Started as a collaboration between DOE and NVIDIA/PGI. Now also involves ARM and other vendors.

● Public discussions on making Flang (f18+runtimes) part of LLVM going well.

● Two development paths:

Flang based on PGI’s existing

frontend (in C).

Production ready including

OpenMP support.

f18 – A new frontend written in

modern C++.

Parsing, semantic

analysis, etc. under active

development.

Fortran runtime

library and vectorized

math-function library.

LLVM Project

8

What About MLIR?

● MLIR is now part of the LLVM project.● MLIR is a "multi-level IR", originaly

developed as part of the TensorFlow project.● It is a kind of framework for producing

particular IRs (along with a nice way to specify peephole optimizations for them, translations between them, etc.)

● MLIR is being used for Flang

9

EuroLLVM

Upcoming LLVM Status...

● Relicensing well underway.● Moved to github (might also move issue

tracking, etc. soon).● Version 9.0.1 has been released.

10

What’s Working Well

● Involvement:

● Features: LLVM has become well known for an important set of features:

● A well-defined IR allows use by a lot of different languages (C, C++, Fortran, Julia, Rust, Python

(e.g., via Numba), Swift, ML frameworks (e.g., TensorFlow/XLA, PyTorch/Glow), and many others.

● A backend infrastructure allowing the efficient creation of backends for new hardware.

● A state-of-the-art C++ frontend, CUDA support, scalabale LTO, sanitizers and other debugging

capabilities, and more.

● High code-quality standards.

11

MPI-specifc warning messages

These are not really MPI specific, but uses the “type safety” attributes inspired by this use case:

int MPI_Send(void *buf, int count, MPI_Datatype datatype)

__attribute__(( pointer_with_type_tag(mpi,1,3) ));

…

#define MPI_DATATYPE_NULL ((MPI_Datatype) 0xa0000000)

#define MPI_FLOAT ((MPI_Datatype) 0xa0000001)

…static const MPI_Datatype mpich_mpi_datatype_null __attribute__(( type_tag_for_datatype(mpi,void,must_be_null) )) = 0xa0000000;

static const MPI_Datatype mpich_mpi_float __attribute__(( type_tag_for_datatype(mpi,float) )) = 0xa0000001;

See Clang's test/Sema/warn-type-safety-mpi-hdf5.c, test/Sema/warn-type-safety.c and

test/Sema/warn-type-safety.cpp for more examples,

and: http://clang.llvm.org/docs/AttributeReference.html#type-safety-checking

http://clang.llvm.org/docs/AttributeReference.html#type-safety-checking

12

Sanitizers

The sanitizers (some now also supported by GCC) – Instrumentation-based debugging● Checks get compiled in (and optimized along with the rest of the code) – Execution speed an order of

magnitude or more faster than Valgrind● You need to choose which checks to run at compile time:

● Address sanitizer: -fsanitize=address – Checks for out-of-bounds memory access, use after free, etc.: http://clang.llvm.org/docs/AddressSanitizer.html

● Leak sanitizer: Checks for memory leaks; really part of the address sanitizer, but can be enabled in a mode just to detect leaks with -fsanitize=leak: http://clang.llvm.org/docs/LeakSanitizer.html

● Memory sanitizer: -fsanitize=memory – Checks for use of uninitialized memory: http://clang.llvm.org/docs/MemorySanitizer.html

● Thread sanitizer: -fsanitize=thread – Checks for race conditions: http://clang.llvm.org/docs/ThreadSanitizer.html

● Undefined-behavior sanitizer: -fsanitize=undefined – Checks for the execution of undefined behavior: http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

● Efficiency sanitizer [Recent development]: -fsanitize=efficiency-cache-frag, -fsanitize=efficiency-working-set (-fsanitize=efficiency-all to get both)

And there's more, check out http://clang.llvm.org/docs/ and Clang's include/clang/Basic/Sanitizers.def for more information.

http://clang.llvm.org/docs/AddressSanitizer.html

http://clang.llvm.org/docs/LeakSanitizer.html

http://clang.llvm.org/docs/MemorySanitizer.html

http://clang.llvm.org/docs/ThreadSanitizer.html

http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

http://clang.llvm.org/docs/

13

Clang Can Compile CUDA!

$ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch>

For example:--cuda-gpu-arch=sm_35

When compiling, you may also need to pass --cuda-path=/path/to/cuda if you didn’t install the CUDA SDK into /usr/local/cuda (or a few other “standard” locations).

For more information, see: http://llvm.org/docs/CompileCudaWithLLVM.html

● CUDA is the language used to compile code for NVIDIA GPUs.● Support now also developed by AMD as part of their HIP project.

Clang's CUDA aims to provide better support for modern C++ than NVIDIA's nvcc.

http://llvm.org/docs/CompileCudaWithLLVM.html

14

Many Derived Vendor Products

Upstream LLVM Code Base

Add optimizations and other features.

Package and Ship Product





Vendor A Vendor B

Vendor C

15

On Interacting with Open-Source LLVM

When compilers are open source:● DOE Staff, collaborators, contractors, etc. can directly contribute features and bug fixes.

● And we can review code to keep quality high (and keep the open source community healthy).● We can enhance user productivity by delivering new features and bug fixes to our users quickly…

● This critically depends on testing!

Vendor(s)

Staff, collaborators, contractors, etc.

open-source LLVM

CI/testing system with large amount of DOE

applications and libraries with correctness tests!

DOE/ECP’s CI effort is important here!

Facilities+

Happy users!

Bug fixes (and simple features) deployed to users in days (not weeks or months)!

Approved for public release

Flang

Alexis Perry–Holby (Los Alamos National Laboratory)Patrick McCormick (Los Alamos National Laboratory)Douglas Miles(NVIDIA)Stephen Scalpone (NVIDIA)Hal Finkel (Argonne National Laboratory)David Bernholdt (Oak Ridge National Laboratory)Brian Friesen (Lawrence Berkeley National Laboratory)

Feb. 6th, 2020

2

Flang – Current Status• Working on upstreaming Flang (F18) codebase to LLVM’s main monorepo

– Crucial to the long-term success of the project– Identified items are worked on collaboratively by the community

• Technical work scope details: – Working on resolution of type-bound generics and operators– Completed support of forward references to derived types– Implemented logical expression lowering– Began character expression lowering– Created an expression lowering test framework– Continued work on DO loop semantic checks, especially where zeroes are not allowed– Continued work on FIR definition

3

Flang – Path to Merging into the LLVM Monorepo• CMake changes to support in-tree building

– PR coming soon

• Style changes that take us closer to LLVM– Clang format changes to bring Flang more in line with LLVM (PR #945)– Rationalization of public/private headers (merged PR #943)– Renaming of files from .cc to .cpp (merge PR #958)

• Making more general use of LLVM APIs and data structures– Discussion of detailed list of changes is ongoing on flang-dev mailing list

• Port testing to use LLVM tools (lit and FileCheck)– PR #941 ports the test suite to lit– Separate discussions of various custom scripts are ongoing

• Build compiler support– Need plan for improving public buildbot coverage and moving in a direction to allow ECP-centric coverage to be

addressed

The LLVM Compiler Infrastructure

4

Flang – Growing Community

The LLVM Compiler Infrastructure

5

Flang – Project Timeline• Schedule is currently driven by phases of the compiler (parsing, semantic analysis, lowering, etc.)

• Schedule priority is on getting infrastructure complete for a functionally correct, sequential compiler – it does not necessarily reflect the steps required for full LLVM adoption. – Additional efforts working on foundational pieces (shared with Clang) for OpenMP support (i.e. OpenMP

IR Builder)

• Fortran-centric optimizations, including runtime library, will follow in a series of releases timed with an early release within ECP and then a drop into LLVM’s 6-month release cycle

Intel

AMD

Code GenerationLLVM Infrastructure

FIR Analysis & Optimization

Lower to FIR(Fortran 2018)(OpenMP 5.x)

SemanticAnalysis

(Fortran 2018)(OpenMP 5.x)

Parse (Fortran 2018)(OpenMP 5.x)

Front End

ECP Release: Fortran 2018LLVM 11 upstream target

Back End

Lower to LLVM IR

Mid-Level

7/2020

Note: Current community design discussions are considering two MLIR dialects here. One for Fortran-centric constructs and a second for OpenMP.

Note: Fortran 2018 & OpenMP 5.x

Note: MLIR dialect for OpenMP proposal (Arm)

1/2020 2/2020 3/2020 4/2020 5/2020 6/2020 8/202011/2019

Prototype OpenMP IR Builder(Shared functionality w/ Clang)

9/2020 11/2020 12/2020

ECP Release: Fortran 2018 and OpenMPLLVM 12 upstream target

1/2021 2/2021 3/2021

LLVM 11

10/2020

LLVM 10 LLVM 12

ECP Release: Fortran 2018 and OpenMPLLVM 13 upstream target

4/202112/2019

Prototype Release: Fortran 2018 and OpenMP

(OpenMP IR w/ AMD and Intel Targets)

Prototype Release: Fortran 2018 and OpenMP

(Extended Features, Performance)

Timeline

Kokkos Clang InteractionsDavid Poliakoff, Sandia National Laboratories, [email protected]

Clang-Tidy for Kokkos

• Kokkos exposes semantics that allow you to say things the developers know to be wrong• clang-tidy

Sanitizer Integration

• Kokkos Callstacks are terrible, but Kokkos supports runtime tooling• Make sanitizers aware of Kokkos tooling (Kokkos field name)

JIT in Kokkos

• In the past we’ve looked at using ClangJIT to speed up RAJA kernels, fixing loop bounds to speed up finite element codes

• There are also finite element codes in Kokkos

• ClangJIT in Kokkos

• Pictured: a place to consider implementing JIT

Programming GPUs without GPUs

• GPU development can be tricky. While the tools that exist are minor miracles, they’re still rough in places.

• Simulate the semantics of GPU memory using Kokkos and ASAN

Kokkos Autotuning

• Kokkos Tools (likely) adding a Tuning callback interface• Obvious: tune Kokkos parameters (CUDA block size, different execution

strategies)

• Fun: tune compiler optimization strategies

• Problem: you’d need a group of compiler experts to help make the fun part a reality• Contributions welcome!

User-Directed Loop Transformations

What are User-Directed Loop Transformations?#pragma clang transform tile sizes(4,4)for (int i = 0; i < m; i += 1)

for (int j = 0; j < n; j += 1)Body(i,j);

for (int i1 = 0; i1 < m; i1 += 4)for (int j1 = 0; j1 < n; j1 += 4)

for (int i2 = i1; i2 < min(i1+4,m); i2 += 1)for (int j2 = i1; j2 < min(j1+4,n); j2 += 1)

Body(i2,j2);

OptimizationLower the bar to try out what’s executing fasterAutotuningLet machine learning do itPerformance-PortabilityDifferent transformations for different platformsMaintainabilityOrganize your code to be more understandable and void code duplication


Loop Transformation Zoo

Tiling

#pragma clang transform tile sizes(4,4)for (int i = 0; i < m; i += 1)


for (int i1 = 0; i1 < m; i1 += 4)for (int j1 = 0; j1 < n; j1 += 4)

for (int i2 = i1; i2 < min(i1+4,m); i2 += 1)for (int j2 = i1; j2 < min(j1+4,n); j2 += 1)Body(i2,j2);

Unrolling

#pragma clang transform unroll partial(4)for (int i = 0; i < n; i += 1)Body(i);

for (int i = 0; i < n; i1 += 4) {Body(i);Body(i+1);Body(i+2);Body(i+3);

}

Fusion#pragma clang transform fusefor (int i = 0; i < n; i+=1)

BodyA(i);for (int i = 0; i < n; i+=1)

BodyB(i);

for (int i = 0; i < n; i+=1) {BodyA(i);BodyB(i);

}

Space-Filling Curves

#pragma clang transform spacefill curve(hilbert)for (int i = 0; i < m; i += 1)


for (int idx = 0; idx < m*n; idx += 1) {tie(i,j) = hilbert2d_from_index(idx,m,n);Body(i,j);

}

Interchange

#pragma clang transform interchangefor (int i = 0; i < m; i+=1)

for (int j = 0; j < n; j+=1)Body(i,j);

for (int j = 0; j < n; j+=1)for (int i = 0; i < m; i+=1)

Body(i,j);

Reversal#pragma clang transform reversefor (int i = 0; i < n; i+=1)

Body(i);

for (int i = n-1; i >= 0; i-=1)Body(i);


Maintainable Performance ImprovementsBLAS dgemm

0 10 20 30 40 50 60 70 80 90 100

-O3 -march=nativeNetlib CBLAS*

#pragma SLPVectorizerATLAS*

#pragma LoopVectorizerOpenBLAS*

Polly MatMulATLAS

OpenBLASIntel MKL 2018.3theoretical peak

33.5s (1.6%)

2.2s (24%)

0.9s (60%)

1.27s (42%)

0.64s (83%)0.59s (89%)

74.9s (0.7%)

2.2s (24%)

8.2s (6%)

0.53s

1.25s (42%)

Double precsion FP operations per time unit in percentage of peak

Polybench syr2k

0 50 100 150 200 250 300 350 400

-O3 -march=nativefissionPolly

fission tileOpenMP

fission tile interchangefission tile interchange parallel

OpenMP targetPolly parallel

OpenMP target unroll-and-jam

436s

14s

2.7s

402s

36s

8.7s3.1s

1.6s

39s

1.9s

Executon time

SPEC CPU 2016 456.hmmer

0 100 200 300

-O3 -march=nativeLoopDistribute

LoopDistribute non-if-convertablePolly (fission)

306s286s

217s167s

Executon time

Polybench head-3d

0 5 10 15 20

-O3 -march=nativeOpenMP

#pragma tile sizes(16,1,1024)tile sizes(24,4,1024)+threading

19.9s16.1s

14.2s11.1s

Executon time

Polybench covariance

0 20 40 60 80 100 120 140 160 180 200

-O3 -march=nativeOpenMP

PollyPolly parallel

fission tile(128,512,4)fission tile(32,128,1) parallel

186s60s

1.6s1.1s

9.6s3.2s

Executon time


OpenMP Standardization

Why Standardizing?

Composable with OpenMP directivesWork with multiple compilersEncourage compiler vendors to implement loop transformations

Specification Status

New tile directive in TR8 (predecessor for OpenMP 5.1)Working on also adding unroll directive to OpenMP 5.1 (target Nov 2020)More planned for OpenMP 6.0 (target Nov 2023)

Followup-transformationsLoop identifiersTransformation optionsMore transformations…


Implementation Status

Front-End Parsing

Under review for upstreaming: https://reviews.llvm.org/D69088#pragma clang transform syntaxOnly transformations that LLVM has passes for:Distribution, Unrolling, Unroll-And-Jam, Vectorization, Interleaving

Applying Transformations

Prototype based on Polly: github.com/SOLLVE/llvm-project/tree/pragma-pollyMore and composable transformationsTiling, Interchange, Reversal, Unroll(-and-Jam), Array Packing, Thread Parallelization

Usual Polly restrictions applyLong-term looking into more encompassing loop optimizer

Optimizing (Parallel) Programs

with interprocedural optimizations and (parallelism/)runtime awareness

SC’19 OpenMP Booth talk on OpenMP in LLVM: https://youtu.be/6yOa-hRi63M

LLVM Developers Conference 2018: Talk: https://youtu.be/zfiHaPaoQPc

LLVM Developers Conference 2019: Talk: https://youtu.be/CzWkc_JcfS0 Tutorial: https://youtu.be/HVvvCSSLiTw

LLVM Developers Conference 2019: https://youtu.be/elmio6AoyK0

ISC’19 Talk: https://doi.org/10.1007/978-3-030-20656-7_13

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Clacc: OpenACC Supportfor Clang and LLVM

Joel E. Denny, Seyong Lee, Jeffrey S. Vetter

Future Technologies Group, ORNL

https://ft.ornl.gov/ [email protected]

Feb 6, 2020 ECP Annual Meeting: LLVM Session

ECP 2.3.2.10 STDT PROTEAS-TUNE

https://ft.ornl.gov/

mailto:[email protected]

22

Clacc Background

OpenACC• Launched 2010 as portable directive-

based programming model in C, C++, Fortran for heterogeneous accelerators

• Best known for NVIDIA GPU; implementations have targeted AMD GCN, multicore CPU, Intel Xeon Phi, FPGA

• Compared to OpenMP– Descriptive vs. Prescriptive– Many features ported to OpenMP– Specification less complex

• OpenACC 3.0 released in Nov, 2019

Clacc• US Exascale Computing Project (ECP)

• Goal: Open-source, production-quality, standard-conforming OpenACCcompiler support for Clang and LLVM

• Why?– Needed for HPC app development and

OpenACC adoption and evolution– GCC is only open-source, production-

quality compiler supporting OpenACC

• Design: Translate OpenACC to OpenMP to build on OpenMP support in Clang

33

Clacc Design

• AST transformation– OpenACC AST for source-level tools: pretty printers, analyzers, lint

tools, debugger and editor extensions, etc.– OpenMP AST for source-to-source: reuse OpenMP implementation

and tools, automatically port apps, etc.– Clang AST is immutable by design– Using Clang’s TreeTransform facility

• Two Compilation Modes– Traditional compilation: OpenACC source to executable– Source-to-source: OpenACC source to OpenMP source

• Future: MLIR OpenACC dialect?

44

2019 Progress on OpenACC Support

• Support for device offloading (e.g., NVIDIA GPUs) in addition to multicore

– Clacc poster presents preliminary SPEC ACCEL benchmark results for NVIDIA GPU

• Source-to-source mode now uses Clang’s Rewrite facility

– Avoids most preprocessor expansions in generated OpenMP source

– More usable for targeting different OpenMP compilers and for app migration

• Support for OpenACC Profiling Interface– Layer over OMPT plus Clacc extensions– Prototype of OMPT offloading support,

which isn’t included in LLVM OpenMP– Most events supported, not all features

• Support for implicit gang clauses, an unspecified behavior implemented by all major OpenACC compilers

• Support for data construct, update directive, subarrays, and various clauses

• Regular maintenance– Test suite, bug fixes, docs, user feedback– Updates for OpenACC spec revisions– Continuous integration of upstream Clang

and LLVM

• Investigation of Gitlab CI and SLURM for accelerated building and testing of Clacc on ORNL’s ExCL cluster

55

2019 Upstream Clang and LLVM Contributions

Clang, LLVM, and OpenMP Improvements

• Fixes and OpenMP 5.0 extension for Clang Parse and Sema for OpenMP

• Fixes for Debian/Ubuntu nvidia-cuda-toolkit support (affects OpenMP/OpenACC offloading)

• Fixes for Clang pragma location tracking issue to enable Clacc’s use of Rewrite

• Fixes for signedness tests in LLVM’s arbitrary-precision integer type (APSInt)

• Fix for Clang -ast-print (affects Claccsource-to-source mode)

Testing Infrastructure Improvements

• LIT– LLVM Integrated Tester: LLVM’s testing

framework– Fixes for a series of issues related to LIT

internal shell commands– Added LIT_OPTS env var to pass

command-line ops through ninja/make

• FileCheck– Tool used pervasively in Clacc and LLVM

test suites for verifying test case output– Made various improvements related to

previous contributions, primarily debugging facilities for FileCheck

66

2019 OpenACC Specification Contributions

• Clarifications for the semantics of combinations of seq, independent, auto, gang, worker, and vector clauses

• Clarification about implicit independent clauses on orphaned loops

• Clarifications about the resolution of conflicts between device-specific and default clauses

• Clarifications about when loop reductions update reduction variables

• Clarifications for compute constructs, their restrictions, and their implicitly determined data attributes

• Corrections for the OpenACC Profiling Interface

• Series of examples to clarify subtle issues for device_type clauses and reductions

• Launch of an OpenACC rationale document with initial content related to implicit clauses

77

Path Forward

Development Strategy• Focus on C and then C++

• Focus on behavioral correctness– Prescriptive OpenACC interpretation– Many-to-one mapping to OpenMP

• Then performance– Descriptive OpenACC interpretation– Analyses for best mapping to OpenMP– Investigate advanced LLVM analyses

Clacc Access• For now, email us: [email protected]

• Might be hosted publicly with ECP LLVM integration repo… if appropriate

• Otherwise, likely hosted publicly on ORNL Gitlab server

• Eventually upstream to Clang and LLVM

88

Clacc Takeaways

• Overview– Objective: Production-quality OpenACC compiler support for Clang and LLVM– Design: Translate OpenACC to OpenMP to build on existing OpenMP support in Clang

• Join Us– Future Technologies Group, Oak Ridge National Laboratory– Hiring interns, postdocs, research and technical staff– External collaborators welcome

• Clacc poster at ST Poster Session

https://ft.ornl.gov/[email protected]

Clacc: Translating OpenACC to OpenMP in Clang, Joel E. Denny, SeyongLee, and Jeffrey S. Vetter, 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), Dallas, TX, USA, (2018).

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Exploring MLIR for OpenACC

Valentin Clement, Jeffrey S. VetterECP 2.3.2.10 PROTEAS-TUNE ECP Annual Meeting, February 6 - 2020, Houston, Texas

http://ft.ornl.gov

http://ft.ornl.gov/

2

Exploring MLIR for OpenACC - Goals

• Definition of an MLIR OpenACCdialect compatible with the core dialects

• Use the MLIR dialect for an Open-source OpenACC compiler support for Flang (f18) and LLVM

4.2

24.5

4.466.9

Accelerator Programming Model:

Runtime Weighted (INCITE up to June 2019)

OpenMP offload

OpenACC

CUDA Fortran

CUDA

Source: XALT/Reuben Budiardja, NCCS

3

MLIR – Multi-Level Intermediate Representation

func @testFunction(%arg0: i32) {

%x = mydialect.op(%arg0) : (i32) -> i32

br ^bb1

^bb1:

%y = addi %x, %x : i32

return %y : i32

}

- SSA-based designs- Module/Function/Block/Instruction structure- Round trippable textual form- Syntactically to LLVM IR - Progressive lowering -> general lowering passes = more code reuse- Great location tracking

Operations, Not instructions• No predefined set of instructions. Collection of dialects• Operations are like “opaque functions” to MLIR

Dialect prefix

Operation Id

Arguments

Argument type

Return typeResult name

4

MLIR Core dialects

OpenMPOpenACC

5

OpenACC dialect – !$acc parallel loopfunc @saxpy(%x: memref<1024xf32>, %y: memref<1024xf32>,

%n: index, %a: f32) -> memref<1024xf32> {

%c0 = constant 0 : index


// y[i] = a*x[i] + y[i];

acc.parallel {

acc.loop {

loop.for %arg0 = %c0 to %n step %c1 {

%xi = load %x[%arg0] : memref<1024xf32>

%yi = load %y[%arg0] : memref<1024xf32>

%ax = mulf %a, %xi : f32

%yy = addf %ax, %yi : f32

store %yy, %y[%arg0] : memref<1024xf32>

}

} attributes { independent }

} attributes { num_gangs = 8, num_workers = 128 }

return %y : memref<1024xf32>

} Attributes attached to the operation

- Region impacted by the operation acc.loop- Should support different loop

- loop.for- affine.for- fir.do

6

OpenACC dialect - lowering

7

OpenACC lowered to GPU dialect

func @saxpy(%arg0: memref<1024xf32>, %arg1: memref<1024xf32>, %arg2: index, %arg3: f32) -> memref<1024xf32> {





%c1_0 = constant 1 : index

gpu.launch blocks(%arg4, %arg5, %arg6) in (%arg10 = %c8, %arg11 = %c1_0, %arg12 = %c1_0)

threads(%arg7, %arg8, %arg9) in (%arg13 = %c128, %arg14 = %c1_0, %arg15 = %c1_0)

args(%arg16 = %arg0, %arg17 = %arg1, %arg18 = %arg3, %arg19 = %c0, %arg20 = %arg2, %arg21 = %c1)

: memref<1024xf32>, memref<1024xf32>, f32, index, index, index {

%0 = muli %arg21, %arg4 : index

%1 = addi %arg19, %0 : index

%2 = muli %arg21, %arg10 : index

loop.for %arg22 = %1 to %arg20 step %2 {

%3 = load %arg16[%arg22] : memref<1024xf32>

%4 = load %arg17[%arg22] : memref<1024xf32>

%5 = mulf %arg18, %3 : f32

%6 = addf %5, %4 : f32

store %6, %arg17[%arg22] : memref<1024xf32>

}

gpu.terminator

}

return %arg1 : memref<1024xf32>

}

$ mlir-opt –convert-openacc-to-gpu saxpy.mlir

> This code is then lowered down to NVVM/LLVM IR and passed to LLVM

8

Work to be done

f18

•Parsing

•Semantic

•AST lowering

MLIR

•Dialect design

•Optimization

•Progressive lowering

Runtime•Plug into compatible runtime

Concurrency in LLVM

Pat McCormick, George Stelle, Alexis Perry-Holby, EJ Park,Nirmal Prajapati, Daniel Shevitz *

TB Schardl, William Moses, Charles Leiserson +

* Los Alamos National Laboratory+ MIT

February 2020

Operated by Triad National Security, LLC for the U.S. Department of Energy’s NNSA

Overview

Slide 2


Recent Developments

Reduction improvementImproved FLECSI and Kokkos supportConcurrent region analysisRefactored backendsRace condition preventionRealm backend runtime wrapperLLVM 9 rebaseConcurrent SSA theory

Slide 3


Single Static Assignment (SSA)

define f(x){entry:

cond = and x, 1br cond , a, b

a:y = mul 4, xcall g()br cond , b, c

b:z = add x, xcall h()br cond , c, a

c:r = add y, zret r

}

Slide 4


Control Flow Graph (CFG)

define f(x){entry:




c:r = add y, zret r

}

entry

a b

c

Slide 5


Dominator Trees

entry

a b

c

entry

a b c

Slide 6


Limitation of Dominator Trees

define f(x){entry:




c:r = add y, zret r

}

CFG

entry

a b

c

DomTree

entry

a b c

Slide 7


Dominator DAG

define f(x){entry:




c:r = add y, zret r

}

CFG

entry

a b

c

DomDAG

entry

a b

c

Slide 8


Correctness

Valid Paths

LLVM ⊆ Conditional CFG ⊆ CFG

Dominator Relation

CFG ⊆ Conditional CFG ⊆ LLVM

Slide 9


Concurrency

fork

a b

join

Slide 10


Tapir

fork

a b

join

fork

a b

join

Slide 11


Concurrency

fork

a b

join

fork

a b

join

Slide 12


Questions?

Slide 13


Business Sensitive Information

Working toward an ECP fork of LLVM

Jeffrey Vetter

ECP Annual Meeting6 Feb 2020

2

Deep Dive : Improving the LLVM Compiler Ecosystem

LLVM

• Very popular open source compiler infrastructure

• Easily extensible

• Widely used and contributed to in industry

• Permissive license

• Used for heterogeneous computing

+SOLLVE

• Enhancing the implementation of OpenMP in LLVM

• Unified memory

• OMP Optimizations

• Prototype OMP features for LLVM

• OMP test suite

+PROTEAS

• Core optimization improvements to LLVM

• OpenACCcapability for LLVM

• Autotuning for OpenACC and OpenMP in LLVM

• Integration with Tau performance tools

+FLANG

• Developing an open-source, production Fortran frontend

• Upstream to LLVM public release

• Support for OpenMP and OpenACC

• Approved by LLVM

+ATDM

• Enhancing LLVM to optimize template expansion for FlexCSI, Kokkos, RAJA, etc.

• Flang testing and evaluation

Vendors

• Increasing dependence on LLVM

• Collaborations with many vendors using LLVM

• AMD

• ARM

• Cray

• HPE

• IBM

• Intel

• NVIDIA

Active involvement with broad LLVM community: LLVM Dev, EuroLLVM

3

ECP LLVM Integration and Deployment

• Develop an integrated ECP LLVM distribution

– Integrating different ECP projects using LLVM

– CI on target architectures

– Shared vehicle for improvements in LLVM

– Increased collaboration within ECP

– If vendor or LLVM compiler fails, we have a functioning risk mitigation solution

• Operations

– ECP LLVM distro will be closely maintained fork of LLVM mono repo

– Individual ECP projects will exist as git branches

– Branches will be integrated into ECP LLVM as they mature

• Periodic upstreaming and patching of LLVM monorepo

4

Next Steps

• Create new repo - easy

• Operations

– Select projects

– Merge existing projects into ecp.llvm as branches

– Leverage CI infrastructure for our platforms of interest

• Request contingency funding for existing projects to merge, maintain, down/upstream changes

– See Mike’s presentation from yesterday

• More info: [email protected]

mailto:[email protected]