Implement Runtime Environments for HSA using LLVM
-
Upload
jim-huang -
Category
Technology
-
view
2.919 -
download
0
description
Transcript of Implement Runtime Environments for HSA using LLVM
![Page 1: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/1.jpg)
Implement Runtime Environments for HSA using LLVM
Jim Huang ( 黃敬群 ) <[email protected]>Oct 17, 2013 / ITRI, Taiwan
![Page 2: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/2.jpg)
About this presentation
• Transition to heterogeneous– mobile phone to data centre
• LLVM and HSA• HSA driven computing environment
![Page 3: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/3.jpg)
Transition to Heterogeneous Environments
![Page 4: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/4.jpg)
![Page 5: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/5.jpg)
Herb Sutter’s new outlook
http://herbsutter.com/welcome-to-the-jungle /
“In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are
not separate trends, but aspects of a single trend – mainstream computers from desktops to ‘smartphones’ are
being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single compute-
intensive application will need to harness different kinds of cores, in immense numbers, to get its job done.”
“The free lunch is over.Now welcome to the hardware jungle.”
![Page 6: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/6.jpg)
Four causes of heterogeneity
• Multiple types of programmable core• CPU (lightweight, heavyweight)• GPU• Accelerators and application-specific
• Interconnect asymmetry• Memory hierarchies• Service oriented software demanding
![Page 7: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/7.jpg)
but getting closer day by day...
Embedded HPC
Type of processors
Heterogeneous Homogeneous
Size Small Massive
Memory Shared Distributed
High Performance vs. Embedded
![Page 8: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/8.jpg)
Heterogeneity is mainstream
Quad-core ARM Cortex A9 CPUQuad-core SGX543MP4+ Imagination GPU
Most tablets and smartphones are already powered by heterogeneous processors.
Dual-core ARM 1.4GHz, ARMv7s CPUTriple-core SGX554MP4 Imagination GPU
![Page 9: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/9.jpg)
Source: Heterogeneous Computing in ARM
![Page 10: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/10.jpg)
![Page 11: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/11.jpg)
Current limitations
• Disjoint view of memory spaces between CPUs and GPUs
• Hard partition between “host” and “devices” in programming models
• Dynamically varying nested parallelism almost impossible to support
• Large overheads in scheduling heterogeneous, parallel tasks
![Page 12: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/12.jpg)
Background Facilities
![Page 13: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/13.jpg)
OpenCL Framework overview
MPU GPU
__kernel void dot(__global const float4 *a __global const float4 *b __global float4 *c){ int tid = get_global_id(0); c[tid] = a[tid] * b[tid];}
OpenCL KernelsHost ProgramApplication
OpenCL Framework
Platform
int main(int argc, char **argv){ ... clBuildProgram(program, ...); clCreateKernel(program, “dot”...); ...}
Runtime APIs
Platform APIs
OpenCL CompilerOpenCL Runtime
Front-endFront-end
Back-endBack-end
Separate programs into host-side and kernel-side code fragment
Compiler• compile OpenCL C
language just-in-timeRuntime
• allow host program to manipulate context
MPU : host, kernel programGPU : kernel program
Source: OpenCL_for_Halifux.pdf, OpenCL overview, Intel Visual Adrenaline
![Page 14: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/14.jpg)
OpenCL Execution Scenario
kernel{ code fragment 1 code fragment 2 code fragment 3}
kernel A{ code 1 code 2 code 3}
X O
11 22 33
Task-level parallelism
A:1,2,3data(0)A:1,2,3data(0)
A:1,2,3data(1)A:1,2,3data(1)
A:1,2,3data(2)A:1,2,3data(2)
A:1,2,3data(3)A:1,2,3data(3)
A:1,2,3data(4)A:1,2,3data(4)
A:1,2,3data(5)A:1,2,3data(5)
kernel A{ code 1}
kernel B{ code 2}
kernel C{ code 3}
A:1A:1 B:2B:2 C:3C:3
A:1A:1 B:2B:2 C:3C:3
OpenCL RuntimeOpenCL Runtime
Data-level parallelism
14
![Page 15: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/15.jpg)
Supporting OpenCL
• Syntax parsing by compiler– qualif ier– vector– built-in function– Optimizations on
single core
• Runtime implementation– handle multi-core issues– Co-work with device vendor
__kernel void add( __global float4 *a, __global float4 *b, __global float4 *c){ int gid = get_global_id(0); float4 data = (float4) (1.0, 2.0, 3.0, 4.0); c[gid] = a[gid] + b[gid] + data;}
Runtime APIsPlatform APIs
MPU GPU
![Page 16: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/16.jpg)
LLVM-based compilation toolchain
Module 1Module 1
Module NModule N
.
.
.
EstimationEstimation
LinkedmoduleLinkedmodule
Front endsFront ends
opt &
link
opt &
link
Profiling?
Profiling? llilli
Profileinfo
Profileinfo
Partitioning & mappingPartitioning & mapping
Module 1Module 1
Module M
Module M
.
.
.
Back end 1Back end 1
Back end M
Back end M
Exe1
Exe1
ExeM
ExeM
.
.
.
.
.
.
yes
no
Data inData in
![Page 17: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/17.jpg)
HSA: Heterogeneous System Architecture
![Page 18: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/18.jpg)
HSA overview
• open architecture specification• HSAIL virtual (parallel) instruction set• HSA memory model• HSA dispatcher and run-time
• Provides an optimized platform architecture for heterogeneous programming models such as OpenCL, C++AMP, et al
18
![Page 19: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/19.jpg)
HSA Runtime Stack
![Page 20: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/20.jpg)
Heterogeneous Programming
• Unified virtual address space for all cores• CPU and GPU• distributed arrays
• Hardware queues per code with lightweight user mode task dispatch• Enables GPU context switching, preemption,
efficient heterogeneous scheduling
• First class barrier objects• Aids parallel program composability
![Page 21: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/21.jpg)
HSA Intermediate Layer (HSAIL)• Virtual ISA for parallel programs• Similar to LLVM IR and OpenCL SPIR• Finalised to specific ISA by a JIT compiler• Make late decisions on which core should run a
task• Features:
• Explicitly parallel• Support for exceptions, virtual functions and other high-level
features• syscall methods (I/O, printf etc.)• Debugging support
![Page 22: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/22.jpg)
HSA stack for Android (conceptual)
![Page 23: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/23.jpg)
HSA memory model
• Compatible with C++11, OpenCL, Java and .NET
memory models
• Relaxed consistency
• Designed to support both managed language (such
as Java) and unmanaged languages (such as C)
• Will make it much easier to develop 3rd party
compilers for a wide range of heterogeneous
products• E.g. Fortran, C++, C++AMP, Java et al
![Page 24: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/24.jpg)
HSA dispatch
• HSA designed to enable heterogeneous task queuing• A work queue per core (CPU, GPU, …)• Distribution of work into queues• Load balancing by work stealing
• Any core can schedule work for any other, including itself
• Significant reduction in overhead of scheduling work for a core
![Page 25: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/25.jpg)
Command Flow Data Flow
Soft Queue
Soft Queue
Kernel Mode Driver
Kernel Mode Driver
Application A
Application A
Command Buffer
User Mode Driver
User Mode Driver
Direct3DDirect3D
DMA Buffer
Hardware Queue
A GPU HARDWARE
GPU HARDWARE
Today’s Command and Dispatch Flow
![Page 26: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/26.jpg)
Today’s Command and Dispatch FlowCommand Flow Data Flow
Soft Queue
Soft Queue
Kernel Mode Driver
Kernel Mode Driver
Application A
Application A
Command Buffer
User Mode Driver
User Mode Driver
Direct3DDirect3D
DMA Buffer
Command Flow Data Flow
Soft Queue
Soft Queue
Kernel Mode Driver
Kernel Mode Driver
Application C
Application C
Command Buffer
User Mode Driver
User Mode Driver
Direct3DDirect3D
DMA Buffer
Command Flow Data Flow
Soft Queue
Soft Queue
Kernel Mode Driver
Kernel Mode Driver
Application B
Application B
Command Buffer
User Mode Driver
User Mode Driver
Direct3DDirect3D
DMA Buffer
Hardware Queue
A GPU HARDWARE
GPU HARDWARE
C
BA B
![Page 27: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/27.jpg)
HSA Dispatch Flow Software View✔ User-mode dispatch to hardware✔ No kernel mode driver overhead✔ Low dispatch latency
Software View✔ User-mode dispatch to hardware✔ No kernel mode driver overhead✔ Low dispatch latency
Hardware View✔ HW / microcode controlled✔ HW scheduling ✔ HW-managed protection
Hardware View✔ HW / microcode controlled✔ HW scheduling ✔ HW-managed protection
![Page 28: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/28.jpg)
HSA enabled dispatch
Application / RuntimeApplication / Runtime
CPU2CPU2CPU1CPU1 GPUGPU
![Page 29: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/29.jpg)
HSA Memory Model
• compatible with C++11/Java/.NET memory models• Relaxed consistency memory model• Loads and stores can be re-ordered by the finalizer
![Page 30: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/30.jpg)
Data Flow of HSA
![Page 31: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/31.jpg)
HSAIL
• intermediate language for parallel compute in HSA
– Generated by a high level compiler (LLVM, gcc, Java VM, etc)
– Compiled down to GPU ISA or parallel ISAFinalizer
– Finalizer may execute at run time, install time or build time
• low level instruction set designed for parallel compute in a shared virtual memory environment.
• designed for fast compile time, moving most optimizations to HL compiler
– Limited register set avoids full register allocation in finalizer
![Page 32: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/32.jpg)
GPU based Languages
• Dynamic Language for exploration of heterogeneous parallel runtimes
• LLVM-based compilation– Java, Scala, JavaScript, OpenMP
• Project Sumatra:GPU Acceleration for Java in OpenJDK
• ThorScript
![Page 33: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/33.jpg)
Accelerating Java
Source: HSA, Phil Rogers
![Page 34: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/34.jpg)
Open Source software stack for HSA
Component Name Purpose
HSA Bolt Library Enable understanding and debug
OpenCL HSAIL Code Generator Enable research
LLVM Contributions Industry and academic collaboration
HSA Assembler Enable understanding and debug
HSA Runtime Standardize on a single runtime
HSA Finalizer Enable research and debug
HSA Kernel Driver For inclusion in Linux distros
A Linux execution and compilation stack is open-sourced by AMD
• Jump start the ecosystem• Allow a single shared implementation where appropriate• Enable university research in all areas
![Page 35: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/35.jpg)
Open Source Tools
• Hosted at GitHub: https://github.com/HSAFoundation
• HSAIL-Tools: Assembler/Disassembler• Instruction Set Simulator• HSA ISS Loader Library for Java and C++ for
creation and dispatch HSAIL kernel
![Page 36: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/36.jpg)
HSAIL example ld_kernarg_u64 $d0, [%_out]; ld_kernarg_u64 $d1, [%_in];
@block0: workitemabsid_u32 $s2, 0; cvt_s64_s32 $d2, $s2; mad_u64 $d3, $d2, 8, $d1; ld_global_u64 $d3, [$d3]; //pointer ld_global_f32 $s0, [$d3+0]; // x ld_global_f32 $s1, [$d3+4]; // y ld_global_f32 $s2, [$d3+8]; // z mul_f32 $s0, $s0, $s0; // x*x mul_f32 $s1, $s1, $s1; // y*y add_f32 $s0, $s0, $s1; // x*x + y*y mul_f32 $s2, $s2, $s2; // z*z add_f32 $s0, $s0, $s2;// x*x + y*y + z*z sqrt_f32 $s0, $s0; mad_u64 $d4, $d2, 4, $d0; st_global_f32 $s0, [$d4]; ret;
sqrt((float (i*i + (i+1)*(i+1) + (i+2)*(i+2)))
sqrt((float (i*i + (i+1)*(i+1) + (i+2)*(i+2)))
![Page 37: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/37.jpg)
Conclusions
• Heterogeneity is an increasingly important trend• The market is finally starting to create and adopt
the necessary open standards• HSA should enable much more dynamically
heterogeneous nested parallel programs and programming models
![Page 38: Implement Runtime Environments for HSA using LLVM](https://reader034.fdocuments.net/reader034/viewer/2022052315/554fb19ab4c90586258b519e/html5/thumbnails/38.jpg)
Reference
• Heterogeneous Computing in ARM (2013)• Project Sumatra: GPU Acceleration for Java in
OpenJDK– http://openjdk.java.net/projects/sumatra/
• HETEROGENEOUS SYSTEM ARCHITECTURE, Phil Rogers