18 643 Lecture 11: OpenCL and Altera...

18-643-F20-L11-S1, James C. Hoe, CMU/ECE/CALCM, ©2020

18-643 Lecture 11:Intel OpenCL for FPGA

Shashank Obla Department of ECE

Carnegie Mellon University


Housekeeping

• Your goal today: understand Intel’s interpretation of OpenCL for FPGAs

• Notices– Handout #6: lab 2, due Monday noon, 10/12– Handout #7: lab 3 (look for on Friday)– Project status report due each Friday

3 weeks to project proposal!!

• Readings– skim Ch 10, Reconfigurable Computing– for concrete reference: Intel SDK for OpenCL:

Programming Guide and Best Practices Guide


Khronos’ OpenCL


Two Parts to OpenCL1. Platform model

– host (processor & memory) – 1 or more accelerator devices

+ device-side mem hierarchy: global/local/private– APIs for host-thread to interact with devices

• launch compute kernels to devices • prepare (load/unload) device memory

2. Kernel programming language– perfect triply-nested loops– no loop-carried dependence

OpenCL terms introduced in bold


OpenCL Platform Model

hostCPU

hostmem

globalmem

computedevice

globalmem

computedevicecompute

devicecomputedevice

What are these “compute devices”???


Basic Host Program Examplemain ( ) {

. . . get device handle and queue . . .

. . . allocate memory buf objects . . .

. . . get kernel object. . .while ( ) {

. . . initialize memory buf data . . .

. . . bind buf objects to kernel arguments . . .

. . . add kernel and buf objects to device queue . . .

. . . wait for kernel to finish . . .

. . . retrieve memory buf object for result . . .}

What are these “kernels”???


What are these kernels?

Specifically talking about OpenCL C


Conceptually . . . for (int i=0; i < R0; i++)

for (int j=0; j < R1; j++)

for (int k=0; k < R2; k++) {

<< local variable declarations >>

<< arbitrary C-code with access to

global memory >>

}

• Loop body must be data-parallel– local variables limited to scope– disallow loop-carried dependencies through global

memory==> statements from different iterations can interleave in

any order (using disambiguated local variables)


Concretely . . . • Only specify loop body as a kernel function

__kernel foo(<< pointers to global mem buf>>) {

int i=get_global_id(2), j=get…(1), k=get…(0);

<< local variable declarations >>

<< arbitrary C-code with access to

global memory >>

}

• Triply-nested loops hardwired as NDRange– specified as 3 integer constants, i.e., the loop

bounds (R0, R1, R2)– 1 execution of kernel function is a work-item

work-item has private memory for local var’s– 1 kernel execution is R0×R1×R2 work-items


Example: N-by-N MMM__kernel mmm(__global float *A, … *B, … *C) {

int i=get_global_id(1);

int j=get_global_id(0);

for(int k=0; k<N; k++)

C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j];

}

• NDRange=(N, N, 1)– kernel function executed by NxNx1 work-items – each work-item sees a different combination of

dimension-0 and dimension-1 global id’s– no assumption about work-items’ relative progress


(For Your Reference: N-by-N MMM in C)float A[N][N], B[N][N], C[N][N];

for(int i=0; i<N; i++)

for(int j=0; j<N; j++) {


C[i][j]=C[i][j]+A[i][k]*B[k][j]

}

• Note:– Loop body of the inner-most loop is not data-

parallel---dependency through C[i][j]– Loop body of the second inner-most loop is


• Partition NDRange of R0×R1×R2 work-items into 3D work-groups of G0×G1×G2 work-items– G0/1/2 must divide R0/1/2 evenly– get_local_id(dim): id within group– get_group_id(dim): id of group

• Work-group signifies “locality” b/w work-items – execute together by a processing element– can share per-group local memory– can synchronize by barrier()

Why do we need this?

Work-Group

NDRange=(9, 3, 6)Group Size=(3, 3, 3)

Group Range=(3,1,2)


OpenCL Kernels on GPGPUs• Work-items is a CUDA thread• Work-group executes as a thread block---broken

down into 32-work-item SIMD Warps• Work-groups from same and different kernels are

interleaved on a Streaming Processor128-SIMD lanes, 1 INT+1 FMA per lane, 1.73GHz

• 1 kernel can execute on upto 20 StrmProc’s (as 1 compute device), peak 8,873 GFLOPS

• Global=GDDR; local=shared memory 96KB SRAM; private=register file 256KB SRAMNvidia terms in italic-bold; numbers for GTX 1080


“warp”InstT1

InstT3

A B C F G HA B C D E E G H

D EA B E F G HC D

A B E F G HC D

InstT2

InstT4A B E F G HC D

A B E F G HC DA B E F G HC D

A B E F G HC DA B E F G HC D

InstT5

InstT7

InstT6

InstT8InstT1

Aside: Barrel Processor [HEP, Smith]• Each cycle, select a “ready” thread from scheduling pool

– only one instruction per thread in flight at once– on a long latency stall, remove the thread from scheduling

• Simpler and faster pipeline implementation since– no data dependence, hence, no stall or forwarding– no penalty in making pipeline deeper


To get 8,873 GFLOPS . . .

• # work-items ≥ 128 x StrmProc pipeline depth x 20• Computation entirely of Fused-Multiply-Add

Interleaved warps so no worries about RAW stalls• No if-then-else (branch divergence)BTW:• ld’s and st’s take up inst. issue BW of-the-top• 320 GB/sec DRAM BW AI > 108 SP FP / float

– only certain access pattern can sustain peak BW– SIMD ld’s and st’s in a warp must go to the same

memory line (memory coalescing)


Intel OpenCL for FPGA


OpenCL FPGA Platform Model

hostCPU

hostmem

globalmem

(DRAM)

Compute devices synthesized from kernel functions

globalMem

(DRAM)

globalMem

(DRAM)

FPGA

customKerneldevice

customkerneldevice

customkerneldevice


Example: N-by-N MM“Add”__kernel mma(__global float *A, … *B, … *C) {



C[i*N+j]=A[i*N+j]+B[i*N+j];

}

• NDRange=(N, N, 1)• Note in this example:

– data-parallel kernel function– no loop in kernel function


Fully-pipelined Kernel Datapath

load

NDRange = (N,N,1)(gid0,gid1) stream =(0,1),…(0,N-1),(1,0)…(1,N-1)……(N-1,0)…(N-1,N-1)

addrcalc

gid0

load

addrcalc

gid1

add

&B[i*N+j]&A[i*N+j]

A B

addrcalc

C

store

&C[i*N+j]

A[i*N+j] B[i*N+j]

fully unrolled loop in kernel fxn also okay


What about MMM?__kernel mmm(__global float *A, … *B, … *C) {




C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j];

}

• NDRange=(N, N, 1)• Can’t easily pipeline work-items like before• PE pipelines the k iterations

– dependency on C[i*N+j]– kernel function scope limits the tricks we can play


Use of Single Work-Item Kernel__kernel mmm(__global float *A, … *B, … *C) {

for(int i=0; i<N; i++)

for(int j=0; j<N; j++)

for(in k=0; k<N; k++)

C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j]

• NDRange=(1, 1, 1) never do this on GPU!!• Arbitrary control flow (loops, if’s) and dependencies• Becomes just “regular” C-to-HW synthesis

– pipeline and parallelize loops – schedule for initiation-interval, resource, etc.

Only want OpenCL’s platform model and API; “work-group” & “work-item” not too meaningful


globalmem

(DRAM)krnlCkernelB

Use of Kernel-Kernel Channels• GPU multi-kernel OpenCL program

– kernel computes from global-mem to global-mem – next kernel continues from last kernel’s output buf– producer and consumer kernels fully serialized

• For streaming processing on FPGA, connect kernels with streaming channels to bypass DRAM– concurrent producer and consumer kernels– reduce DRAM and PCIe bandwidth requirements

globalmem

(DRAM)kernelA hostchnl storeload chnl PCIehost PCIe

Different programming mindset from GPU


Program-to-HW, not Function-to-IP

hostCPU

hostmem

Device DRAM

FPGA

PCIe Controller

Memory Controller

Glo

bal I

nter

conn

ect

Local Memory

Local Memory

Loca

l In

terc

onne

ct

PCIe

Shell User Space

HLS kernel

HLS kernel

Object of design/optimization is the entire FPGA system


Inside an example: Tiled MMMfor (int i = 0; i < HA; i += BHA) {

for (int j = 0; j < WB; j += BWB) {for (int k = 0; k < WA; k += BWA) {

for (int ii = 0; ii < BHA; ++ii) {for (int jj = 0; jj < BWB; ++jj) {

#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {

C[(i + ii) * WC + j + jj] +=A[(i + ii) * WA + k + kk] *B[(k + kk) * WB + j + jj];


Performance-Geared Synthesis• Automatic pipelining and unrolling

– no pipeline directive, but disable_loop_pipelining so you still have control

– unroll factor controls extent of unrolling

• Local memory Optimizations– if you don’t say anything, the tool does it’s best– banks, replicates, private copies and interconnect

heig

ht

byte/cycle = width ports

local memory bank 0

replicate 0 replicate 1bank 1replicate 0 replicate 1


Global Memory Interconnect

• Load-Store Units– automatically chosen based on access pattern

• Constant cache memory• DDR banking automatically handled

Sequential Random Access

Area Efficient Prefetching Pipelined

Area Expensive Burst-Coalesced Burst-Coalesced Cached (loads only)


Revisiting the example: Tiled MMMfor (int i = 0; i < HA; i += BHA) {

for (int j = 0; j < WB; j += BWB) {for (int k = 0; k < WA; k += BWA) {

for (int ii = 0; ii < BHA; ++ii) {for (int jj = 0; jj < BWB; ++jj) {


C[(i + ii) * WC + j + jj] +=A[(i + ii) * WA + k + kk] *B[(k + kk) * WB + j + jj];


Using Local Memoryfor (int ii = 0; ii < BHA; ++ii) {

#pragma unroll 1for (int jj = 0; jj < BWB; ++jj) {


C_local[ii][jj] += A_local[ii][kk] * B_local[jj][kk];

for (int ii = 0; ii < BHA; ++ii) {#pragma unrollfor (int jj = 0; jj < BWB; ++jj) {


C_local[ii][jj] += A_local[ii][kk] * B_local[jj][kk];


Also, Systolic Array Style Coding

• Express structure – describe models such as systolic arrays– instantiate explicit registers – timing – vectorize operations

• Not so readable as “program”

FedA

FedB

float PE(struct vec_float_t_bool valA,vec_float_t valB, float *accum) {

float oldAcc = __fpga_reg(accum[0]);float sum = (valA.c ? 0.0f : oldAcc);#pragma unrollfor (int i = 0; i < VEC; i++) {

sum += valA.data[i] * valB[i];}

#pragma unrollfor (int i = 0; i < ACCUM_SIZE - 1; i++) {

accum[i] = accum[i + 1];}accum[ACCUM_SIZE - 1] = sum;return oldAcc;

while(1) {...#pragma unrollfor (int row = 0; row < ROWS; row++) {

#pragma unrollfor (int col = 0; col < COLS; col++) {

// compute and store outputs in shift registerfloat result = PE(fedA[row], fedB[col], accum[row][col]);if (fedA[row].c) {

drain[col][row * ACCUM_SIZE] = result;}fedA[row].data = __fpga_reg(__fpga_reg(fedA[row].data));fedA[row].c = __fpga_reg(__fpga_reg(fedA[row].c));fedB[col] = __fpga_reg(__fpga_reg(fedB[col]));

An excerpt from a 500-line device code


Parting Thoughts

• OpenCL = platform model/API + SIMD language– kernel language forces regular and explicit parallelism– SIMD parallelism different on GPUs vs FPGAs

• FPGA OpenCL = platform model/API + memory system + “regular” HLS– single-work-item kernel unlocks parallelism style– kernel-kernel channels alleviate DRAM and PCIe

bottleneck for streaming use cases– develop/debug/analysis tools integral to appeal

GPUs great at SIMD; FPGAs good for more than SIMD

18 643 Lecture 11: OpenCL and Altera...

Documents

Transcript of 18 643 Lecture 11: OpenCL and Altera...