18 643 Lecture 11: OpenCL and Altera...

30
18-643-F20-L11-S1, James C. Hoe, CMU/ECE/CALCM, ©2020 18-643 Lecture 11: Intel OpenCL for FPGA Shashank Obla Department of ECE Carnegie Mellon University

Transcript of 18 643 Lecture 11: OpenCL and Altera...

Page 1: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S1, James C. Hoe, CMU/ECE/CALCM, ©2020

18-643 Lecture 11:Intel OpenCL for FPGA

Shashank Obla Department of ECE

Carnegie Mellon University

Page 2: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S2, James C. Hoe, CMU/ECE/CALCM, ©2020

Housekeeping

• Your goal today: understand Intel’s interpretation of OpenCL for FPGAs

• Notices– Handout #6: lab 2, due Monday noon, 10/12– Handout #7: lab 3 (look for on Friday)– Project status report due each Friday

3 weeks to project proposal!!

• Readings– skim Ch 10, Reconfigurable Computing– for concrete reference: Intel SDK for OpenCL:

Programming Guide and Best Practices Guide

Page 3: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S3, James C. Hoe, CMU/ECE/CALCM, ©2020

Khronos’ OpenCL

Page 4: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S4, James C. Hoe, CMU/ECE/CALCM, ©2020

Two Parts to OpenCL1. Platform model

– host (processor & memory) – 1 or more accelerator devices

+ device-side mem hierarchy: global/local/private– APIs for host-thread to interact with devices

• launch compute kernels to devices • prepare (load/unload) device memory

2. Kernel programming language– perfect triply-nested loops– no loop-carried dependence

OpenCL terms introduced in bold

Page 5: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S5, James C. Hoe, CMU/ECE/CALCM, ©2020

OpenCL Platform Model

hostCPU

hostmem

globalmem

computedevice

globalmem

computedevicecompute

devicecomputedevice

What are these “compute devices”???

Page 6: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S6, James C. Hoe, CMU/ECE/CALCM, ©2020

Basic Host Program Examplemain ( ) {

. . . get device handle and queue . . .

. . . allocate memory buf objects . . .

. . . get kernel object. . .while ( ) {

. . . initialize memory buf data . . .

. . . bind buf objects to kernel arguments . . .

. . . add kernel and buf objects to device queue . . .

. . . wait for kernel to finish . . .

. . . retrieve memory buf object for result . . .}

What are these “kernels”???

Page 7: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S7, James C. Hoe, CMU/ECE/CALCM, ©2020

What are these kernels?

Specifically talking about OpenCL C

Page 8: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S8, James C. Hoe, CMU/ECE/CALCM, ©2020

Conceptually . . . for (int i=0; i < R0; i++)

for (int j=0; j < R1; j++)

for (int k=0; k < R2; k++) {

<< local variable declarations >>

<< arbitrary C-code with access to

global memory >>

}

• Loop body must be data-parallel– local variables limited to scope– disallow loop-carried dependencies through global

memory==> statements from different iterations can interleave in

any order (using disambiguated local variables)

Page 9: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S9, James C. Hoe, CMU/ECE/CALCM, ©2020

Concretely . . . • Only specify loop body as a kernel function

__kernel foo(<< pointers to global mem buf>>) {

int i=get_global_id(2), j=get…(1), k=get…(0);

<< local variable declarations >>

<< arbitrary C-code with access to

global memory >>

}

• Triply-nested loops hardwired as NDRange– specified as 3 integer constants, i.e., the loop

bounds (R0, R1, R2)– 1 execution of kernel function is a work-item

work-item has private memory for local var’s– 1 kernel execution is R0×R1×R2 work-items

Page 10: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S10, James C. Hoe, CMU/ECE/CALCM, ©2020

Example: N-by-N MMM__kernel mmm(__global float *A, … *B, … *C) {

int i=get_global_id(1);

int j=get_global_id(0);

for(int k=0; k<N; k++)

C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j];

}

• NDRange=(N, N, 1)– kernel function executed by NxNx1 work-items – each work-item sees a different combination of

dimension-0 and dimension-1 global id’s– no assumption about work-items’ relative progress

Page 11: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S11, James C. Hoe, CMU/ECE/CALCM, ©2020

(For Your Reference: N-by-N MMM in C)float A[N][N], B[N][N], C[N][N];

for(int i=0; i<N; i++)

for(int j=0; j<N; j++) {

for(int k=0; k<N; k++)

C[i][j]=C[i][j]+A[i][k]*B[k][j]

}

• Note:– Loop body of the inner-most loop is not data-

parallel---dependency through C[i][j]– Loop body of the second inner-most loop is

Page 12: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S12, James C. Hoe, CMU/ECE/CALCM, ©2020

• Partition NDRange of R0×R1×R2 work-items into 3D work-groups of G0×G1×G2 work-items– G0/1/2 must divide R0/1/2 evenly– get_local_id(dim): id within group– get_group_id(dim): id of group

• Work-group signifies “locality” b/w work-items – execute together by a processing element– can share per-group local memory– can synchronize by barrier()

Why do we need this?

Work-Group

NDRange=(9, 3, 6)Group Size=(3, 3, 3)

Group Range=(3,1,2)

Page 13: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S13, James C. Hoe, CMU/ECE/CALCM, ©2020

OpenCL Kernels on GPGPUs• Work-items is a CUDA thread• Work-group executes as a thread block---broken

down into 32-work-item SIMD Warps• Work-groups from same and different kernels are

interleaved on a Streaming Processor128-SIMD lanes, 1 INT+1 FMA per lane, 1.73GHz

• 1 kernel can execute on upto 20 StrmProc’s (as 1 compute device), peak 8,873 GFLOPS

• Global=GDDR; local=shared memory 96KB SRAM; private=register file 256KB SRAMNvidia terms in italic-bold; numbers for GTX 1080

Page 14: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S14, James C. Hoe, CMU/ECE/CALCM, ©2020

“warp”InstT1

InstT3

A B C F G HA B C D E E G H

D EA B E F G HC D

A B E F G HC D

InstT2

InstT4A B E F G HC D

A B E F G HC DA B E F G HC D

A B E F G HC DA B E F G HC D

InstT5

InstT7

InstT6

InstT8InstT1

Aside: Barrel Processor [HEP, Smith]• Each cycle, select a “ready” thread from scheduling pool

– only one instruction per thread in flight at once– on a long latency stall, remove the thread from scheduling

• Simpler and faster pipeline implementation since– no data dependence, hence, no stall or forwarding– no penalty in making pipeline deeper

Page 15: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S15, James C. Hoe, CMU/ECE/CALCM, ©2020

To get 8,873 GFLOPS . . .

• # work-items ≥ 128 x StrmProc pipeline depth x 20• Computation entirely of Fused-Multiply-Add

Interleaved warps so no worries about RAW stalls• No if-then-else (branch divergence)BTW:• ld’s and st’s take up inst. issue BW of-the-top• 320 GB/sec DRAM BW AI > 108 SP FP / float

– only certain access pattern can sustain peak BW– SIMD ld’s and st’s in a warp must go to the same

memory line (memory coalescing)

Page 16: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S16, James C. Hoe, CMU/ECE/CALCM, ©2020

Intel OpenCL for FPGA

Page 17: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S17, James C. Hoe, CMU/ECE/CALCM, ©2020

OpenCL FPGA Platform Model

hostCPU

hostmem

globalmem

(DRAM)

Compute devices synthesized from kernel functions

globalMem

(DRAM)

globalMem

(DRAM)

FPGA

customKerneldevice

customkerneldevice

customkerneldevice

Page 18: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S18, James C. Hoe, CMU/ECE/CALCM, ©2020

Example: N-by-N MM“Add”__kernel mma(__global float *A, … *B, … *C) {

int i=get_global_id(1);

int j=get_global_id(0);

C[i*N+j]=A[i*N+j]+B[i*N+j];

}

• NDRange=(N, N, 1)• Note in this example:

– data-parallel kernel function– no loop in kernel function

Page 19: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S19, James C. Hoe, CMU/ECE/CALCM, ©2020

Fully-pipelined Kernel Datapath

load

NDRange = (N,N,1)(gid0,gid1) stream =(0,1),…(0,N-1),(1,0)…(1,N-1)……(N-1,0)…(N-1,N-1)

addrcalc

gid0

load

addrcalc

gid1

add

&B[i*N+j]&A[i*N+j]

A B

addrcalc

C

store

&C[i*N+j]

A[i*N+j] B[i*N+j]

fully unrolled loop in kernel fxn also okay

Page 20: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S20, James C. Hoe, CMU/ECE/CALCM, ©2020

What about MMM?__kernel mmm(__global float *A, … *B, … *C) {

int i=get_global_id(1);

int j=get_global_id(0);

for(int k=0; k<N; k++)

C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j];

}

• NDRange=(N, N, 1)• Can’t easily pipeline work-items like before• PE pipelines the k iterations

– dependency on C[i*N+j]– kernel function scope limits the tricks we can play

Page 21: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S21, James C. Hoe, CMU/ECE/CALCM, ©2020

Use of Single Work-Item Kernel__kernel mmm(__global float *A, … *B, … *C) {

for(int i=0; i<N; i++)

for(int j=0; j<N; j++)

for(in k=0; k<N; k++)

C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j]

• NDRange=(1, 1, 1) never do this on GPU!!• Arbitrary control flow (loops, if’s) and dependencies• Becomes just “regular” C-to-HW synthesis

– pipeline and parallelize loops – schedule for initiation-interval, resource, etc.

Only want OpenCL’s platform model and API; “work-group” & “work-item” not too meaningful

Page 22: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S22, James C. Hoe, CMU/ECE/CALCM, ©2020

globalmem

(DRAM)krnlCkernelB

Use of Kernel-Kernel Channels• GPU multi-kernel OpenCL program

– kernel computes from global-mem to global-mem – next kernel continues from last kernel’s output buf– producer and consumer kernels fully serialized

• For streaming processing on FPGA, connect kernels with streaming channels to bypass DRAM– concurrent producer and consumer kernels– reduce DRAM and PCIe bandwidth requirements

globalmem

(DRAM)kernelA hostchnl storeload chnl PCIehost PCIe

Different programming mindset from GPU

Page 23: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S23, James C. Hoe, CMU/ECE/CALCM, ©2020

Program-to-HW, not Function-to-IP

hostCPU

hostmem

Device DRAM

FPGA

PCIe Controller

Memory Controller

Glo

bal I

nter

conn

ect

Local Memory

Local Memory

Loca

l In

terc

onne

ct

PCIe

Shell User Space

HLS kernel

HLS kernel

Object of design/optimization is the entire FPGA system

Page 24: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S24, James C. Hoe, CMU/ECE/CALCM, ©2020

Inside an example: Tiled MMMfor (int i = 0; i < HA; i += BHA) {

for (int j = 0; j < WB; j += BWB) {for (int k = 0; k < WA; k += BWA) {

for (int ii = 0; ii < BHA; ++ii) {for (int jj = 0; jj < BWB; ++jj) {

#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {

C[(i + ii) * WC + j + jj] +=A[(i + ii) * WA + k + kk] *B[(k + kk) * WB + j + jj];

Page 25: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S25, James C. Hoe, CMU/ECE/CALCM, ©2020

Performance-Geared Synthesis• Automatic pipelining and unrolling

– no pipeline directive, but disable_loop_pipelining so you still have control

– unroll factor controls extent of unrolling

• Local memory Optimizations– if you don’t say anything, the tool does it’s best– banks, replicates, private copies and interconnect

heig

ht

byte/cycle = width ports

local memory bank 0

replicate 0 replicate 1bank 1replicate 0 replicate 1

Page 26: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S26, James C. Hoe, CMU/ECE/CALCM, ©2020

Global Memory Interconnect

• Load-Store Units– automatically chosen based on access pattern

• Constant cache memory• DDR banking automatically handled

Sequential Random Access

Area Efficient Prefetching Pipelined

Area Expensive Burst-Coalesced Burst-Coalesced Cached (loads only)

Page 27: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S27, James C. Hoe, CMU/ECE/CALCM, ©2020

Revisiting the example: Tiled MMMfor (int i = 0; i < HA; i += BHA) {

for (int j = 0; j < WB; j += BWB) {for (int k = 0; k < WA; k += BWA) {

for (int ii = 0; ii < BHA; ++ii) {for (int jj = 0; jj < BWB; ++jj) {

#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {

C[(i + ii) * WC + j + jj] +=A[(i + ii) * WA + k + kk] *B[(k + kk) * WB + j + jj];

Page 28: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S28, James C. Hoe, CMU/ECE/CALCM, ©2020

Using Local Memoryfor (int ii = 0; ii < BHA; ++ii) {

#pragma unroll 1for (int jj = 0; jj < BWB; ++jj) {

#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {

C_local[ii][jj] += A_local[ii][kk] * B_local[jj][kk];

for (int ii = 0; ii < BHA; ++ii) {#pragma unrollfor (int jj = 0; jj < BWB; ++jj) {

#pragma unrollfor (int kk = 0; kk < BWA; ++kk) {

C_local[ii][jj] += A_local[ii][kk] * B_local[jj][kk];

Page 29: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S29, James C. Hoe, CMU/ECE/CALCM, ©2020

Also, Systolic Array Style Coding

• Express structure – describe models such as systolic arrays– instantiate explicit registers – timing – vectorize operations

• Not so readable as “program”

FedA

FedB

float PE(struct vec_float_t_bool valA,vec_float_t valB, float *accum) {

float oldAcc = __fpga_reg(accum[0]);float sum = (valA.c ? 0.0f : oldAcc);#pragma unrollfor (int i = 0; i < VEC; i++) {

sum += valA.data[i] * valB[i];}

#pragma unrollfor (int i = 0; i < ACCUM_SIZE - 1; i++) {

accum[i] = accum[i + 1];}accum[ACCUM_SIZE - 1] = sum;return oldAcc;

while(1) {...#pragma unrollfor (int row = 0; row < ROWS; row++) {

#pragma unrollfor (int col = 0; col < COLS; col++) {

// compute and store outputs in shift registerfloat result = PE(fedA[row], fedB[col], accum[row][col]);if (fedA[row].c) {

drain[col][row * ACCUM_SIZE] = result;}fedA[row].data = __fpga_reg(__fpga_reg(fedA[row].data));fedA[row].c = __fpga_reg(__fpga_reg(fedA[row].c));fedB[col] = __fpga_reg(__fpga_reg(fedB[col]));

An excerpt from a 500-line device code

Page 30: 18 643 Lecture 11: OpenCL and Altera OpenCLusers.ece.cmu.edu/~jhoe/course/ece643/F17handouts/L11.pdf · OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University.

18-643-F20-L11-S30, James C. Hoe, CMU/ECE/CALCM, ©2020

Parting Thoughts

• OpenCL = platform model/API + SIMD language– kernel language forces regular and explicit parallelism– SIMD parallelism different on GPUs vs FPGAs

• FPGA OpenCL = platform model/API + memory system + “regular” HLS– single-work-item kernel unlocks parallelism style– kernel-kernel channels alleviate DRAM and PCIe

bottleneck for streaming use cases– develop/debug/analysis tools integral to appeal

GPUs great at SIMD; FPGAs good for more than SIMD