"Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation...

22
Copyright © 2015 Altera 1 Dr. Deshanand Singh, Director of Software Engineering 12 May 2015 Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs

Transcript of "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation...

Page 1: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 1

Dr. Deshanand Singh, Director of Software Engineering

12 May 2015

Efficient Implementation of

Convolutional Neural Networks using

OpenCL on FPGAs

Page 2: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 2

• Convolutional Neural Network

• Feed forward network Inspired by biological processes

• Composed of different layers

• More layers increases accuracy

• Convolutional Layer

• Extract different features from input

• Low level features

• e.g. edges, lines corners

• Pooling Layer

• Reduce variance

• Invariant to small translations

• Max or average value

• Feature over region in the image

• Applications

• Classification & Detection, Image recognition/tagging

Convolutional Neural Network (CNN)

Prof. Hinton’s CNN Algorithm

Page 3: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 3

Image

Pooling

Convolution

Basic Building Block of the CNN

Source: Li Deng

Deep Learning Technology Center, Microsoft

Research, Redmond, WA. USA

A Tutorial at International Workshop on

Mathematical Issues in Information Sciences

Page 4: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 4

• Dataflow through the CNN can proceed in pipelined fashion

• No need to wait until the entire execution is complete

• Can start a new set of data going to stage 1 as soon as the stage

complete its execution

Key Observation: Pipelining

Layer 1:

Convolving Kernel

Layer 1:

Pooling Kernel

Layer 2:

Convolving Kernel

Page 5: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 5

FPGAs : Compute fabrics that support pipelining

1-bit configurable

operation

Configured to perform any

1-bit operation:

AND, OR, NOT, ADD, SUB

Basic Element

1-bit register

(store result)

Basic Elements are

surrounded with a

flexible interconnect

16-bit add

Your custom 64-bit

bit-shuffle and encode

32-bit sqrt

Memory

Block

20 Kb

addr

data_in

data_out

Can be configured and

grouped using the

interconnect to create

various cache architectures

data_in

Dedicated floating point

multiply and add blocks

data_out

Blocks are connected into

a custom data-path that

matches your application.

Page 6: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 6

Programming FPGAs: SDK for OpenCL

C compiler

OpenCL

Host

Program

Altera Kernel

Compiler

OpenCL

Kernels

Host

Binary

Device

Binary

Altera OpenCL

host library

Users write

software

FPGA

specific

compilatio

n

target

Page 7: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 7

Accelerator

Local M

em

Glo

bal M

em

Local M

em

Local M

em

Local M

em

Accelerato

r

Accelerato

r

Accelerato

r Processor

Accelerator

Local M

em

Glo

bal M

em

Local M

em

Local M

em

Local M

em

Accelerator Accelerator Accelerator Processor

• Host + Accelerator Programming Model

• Sequential Host program on microprocessor

• Function offload onto a highly parallel accelerator device

OpenCL Programming Model

Host Accelerator

Local M

em

Glo

bal M

em

Local M

em

Local M

em

Local M

em

Accelerato

r

Accelerato

r

Accelerato

r Processor

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }

Page 8: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 8

• Data-parallel function

• Defines many parallel threads of

execution

• Each thread has an identifier

specified by “get_global_id”

• Contains keyword extensions to

specify parallelism and memory

hierarchy

• Executed by compute object

• CPU

• GPU

• FPGA

• DSP

• Other Accelerators

OpenCL Kernels

__kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }

float *a =

float *b =

float *answer =

__kernel void sum( … );

0 1 2 3 4 5 6 7

7 6 5 4 3 2 1 0

7 7 7 7 7 7 7 7

Page 9: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 9

• On each cycle the portions of the pipeline

are processing different threads

• While thread 2 is being loaded, thread 1

is being added, and thread 0 is being

stored

Dataflow / Pipeline Architecture from OpenCL

Load Load

Store

0 1 2 3 4 5 6 7

8 threads for vector add example

Thread IDs

+

Load Load

Store

0 1 2 3 4 5 6 7

8 threads for vector add example

Thread IDs

+

Load Load

Store

0

1 2 3 4 5 6 7

8 threads for vector add example

Thread IDs

+

Load Load

Store

1

2

3 4 5 6 7

8 threads for vector add example

Thread IDs

+

0

Load Load

Store

2

3

4 5 6 7

8 threads for vector add example

Thread IDs

+

0

1

Page 10: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 10

Convolutions: Our Basic Building Block

Inew 𝑥 𝑦 = Iold

1

𝑦′=−1

1

𝑥′=−1

𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′

Page 11: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 11

Main

Memory

Cache

• A cache can hide poor memory access patterns

Processor (CPU/GPU) Implementation

for(int y=1; y<height-1; ++y) { for(int x=1; x<width-1; ++x) { for(int y2=-1; y2<1; ++y2) { for(int x2=-1; x2<1; ++x2) { i2[y][x] += i[y+y2][x+x2] * filter[y2][x2];

CPU

Page 12: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 12

• Example Performance Point: 1 pixel per cycle

• Cache requirements: 9 reads + 1 write per cycle

• Expensive hardware!

• Power overhead

• Cost overhead: More built in addressing flexibility than we need

• Why not customize the cache for the application?

FPGA Implementation ?

Cache Custom

Data-path

9 read ports! Memory

Page 13: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 13

Optimizing the “Cache”

• Start out with the initial picture that is W pixels wide

w

• Let’s remove all the lines that aren’t in the neighborhood of the

window

w w

• Take all of the lines and arrange them as a 1D array of pixels

w w

w

w w

• The pixels at the edges that we don’t need for the computation

w w

w

w w

w w

w

w w

w

• What happens when we move the window one pixel to the right?

w w

w

w w

w

• What happens when we move the window one pixel to the right?

Page 14: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 14

Optimizing the “Cache”

w w

w

Page 15: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 15

data_out[9]

• Managing data movement

to match the FPGA’s

architectural strengths is

key to obtaining high

performance.

Shift Registers in Software

pixel_t sr[2*W+3]; while(keep_going) { // Shift data in #pragma unroll for(int i=1; i<2*W+3; ++i) sr[i] = sr[i-1] sr[0] = data_in; // Tap output data data_out = {sr[ 0], sr[ 1], sr[ 2], sr[ w], sr[ w+1], sr[ w+2] sr[2*w], sr[2*w+1], sr[2*w+2]} // ... }

w w data_in

sr[0] sr[2*W+2]

Page 16: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 16

• CIFAR-10 Dataset

• 60000 32x32 color images

• 10 classes

• 6000 images per class

• 50000 training images

• 10000 test images

• Many CNN implementations are available for CIFAR-10

• Cuda-convnet provides a baseline implementation that many

works build upon

Building a CNN for CIFAR-10 on an FPGA

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Here are the classes in the dataset, as well as 10 random images from each:

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

Max-Pooling

Fully Connected

2-D Convolution (5x5)

On FPGA

Page 17: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 17

• High-latency: Requires access to global memory

• High memory-bandwidth

• Requires host coordination to pass buffers from one kernel to another

CIFAR-10 CNN: Traditional OpenCL

Implementation

Conv 1

Kernel

Pool 1

Kernel

Conv 2

Kernel

Global Memory (DDR)

Buffer Buffer Buffer Buffer

Stratix V Implementation :

• Processes 183 images per second

for CIFAR-10

Page 18: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 18

• Low-latency communication between kernels

• Significantly less memory bandwidth requirements

• Host is not involved in coordinating communication between

kernels

Global Memory (DDR)

CIFAR-10 CNN: Kernel-to-Kernel Channels

Buffer Buffer

Channels

Conv 1

Kernel

Pool 1

Kernel

Conv N

Kernel

Stratix V Implementation :

• Processes 400 images per second

for CIFAR-10

• Channel declaration:

• Create a queue:

value_type channel();

• Channel write:

• Push data into the queue:

void write_channel_altera(channel &ch, value_type data);

• Channel read:

• Pop the first element from the queue

value_type read_channel_altera(channel &ch);

channel int my_channel;

write_channel_altera(my_channel, x);

int y = read_channel_altera(my_channel);

Page 19: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 19

• Entire algorithm can be expressed in ~500

lines of OpenCL for the FPGA

• Kernels are written as standard building

blocks that are connected together

through channels

• The shift register convolution building

block is the portion of this code that is

used most heavily

• The concept of having multiple

concurrent kernels executing

simultaneously and communicating

directly on a device is currently unique

to FPGAs

• Will be portable in OpenCL 2.0 through

the concept of “OpenCL Pipes”

CIFAR-10: FPGA Code

#pragma OPENCL_EXTENSION cl_altera_channel : enable

// Declaration of Channel API data types channel float prod_conv1_channel; channel float conv1_pool1_channel; channel float pool1_conv2_channel; channel float conv2_pool2_channel; channel float pool2_conv3_channel; channel float conv3_pool3_channel; channel float pool3_cons_channel;

__kernel void convolutional_neural_network_prod( int batch_id_begin, int batch_id_end, __global const volatile float * restrict input_global) { for(...) { write_channel_altera( prod_conv1_channel, input_global[...]); write_channel_altera( prod_pool3_channel, input_global[...]); } }

Page 20: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 20

• Altera’s Arria-10 family of FPGA introduces DSP blocks with a dedicated floating point mode.

• Each DSP includes a IEEE 754 single precision floating-point multiplier and adder

• FPGA logic blocks (lookup tables and registers) are no longer needed to implement floating point functions

• Massive resource savings allows many more processing pipelines to run simultaneously on the FPGA

Effect of Floating Point FPGAs

Arria-10 Implementation :

• Processes 6800 images per

second for CIFAR-10

Page 21: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 21

• CNN implementations are well suited to pipelined implementations

• Exploiting pipelining on the FPGA requires some attention to coding

style to overcome the inherent assumptions of the writing “software”

• FPGAs do not have caches

• Need to exploit data reuse in a more explicit way

• The concept of dataflow pipelining will not realize its full potential if

we write intermediate results to memory

• Bandwidth limitations begin to dominate compute

• Use direct kernel to kernel communication called channels

• Native support for floating point on the FPGA allows order of magnitude

performance increase

Lessons Learned

Page 22: "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs," a Presentation From Altera

Copyright © 2015 Altera 22

• Altera’s SDK for OpenCL

• http://www.altera.com/products/software/opencl/opencl-

index.html

• FPGA Optimized OpenCL examples (filters, convolutions, …)

• http://www.altera.com/support/examples/opencl/opencl.html

• CIFAR-10 dataset

• http://www.cs.toronto.edu/~kriz/cifar.html

• CUDA-Convnet

• https://code.google.com/p/cuda-convnet/

Resources