OpenCL framework for a CPU, GPU, and FPGA Platform · 2012. 11. 3. · for FPGA (O4F) developed for...

OpenCL framework for a CPU, GPU, and FPGA Platform

by

Taneem Ahmed

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

Copyright c© 2011 by Taneem Ahmed

Abstract

OpenCL framework for a CPU, GPU, and FPGA Platform

Taneem Ahmed

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2011

With the availability of multi-core processors, high capacity FPGAs, and GPUs, a hetero-

geneous platform with tremendous raw computing capacity can be constructed consisting

of any number of these computing elements. However, one of the major challenges for

constructing such a platform is the lack of a standardized framework under which an ap-

plication’s computational task and data can be easily and effectively managed amongst

the computing elements. In this thesis work such a framework is developed based on

OpenCL (Open Computing Language). An OpenCL API and run time framework, called

O4F, was implemented to incorporate FPGAs in a platform with CPUs and GPUs un-

der the OpenCL framework. O4F help explore the possibility of using OpenCL as the

framework to incorporate FPGAs with CPUs and GPUs. This thesis details the findings

of this first-generation implementation and provides recommendations for future work.

ii

Dedication

To Mohsin - for all the inspiration

iii

Acknowledgements

I would like to acknowledge all the support and guidance provided by my supervisor Prof.

Paul Chow. His direction and feedback on this thesis has been invaluable. I also thank

all the students in the program for their help, feedback, and friendship. Special thanks to

my wife, my mother, and rest of the family for all their support and patience. I greatly

appreciate all the encouragement and guidance from Dr. Jason Anderson and Dr. Qiang

Wang - the two great ‘friend, philosopher and guide’s I have been blessed with.

iv

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 OpenCL Overview 5

2.1 OpenCL Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.4 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 OpenCL Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Platform Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Runtime Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Related Work 15

4 Heterogeneous Platforms Under the OpenCL Framework 18

4.1 ICD Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Application flow under ICD Loader . . . . . . . . . . . . . . . . . . . . . 19

4.3 Challenges of using OpenCL for Heterogeneous Platforms . . . . . . . . . 20

v

4.3.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3.2 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3.3 Cluster Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 OpenCL For FPGA 22

5.1 Application Flow using FPGAs . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1 OpenCL Code Compilation . . . . . . . . . . . . . . . . . . . . . 23

5.2 Flow Used in this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4.1 OpenCL API Library . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4.2 Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.5 Architecture for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.5.1 Static Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.5.2 Kernel Organization . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.5.3 Kernel Information . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.6 Benefits of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.6.1 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.6.2 Data lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.6.3 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.7 Challenges of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.7.1 FPGA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.7.2 FPGA Resource Estimation . . . . . . . . . . . . . . . . . . . . . 36

6 Example Application 37

6.1 Potential Application Types . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.1.1 Iterative Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.1.2 Task Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vi

6.1.3 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2 Example: Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . 41

6.2.1 Reason for using Monte Carlo simulation . . . . . . . . . . . . . . 41

6.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.3 Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Summary 50

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A Implemented OpenCL API List 52

B Monte Carlo Kernel Execution 58

C Sobol Sequence Implementation 71

Bibliography 73

vii

List of Tables

2.1 OpenCL Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.1 BAR1 Offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

viii

List of Figures

1.1 OpenCL Framework Implementation . . . . . . . . . . . . . . . . . . . . 2

2.1 OpenCL Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 An example 3D indexed kernel space . . . . . . . . . . . . . . . . . . . . 8

2.3 OpenCL Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Multiple OpenCL Implementations Under ICD Loader . . . . . . . . . . 19

4.2 Possible OpenCL Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 FPGA Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Actual Flow used in this work . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 UML Class Diagram of the API Library . . . . . . . . . . . . . . . . . . 26

5.4 Kernel Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.5 One Kernel Group with two Kernels . . . . . . . . . . . . . . . . . . . . . 32

5.6 Two Kernel Groups with one Kernel each . . . . . . . . . . . . . . . . . . 33

6.1 Monte Carlo Simulation Flowchart . . . . . . . . . . . . . . . . . . . . . 38

6.2 Components in Community Climate System Model . . . . . . . . . . . . 39

6.3 Possible application of the platform . . . . . . . . . . . . . . . . . . . . . 40

6.4 Distribution of the Monte Carlo simulation tasks across three different

architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

Chapter 1

Introduction

The availability of multi-core CPUs, high capacity FPGAs and GPUs makes possible

a heterogeneous platform with enormous computational capacity. Previous research

[4, 5, 8] has shown that each type of processor technology is ideally suited to imple-

ment specific types of functions. Thus an application with multiple compute intensive

segments would benefit from a heterogeneous platform consisting of different processor

technologies. However, mass adaptation of such platforms remains elusive due to the

challenging task of programming for such heterogeneous platforms.

In the remainder of this Chapter, Section 1.1 details the motivation of this research,

Section 1.2 summarizes the contributions, and Section 1.3 provides the outline of this

thesis.

1.1 Motivation

CPUs, GPUs, and FPGAs all have their own programming models that are very differ-

ent from each other. Moving to a heterogeneous platform makes it even more difficult to

present a unified programming model that works for all architectures. All of the existing

heterogeneous platforms define their own programming paradigm and application devel-

opment process. There is always a learning curve for the application developers to even

1

Chapter 1. Introduction 2

OpenCL Application

OpenCL Language

Driver

Hardware

OpenCL Library

OpenCL API

OpenCLRuntime

OpenCLPlatform Layer

Compiler

Figure 1.1: OpenCL Framework Implementation

evaluate such a platform. The lack of a standardized framework for application develop-

ers is a major barrier for mass adaptation of such platforms. So far OpenCLTM1 (Open

Computing Language) seems like a promising framework to address this issue. The fact

that OpenCL is the only common framework supported by all GPU vendors currently

makes it the sole candidate to provide a unified programming model for heterogeneous

platforms containing GPUs.

OpenCL is a complete framework consisting of a programming language, a set of

APIs, and hardware that supports OpenCL constructs. Figure 1.1 shows all the com-

ponents necessary to realize an OpenCL framework implementation, hereafter referred

to simply as an OpenCL implementation. An OpenCL implementation encapsulates

a library that implements the OpenCL API, a toolchain to compile the OpenCL lan-

guage for the target architecture, computational devices that support OpenCL concepts,

and device drivers to communicate with the devices if necessary. It is possible for one

1OpenCL is a trademark of Apple Inc


OpenCL implementation to support different types of devices, e.g. the AMD OpenCL

implementation supports GPUs and CPUs from AMD.

Due to active support for OpenCL from CPU and GPU vendors, existing worksta-

tions with supported GPUs have become heterogeneous platforms for general purpose

computing. OpenCL is increasingly becoming the standard framework for CPU+GPU

platforms. However, there are no existing OpenCL implementations that integrate FP-

GAs under the OpenCL framework. The motivation of this research work is to explore

the feasibility of OpenCL as the standard framework for developing applications for het-

erogeneous platforms with CPUs, GPUs, and FPGAs.

1.2 Research Contributions

The work required to integrate FPGAs under the OpenCL framework for heterogeneous

platforms can be divided into three segments. Firstly, a target architecture supporting

OpenCL concepts needs to be defined for FPGA implementations. The architecture can

be based on an array-of-soft-processors or an array-of-custom-cores. Secondly, a tool

is required to convert computation described in the OpenCL language to the targeted

architecture. This tool can be a compiler or high-level synthesis tool depending on

the architecture. Thirdly, a middleware is necessary to integrate FPGAs with existing

OpenCL implementations for CPUs and GPUs. The middleware consists of a library that

implements the OpenCL API standard and a device driver to facilitate communication

between this library and the FPGA.

There is existing and ongoing research work on the first two segments described above.

The major contribution of this thesis is to provide the first known implementation of the

middleware required for FPGA integration. This middleware allows FPGAs to be used

under the OpenCL framework along with commodity CPUs and GPUs. During the

development of the middleware, some challenges were exposed that are described along


with possible solutions.

As a target architecture is necessary to test the middleware, an architecture based

on an array-of-custom-cores is also described in this thesis. Different aspects of the

architecture are examined and some recommendations are made for future research work.

1.3 Thesis Outline

The remainder of this thesis is organized as following. Chapter 2 provides an overview

of the OpenCL framework and OpenCL application execution flow. This overview in-

troduces the OpenCL terminology before introducing the related research work based on

OpenCL and FPGAs in Chapter 3. The overview in Chapter 2 also helps with under-

standing how a heterogeneous platform is created under the OpenCL framework, which

is described in Chapter 4. Chapter 5 describes the details of the OpenCL implementation

for FPGA (O4F) developed for this thesis. This chapter also discusses the challenges and

future work for any OpenCL implementation for FPGAs. A Monte Carlo simulation for

Asian Options was run to test out the platform including CPU, GPU, and FPGA. Chap-

ter 6 describes the reason for deciding on this test application and how the computational

work was partitioned. Chapter 7 has the conclusions and future work.

Chapter 2

OpenCL Overview

OpenCL is an open standard targeted for general-purpose parallel programming on differ-

ent types of processors. The goal of OpenCL is to provide software developers a standard

framework for easy access to heterogeneous processing platforms. The OpenCL standard

specifies a set of API and a programming language based on C. For the purpose of this

thesis it is only necessary to describe the OpenCL concepts rather than the technical

details. The technical details of the OpenCL framework can be found in the OpenCL

specification [11].

The OpenCL framework can be best understood by the four models explained in

Section 2.1. Section 2.2 describes the execution flow of an OpenCL application.

2.1 OpenCL Models

The following four models describe the core ideas behind the OpenCL framework.

• Platform Model

• Memory Model

• Execution Model

5

Chapter 2. OpenCL Overview 6

Host

OpenCL Device

Processing Element

Compute Element

Device B

Device A

Figure 2.1: OpenCL Platform Model

• Programming Model

2.1.1 Platform Model

Figure 2.1 depicts all the components of the OpenCL platform model. An OpenCL

application is executed on the host and most of the runtime control of the application

resides on the host. There can be one or more computing devices connected to the

host. The OpenCL standard does not specify the type of connectivity, i.e. whether the

connection is by a bus, e.g. PCI, PCI-express, etc., or over an Ethernet network. The

OpenCL implementation specific to each device is responsible for the communication and

it is hidden from the application developer.

Each OpenCL device has one or more Compute Units (CU), and each CU has one or

more Processing Elements (PE). The actual computation is done on the PEs. Consider

the case of a GPU. The card containing the GPU is the OpenCL device. This card

contains the GPU which is the compute unit, and each GPU contains processing cores

which are the processing elements.


2.1.2 Execution Model

The execution of an OpenCL application has two components. One part, called the

kernel, executes on the devices, and the other part that executes on the host. The host

part manages the kernels and the memory objects under a context through command

queues.

Context

The context contains all the pieces necessary to use a device for computation. Using

the OpenCL API, the host part of the application creates a context object and the

other objects under it, i.e. kernel object, program object, memory objects, and command

queues object.

Kernel

The kernel represents the computation that is executed on the processing elements. The

following simple example is used to clarify the kernel concept. Assuming there is an

integer array of length 10 and the goal is to multiply each integer by a constant. The

kernel for this problem would only represent multiplication of one integer by the constant,

and the kernel would be instantiated 10 times to solve the complete problem. However,

out of consideration for processor utilization and memory access, it is possible to multiply

two integers in the same kernel. In that case the kernel would be instantiated five times

to solve the complete problem.

Work Items and Work Groups

A virtual N-dimensional indexed space is defined for the execution of the kernel, and one

kernel instance is executed for each point in this indexed space. The value of N can

be one, two, or three. Each kernel instance is called a work-item. All the work-items

execute the same code, however, they usually work on different data and their execution


Gx

Gy

Gz

Wy

WxWz

Work-Group3D Indexed Space

Work-Item

Figure 2.2: An example 3D indexed kernel space

path through the code can diverge. Each work-item is assigned a global ID that is unique

across the indexed space.

Equal numbers of work-items are grouped together to form work-groups, with all the

work-groups having the same dimensions. The work-item within a work-group has a local

ID that is unique within the work-group, and also has access to shared local memory as

described in Section 2.1.3.

It is important to note that with proper device support, the total number of work-

items can be much greater than the number of processing elements present in a device.

Through API calls an application can find out the maximum number of work-items a

device supports.

Program and Memory Object

The program object consists of the source code and the binary implementation of the

kernels. During application execution, the binary implementation can be generated from

the source code, or a pre-compiled binary can be loaded to create the program object.

A program object can be considered as a library for kernels because one program object

can contain multiple kernels. The application decides which kernel to execute during


runtime.

The memory objects are visible to both the host and the kernels, and used to transfer

data between the host and the device. The host creates memory objects, and through

the OpenCL API allocates memory on the device for the memory objects. The details

of the memory model are described in Section 2.1.3.

Command Queue

Each device in the context has an associated command queue, and kernel execution and

memory transfer are coordinated using the command queue. There are three types of

commands that can be issued. Memory commands are mainly used to transfer memory

between the host and the device. Kernel commands are issued to start the execution of

kernels on the device. Synchronization commands can be used to control the execution

order of the commands.

Once the commands are scheduled on the queue, there are two possible execution

modes. The commands can be executed in-order, meaning the previous command on the

queue must finish execution for a command to start execution. The other option is for the

commands to execute out-of-order, where commands do not wait for previously queued

commands to finish. However, explicit ordering can be enforced in an out-or-order queue

by synchronization commands.

2.1.3 Memory Model

The memory model in OpenCL is divided into four types based on the memory access

capabilities of the work-items. Table 2.1, based on Table 3.1 in [11], summarizes the

memory types. The dynamic allocation means memory allocated at run-time, and static

allocation indicates memory allocated at compile time.

• Global Memory: All work-items have read-write access to this memory region.

Usually the input data for the work-items are written to this region by the host,


Table 2.1: OpenCL Memory Model

Global Constant Local Private

Host Dynamic

Allocation

Dynamic

Allocation

Dynamic

Allocation

No

Allocation

Read/Write

Access

Read/Write

Access

No

Access

No

Access

Kernel Static

Allocation

Static

Allocation

Static

Allocation

Static

Allocation

Read/Write

Access

Read

Access

Read/Write

Access

Read/Write

Access

and the computed output data is written there by the work-items.

• Constant Memory: This is a Read-Only global memory accessible to all work items.

The host part of the application allocates and initializes this memory region.

• Local Memory: This memory region is the local memory for a work-group. All

the work-items in a work-group shares this memory region. This memory allows

work-items to communicate with each other within a work-group.

• Private Memory: This memory region represents the local variables of the kernel

instance. Each work-item has its own copy of the local variables and they are only

visible to the work-item.

2.1.4 Programming Model

Under the OpenCL programming model, computation can be done in data parallel, task

parallel, or a hybrid of these two models. The main focus of the OpenCL programming

model is the data parallel model, where each work-item works on a data item - effectively


implementing SIMD.

The task parallel model can be realized by enqueing the execution of multiple kernels,

where only one work-item for each kernel is created. Even though some of the GPUs

support this model, this is highly inefficient model for GPUs.

It is possible to have a hybrid model where multiple kernels each with multiple work-

items are enqueued for execution at the same time.

2.2 OpenCL Application Flow

The OpenCL application flow is depicted in Figure 2.3, with the steps numbered for

reference in the following discussion. The flow is split into two sections. The platform

layer creates a context based on available platforms, and the runtime layer creates all

other necessary objects to execute the kernel.

2.2.1 Platform Layer

An OpenCL application initially queries for the available OpenCL platforms (step 1).

Once the available platform list is gathered, the application selects the one with the

desired device type (step 2) and creates a context. Possible device types allowed in

the OpenCL specification are CL DEVICE TYPE CPU, CL DEVICE TYPE GPU, and

CL DEVICE TYPE ACCELERATOR. The context then adds the desired number of

devices from the available devices (step 3). Once added to a context, the devices are

made exclusive to the context until they are explicitly released from the context.

2.2.2 Runtime Layer

The tasks described below are considered to be part of the runtime layer. Note that it is

not necessary to execute the tasks in the same order as explained below.


Get Platform List

clGetDeviceIDs

clGetPlatformIDs

clCreateContextFromTypeCreate Context

Of Type T

clCreateCommandQueueCreate CommandQueue For Device

Create MemoryObjects

clCreateBuffer

Create Program ObjFrom Source/Binary

clCreateProgramWithSourceclCreateProgramWithBinary

Pick Platform WithDevice Type T

Copy Host MemoryTo Device Memory

clEnqueueWriteBuffer

Setup Kernel andArguments

clCreateKernelclSetKernelArg

Run KernelclEnqueueTask

clEnqueueNDRangeTask

Copy Device MemoryTo Host Memory clEnqueueReadBuffer

Clean Up

Cor

resp

ondi

ng O

penC

L A

PI C

alls

Pla

tform

Lay

erR

untim

e La

yer

1

2

3

4

5

6

7

8

9

10

11

Figure 2.3: OpenCL Application Flow


The communication between the host and the devices are done using the commands

explained in Section 2.1.2. To issue these commands to the devices, a command queue

is created for each device selected under the context (step 4). Whenever a command is

issued, an optional OpenCL event object can be created. These event objects allow the

application to check for the completion of the command, and can be used for explicit

synchronization.

The memory objects are created to allocate memory on the devices (step 5). The

permission to read and/or write to these memory objects from the host is set by the

application when they are created.

The program objects are created by either loading the source code or the binary

implementation of one or more kernels (step 6). The binary implementation can either be

the device-specific executable or the intermediate representation (IR) used by the current

OpenCL implementation. Once created, the program objects are then built to generate

the device-specific executable. The OpenCL implementation decides what action to take

in the build stage depending on whether source code, IR, or an executable was used to

create the program object. The OpenCL API allows writing of the binary implementation

to a file that can be used in the later runs of the application. The format of the output

file is not part of the OpenCL specification, and the OpenCL implementation decides a

convenient format. Once the executable is built in the program object, the kernel object

is created from it. The kernel object represents one of the functions implemented in the

program object.

Before executing the kernel, the input data is transferred to the device memory by

issuing memory copy commands against the associated memory objects (step 7). The

memory transfer can be blocking where control is returned to the application once the

memory transfer is complete, or non-blocking where control is returned after the memory

transfer is scheduled. For non-blocking transfer, events are used for synchronization.

Once the input data is transferred, the values of the kernel arguments are set (step 8)


and the kernel is scheduled for execution through the command queue (step 9). Once

the kernel execution is complete, the output memory is transferred to the host from

the device (step 10). It is possible to have an iterative process where the same kernel

is scheduled to run again. New input data can be transferred to the device, and new

output data transferred back to the host after the kernel execution.

As a final step all the OpenCL objects are released (step 11) once all the computation

is done.

Chapter 4 describes how multiple OpenCL implementations can be used in the same

OpenCL application.

Chapter 3

Related Work

As mentioned in Section 1.2, there is no known work towards integration of FPGAs in a

heterogeneous platform with CPUs and GPUs using OpenCL. In this Chapter, research

work on the architectures to support the OpenCL framework on FPGAs and converting

the OpenCL language for FPGAs are discussed1.

Lin et al. [13] presents the Open Reconfigurable Computing Language (OpenRCL)

framework that is based on the OpenCL framework but only targets FPGAs. The archi-

tecture for OpenRCL is based on an array of MIPS processors. A crossbar switch with

a scheduler is used to connect the processors to the memory regions. OpenRCL also

provides a LLVM-based compiler [12] to convert kernels written in the OpenCL language

to target their architecture. With comparable performance versus the Nvidia GeForce

9400m GPU, OpenRCL shows a 5-fold power benefit for their test application.

SOpenCL (Silicon OpenCL) [15] is an OpenCL-based FPGA architecture synthesis

tool. It converts the OpenCL kernels into accelerators and targets a template-based

architecture. It has a predefined datapath and memory access module. A LLVM-based

high-level synthesis tool converts the kernel into an accelerator and it is inserted in the

template architecture. During high-level synthesis it combines all the work-items in a

1OpenCL implementations for GPUs are provided by the commercial vendors. The details of OpenCLsupport for AMD and Nvidia GPUs can be found at [1] and [14] respectively.

15

Chapter 3. Related Work 16

work-group into one accelerator to reduce the number of accelerator instances. Under

SOpenCL the host part of the OpenCL application runs on the PowerPC located in the

FPGA.

FSM SYS Builder [3] parses an OpenCL application to generate an array-of-processors

with MicroBlaze soft processors, and compiles the kernel source code to be executed on

MicroBlaze. Their approach is to use OpenCL as a high-level programming model to

generate a hardware/software co-designed multiprocessor system on programmable chip.

However, this work cannot be integrated with CPUs and GPUs due to the lack of the

middleware.

Instead of OpenCL, FCUDA [16] generates FPGA platforms based on applications

written in CUDA. In 2007 Nvidia introduced CUDA (Compute Unified Device Architec-

ture) to allow programmers access to their GPUs for general purpose computation. In

FCUDA, the developer annotates the CUDA kernels with FCUDA pragmas that guides

the conversion of CUDA code to AutoPilot [7] C code.

Even though not based on the OpenCL framework, QP (Quadro Plex) [17] is a het-

erogeneous cluster consisting of CPUs, GPUs, and FPGAs. Each node of the QP cluster

has two dual-core CPUs, four GPUs, and one FPGA. The programming for the CPU

is done using common compilers, and CUDA is used for the programming the GPUs.

The FPGA is programmed using DIME-C code, and Nallatech’s DIME-C C to VHDL

Function Generator is used to translate DIME-C C code to VHDL.

Similar to QP, Axel [20] is another heterogeneous cluster consisting of CPUs, GPUs,

and FPGAs. Axel also does not provide an unified programming model. GPUs and

FPGAs are programmed separately using CUDA and Xilinx ISE tools respectively. The

CPU part of the application is compiled using GCC.

All of the known OpenCL works described above do not address the issue of the mid-

dleware layer that would enable an OpenCL system to include FPGAs that can interact

with CPUs and GPUs. The next Chapter describes how multiple implementations of

Chapter 3. Related Work 17

OpenCL framework can interact with each other.

Chapter 4

Heterogeneous Platforms Under the

OpenCL Framework

The OpenCL API has all the necessary function calls to construct a heterogeneous plat-

form under the OpenCL framework. However, each device vendor provides proprietary

OpenCL implementations and there are no API calls to integrate various different imple-

mentations. In this research work the OpenCL extension installable client driver (ICD)

loader [10] is used to achieve this goal.

4.1 ICD Loader

The ICD loader is an OpenCL extension that allows multiple OpenCL implementations

to co-exist on a host system. When an application is written against the ICD loader, in-

stead of a specific implementation, the application has access to all the available platforms

provided by all the existing implementations on the host. The ICD loader decouples an

OpenCL application binary from a specific implementation, and allows the application

to select an implementation at runtime. Figure 4.1 illustrates a scenario where imple-

mentation A and B are available to the application through ICD loader.

On a Linux host system, an ICD compliant OpenCL implementation registers itself

18

Chapter 4. Heterogeneous Platforms Under the OpenCL Framework 19

OpenCL ApplicationOpenCL Language

Driver A

Hardware A

OpenCL Implementation A

OpenCL API

OpenCLRuntime


Compiler

ICD Loader

Driver B

Hardware B

OpenCL Implementation B

OpenCL API

OpenCLRuntime


Compiler

Figure 4.1: Multiple OpenCL Implementations Under ICD Loader

with the ICD loader by adding a file in the /etc/OpenCL/vendors/ directory. The file

contains the name of the dynamic library that has the OpenCL implementation. The

ICD loader scans this directory to enumerate available implementations, and presents

them to the application. For a Windows host system the Windows registry is used to

register OpenCL implementations.

4.2 Application flow under ICD Loader

The flow is similar to the one explained in Section 2.2 when an application intends to

utilize multiple OpenCL implementations. The following description uses the scenario

depicted in Figure 4.1 to explain. While gathering platform information, the applica-

tion will be presented with the platforms provided by both A and B implementations.

However, instead of creating one context, the application needs to create two separate

contexts to use both device A and B.

After creating the two contexts, the application needs to create separate copies of

all the other objects, e.g. memory objects, kernel objects, command queue objects, etc.

This is necessary because the ICD layer does a 1-to-1 mapping of all the OpenCL API

calls to an implementation based on the objects used in the API.

/etc/OpenCL/vendors/


The OpenCL framework does not provide any high-level modeling to decompose tasks

to be executed in parallel. The application developer needs to explicitly define the tasks

to be executed on each device, and also manually partition the associated data. Any

synchronization for data or kernel execution between the two devices needs to be explicitly

managed by the application.

4.3 Challenges of using OpenCL for Heterogeneous

Platforms

Devices with different architectures can be integrated under the OpenCL framework

using the ICD loader extension. During the course of this research, however, some of

the challenges to implement an efficient heterogeneous platform under OpenCL became

evident. The following challenges apply when multiple devices are considered.

4.3.1 Synchronization

Under the OpenCL framework, the host is the central point for all application control

logic. A device is completely unaware of any other device being used by the application.

This lack of visibility restricts any direct communication between the devices and requires

the host for all coordination. The host needs to manage any synchronization necessary

among tasks running on different devices.

4.3.2 Data Transfer

An important aspect for overall efficiency of a heterogeneous platform is the capability

to transfer data efficiently. OpenCL implements a distributed memory model but lacks

the support for point-to-point data transfer. Under the current framework, the host is

involved in all data transfer between devices. Data must be first transferred to the host


Application

ICD Loader

Virtual OpenCL

Network

Client

ICD Loader

OpenCL

Device

Client

ICD Loader

OpenCL

Device

Client

ICD Loader

OpenCL

Device

Figure 4.2: Possible OpenCL Cluster

to move it to another device, doubling the time required for the transfer.

4.3.3 Cluster Support

As mentioned earlier, OpenCL does not specify the connectivity type between the host

and the devices, and in theory this allows the creation of a networked cluster of devices

to run an application. However, due to the lack of explicit clustering support in the

framework, all the existing OpenCL implementations assume the devices to be on the host

itself. It will be a challenging task to construct a cluster under the current framework with

available OpenCL implementations. A virtual OpenCL implementation with a server-

client architecture needs to be developed for such a cluster. The server would run on the

host and through the ICD loader provide a unified view of the cluster to the OpenCL

application. The client running on the nodes would be in fact OpenCL applications with

an extra layer to communicate with the server.

Figure 4.2 shows a possible OpenCL cluster. However, the complexity can be largely

reduced if the OpenCL API is extended to support explicit clustering, or a OpenCL

extension similar to the ICD loader is introduced.

Chapter 5

OpenCL For FPGA

The motivation of this thesis is to explore the OpenCL framework as a unified program-

ming model for a platform consisting of CPUs, GPUs, and FPGAs. This work focuses on

an OpenCL implementation for FPGAs as there are existing vendor provided OpenCL

implementations for CPUs and GPUs. An OpenCL implementation for FPGAs can be

divided into three segments: 1) a target architecture supporting OpenCL concepts needs

to be defined; 2) a compiler or high-level synthesis tool to compile OpenCL C code

for the architecture; and 3) a middleware to integrate FPGAs with existing OpenCL

implementations.

There is existing and ongoing research work on the first two parts [13], [15], [3], [16],

however, the lack of a middleware does not allow any of these works to interact with

other OpenCL implementations. In this work, a middleware was developed along with a

light weight architecture framework for FPGA implementations. The middleware allows

interaction with vendor provided OpenCL implementations for CPUs and GPUs using

the ICD loader (as explained in Section 4.2). The architecture uses custom cores as

processing elements instead of soft processors. The details of the middleware and the

architecture are described in this Chapter.

Section 5.1 describes the FPGA specific step necessary in an OpenCL application.

22

Chapter 5. OpenCL For FPGA 23

The actual flow used in this work is described in Section 5.2. A short description of

the hardware and software used in this work is provided in Section 5.3. Section 5.4

describes in detail the software part of the middleware developed for this work. Details

of the architecture used for this work are explained in Section 5.5. FPGA usage has its

own benefits and challenges, but during this work some OpenCL specific benefits and

challenges are noticed that are described in Section 5.6 and 5.7 respectively.

Note that all the discussions in these sections are focused on a custom-core based

architecture. An array-of-soft-processors based FPGA design has a process similar to

GPUs, and it is noted where relevant.

5.1 Application Flow using FPGAs

An OpenCL application flow using FPGAs as OpenCL devices is almost exactly as the

one explained in Section 2.3. However, due to the configurable nature of the FPGA, an

extra step is required as explained below.

5.1.1 OpenCL Code Compilation

This step involves building the program object after it has been created from the source

code. For a custom-core-based architecture, a predefined static framework is necessary

that contains the interface to the host, the memory, logic to implement OpenCL related

concepts, and an interface for the custom cores to communicate with this framework.

Figure 5.1 provides an overview of this design. In the figure the kernels represent the

custom cores.

From the source code, a high-level synthesis tool implements the kernels in HDL

with the interface to interact with the static framework. Once the core is generated,

the required number of the cores are instantiated and glued to the static framework to

create the complete FPGA design. FPGA vendor provided CAD tools then generate


Application Specific Cores

Static Framework

Interface to Host

Kernel Controller

Global Memory(On-Chip)

Memory Controller

Data Control

Kernel Kernel

DataCommand& Control

Figure 5.1: FPGA Design Overview

the configuration bitstream for this design. The FPGA is then configured using this

bitstream. The timing of the FPGA configuration presents a few challenges and they are

discussed in Section 5.7.1.

Note, that if the architecture is an array-of-soft-processors, then the source code

simply needs to be compiled to produce the binary for the target processor. This binary

would be downloaded to the processors when the kernel is being executed.

5.2 Flow Used in this Work

The lack of a high-level synthesis tool to convert OpenCL C code to HDL and FPGA

configuration challenges (see Section 5.7.1) forced a flow with some manual steps for this

work as shown in Figure 5.2.

Once the computation of the application is partitioned, the part assigned for the

FPGA is coded manually to create the kernels in HDL. Instances of this core are then

manually integrated with the static framework and CAD tools are run to generate the

configuration bitstream. The FPGA is configured before running the OpenCL applica-


Manual

Custom Core Developedand Integrated withStatic Framework

Design BitstreamGenerated and

FPGA Configured

OpenCL C Kernels Developed

for CPU/GPU

Run OpenCL Application

Figure 5.2: Actual Flow used in this work

tion.

The kernel code for the CPU/GPU part is developed as usual with rest of the OpenCL

application, and it is executed following the flow shown in Figure 2.3

5.3 Experimental Setup

The heterogeneous platform used in this work included an AMD Athlon 7750 Dual-Core

Processor running at 1.4GHz, a graphics card with an ATI Radeon HD 5450 GPU, and

the Xilinx XUPV5 board with a Virtex5-LX110T FPGA. The graphics card is connected

to the mother board using a 16-lane PCI-express interface and the 1-lane PCI-express

interface of the XUPV5 board is used to connect the FPGA board.

The OpenCL implementation packaged with the AMD-APP-SDK-v2.4-lnx64 is used

for the CPU and the GPU, and the implementation developed for this work is used for

the FPGA. The OS running on the host is CentOS 5.6. The Xilinx ISE 12.3 tool is used

to compile the FPGA design.


o4f_platform

o4f_contexto4f_device

o4f_command_queue

*

* 1

1

1

*

1

o4f_event

1..*

0..*

o4f_program

*

o4f_mem

*

1

1

o4f_kernel

1

1

1

1

*

0..1

0..1

Figure 5.3: UML Class Diagram of the API Library

5.4 Software

The software part of the middleware is divided into two parts - the library that implements

the OpenCL API, and the device driver that allows communication between this library

and the FPGA design. The source code for this work will be made publicly available and

the details of the implementation can be found in the source code. A brief overview is

provided here.

5.4.1 OpenCL API Library

A multi-threaded dynamic library is designed and developed to implement the OpenCL

API specification 1.1 [11]. Only a sub-set of the OpenCL API deemed necessary to

integrate the FPGA as an OpenCL device has been implemented. Appendix A has

this subset of the API listed.


Figure 5.3 shows the UML class diagram for the major classes used in the API library.

The class o4f context, representing an OpenCL context, has relationship to all other

classes because an OpenCL context contains all the other objects in an application.

Classes o4f program and o4f kernel represent an OpenCL program and an OpenCL kernel

respectively. In this work the kernel on the FPGA is pre-configured, and these two classes

are placeholder for future work when kernels can be created at runtime.

Class o4f command queue represents the command queue. When a command is is-

sued, a new thread and an instance of the class o4f event, representing an event, is

created. The o4f event tracks the new thread. A command can be instructed to wait for

the completion of previously generated events before being executed. To accommodate

this explicit synchronization, o4f event can have a collection of class o4f event.

The relationship shown in the diagram supports one FPGA board with one chip, but

it can easily be extended to support multiple boards and multiple chips on each board.

However, supporting multiple board or FPGA chips would require significant redesign of

the device driver and a few modifications to the static framework of the FPGA design.

ICD Compatibility

Initially the library was developed to implement the OpenCL API to allow an OpenCL

application to interact solely with a FPGA device. It was modified to support the

ICD loader extension to interact with commercial OpenCL implementations. Access to

the ICD implementation source code from Khronos Group was necessary to make the

modifications because specific additions to the data structures are required.

The ICD loader initially queries a registered library (see Section 4.1) through the

function call clGetExtensionFunctionAddress to get the address of the functions

clIcdGetPlatformIDsKHR and clGetPlatformInfo. The detailed process of how

the ICD loader works is described in [10].


Events

Event objects are created when memory transfer or task execution commands are en-

queued. A thread is spawned for each event to execute the command. This allows

returning the control to the main application without blocking it, and makes it con-

venient to check the status of the event. During explicit synchronization on an event

completion, the main process sleeps until the thread finishes instead of using any polling

mechanism.

Command Queue

The current implementation of the command queue only supports the out-of-order model.

However, the in-order model can still be realized by explicit synchronization of the events

generated during queuing commands.

5.4.2 Device Driver

The FPGA board communicates with the host through a 1-lane PCI-express interface.

The device driver facilitates the communication between the API library and the board.

The Linux Kernel presents the device as a PCI device, and the PCI related API is used

by the driver to control the device.

DMA

PCI-express specification does not provide native DMA support. Instead the devices

are responsible to implement DMA support. For this work it is incorporated in the

FPGA design as part of the host interface. When data transfer is required, the device

driver sets up the transfer size, the source and the destination memory address in the

appropriate registers in the FPGA using the host interface (see Section 5.5.1). Then the

driver instructs the FPGA to commence the transfer, and the FPGA does the transfer

without the CPU being involved.


Interrupt

In two cases the FPGA needs to initiate communication with the device driver. By

raising the interrupt the FPGA either indicates the completion of the data transfer or

completion of the kernel execution requested by the driver.

Currently the legacy interrupt is used, which means the same interrupt is used to no-

tify completion of both data transfer and kernel execution. This is a drawback because

in the current implementation the driver only permits either data transfer or kernel ex-

ecution. PCI-express supports Message Signaled Interrupts (MSI), which allows devices

to generate multiple unique interrupts. This would allow the FPGA to generate unique

interrupts to signal data transfer and kernel execution completion independently. Cur-

rently the FPGA design supports MSI, however, the Linux Kernel used by the host OS

does not support it.

5.5 Architecture for FPGAs

A simple custom core based architecture is designed for the FPGA to complete the

middleware. Figure 5.1 shows the overview of this design. The three major components

in the static framework are the Kernel controller, the host interface, and the memory

controller.

5.5.1 Static Framework

Kernel Controller

The API library communicates with the custom core, i.e. the kernel, through this block

using an 8-bit command word (Figure 5.4). The most significant two bits indicates the

type of command. When an argument is being set for the kernel, the least significant four

bits indicate the index of the argument. This allows a kernel to have 16 arguments. When


0 0 0 0 0 0 0 0

MSB LSB

Command CommandArguments

Command Value Command Arguments

Set Argument 01 Command argument part holds the Kernel argument index

Start Kernel 10 N/A

0 0

Kernel ID

Figure 5.4: Kernel Command

the ‘set argument’ command arrives, this block broadcasts all this information along with

the value of the argument (set by the library right before issuing the command) to all

the kernels. The same happens when the ‘start’ command arrives.

Note that there are four bits allocated for Kernel ID. This ID is the same for all

instances of the same kernel. This allows up to 16 unique kernels to be added to this

framework. The motivation for this is to provide a true task parallel model in FPGAs

(see Section 5.6.1).

Host Interface

PCI-express is used as the host interface, and one of the six base address registers (BAR)

available in PCI-express is utilized for DMA data transfers and passing information to

the kernel controller. BAR is part of the PCI configuration space specification that is

also used in PCI-express. For the system OS to address a device, part of the device

needs to be mapped into either the memory or the IO port address space. For PCI/PCI-

express devices BARs are mapped into the system OS address space. Table 5.1 shows


Table 5.1: BAR1 Offsets

Offset Name Meaning

0x00 WRITE FPGA ADDR FPGA memory address for DMA write

0x04 WRITE HOST ADDR Host memory address for DMA write

0x08 WRITE SIZE Number of bytes to write

0x18 WRITE START Initiate DMA write

0x0C READ FPGA ADDR FPGA memory address for DMA read

0x10 READ HOST ADDR Host memory address for DMA read

0x14 READ SIZE Number of bytes to read

0x1C READ START Initiate DMA read

0x40 KERNEL CMD DATA Command data for kernel

0x44 KERNEL CMD Broadcast the current command data to kernel

the registers used in BAR1. The read and write operations are from the FPGA’s point

of view.

The Host interface also sends the interrupt signals to indicate a DMA transfer com-

pletion, or when requested by the kernel controller.

Memory Controller

The memory controller currently only supports pre-fixed point-to-point connections with

priority based access to memory. Only two kernel instances can be connected to the

memory controller. However, this is enough to explore quite a few concepts explained

later in this Chapter. Optimized memory access is a major research topic, but not a

goal for this work. This simple controller can easily be replaced by a more sophisticated

controller in future without much modification to the framework.


Kernel Group

KernelLocal id 0Global id 0

KernelController

Kernel Group Control

DoneAggregate

Kernel Settingand Control

KernelDone


DoneAggregate

Global Memory

Figure 5.5: One Kernel Group with two Kernels

5.5.2 Kernel Organization

The framework has been designed to allow the OpenCL concepts work-item and work-

group. However, as there is no support for ‘context’, the number of work-items represents

the number of kernel instances. Going forward, work-item and work-group are simply

referred as kernel and kernel-group respectively. If shared memory is required by the

kernel, then kernels within the same kernel-group have access to the shared memory.

The two configurations shown in Figures 5.5 and 5.6 are tried during the test appli-

cation explained in Chapter 6. The shared memory is not shown in these figures.

Initially it may seem there are no benefits between using the two configurations.

However, kernel-groups can benefit from the presence of shared memory along with

application-specific requirements for using the shared memory.


Kernel Group


KernelController


DoneAggregate

Kernel Settingand Control

KernelDone

Kernel Group



DoneAggregate

KernelDone

DoneAggregate

Global Memory

Figure 5.6: Two Kernel Groups with one Kernel each

5.5.3 Kernel Information

Kernel instances require some OpenCL-based information to distinguish themselves from

each other. For example, the global ID of a kernel uniquely identifies itself among all the

kernel instances. The local ID identifies a kernel within a kernel-group. The number of

kernel-groups, the number of kernels in a group, etc. are all important information used

to decide which part of the data a kernel must process. In our work, this information

is passed as parameters to the HDL modules. This method would work even when a

high-level synthesis tool is used to generate the HDL for the kernels.

5.6 Benefits of FPGAs

This section discusses OpenCL specific benefits of using FPGAs, and not the general

benefits. The benefits are weighed mostly against GPUs.


5.6.1 Task Parallelism

In GPUs, all the processing elements execute the same instruction, processing data using

a SIMD model. This does not allow task parallelism in a GPU. For FPGAs, true task

parallelism can be implemented very easily. Two different kernels can be running in

parallel seamlessly.

A setup with two different kernels was implemented to show this benefit. Unfortu-

nately the lack of MSI support (see Section 5.4.2) prevented running of both tasks in

parallel. In the legacy interrupt mode, the same interrupt is raised when either kernel

completes its execution and the device driver is unable to decide which kernel finished

execution.

5.6.2 Data lifetime

The data stored in the shared or private memory region of the GPUs is only valid during

the execution of the kernel. A kernel instance in a GPU is a software thread and cannot

retain any state information once the execution is complete. For an application where the

kernel is executed iteratively on a GPU, it is not guaranteed that a kernel instance will

be assigned to the same processing element and use the same shared or private memory

region. This will cause performance degradation when the same data needs to be loaded

in the shared or private memory in consecutive kernel execution. For FPGAs with an

array-of-custom-cores architecture, each custom core represents a kernel instance. Data is

persistent between kernel executions and previously loaded data in the shared or private

memory can be reused.

The test application in Chapter 6 utilizes this idea. The kernel has an extra argu-

ment to indicate whether to load data to the shared memory from the global memory

before starting the actual computation. The argument is set to true from the OpenCL

application the first time, and false for consecutive kernel executions.


5.6.3 Resource Utilization

The configurable nature of FPGAs would allow better resource utilization for OpenCL

applications. For example, a GPU has a fixed amount of available memory of all types.

This is not a restriction for FPGAs. The total amount of memory available, including

off-chip memory, can be partitioned as required to optimize performance.

5.7 Challenges of FPGAs

The OpenCL specific challenges of using FPGAs described in this section.

5.7.1 FPGA Configuration

The timing of the FPGA configuration becomes a critical issue for OpenCL applications.

Ideally the FPGA should be configured once the application enqueues a specific kernel

for execution. However, that is not an option when a FPGA board is used where the

FPGA is responsible for the communication with the host. For the setup of this work, the

FPGA implements the PCI-express link to communicate with the host and it should be

configured even before the host is booted. Trying to configure or reconfigure the FPGA

after the host boots up makes the host unstable, and most likely will crash the system.

Also, the input data for the kernel needs to be transferred before enqueuing the kernel

execution. To facilitate the memory transfer, the static part of the FPGA design must be

present as well. For these two reasons the FPGA is configured beforehand in this work.

One possible solution is to use a FPGA board where another device is responsible for

the communication with the host. However, this will add overhead for the communication

and may cause overall performance degradation.

One other solution is to use technology like partial reconfiguration. This method

would eliminate the need for an extra device on the board, and should not impact per-

formance.


5.7.2 FPGA Resource Estimation

In the OpenCL framework, the application expects to know the computational capacity

in terms of processing elements of a device at initialization. This allows the application

to partition a problem accordingly. In the case of FPGAs, computational capacity only

in terms of FPGA resources can be known beforehand. How much FPGA resource a

kernel requires is known only after the CAD tool implementation. Even this information

does not provide the knowledge of how many instances of the kernel can be put together

on the device. FPGA CAD tools usually struggle with higher resource utilization, and

FPGA devices can rarely be used fully.

Chapter 6

Example Application

There is continuing research on optimizing compute intensive applications on various pro-

cessing architectures. Focus has been on comparing algorithm implementations among

architectures to understand the best match [4, 5, 8, 21], or improving application per-

formance on a specific architecture. The absence of research work targeting platforms

with CPU, GPU, and FPGA is noticeable, and it is reasonable to assume that the lack

of availability of such a platform has been a barrier.

In this work an example application, i.e. Monte Carlo simulation for Asian Options,

is developed to demonstrate that all three architectures can work together under the

OpenCL framework with OpenCL implementations from the commercial vendors and

O4F. Some of the design choices for the application are made to demonstrate the benefits

of using FPGAs in OpenCL. Section 6.2 describes the details of the example application.

However, before describing the example application, Section 6.1 discusses the application

types that have the potential to benefit from a heterogeneous OpenCL platform.

6.1 Potential Application Types

A heterogeneous platform consisting of CPUs, GPUs, and FPGAs provide an attractive

options to improve overall application performance. However, the lack of peer-to-peer

37

Chapter 6. Example Application 38

GenerateRandom Numbers

Compute DataPoints Using

Random Numbers

Reduction Stepon ComputedData Points

TargetIteration Count

Reached?

Start

Finish

No

Yes

Figure 6.1: Monte Carlo Simulation Flowchart

communication, especially peer-to-peer data transfer, in the OpenCL framework can

restrict the types of application suitable for this platform. This section describes some

of the potential application types for this platform.

6.1.1 Iterative Process

An application with an iterative process, where all computation are encapsulated in a

loop, can be a candidate. The computation within the loop needs to be segmented,

and these segments can be assigned to different processing architectures. An example

application is the Monte Carlo simulation, which has three major segments. These are


Atmosphericmodel

Oceanmodel

Landmodel

Icemodel

Coupler

Figure 6.2: Components in Community Climate System Model

shown in the Monte Carlo simulation flowchart illustrated in Figure 6.1. The first

segment generates the random numbers, the second segment uses the random numbers

to compute multiple data points, and the last segment is the reduction step to produce

the result based on the calculated data points.

The iterative nature of the application is necessary to hide some of the latency in-

troduced by transferring data through the host. The example application described in

Section 6.2 will illustrate this more clearly.

6.1.2 Task Parallel

An application containing multiple compute intensive segments that are independent of

each other is an obvious choice. It is unlikely to have completely independent compute

segments, but it may be possible to gain performance even with some communication

done through the host. An example application is climate modeling where multiple

components are simulated simultaneously. Figure 6.2 shows a simplified view of the

software design used in the Community Climate System Model [6]. The four models

are simulated independently, and they intermittently exchange data through the coupler.


OpenCL Application

EncryptionTask

ImageProcessing

CommunicationInterface

DecryptionTask

Host

GPU CPU FPGA

Figure 6.3: Possible application of the platform

Depending on the actual computation involved within each model, different processing

architecture may be suitable for individual models.

6.1.3 Other Considerations

This platform can be useful when considering other aspects besides just runtime perfor-

mance. Power usage has become a serious consideration for many applications, and this

platform can be used to balance between power usage and runtime performance.

FPGAs are ideal to interface with external IO devices. An application that interacts

with external IO devices can utilize this platform as the OpenCL framework does not

restrict how data is sent or received by an application. For example, a video confer-

encing application with encrypted data communication can utilize the FPGA to receive

encrypted data and decrypt the data before passing it onto the host. The host can use

the GPU for image processing before displaying the video. Data sent from the host can

be encrypted by the FPGA before sending it out. Figure 6.3 depicts the block diagram

of one such possible application.


6.2 Example: Monte Carlo Simulation

A Monte Carlo simulation for Asian options is used as the example application for this

work. For Asian options the payoff is decided by the average price of the underlying

financial instrument, e.g. stock, over a pre-set period of time. The average price is based

on the price of the instrument on pre-set intervals over this period of time. The Monte

Carlo method of calculating the Asian options generates large numbers of trajectories

the price can follow to reach an interval, and takes the average of all the trajectories to

produce the estimated price for the interval. Random numbers are used to generate the

price trajectories.

The computation involved in this Monte Carlo simulation has the same three segments

as depicted in Figure 6.1. One iteration of the Monte Carlo simulation evaluates the

instrument price at one pre-set interval.

6.2.1 Reason for using Monte Carlo simulation

There are two main reasons to use a Monte Carlo simulation for this work: the iter-

ative process involved, and usage of random numbers. Thomas et al. [18] shows the

performance and power advantage of using FPGAs in generating random numbers. The

authors also mention that the random number generation is only one part of the Monte

Carlo simulation, and other architectures may provide better performance for the overall

application.

Quasi-Monte Carlo simulation

A special type of Monte Carlo technique, called the Quasi-Monte Carlo, is used for

the example application. It is similar to the traditional Monte Carlo technique, except

that quasi-random sequences are used instead of pseudo-random ones. A quasi-random

sequence attempts to avoid clustering of numbers by generating a number as far away as


Data Transfer



Random Numbers


Data TransferData Transfer



Random Numbers


Data TransferData Transfer



Random Numbers


Data Transfer

Data Flow

FPGA GPU CPU

Figure 6.4: Distribution of the Monte Carlo simulation tasks across three different archi-

tectures

possible from previously generated numbers. The Sobol sequence [9] is one such quasi-

random sequence, and it is used in the example application. The reason for deciding

on the Quasi-Monte Carlo technique is because Sobol sequence can help to demonstrate

several benefits of using FPGAs in the OpenCL framework.

6.2.2 Implementation

A sample application of the Monte Carlo simulation for Asian options is provided by

the AMD-APP-SDK-v2.4-lnx64 from AMD. Two major modifications are made to this


sample application to integrate an FPGA as an OpenCL compute element. Firstly,

an extra OpenCL context, besides the context for the GPU, is created for the FPGA

platform provided by O4F. The detailed process of creating multiple OpenCL contexts

has been explained in Section 4.2. Second modification is related to the random number

generation and usage. In the sample application, the GPU kernel generates the random

numbers to calculate the price on various points of the trajectory. In this work the FPGA

generates the random numbers, and these numbers are transferred to the GPU. The GPU

kernel is modified to use these random numbers directly.

Figure 6.4 shows how the tasks are distributed across the three processing architec-

tures of the platform. The FPGA kernel is launched first to generate a block of random

numbers. These are transferred to the GPU by first moving them to the host CPU and

then to the GPU. The FPGA can then begin generation of the next block of random

numbers while the GPU computes price trajectories using the random numbers. Once

this is complete, the results are transferred back to the CPU where the average price is

computed. Note that in the example application, the CPU part does not use an OpenCL

kernel. Instead regular C code is used to perform the reduction step.

FPGA Kernel: Sobol Sequence Generation

A detailed description of Sobol sequence generation can be found in [19]. The description

here focuses on the part that helps illustrate the usefulness of FPGAs in OpenCL. To

construct a Sobol sequence, an initial vector of numbers, called the directional vector,

needs to be generated. To generate w -bit wordlength Sobol numbers, a w size directional

vector is necessary. Multi-dimensional Sobol sequences can be generated (almost always

required for financial Monte Carlo simulation), and each dimension needs its own direc-

tional vector. Note that the directional vectors are generated only once, and they remain

constant afterwords.

As the directional vectors remain constant after being created, there is no need to


use FPGA resources to generate these numbers. In [19] these vectors are also generated

offline and loaded in the FPGA during runtime. The OpenCL framework provides a

convenient way to generate these numbers using the CPU and load them in the FPGA

as a memory object.

The kernel generating the Sobol sequence is designed to load these directional vectors

from the global memory to the local memory region. When the kernel is executed for

the first time, the directional vector is loaded based on the true value of one of the

kernel arguments. As Monte Carlo is an iterative process, the argument is set to false in

consecutive iterations, and the kernel uses the previously loaded directional vector. This

demonstrates the benefit of prolonged data lifetime as explained in Section 5.6.2.

The goal of the example application is to demonstrate a working heterogeneous plat-

form. As such, the performance for the kernel is not considered. Appendix C has sug-

gestions to improve the performance of this kernel.

6.2.3 Application Flow

Section 5.2 explained the overall flow necessary for using O4F and the same flow is used

for the example application. First the Sobol sequence generator for the FPGA is coded

manually as a Verilog module. A predefined interface, required to communicate with the

rest of the design, is used for the module as shown in Figure 5.1. An existing template is

used to group two of the module instances to create a kernel group module. The kernel

group module is inserted inside another template module to create a top level module for

the kernel group. A different set of templates is used if multiple kernel groups need to be

instantiated, but the top level kernel module has the same interface. The FPGA design

for the static framework of O4F has a placeholder for the top level kernel group module.

An ISE project is created with the files for the static framework and the kernel related

modules, and FPGA CAD tools are run to generate the configuration bitstream. Note

that as explained in Section 5.7.1, the FPGA is configured before the host is powered on


because the PCI-express link of the FPGA board needs to be available when the host is

booted.

Once the host is booted with the configured FPGA board, the O4F device driver is

loaded to provide the O4F API library access to the FPGA. This allows the example

application to run like a regular OpenCL application and access the FPGA as an OpenCL

device.

Executing the Kernels

The actual application execution flow is the same as the one described in Section 4.2.

This section describes how the kernels are executed on multiple devices.

The code listing in appendix B shows the function that executes the kernels in the

example application. The control flow of this function is described using the pseudo code

shown in Pseudo Code 1. The actual API calls are shown in the pseudo code without the

actual arguments, but with the targeted device name for readability. The pseudo code

is also annotated with the line numbers from the appendix. The full application along

with all work related to the middleware will be made public online.

Before entering the main iterative loop of the Monte Carlo simulation, the random

number generation task in the FPGA is executed. Line 133 enqueues the task and line

167 reads the random numbers from the FPGA. These two steps are done outside the

main loop to have the random numbers ready to be used in the first iteration. This

allows overlapping kernel execution inside the loop. Note that after the first execution,

the kernel argument to load the directional vectors is set to false at line 159.

The main for loop starts with a synchronization point at line 192. This is to ensure

the data transfer from the FPGA has been complete. In the first iteration, data transfer

is enqueued before the for loop at line 167. In the following iterations, data transfer is

enqueued during the previous iteration at line 294.

Inside the main loop, the kernel to calculate the price and the next set of random


Pseudo Code 1 Executing Kernels on the GPU and the FPGA

clEnqueueTask(FPGA) Line 133

clSetKernelArg(FPGA) Line 159

clEnqueueReadBuffer(FPGA) Line 167

for k = 0 → (steps− 1) do

clWaitForEvents(FPGA) Line 192

clEnqueueNDRangeKernel(GPU) Line 241

clEnqueueTask(FPGA) Line 259

clWaitForEvents(FPGA) Line 272

clWaitForEvents(GPU) Line 283

clEnqueueReadBuffer(FPGA) Line 294

for all prices do

calculate average price Line 362

end for

end for


numbers are enqueued at lines 241 and 259 respectively. Note that for the GPU, function

clEnqueueNDRangeKernel is used to create a virtual 2-dimensional index space. For

the FPGA, function clEnqueueTask is used because there is no virtual index space for

the FPGA. The FPGA board is pre-configured with two instances of the Sobol sequence

generation kernel.

Lines 272 and 283 are synchronization points for the FPGA and the GPU to finish

execution. Once the execution is done, a read buffer command is queued for the FPGA

at line 294. As mentioned earlier, the synchronization point for this command is at the

starting point of the main loop.

Once the result buffers are read from the GPU, and the CPU is used to perform the

reduction step at line 362.

6.3 Analysis

The main goal for developing the Monte Carlo simulation for Asian options is to demon-

strate a working platform consisting of CPUs, GPUs, and FPGAs. The lack of high-level

synthesis support and the inability to configure the FPGA during runtime introduces

some manual steps, however, the example shows how OpenCL makes it possible to easily

utilize heterogeneous computing elements as long as the supporting middleware infras-

tructure exists. The manual steps can be removed by adding a high-level synthesis tool,

and using partial reconfiguration methods or special boards (see Section 5.7.1).

6.3.1 Observations

The current OpenCL framework does allow addition of new processor architectures, how-

ever, it appears the framework is more suitable for GPU-like devices with an array of

processors. The concept of having a virtual index space, a core idea of OpenCL, implies

the underlying device needs to handle multiple threads. This provides flexibility and


portability for an OpenCL application. The size of the virtual index space can change

based on the input data size, and the application is not tied to a specific device. An

accelerator device is unlikely to support threads.

The OpenCL API has function calls to transfer memory to and from the device,

however, implicit memory transfer can also occur. In the example application that uses

AMD’s OpenCL implementation for the GPU, when a kernel is enqueued for execution

on the GPU, implicit memory transfer is done for memory objects specified as arguments

to the kernel. Notice that in the code listed in appendix B, there is no call to transfer the

random numbers to the GPU. This behaviour is not clearly specified in the specification.

A clarification is necessary to ensure all OpenCL implementations behave similarly, and

an OpenCL application does not require modifications based on the implementation being

used.

It also appears the current API specification does not consider multiple devices very

carefully. For example, according to the OpenCL specification 1.1, the API call clCre-

ateBuffer should return CL OUT OF RESOURCES if the OpenCL implementation

fails to allocate the required resources the on the device. However, the API takes an

OpenCL context object as an argument, not an OpenCL device object. As a context can

have multiple devices, it is not possible for the implementation to decide which device is

the target.

6.3.2 Performance

No performance analysis is done for the Monte Carlo simulation as performance im-

provement was not a goal. However, measurement shows that the sample application

that generates the random numbers and calculates the estimated price on the GPU,

spends half of the kernel execution time to generate the random numbers. As it has

been shown [18] that FPGAs can generate random numbers three-fold faster compared

to GPUs, it is conceivable to achieve an overall performance gain. However, issues men-


tioned in Appendix C must be addressed before any performance analysis.

The result from the FPGA is validated by a test application running on the host to

ensure the middleware is functioning properly. A software version of the Sobol sequence

generator is implemented in the test application to create a baseline, and the result

generated by the FPGA Sobol sequence generator is matched against this baseline.

Chapter 7

Summary

The motivation of this thesis is to provide a standardized unified programming model for

platforms with CPUs, GPUs, and FPGAs. The OpenCL framework is used to achieve

this goal by developing the middleware necessary to integrate FPGAs with CPUs and

GPUs. This work is the first known platform to integrate CPUs, GPUs, and FPGAs

under the OpenCL framework. The challenges and benefits of using FPGAs on such a

platform are discussed in Chapter 5.

Previous research has shown that different architectures provide a performance ad-

vantage over other architectures for various types of computation. This platform will

allow researchers to improve overall performance of an application with multiple com-

pute intensive segments by utilizing a suitable architecture for each segment. Potential

application types are discussed in Chapter 6. One such application, a Monte Carlo sim-

ulation for Asian Options, is developed to show a working platform under the OpenCL

framework.

7.1 Future Work

This work provides a first-generation FPGA OpenCL implementation that allows inte-

gration of FPGAs with CPUs and GPUs under the OpenCL framework. However, users

50

Chapter 7. Summary 51

still need to code the kernels for FPGAs in HDL. A major improvement would be to

integrate a high-level synthesis tool into the implementation. A high-level synthesis tool

like LegUp [2] would be ideal as it targets FPGA architecture and the source code is

publicly available. The existing tools in [15, 13] can be integrated as well.

O4F does not support all the API function calls in the OpenCL specification [11].

All the API calls need to be implemented to become fully compliant and allow true

portability of OpenCL applications. As mentioned earlier in Section 5.5.1, currently

a simple memory controller is being used. A more sophisticated memory controller is

necessary, but the design of the memory controller will be dependent on the overall

architecture being used.

The current FPGA design has a static framework to support OpenCL concepts and an

array of custom cores. As the OpenCL API implementation library adds more function

calls, minor modifications to the static framework maybe necessary to support the newer

function calls. However, extensive research for the custom core architecture is necessary.

A template-based architecture with a predefined datapath, similar to the one described

in [15], is one option. Another option is to generate an application-specific custom archi-

tecture using high-level synthesis. A mix of custom accelerators with microprocessors, as

generated by LegUp [2], can also be an option.

Appendix A

Implemented OpenCL API List

extern CL API ENTRY c l int CL API CALL

clGetPlat formIDs ( cl uint p num entr ies ,

cl platform id ∗ p plat forms ,

cl uint ∗p num platforms ) ;


c lGetPlat fo rmIn fo ( cl platform id p plat form ,

cl platform info p param name ,

s ize t p param va lue s i ze ,

void ∗ p param value ,

s ize t ∗ p pa r am va lu e s i z e r e t ) ;


clGetDeviceIDs ( cl platform id p plat form ,

cl device type p dev i ce type ,

cl uint p num entr ies ,

cl device id ∗ p dev i ce s ,

cl uint ∗ p num devices ) ;


c lGetDev ice In fo ( cl device id p dev ice ,

52

Appendix A. Implemented OpenCL API List 53

c l d e v i c e i n f o p param name ,




extern CL API ENTRY cl context CL API CALL

clCreateContextFromType ( const c l c o n t e x t p r o p e r t i e s ∗ p prope r t i e s ,

cl device type p dev i ce type ,

void (CL CALLBACK ∗ p p f n no t i f y )

( const char∗ , const void ∗ , s ize t , void ∗ ) ,

void ∗ p user data ,

c l int ∗ p e r r c od e r e t ) ;


clGetContextIn fo ( cl context p context ,

cl context info p param name ,


void ∗p param value ,



clReta inContext ( cl context p context ) ;


c lRe leaseContext ( cl context p context ) ;


clGetCommandQueueInfo (cl command queue p command queue ,

c l command queue info p param name ,





extern CL API ENTRY cl command queue CL API CALL

clCreateCommandQueue ( cl context p context ,

cl device id p dev ice ,

cl command queue properties p prope r t i e s ,



clRetainCommandQueue (cl command queue p command queue ) ;


clReleaseCommandQueue (cl command queue p command queue ) ;

extern CL API ENTRY cl mem CL API CALL

c lCrea t eBu f f e r ( cl context p context ,

cl mem flags p f l a g s ,

s ize t p s i z e ,

void ∗ p hos t pt r ,



clRetainMemObject (cl mem p mem ) ;


clReleaseMemObject (cl mem p mem ) ;


clSetMemObjectDestructorCal lback (cl mem p memobj ,

void (CL CALLBACK p p fn no t i f y ) (cl mem , void ∗ ) ,

void ∗ p use r da ta ) ;

extern CL API ENTRY cl program CL API CALL


clCreateProgramWithBinary ( cl context p context ,

cl uint p num devices ,

const cl device id ∗ p d e v i c e l i s t ,

const s ize t ∗ p lengths ,

const unsigned char ∗∗ p b ina r i e s ,

c l int ∗ p b ina ry s ta tu s ,



clRetainProgram (cl program p program ) ;


clReleaseProgram (cl program p program ) ;

extern CL API ENTRY cl kernel CL API CALL

c lCreateKerne l (cl program p program ,

const char ∗p kernel name ,



clSetKerne lArg ( cl kernel p kerne l ,

cl uint p arg index ,

s ize t p a r g s i z e ,

const void ∗ p arg va lue ) ;


c lReta inKerne l ( cl kernel p ke rne l ) ;


c lRe l ea s eKerne l ( cl kernel p ke rne l ) ;



c lRe leaseEvent ( cl event p event ) ;


clReta inEvent ( cl event p event ) ;


clSetUserEventStatus ( cl event p event ,

c l int p ex e cu t i on s t a tu s ) ;

extern CL API ENTRY cl event CL API CALL

clCreateUserEvent ( cl context p context ,



clWaitForEvents ( cl uint p num events ,

const cl event ∗ p e v e n t l i s t ) ;


clEnqueueWriteBuffer (cl command queue p command queue ,

cl mem p bu f f e r ,

cl bool p b lock ing read ,

s ize t p o f f s e t ,

s ize t p cb ,

const void ∗ p ptr ,

cl uint p num even t s i n wa i t l i s t ,

const cl event ∗ p e v e n t wa i t l i s t ,

cl event ∗ p event ) ;


clEnqueueReadBuffer (cl command queue p command queue ,

cl mem p bu f f e r ,

cl bool p b lock ing read ,


s ize t p o f f s e t ,

s ize t p cb ,

void ∗ p ptr ,



cl event ∗ p event ) ;


clEnqueueTask (cl command queue p command queue ,

cl kernel p kerne l ,



cl event ∗p event ) ;

Appendix B

Monte Carlo Kernel Execution

1 int

2 MonteCarloAsian : : runCLKernels (void )

3 {

4 c l int s t a tu s ;

5 cl event events [ 1 ] ;

6 cl event f e v en t s [ 3 ] ;

7

8 s ize t globalThreads [ 2 ] = {width , he ight } ;

9 s ize t l o ca lThreads [ 2 ] = {blockSizeX , blockSizeY } ;

10

11 /∗

12 ∗ Declare a t t r i b u t e s t r u c t u r e

13 ∗/

14 MonteCarloAttrib a t t r i b u t e s ;

15

16 i f ( l oca lThreads [ 0 ] > maxWorkItemSizes [ 0 ] | |

17 loca lThreads [ 1 ] > maxWorkItemSizes [ 1 ] | |

18 ( s ize t ) blockSizeX ∗ blockSizeY > maxWorkGroupSize )

19 {

20 std : : cout << ”Unsupported : Device does not support reques ted ”

21 ” : number o f work items . ” ;

58

Appendix B. Monte Carlo Kernel Execution 59

22 return SDK FAILURE;

23 }

24

25 /∗ width − i . e number o f e lements in the array ∗/

26 s t a tu s = clSetKerne lArg ( kerne l , 2 , s izeof ( cl uint ) , (void∗)&width ) ;

27 i f ( ! sampleCommon−>checkVal ( s tatus ,

28 CL SUCCESS,

29 ” c lSetKerne lArg f a i l e d . ( width ) ” ) )

30 {


32 }

33

34 /∗ whether s o r t i s to be in inc r ea s in g order .

35 CL TRUE imp l i e s i n c r ea s in g ∗/

36 s t a tu s = clSetKerne lArg ( kerne l , 3 , s izeof (cl mem ) , (void∗)&randBuf ) ;


38 CL SUCCESS,

39 ” c lSetKerne lArg f a i l e d . ( randBuf ) ” ) )

40 {


42 }

43

44 s t a tu s = clSetKerne lArg ( kerne l , 4 , s izeof (cl mem ) , (void∗)&pr i ceBuf ) ;


46 CL SUCCESS,

47 ” c lSetKerne lArg f a i l e d . ( pr i ceBuf ) ” ) )

48 {


50 }

51

52 s t a tu s = clSetKerne lArg ( kerne l , 5 , s izeof (cl mem ) ,

53 (void∗)&pr iceDer ivBuf ) ;



55 CL SUCCESS,

56 ” c lSetKerne lArg f a i l e d . ( pr i ceDer ivBuf ) ” ) )

57 {


59 }

60

61 s t a tu s = clSetKerne lArg ( kerne l , 1 , s izeof ( c l int ) , (void∗)&noOfSum ) ;


63 CL SUCCESS,

64 ” c lSetKerne lArg f a i l e d . (noOfSum) ” ) )

65 {


67 }

68

69 struct o 4 f k e r n e l a r g karg ;

70

71 karg . mem arg = direc t ionBufF ;

72 karg . type = O4F KERNEL ARG CL MEM;

73 karg . i dx = 0 ;

74 s t a tu s = clSetKerne lArg ( kernelF , 0 ,

75 s izeof ( struct o 4 f k e r n e l a r g ) , (void∗)&karg ) ;


77 CL SUCCESS,

78 ” c lSetKerne lArg f a i l e d . ( d i rec t ionBufF ) ” ) )

79 {


81 }

82

83 karg . mem arg = randBufF ;

84 karg . type = O4F KERNEL ARG CL MEM;

85 karg . i dx = 1 ;





89 CL SUCCESS,

90 ” c lSetKerne lArg f a i l e d . ( randBufF ) ” ) )

91 {


93 }

94

95 // the r e are two k e rn e l s running on FPGA

96 karg . i n t a r g = ( noOfTraj ∗ noOfTraj ∗ noOfSum)/2 ;

97 karg . type = O4F KERNEL ARG CL INT;

98 karg . i dx = 2 ;




102 CL SUCCESS,

103 ” c lSetKerne lArg f a i l e d . ( count ) ” ) )

104 {


106 }

107

108 karg . i n t a r g = 1 ;


110 karg . i dx = 3 ;




114 CL SUCCESS,

115 ” c lSetKerne lArg f a i l e d . ( l oadD i r e c t i on ) ” ) )

116 {



118 }

119

120 // load the d i r e c t i o n numbers .

121 s t a tu s = clEnqueueWriteBuffer ( th i s−>commandQueueF ,

122 th i s−>direct ionBufF ,

123 CL FALSE,

124 0 ,

125 th i s−>sobolBitWidth∗ th i s−>dimentionCount ,

126 th i s−>directionNumF ,

127 0 ,

128 NULL, // e v e n t l i s t ,

129 &f ev en t s [ 0 ] ) ;

130 s t a tu s = clWaitForEvents (1 , &f ev en t s [ 0 ] ) ;


132 CL SUCCESS,

133 ” clWaitForEvents f a i l e d . ” ) )

134 {


136 }

137 c lRe leaseEvent ( f e v en t s [ 0 ] ) ;

138

139 // Run the rnd genera tor f o r the f i r s t time .

140 s t a tu s = clEnqueueTask ( th i s−>commandQueueF ,

141 th i s−>kernelF ,

142 0 ,

143 NULL,

144 &f ev en t s [ 0 ] ) ;


146 CL SUCCESS,

147 ”clEnqueueTask f a i l e d . ” ) )

148 {



150 }

151

152 /∗ wai t f o r the k e rne l c a l l to f i n i s h execu t ion ∗/



155 CL SUCCESS,


157 {


159 }


161

162 // next time the k e rne l does not need to load the d i r e c t i o n a l v e c t o r

163 karg . i n t a r g = 0 ;


165 karg . i dx = 3 ;




169 CL SUCCESS,

170 ” c lSetKerne lArg f a i l e d . ( l oadD i r e c t i on ) ” ) )

171 {


173 }

174

175 s t a tu s = clEnqueueReadBuffer ( th i s−>commandQueueF ,

176 th i s−>randBufF ,

177 CL TRUE, // CL FALSE,

178 0 ,

179 th i s−>noOfSum ∗

180 th i s−>noOfTraj ∗

181 th i s−>noOfTraj ,


182 th i s−>randNum ,

183 0 ,


185 &f ev en t s [ 0 ] ) ;

186

187 f loat t imeStep = maturity / (noOfSum − 1 ) ;

188

189 // I n i t i a l i z e random number genera tor

190 // srand ( 1 ) ;

191

192 for ( int k = 0 ; k < s t ep s ; k++)

193 {

194 // f o r ( i n t j = 0 ; j < ( width ∗ h e i g h t ∗ 4 ) ; j++)

195 //{

196 // randNum [ j ] = ( c l u i n t ) rand ( ) ;

197 //}

198 // For k = 0 , the random numbers are generated b e f o r e g e t t i n g

199 // in t o the loop . We j u s t wai t here to ensure memory t r an s f e r

200 // i s f i n i s h e d .

201 /∗ wai t f o r the random numbers to be t r an s f e r r e d to hos t ∗/



204 CL SUCCESS,


206 {


208 }


210

211 f loat c1 = ( i n t e r e s t − 0 .5 f ∗ sigma [ k ] ∗ sigma [ k ] ) ∗ t imeStep ;

212 f loat c2 = sigma [ k ] ∗ s q r t ( t imeStep ) ;

213 f loat c3 = ( i n t e r e s t + 0 .5 f ∗ sigma [ k ] ∗ sigma [ k ] ) ;


214

215 const c l f l o a t 4 c1F4 = {c1 , c1 , c1 , c1 } ;

216 a t t r i b u t e s . c1 = c1F4 ;

217


219 a t t r i b u t e s . c2 = c2F4 ;

220


222 a t t r i b u t e s . c3 = c3F4 ;

223

224 const c l f l o a t 4 i n i tP r i c eF4 =

225 { i n i tP r i c e , i n i tP r i c e , i n i tP r i c e , i n i t P r i c e } ;

226 a t t r i b u t e s . i n i t P r i c e = in i tP r i c eF4 ;

227

228 const c l f l o a t 4 s t r i k ePr i c eF4 =

229 { s t r i k eP r i c e , s t r i k eP r i c e , s t r i k eP r i c e , s t r i k eP r i c e } ;

230 a t t r i b u t e s . s t r i k eP r i c e = s t r i k ePr i c eF4 ;

231

232 const c l f l o a t 4 sigmaF4 =

233 { sigma [ k ] , sigma [ k ] , sigma [ k ] , sigma [ k ] } ;

234 a t t r i b u t e s . sigma = sigmaF4 ;

235

236 const c l f l o a t 4 timeStepF4 =

237 { timeStep , timeStep , timeStep , t imeStep } ;

238 a t t r i b u t e s . t imeStep = timeStepF4 ;

239

240

241 /∗ Set appropr ia t e arguments to the k e rne l ∗/

242

243 /∗ the input array − a l s o ac t s as output f o r

244 t h i s pass ( input f o r next ) ∗/

245 s t a tu s = clSetKerne lArg ( kerne l , 0 ,


246 s izeof ( a t t r i b u t e s ) , (void∗)& a t t r i b u t e s ) ;


248 CL SUCCESS,

249 ” c lSetKerne lArg f a i l e d . ( a t t r i b u t e s ) ” ) )

250 {


252 }

253

254 /∗

255 ∗ Enqueue a ke rne l run c a l l .

256 ∗/

257 s t a tu s = clEnqueueNDRangeKernel (commandQueue ,

258 kerne l ,

259 2 ,

260 NULL,

261 globalThreads ,

262 loca lThreads ,

263 0 ,

264 NULL,

265 &events [ 0 ] ) ;

266


268 CL SUCCESS,

269 ”clEnqueueNDRangeKernel f a i l e d . ” ) )

270 {


272 }

273

274 // Enqueue the rnd genera tor to genera te next s e t o f numbers

275 s t a tu s = clEnqueueTask ( th i s−>commandQueueF ,

276 th i s−>kernelF ,

277 0 ,


278 NULL,

279 &f ev en t s [ 0 ] ) ;


281 CL SUCCESS,

282 ”clEnqueueTask f a i l e d . ” ) )

283 {


285 }

286

287 /∗ wai t f o r the rnd number genera tor to f i n i s h execu t i on ∗/



290 CL SUCCESS,


292 {


294 }


296

297

298 /∗ wai t f o r the k e rne l c a l l to f i n i s h execu t ion ∗/

299 s t a tu s = clWaitForEvents (1 , &events [ 0 ] ) ;


301 CL SUCCESS,


303 {


305 }

306

307 c lRe leaseEvent ( events [ 0 ] ) ;

308

309 /∗ Enqueue read ing in the rnd numbers ∗/


310 s t a tu s = clEnqueueReadBuffer ( th i s−>commandQueueF ,

311 th i s−>randBufF ,

312 CL TRUE, // CL FALSE,

313 0 ,

314 th i s−>noOfSum ∗

315 th i s−>noOfTraj ∗

316 th i s−>noOfTraj ,

317 th i s−>randNum ,

318 0 ,


320 &f ev en t s [ 0 ] ) ;

321

322

323 /∗ Enqueue the r e s u l t s to a p p l i c a t i o n po in t e r ∗/

324 s t a tu s = clEnqueueReadBuffer (commandQueue ,

325 pr iceBuf ,

326 CL TRUE,

327 0 ,

328 width∗ he ight ∗2∗ s izeof ( c l f l o a t 4 ) ,

329 pr i ceVa l s ,

330 0 ,

331 NULL,

332 &events [ 0 ] ) ;


334 CL SUCCESS,

335 ” clEnqueueReadBuffer f a i l e d . ” ) )

336 {


338 }

339

340 /∗ wai t f o r the read b u f f e r to f i n i s h execu t ion ∗/




343 CL SUCCESS,


345 {


347 }

348


350

351 /∗ Enqueue the r e s u l t s to a p p l i c a t i o n po in t e r ∗/

352 s t a tu s = clEnqueueReadBuffer (commandQueue ,

353 pr iceDer ivBuf ,

354 CL TRUE,

355 0 ,

356 width∗ he ight ∗2∗ s izeof ( c l f l o a t 4 ) ,

357 pr i ceDer iv ,

358 0 ,

359 NULL,

360 &events [ 0 ] ) ;


362 CL SUCCESS,

363 ” clEnqueueReadBuffer f a i l e d . ” ) )

364 {


366 }

367

368 /∗ wai t f o r the read b u f f e r to f i n i s h execu t ion ∗/



371 CL SUCCESS,


373 {



375 }

376


378

379 /∗ Replace Fo l lowing ” f o r ” loop wi th reduc t ion ke rne l ∗/

380 for ( int i = 0 ; i < noOfTraj ∗ noOfTraj ; i++)

381 {

382 p r i c e [ k ] += pr i c eVa l s [ i ] ;

383 vega [ k ] += pr i c eDer iv [ i ] ;

384 }

385

386 p r i c e [ k ] /= ( noOfTraj ∗ noOfTraj ) ;

387 vega [ k ] /= ( noOfTraj ∗ noOfTraj ) ;

388

389 p r i c e [ k ] = exp(− i n t e r e s t ∗ maturity ) ∗ p r i c e [ k ] ;

390 vega [ k ] = exp(− i n t e r e s t ∗ maturity ) ∗ vega [ k ] ;

391 }

392

393 // we do an ex t ra s e t o f random numbers , and ask to read i t

394 // t h i s s e t won ’ t be used , but j u s t c l ean ing up .



397 CL SUCCESS,


399 {


401 }


403

404 return SDK SUCCESS;

405 }

Appendix C

Sobol Sequence Implementation

Performance was not a goal for the Sobol sequence generator. The following two topics

need to be addressed foremost to improve performance.

Clock Frequency

The 62.5 MHz clock provided by the PCI-express core is used by the Sobol sequence

module. The floating point operation core used in the module allows a variable length

pipeline to tune performance. This pipeline is adjusted simply to meet the 62.5 MHz

frequency. In contrast, a random number generator running at 180 MHz on an older

FPGA device is reported by Tian [19]. It evident there is room to improve the clock

frequency of the Sobol sequence module.

Random Number Generation Throughput

Currently with two instances of the Sobol sequence module instantiated, the complete

design utilizes 9% of the LUTs and 35% of the block RAMs on the Xilinx Virtex5-

LX110T FPGA device. The two instances themselves only utilizes 1% of the LUTS and

1% of the block RAMs. The random number generation throughput for the FPGA can

be greatly incrased by instantiating more copies of the module. However, the simple

71

Appendix C. Sobol Sequence Implementation 72

memory controller used in the static framework restricts the number to two.

Bibliography

[1] Advanced Micro Devices. AMD OpenCL Zone. developer.amd.com/zones/

OpenCLZone/Pages/default.aspx.

[2] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,

Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. Legup: high-level

synthesis for fpga-based processor/accelerator systems. In Proceedings of the 19th

ACM/SIGDA international symposium on Field programmable gate arrays, FPGA

’11, pages 33–36, New York, NY, USA, 2011. ACM.

[3] Eugene Cartwright, Sen Ma, David Andrews, and Miaoqing Huang. Creating

HW/SW co-designed MPSoPC’s from high level programming models. In High Per-

formance Computing and Simulation (HPCS), 2011 International Conference on,

pages 554 –560, july 2011.

[4] Shuai Che, Jie Li, J.W. Sheaffer, K. Skadron, and J. Lach. Accelerating Compute-

Intensive Applications with GPUs and FPGAs. In Application Specific Processors,

2008. SASP 2008. Symposium on, pages 101 –107, june 2008.

[5] B. Cope, P.Y.K. Cheung, W. Luk, and S. Witt. Have GPUs made FPGAs redun-

dant in the field of video processing? In Field-Programmable Technology, 2005.

Proceedings. 2005 IEEE International Conference on, pages 111 –118, dec. 2005.

73

developer.amd.com/zones/OpenCLZone/Pages/default.aspx

developer.amd.com/zones/OpenCLZone/Pages/default.aspx

Bibliography 74

[6] John B. Drake, Philip W. Jones, and George R. Carr. Overview of the software

design of the community climate system model. Int. J. High Perform. Comput.

Appl, 19:177–186, 2005.

[7] Z. Zhang et al. AutoPilot: A Platform-Based ESL Synthesis System”, High-Level

Synthesis, Springer Netherlands. www.autoesl.com, 2008.

[8] L.W. Howes, P. Price, O. Mencer, O. Beckmann, and O. Pell. Comparing FPGAs

to Graphics Accelerators and the Playstation 2 Using a Unified Source Descrip-

tion. In Field Programmable Logic and Applications, 2006. FPL ’06. International

Conference on, pages 1 –6, aug. 2006.

[9] Sobol Ilya. Uniformly distributed sequences with an additional uniform property.

In USSR Computational Mathematics and Mathematical Physics, Volume 16, pages

236–242, 1977.

[10] Khronos Group. Installable Client Drivers (ICD) Loader. http://www.khronos.

org/registry/cl/extensions/khr/cl khr icd.txt.

[11] Khronos Group. OpenCL Specification 1.1. www.khronos.org/registry/cl/

specs/opencl-1.1.pdf.

[12] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong

Program Analysis & Transformation. In Proceedings of the 2004 International Sym-

posium on Code Generation and Optimization (CGO’04), Palo Alto, California, Mar

2004.

[13] Mingjie Lin, I. Lebedev, and J. Wawrzynek. OpenRCL: Low-Power High-

Performance Computing with Reconfigurable Devices. In Field Programmable Logic

and Applications (FPL), 2010 International Conference on, pages 458 –463, 31 2010-

sept. 2 2010.

www.autoesl.com

http://www.khronos.org/registry/cl/extensions/khr/cl_khr_icd.txt

http://www.khronos.org/registry/cl/extensions/khr/cl_khr_icd.txt

www.khronos.org/registry/cl/specs/opencl-1.1.pdf

www.khronos.org/registry/cl/specs/opencl-1.1.pdf

Bibliography 75

[14] Nvidia Corporation. Nvidia OpenCL Support. http://developer.nvidia.com/

opencl.

[15] M. Owaida, N. Bellas, K. Daloukas, and C.D. Antonopoulos. Synthesis of Platform

Architectures from OpenCL Programs. In Field-Programmable Custom Computing

Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, pages 186

–193, may 2011.

[16] A. Papakonstantinou, K. Gururaj, J.A. Stratton, D. Chen, J. Cong, and W.-M.W.

Hwu. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In

Application Specific Processors, 2009. SASP ’09. IEEE 7th Symposium on, pages 35

–42, july 2009.

[17] Michael Showerman, Jeremy Enos, Avneesh Pant, Volodymyr Kindratenko, Craig

Steffen, Robert Pennington, and Wen mei Hwu. QP: A heterogeneous multi-

accelerator cluster. In Proceedings of the 10th LCI International Conference on

High-performance Clustered Computing, march 2009.

[18] David Barrie Thomas, Lee Howes, and Wayne Luk. A comparison of cpus, gpus,

fpgas, and massively parallel processor arrays for random number generation. In

Proceeding of the ACM/SIGDA international symposium on Field programmable

gate arrays, FPGA ’09, pages 63–72, New York, NY, USA, 2009. ACM.

[19] Xiang Tian and K. Benkrid. Massively parallelized quasi-monte carlo financial sim-

ulation on a fpga supercomputer. In High-Performance Reconfigurable Computing

Technology and Applications, 2008. HPRCTA 2008. Second International Workshop

on, pages 1 –8, nov. 2008.

[20] Kuen Hung Tsoi and Wayne Luk. Axel: a heterogeneous cluster with FPGAs and

GPUs. In Proceedings of the 18th annual ACM/SIGDA international symposium on

http://developer.nvidia.com/opencl

http://developer.nvidia.com/opencl

Bibliography 76

Field programmable gate arrays, FPGA ’10, pages 115–124, New York, NY, USA,

2010. ACM.

[21] R. Weber, A. Gothandaraman, R.J. Hinde, and G.D. Peterson. Comparing Hardware

Accelerators in Scientific Applications: A Case Study. Parallel and Distributed

Systems, IEEE Transactions on, 22(1):58 –68, jan. 2011.

OpenCL framework for a CPU, GPU, and FPGA Platform · 2012. 11. 3. · for FPGA (O4F) developed for...

Documents

Transcript of OpenCL framework for a CPU, GPU, and FPGA Platform · 2012. 11. 3. · for FPGA (O4F) developed for...