OpenCL framework for a CPU, GPU, and FPGA Platform · 2012. 11. 3. · for FPGA (O4F) developed for...
Transcript of OpenCL framework for a CPU, GPU, and FPGA Platform · 2012. 11. 3. · for FPGA (O4F) developed for...
OpenCL framework for a CPU, GPU, and FPGA Platform
by
Taneem Ahmed
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
Copyright c© 2011 by Taneem Ahmed
Abstract
OpenCL framework for a CPU, GPU, and FPGA Platform
Taneem Ahmed
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2011
With the availability of multi-core processors, high capacity FPGAs, and GPUs, a hetero-
geneous platform with tremendous raw computing capacity can be constructed consisting
of any number of these computing elements. However, one of the major challenges for
constructing such a platform is the lack of a standardized framework under which an ap-
plication’s computational task and data can be easily and effectively managed amongst
the computing elements. In this thesis work such a framework is developed based on
OpenCL (Open Computing Language). An OpenCL API and run time framework, called
O4F, was implemented to incorporate FPGAs in a platform with CPUs and GPUs un-
der the OpenCL framework. O4F help explore the possibility of using OpenCL as the
framework to incorporate FPGAs with CPUs and GPUs. This thesis details the findings
of this first-generation implementation and provides recommendations for future work.
ii
Dedication
To Mohsin - for all the inspiration
iii
Acknowledgements
I would like to acknowledge all the support and guidance provided by my supervisor Prof.
Paul Chow. His direction and feedback on this thesis has been invaluable. I also thank
all the students in the program for their help, feedback, and friendship. Special thanks to
my wife, my mother, and rest of the family for all their support and patience. I greatly
appreciate all the encouragement and guidance from Dr. Jason Anderson and Dr. Qiang
Wang - the two great ‘friend, philosopher and guide’s I have been blessed with.
iv
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 OpenCL Overview 5
2.1 OpenCL Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 OpenCL Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Platform Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Runtime Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Related Work 15
4 Heterogeneous Platforms Under the OpenCL Framework 18
4.1 ICD Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Application flow under ICD Loader . . . . . . . . . . . . . . . . . . . . . 19
4.3 Challenges of using OpenCL for Heterogeneous Platforms . . . . . . . . . 20
v
4.3.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.2 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.3 Cluster Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 OpenCL For FPGA 22
5.1 Application Flow using FPGAs . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 OpenCL Code Compilation . . . . . . . . . . . . . . . . . . . . . 23
5.2 Flow Used in this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4.1 OpenCL API Library . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4.2 Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5 Architecture for FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5.1 Static Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5.2 Kernel Organization . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5.3 Kernel Information . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Benefits of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6.1 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6.2 Data lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6.3 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7 Challenges of FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7.1 FPGA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.7.2 FPGA Resource Estimation . . . . . . . . . . . . . . . . . . . . . 36
6 Example Application 37
6.1 Potential Application Types . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1.1 Iterative Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.2 Task Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vi
6.1.3 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Example: Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . 41
6.2.1 Reason for using Monte Carlo simulation . . . . . . . . . . . . . . 41
6.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.3 Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Summary 50
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A Implemented OpenCL API List 52
B Monte Carlo Kernel Execution 58
C Sobol Sequence Implementation 71
Bibliography 73
vii
List of Tables
2.1 OpenCL Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1 BAR1 Offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
List of Figures
1.1 OpenCL Framework Implementation . . . . . . . . . . . . . . . . . . . . 2
2.1 OpenCL Platform Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 An example 3D indexed kernel space . . . . . . . . . . . . . . . . . . . . 8
2.3 OpenCL Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Multiple OpenCL Implementations Under ICD Loader . . . . . . . . . . 19
4.2 Possible OpenCL Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 FPGA Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Actual Flow used in this work . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 UML Class Diagram of the API Library . . . . . . . . . . . . . . . . . . 26
5.4 Kernel Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 One Kernel Group with two Kernels . . . . . . . . . . . . . . . . . . . . . 32
5.6 Two Kernel Groups with one Kernel each . . . . . . . . . . . . . . . . . . 33
6.1 Monte Carlo Simulation Flowchart . . . . . . . . . . . . . . . . . . . . . 38
6.2 Components in Community Climate System Model . . . . . . . . . . . . 39
6.3 Possible application of the platform . . . . . . . . . . . . . . . . . . . . . 40
6.4 Distribution of the Monte Carlo simulation tasks across three different
architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
ix
Chapter 1
Introduction
The availability of multi-core CPUs, high capacity FPGAs and GPUs makes possible
a heterogeneous platform with enormous computational capacity. Previous research
[4, 5, 8] has shown that each type of processor technology is ideally suited to imple-
ment specific types of functions. Thus an application with multiple compute intensive
segments would benefit from a heterogeneous platform consisting of different processor
technologies. However, mass adaptation of such platforms remains elusive due to the
challenging task of programming for such heterogeneous platforms.
In the remainder of this Chapter, Section 1.1 details the motivation of this research,
Section 1.2 summarizes the contributions, and Section 1.3 provides the outline of this
thesis.
1.1 Motivation
CPUs, GPUs, and FPGAs all have their own programming models that are very differ-
ent from each other. Moving to a heterogeneous platform makes it even more difficult to
present a unified programming model that works for all architectures. All of the existing
heterogeneous platforms define their own programming paradigm and application devel-
opment process. There is always a learning curve for the application developers to even
1
Chapter 1. Introduction 2
OpenCL Application
OpenCL Language
Driver
Hardware
OpenCL Library
OpenCL API
OpenCLRuntime
OpenCLPlatform Layer
Compiler
Figure 1.1: OpenCL Framework Implementation
evaluate such a platform. The lack of a standardized framework for application develop-
ers is a major barrier for mass adaptation of such platforms. So far OpenCLTM1 (Open
Computing Language) seems like a promising framework to address this issue. The fact
that OpenCL is the only common framework supported by all GPU vendors currently
makes it the sole candidate to provide a unified programming model for heterogeneous
platforms containing GPUs.
OpenCL is a complete framework consisting of a programming language, a set of
APIs, and hardware that supports OpenCL constructs. Figure 1.1 shows all the com-
ponents necessary to realize an OpenCL framework implementation, hereafter referred
to simply as an OpenCL implementation. An OpenCL implementation encapsulates
a library that implements the OpenCL API, a toolchain to compile the OpenCL lan-
guage for the target architecture, computational devices that support OpenCL concepts,
and device drivers to communicate with the devices if necessary. It is possible for one
1OpenCL is a trademark of Apple Inc
Chapter 1. Introduction 3
OpenCL implementation to support different types of devices, e.g. the AMD OpenCL
implementation supports GPUs and CPUs from AMD.
Due to active support for OpenCL from CPU and GPU vendors, existing worksta-
tions with supported GPUs have become heterogeneous platforms for general purpose
computing. OpenCL is increasingly becoming the standard framework for CPU+GPU
platforms. However, there are no existing OpenCL implementations that integrate FP-
GAs under the OpenCL framework. The motivation of this research work is to explore
the feasibility of OpenCL as the standard framework for developing applications for het-
erogeneous platforms with CPUs, GPUs, and FPGAs.
1.2 Research Contributions
The work required to integrate FPGAs under the OpenCL framework for heterogeneous
platforms can be divided into three segments. Firstly, a target architecture supporting
OpenCL concepts needs to be defined for FPGA implementations. The architecture can
be based on an array-of-soft-processors or an array-of-custom-cores. Secondly, a tool
is required to convert computation described in the OpenCL language to the targeted
architecture. This tool can be a compiler or high-level synthesis tool depending on
the architecture. Thirdly, a middleware is necessary to integrate FPGAs with existing
OpenCL implementations for CPUs and GPUs. The middleware consists of a library that
implements the OpenCL API standard and a device driver to facilitate communication
between this library and the FPGA.
There is existing and ongoing research work on the first two segments described above.
The major contribution of this thesis is to provide the first known implementation of the
middleware required for FPGA integration. This middleware allows FPGAs to be used
under the OpenCL framework along with commodity CPUs and GPUs. During the
development of the middleware, some challenges were exposed that are described along
Chapter 1. Introduction 4
with possible solutions.
As a target architecture is necessary to test the middleware, an architecture based
on an array-of-custom-cores is also described in this thesis. Different aspects of the
architecture are examined and some recommendations are made for future research work.
1.3 Thesis Outline
The remainder of this thesis is organized as following. Chapter 2 provides an overview
of the OpenCL framework and OpenCL application execution flow. This overview in-
troduces the OpenCL terminology before introducing the related research work based on
OpenCL and FPGAs in Chapter 3. The overview in Chapter 2 also helps with under-
standing how a heterogeneous platform is created under the OpenCL framework, which
is described in Chapter 4. Chapter 5 describes the details of the OpenCL implementation
for FPGA (O4F) developed for this thesis. This chapter also discusses the challenges and
future work for any OpenCL implementation for FPGAs. A Monte Carlo simulation for
Asian Options was run to test out the platform including CPU, GPU, and FPGA. Chap-
ter 6 describes the reason for deciding on this test application and how the computational
work was partitioned. Chapter 7 has the conclusions and future work.
Chapter 2
OpenCL Overview
OpenCL is an open standard targeted for general-purpose parallel programming on differ-
ent types of processors. The goal of OpenCL is to provide software developers a standard
framework for easy access to heterogeneous processing platforms. The OpenCL standard
specifies a set of API and a programming language based on C. For the purpose of this
thesis it is only necessary to describe the OpenCL concepts rather than the technical
details. The technical details of the OpenCL framework can be found in the OpenCL
specification [11].
The OpenCL framework can be best understood by the four models explained in
Section 2.1. Section 2.2 describes the execution flow of an OpenCL application.
2.1 OpenCL Models
The following four models describe the core ideas behind the OpenCL framework.
• Platform Model
• Memory Model
• Execution Model
5
Chapter 2. OpenCL Overview 6
Host
OpenCL Device
Processing Element
Compute Element
Device B
Device A
Figure 2.1: OpenCL Platform Model
• Programming Model
2.1.1 Platform Model
Figure 2.1 depicts all the components of the OpenCL platform model. An OpenCL
application is executed on the host and most of the runtime control of the application
resides on the host. There can be one or more computing devices connected to the
host. The OpenCL standard does not specify the type of connectivity, i.e. whether the
connection is by a bus, e.g. PCI, PCI-express, etc., or over an Ethernet network. The
OpenCL implementation specific to each device is responsible for the communication and
it is hidden from the application developer.
Each OpenCL device has one or more Compute Units (CU), and each CU has one or
more Processing Elements (PE). The actual computation is done on the PEs. Consider
the case of a GPU. The card containing the GPU is the OpenCL device. This card
contains the GPU which is the compute unit, and each GPU contains processing cores
which are the processing elements.
Chapter 2. OpenCL Overview 7
2.1.2 Execution Model
The execution of an OpenCL application has two components. One part, called the
kernel, executes on the devices, and the other part that executes on the host. The host
part manages the kernels and the memory objects under a context through command
queues.
Context
The context contains all the pieces necessary to use a device for computation. Using
the OpenCL API, the host part of the application creates a context object and the
other objects under it, i.e. kernel object, program object, memory objects, and command
queues object.
Kernel
The kernel represents the computation that is executed on the processing elements. The
following simple example is used to clarify the kernel concept. Assuming there is an
integer array of length 10 and the goal is to multiply each integer by a constant. The
kernel for this problem would only represent multiplication of one integer by the constant,
and the kernel would be instantiated 10 times to solve the complete problem. However,
out of consideration for processor utilization and memory access, it is possible to multiply
two integers in the same kernel. In that case the kernel would be instantiated five times
to solve the complete problem.
Work Items and Work Groups
A virtual N-dimensional indexed space is defined for the execution of the kernel, and one
kernel instance is executed for each point in this indexed space. The value of N can
be one, two, or three. Each kernel instance is called a work-item. All the work-items
execute the same code, however, they usually work on different data and their execution
Chapter 2. OpenCL Overview 8
Gx
Gy
Gz
Wy
WxWz
Work-Group3D Indexed Space
Work-Item
Figure 2.2: An example 3D indexed kernel space
path through the code can diverge. Each work-item is assigned a global ID that is unique
across the indexed space.
Equal numbers of work-items are grouped together to form work-groups, with all the
work-groups having the same dimensions. The work-item within a work-group has a local
ID that is unique within the work-group, and also has access to shared local memory as
described in Section 2.1.3.
It is important to note that with proper device support, the total number of work-
items can be much greater than the number of processing elements present in a device.
Through API calls an application can find out the maximum number of work-items a
device supports.
Program and Memory Object
The program object consists of the source code and the binary implementation of the
kernels. During application execution, the binary implementation can be generated from
the source code, or a pre-compiled binary can be loaded to create the program object.
A program object can be considered as a library for kernels because one program object
can contain multiple kernels. The application decides which kernel to execute during
Chapter 2. OpenCL Overview 9
runtime.
The memory objects are visible to both the host and the kernels, and used to transfer
data between the host and the device. The host creates memory objects, and through
the OpenCL API allocates memory on the device for the memory objects. The details
of the memory model are described in Section 2.1.3.
Command Queue
Each device in the context has an associated command queue, and kernel execution and
memory transfer are coordinated using the command queue. There are three types of
commands that can be issued. Memory commands are mainly used to transfer memory
between the host and the device. Kernel commands are issued to start the execution of
kernels on the device. Synchronization commands can be used to control the execution
order of the commands.
Once the commands are scheduled on the queue, there are two possible execution
modes. The commands can be executed in-order, meaning the previous command on the
queue must finish execution for a command to start execution. The other option is for the
commands to execute out-of-order, where commands do not wait for previously queued
commands to finish. However, explicit ordering can be enforced in an out-or-order queue
by synchronization commands.
2.1.3 Memory Model
The memory model in OpenCL is divided into four types based on the memory access
capabilities of the work-items. Table 2.1, based on Table 3.1 in [11], summarizes the
memory types. The dynamic allocation means memory allocated at run-time, and static
allocation indicates memory allocated at compile time.
• Global Memory: All work-items have read-write access to this memory region.
Usually the input data for the work-items are written to this region by the host,
Chapter 2. OpenCL Overview 10
Table 2.1: OpenCL Memory Model
Global Constant Local Private
Host Dynamic
Allocation
Dynamic
Allocation
Dynamic
Allocation
No
Allocation
Read/Write
Access
Read/Write
Access
No
Access
No
Access
Kernel Static
Allocation
Static
Allocation
Static
Allocation
Static
Allocation
Read/Write
Access
Read
Access
Read/Write
Access
Read/Write
Access
and the computed output data is written there by the work-items.
• Constant Memory: This is a Read-Only global memory accessible to all work items.
The host part of the application allocates and initializes this memory region.
• Local Memory: This memory region is the local memory for a work-group. All
the work-items in a work-group shares this memory region. This memory allows
work-items to communicate with each other within a work-group.
• Private Memory: This memory region represents the local variables of the kernel
instance. Each work-item has its own copy of the local variables and they are only
visible to the work-item.
2.1.4 Programming Model
Under the OpenCL programming model, computation can be done in data parallel, task
parallel, or a hybrid of these two models. The main focus of the OpenCL programming
model is the data parallel model, where each work-item works on a data item - effectively
Chapter 2. OpenCL Overview 11
implementing SIMD.
The task parallel model can be realized by enqueing the execution of multiple kernels,
where only one work-item for each kernel is created. Even though some of the GPUs
support this model, this is highly inefficient model for GPUs.
It is possible to have a hybrid model where multiple kernels each with multiple work-
items are enqueued for execution at the same time.
2.2 OpenCL Application Flow
The OpenCL application flow is depicted in Figure 2.3, with the steps numbered for
reference in the following discussion. The flow is split into two sections. The platform
layer creates a context based on available platforms, and the runtime layer creates all
other necessary objects to execute the kernel.
2.2.1 Platform Layer
An OpenCL application initially queries for the available OpenCL platforms (step 1).
Once the available platform list is gathered, the application selects the one with the
desired device type (step 2) and creates a context. Possible device types allowed in
the OpenCL specification are CL DEVICE TYPE CPU, CL DEVICE TYPE GPU, and
CL DEVICE TYPE ACCELERATOR. The context then adds the desired number of
devices from the available devices (step 3). Once added to a context, the devices are
made exclusive to the context until they are explicitly released from the context.
2.2.2 Runtime Layer
The tasks described below are considered to be part of the runtime layer. Note that it is
not necessary to execute the tasks in the same order as explained below.
Chapter 2. OpenCL Overview 12
Get Platform List
clGetDeviceIDs
clGetPlatformIDs
clCreateContextFromTypeCreate Context
Of Type T
clCreateCommandQueueCreate CommandQueue For Device
Create MemoryObjects
clCreateBuffer
Create Program ObjFrom Source/Binary
clCreateProgramWithSourceclCreateProgramWithBinary
Pick Platform WithDevice Type T
Copy Host MemoryTo Device Memory
clEnqueueWriteBuffer
Setup Kernel andArguments
clCreateKernelclSetKernelArg
Run KernelclEnqueueTask
clEnqueueNDRangeTask
Copy Device MemoryTo Host Memory clEnqueueReadBuffer
Clean Up
Cor
resp
ondi
ng O
penC
L A
PI C
alls
Pla
tform
Lay
erR
untim
e La
yer
1
2
3
4
5
6
7
8
9
10
11
Figure 2.3: OpenCL Application Flow
Chapter 2. OpenCL Overview 13
The communication between the host and the devices are done using the commands
explained in Section 2.1.2. To issue these commands to the devices, a command queue
is created for each device selected under the context (step 4). Whenever a command is
issued, an optional OpenCL event object can be created. These event objects allow the
application to check for the completion of the command, and can be used for explicit
synchronization.
The memory objects are created to allocate memory on the devices (step 5). The
permission to read and/or write to these memory objects from the host is set by the
application when they are created.
The program objects are created by either loading the source code or the binary
implementation of one or more kernels (step 6). The binary implementation can either be
the device-specific executable or the intermediate representation (IR) used by the current
OpenCL implementation. Once created, the program objects are then built to generate
the device-specific executable. The OpenCL implementation decides what action to take
in the build stage depending on whether source code, IR, or an executable was used to
create the program object. The OpenCL API allows writing of the binary implementation
to a file that can be used in the later runs of the application. The format of the output
file is not part of the OpenCL specification, and the OpenCL implementation decides a
convenient format. Once the executable is built in the program object, the kernel object
is created from it. The kernel object represents one of the functions implemented in the
program object.
Before executing the kernel, the input data is transferred to the device memory by
issuing memory copy commands against the associated memory objects (step 7). The
memory transfer can be blocking where control is returned to the application once the
memory transfer is complete, or non-blocking where control is returned after the memory
transfer is scheduled. For non-blocking transfer, events are used for synchronization.
Once the input data is transferred, the values of the kernel arguments are set (step 8)
Chapter 2. OpenCL Overview 14
and the kernel is scheduled for execution through the command queue (step 9). Once
the kernel execution is complete, the output memory is transferred to the host from
the device (step 10). It is possible to have an iterative process where the same kernel
is scheduled to run again. New input data can be transferred to the device, and new
output data transferred back to the host after the kernel execution.
As a final step all the OpenCL objects are released (step 11) once all the computation
is done.
Chapter 4 describes how multiple OpenCL implementations can be used in the same
OpenCL application.
Chapter 3
Related Work
As mentioned in Section 1.2, there is no known work towards integration of FPGAs in a
heterogeneous platform with CPUs and GPUs using OpenCL. In this Chapter, research
work on the architectures to support the OpenCL framework on FPGAs and converting
the OpenCL language for FPGAs are discussed1.
Lin et al. [13] presents the Open Reconfigurable Computing Language (OpenRCL)
framework that is based on the OpenCL framework but only targets FPGAs. The archi-
tecture for OpenRCL is based on an array of MIPS processors. A crossbar switch with
a scheduler is used to connect the processors to the memory regions. OpenRCL also
provides a LLVM-based compiler [12] to convert kernels written in the OpenCL language
to target their architecture. With comparable performance versus the Nvidia GeForce
9400m GPU, OpenRCL shows a 5-fold power benefit for their test application.
SOpenCL (Silicon OpenCL) [15] is an OpenCL-based FPGA architecture synthesis
tool. It converts the OpenCL kernels into accelerators and targets a template-based
architecture. It has a predefined datapath and memory access module. A LLVM-based
high-level synthesis tool converts the kernel into an accelerator and it is inserted in the
template architecture. During high-level synthesis it combines all the work-items in a
1OpenCL implementations for GPUs are provided by the commercial vendors. The details of OpenCLsupport for AMD and Nvidia GPUs can be found at [1] and [14] respectively.
15
Chapter 3. Related Work 16
work-group into one accelerator to reduce the number of accelerator instances. Under
SOpenCL the host part of the OpenCL application runs on the PowerPC located in the
FPGA.
FSM SYS Builder [3] parses an OpenCL application to generate an array-of-processors
with MicroBlaze soft processors, and compiles the kernel source code to be executed on
MicroBlaze. Their approach is to use OpenCL as a high-level programming model to
generate a hardware/software co-designed multiprocessor system on programmable chip.
However, this work cannot be integrated with CPUs and GPUs due to the lack of the
middleware.
Instead of OpenCL, FCUDA [16] generates FPGA platforms based on applications
written in CUDA. In 2007 Nvidia introduced CUDA (Compute Unified Device Architec-
ture) to allow programmers access to their GPUs for general purpose computation. In
FCUDA, the developer annotates the CUDA kernels with FCUDA pragmas that guides
the conversion of CUDA code to AutoPilot [7] C code.
Even though not based on the OpenCL framework, QP (Quadro Plex) [17] is a het-
erogeneous cluster consisting of CPUs, GPUs, and FPGAs. Each node of the QP cluster
has two dual-core CPUs, four GPUs, and one FPGA. The programming for the CPU
is done using common compilers, and CUDA is used for the programming the GPUs.
The FPGA is programmed using DIME-C code, and Nallatech’s DIME-C C to VHDL
Function Generator is used to translate DIME-C C code to VHDL.
Similar to QP, Axel [20] is another heterogeneous cluster consisting of CPUs, GPUs,
and FPGAs. Axel also does not provide an unified programming model. GPUs and
FPGAs are programmed separately using CUDA and Xilinx ISE tools respectively. The
CPU part of the application is compiled using GCC.
All of the known OpenCL works described above do not address the issue of the mid-
dleware layer that would enable an OpenCL system to include FPGAs that can interact
with CPUs and GPUs. The next Chapter describes how multiple implementations of
Chapter 3. Related Work 17
OpenCL framework can interact with each other.
Chapter 4
Heterogeneous Platforms Under the
OpenCL Framework
The OpenCL API has all the necessary function calls to construct a heterogeneous plat-
form under the OpenCL framework. However, each device vendor provides proprietary
OpenCL implementations and there are no API calls to integrate various different imple-
mentations. In this research work the OpenCL extension installable client driver (ICD)
loader [10] is used to achieve this goal.
4.1 ICD Loader
The ICD loader is an OpenCL extension that allows multiple OpenCL implementations
to co-exist on a host system. When an application is written against the ICD loader, in-
stead of a specific implementation, the application has access to all the available platforms
provided by all the existing implementations on the host. The ICD loader decouples an
OpenCL application binary from a specific implementation, and allows the application
to select an implementation at runtime. Figure 4.1 illustrates a scenario where imple-
mentation A and B are available to the application through ICD loader.
On a Linux host system, an ICD compliant OpenCL implementation registers itself
18
Chapter 4. Heterogeneous Platforms Under the OpenCL Framework 19
OpenCL ApplicationOpenCL Language
Driver A
Hardware A
OpenCL Implementation A
OpenCL API
OpenCLRuntime
OpenCLPlatform Layer
Compiler
ICD Loader
Driver B
Hardware B
OpenCL Implementation B
OpenCL API
OpenCLRuntime
OpenCLPlatform Layer
Compiler
Figure 4.1: Multiple OpenCL Implementations Under ICD Loader
with the ICD loader by adding a file in the /etc/OpenCL/vendors/ directory. The file
contains the name of the dynamic library that has the OpenCL implementation. The
ICD loader scans this directory to enumerate available implementations, and presents
them to the application. For a Windows host system the Windows registry is used to
register OpenCL implementations.
4.2 Application flow under ICD Loader
The flow is similar to the one explained in Section 2.2 when an application intends to
utilize multiple OpenCL implementations. The following description uses the scenario
depicted in Figure 4.1 to explain. While gathering platform information, the applica-
tion will be presented with the platforms provided by both A and B implementations.
However, instead of creating one context, the application needs to create two separate
contexts to use both device A and B.
After creating the two contexts, the application needs to create separate copies of
all the other objects, e.g. memory objects, kernel objects, command queue objects, etc.
This is necessary because the ICD layer does a 1-to-1 mapping of all the OpenCL API
calls to an implementation based on the objects used in the API.
Chapter 4. Heterogeneous Platforms Under the OpenCL Framework 20
The OpenCL framework does not provide any high-level modeling to decompose tasks
to be executed in parallel. The application developer needs to explicitly define the tasks
to be executed on each device, and also manually partition the associated data. Any
synchronization for data or kernel execution between the two devices needs to be explicitly
managed by the application.
4.3 Challenges of using OpenCL for Heterogeneous
Platforms
Devices with different architectures can be integrated under the OpenCL framework
using the ICD loader extension. During the course of this research, however, some of
the challenges to implement an efficient heterogeneous platform under OpenCL became
evident. The following challenges apply when multiple devices are considered.
4.3.1 Synchronization
Under the OpenCL framework, the host is the central point for all application control
logic. A device is completely unaware of any other device being used by the application.
This lack of visibility restricts any direct communication between the devices and requires
the host for all coordination. The host needs to manage any synchronization necessary
among tasks running on different devices.
4.3.2 Data Transfer
An important aspect for overall efficiency of a heterogeneous platform is the capability
to transfer data efficiently. OpenCL implements a distributed memory model but lacks
the support for point-to-point data transfer. Under the current framework, the host is
involved in all data transfer between devices. Data must be first transferred to the host
Chapter 4. Heterogeneous Platforms Under the OpenCL Framework 21
Application
ICD Loader
Virtual OpenCL
Network
Client
ICD Loader
OpenCL
Device
Client
ICD Loader
OpenCL
Device
Client
ICD Loader
OpenCL
Device
Figure 4.2: Possible OpenCL Cluster
to move it to another device, doubling the time required for the transfer.
4.3.3 Cluster Support
As mentioned earlier, OpenCL does not specify the connectivity type between the host
and the devices, and in theory this allows the creation of a networked cluster of devices
to run an application. However, due to the lack of explicit clustering support in the
framework, all the existing OpenCL implementations assume the devices to be on the host
itself. It will be a challenging task to construct a cluster under the current framework with
available OpenCL implementations. A virtual OpenCL implementation with a server-
client architecture needs to be developed for such a cluster. The server would run on the
host and through the ICD loader provide a unified view of the cluster to the OpenCL
application. The client running on the nodes would be in fact OpenCL applications with
an extra layer to communicate with the server.
Figure 4.2 shows a possible OpenCL cluster. However, the complexity can be largely
reduced if the OpenCL API is extended to support explicit clustering, or a OpenCL
extension similar to the ICD loader is introduced.
Chapter 5
OpenCL For FPGA
The motivation of this thesis is to explore the OpenCL framework as a unified program-
ming model for a platform consisting of CPUs, GPUs, and FPGAs. This work focuses on
an OpenCL implementation for FPGAs as there are existing vendor provided OpenCL
implementations for CPUs and GPUs. An OpenCL implementation for FPGAs can be
divided into three segments: 1) a target architecture supporting OpenCL concepts needs
to be defined; 2) a compiler or high-level synthesis tool to compile OpenCL C code
for the architecture; and 3) a middleware to integrate FPGAs with existing OpenCL
implementations.
There is existing and ongoing research work on the first two parts [13], [15], [3], [16],
however, the lack of a middleware does not allow any of these works to interact with
other OpenCL implementations. In this work, a middleware was developed along with a
light weight architecture framework for FPGA implementations. The middleware allows
interaction with vendor provided OpenCL implementations for CPUs and GPUs using
the ICD loader (as explained in Section 4.2). The architecture uses custom cores as
processing elements instead of soft processors. The details of the middleware and the
architecture are described in this Chapter.
Section 5.1 describes the FPGA specific step necessary in an OpenCL application.
22
Chapter 5. OpenCL For FPGA 23
The actual flow used in this work is described in Section 5.2. A short description of
the hardware and software used in this work is provided in Section 5.3. Section 5.4
describes in detail the software part of the middleware developed for this work. Details
of the architecture used for this work are explained in Section 5.5. FPGA usage has its
own benefits and challenges, but during this work some OpenCL specific benefits and
challenges are noticed that are described in Section 5.6 and 5.7 respectively.
Note that all the discussions in these sections are focused on a custom-core based
architecture. An array-of-soft-processors based FPGA design has a process similar to
GPUs, and it is noted where relevant.
5.1 Application Flow using FPGAs
An OpenCL application flow using FPGAs as OpenCL devices is almost exactly as the
one explained in Section 2.3. However, due to the configurable nature of the FPGA, an
extra step is required as explained below.
5.1.1 OpenCL Code Compilation
This step involves building the program object after it has been created from the source
code. For a custom-core-based architecture, a predefined static framework is necessary
that contains the interface to the host, the memory, logic to implement OpenCL related
concepts, and an interface for the custom cores to communicate with this framework.
Figure 5.1 provides an overview of this design. In the figure the kernels represent the
custom cores.
From the source code, a high-level synthesis tool implements the kernels in HDL
with the interface to interact with the static framework. Once the core is generated,
the required number of the cores are instantiated and glued to the static framework to
create the complete FPGA design. FPGA vendor provided CAD tools then generate
Chapter 5. OpenCL For FPGA 24
Application Specific Cores
Static Framework
Interface to Host
Kernel Controller
Global Memory(On-Chip)
Memory Controller
Data Control
Kernel Kernel
DataCommand& Control
Figure 5.1: FPGA Design Overview
the configuration bitstream for this design. The FPGA is then configured using this
bitstream. The timing of the FPGA configuration presents a few challenges and they are
discussed in Section 5.7.1.
Note, that if the architecture is an array-of-soft-processors, then the source code
simply needs to be compiled to produce the binary for the target processor. This binary
would be downloaded to the processors when the kernel is being executed.
5.2 Flow Used in this Work
The lack of a high-level synthesis tool to convert OpenCL C code to HDL and FPGA
configuration challenges (see Section 5.7.1) forced a flow with some manual steps for this
work as shown in Figure 5.2.
Once the computation of the application is partitioned, the part assigned for the
FPGA is coded manually to create the kernels in HDL. Instances of this core are then
manually integrated with the static framework and CAD tools are run to generate the
configuration bitstream. The FPGA is configured before running the OpenCL applica-
Chapter 5. OpenCL For FPGA 25
Manual
Custom Core Developedand Integrated withStatic Framework
Design BitstreamGenerated and
FPGA Configured
OpenCL C Kernels Developed
for CPU/GPU
Run OpenCL Application
Figure 5.2: Actual Flow used in this work
tion.
The kernel code for the CPU/GPU part is developed as usual with rest of the OpenCL
application, and it is executed following the flow shown in Figure 2.3
5.3 Experimental Setup
The heterogeneous platform used in this work included an AMD Athlon 7750 Dual-Core
Processor running at 1.4GHz, a graphics card with an ATI Radeon HD 5450 GPU, and
the Xilinx XUPV5 board with a Virtex5-LX110T FPGA. The graphics card is connected
to the mother board using a 16-lane PCI-express interface and the 1-lane PCI-express
interface of the XUPV5 board is used to connect the FPGA board.
The OpenCL implementation packaged with the AMD-APP-SDK-v2.4-lnx64 is used
for the CPU and the GPU, and the implementation developed for this work is used for
the FPGA. The OS running on the host is CentOS 5.6. The Xilinx ISE 12.3 tool is used
to compile the FPGA design.
Chapter 5. OpenCL For FPGA 26
o4f_platform
o4f_contexto4f_device
o4f_command_queue
*
* 1
1
1
*
1
o4f_event
1..*
0..*
o4f_program
*
o4f_mem
*
1
1
o4f_kernel
1
1
1
1
*
0..1
0..1
Figure 5.3: UML Class Diagram of the API Library
5.4 Software
The software part of the middleware is divided into two parts - the library that implements
the OpenCL API, and the device driver that allows communication between this library
and the FPGA design. The source code for this work will be made publicly available and
the details of the implementation can be found in the source code. A brief overview is
provided here.
5.4.1 OpenCL API Library
A multi-threaded dynamic library is designed and developed to implement the OpenCL
API specification 1.1 [11]. Only a sub-set of the OpenCL API deemed necessary to
integrate the FPGA as an OpenCL device has been implemented. Appendix A has
this subset of the API listed.
Chapter 5. OpenCL For FPGA 27
Figure 5.3 shows the UML class diagram for the major classes used in the API library.
The class o4f context, representing an OpenCL context, has relationship to all other
classes because an OpenCL context contains all the other objects in an application.
Classes o4f program and o4f kernel represent an OpenCL program and an OpenCL kernel
respectively. In this work the kernel on the FPGA is pre-configured, and these two classes
are placeholder for future work when kernels can be created at runtime.
Class o4f command queue represents the command queue. When a command is is-
sued, a new thread and an instance of the class o4f event, representing an event, is
created. The o4f event tracks the new thread. A command can be instructed to wait for
the completion of previously generated events before being executed. To accommodate
this explicit synchronization, o4f event can have a collection of class o4f event.
The relationship shown in the diagram supports one FPGA board with one chip, but
it can easily be extended to support multiple boards and multiple chips on each board.
However, supporting multiple board or FPGA chips would require significant redesign of
the device driver and a few modifications to the static framework of the FPGA design.
ICD Compatibility
Initially the library was developed to implement the OpenCL API to allow an OpenCL
application to interact solely with a FPGA device. It was modified to support the
ICD loader extension to interact with commercial OpenCL implementations. Access to
the ICD implementation source code from Khronos Group was necessary to make the
modifications because specific additions to the data structures are required.
The ICD loader initially queries a registered library (see Section 4.1) through the
function call clGetExtensionFunctionAddress to get the address of the functions
clIcdGetPlatformIDsKHR and clGetPlatformInfo. The detailed process of how
the ICD loader works is described in [10].
Chapter 5. OpenCL For FPGA 28
Events
Event objects are created when memory transfer or task execution commands are en-
queued. A thread is spawned for each event to execute the command. This allows
returning the control to the main application without blocking it, and makes it con-
venient to check the status of the event. During explicit synchronization on an event
completion, the main process sleeps until the thread finishes instead of using any polling
mechanism.
Command Queue
The current implementation of the command queue only supports the out-of-order model.
However, the in-order model can still be realized by explicit synchronization of the events
generated during queuing commands.
5.4.2 Device Driver
The FPGA board communicates with the host through a 1-lane PCI-express interface.
The device driver facilitates the communication between the API library and the board.
The Linux Kernel presents the device as a PCI device, and the PCI related API is used
by the driver to control the device.
DMA
PCI-express specification does not provide native DMA support. Instead the devices
are responsible to implement DMA support. For this work it is incorporated in the
FPGA design as part of the host interface. When data transfer is required, the device
driver sets up the transfer size, the source and the destination memory address in the
appropriate registers in the FPGA using the host interface (see Section 5.5.1). Then the
driver instructs the FPGA to commence the transfer, and the FPGA does the transfer
without the CPU being involved.
Chapter 5. OpenCL For FPGA 29
Interrupt
In two cases the FPGA needs to initiate communication with the device driver. By
raising the interrupt the FPGA either indicates the completion of the data transfer or
completion of the kernel execution requested by the driver.
Currently the legacy interrupt is used, which means the same interrupt is used to no-
tify completion of both data transfer and kernel execution. This is a drawback because
in the current implementation the driver only permits either data transfer or kernel ex-
ecution. PCI-express supports Message Signaled Interrupts (MSI), which allows devices
to generate multiple unique interrupts. This would allow the FPGA to generate unique
interrupts to signal data transfer and kernel execution completion independently. Cur-
rently the FPGA design supports MSI, however, the Linux Kernel used by the host OS
does not support it.
5.5 Architecture for FPGAs
A simple custom core based architecture is designed for the FPGA to complete the
middleware. Figure 5.1 shows the overview of this design. The three major components
in the static framework are the Kernel controller, the host interface, and the memory
controller.
5.5.1 Static Framework
Kernel Controller
The API library communicates with the custom core, i.e. the kernel, through this block
using an 8-bit command word (Figure 5.4). The most significant two bits indicates the
type of command. When an argument is being set for the kernel, the least significant four
bits indicate the index of the argument. This allows a kernel to have 16 arguments. When
Chapter 5. OpenCL For FPGA 30
0 0 0 0 0 0 0 0
MSB LSB
Command CommandArguments
Command Value Command Arguments
Set Argument 01 Command argument part holds the Kernel argument index
Start Kernel 10 N/A
0 0
Kernel ID
Figure 5.4: Kernel Command
the ‘set argument’ command arrives, this block broadcasts all this information along with
the value of the argument (set by the library right before issuing the command) to all
the kernels. The same happens when the ‘start’ command arrives.
Note that there are four bits allocated for Kernel ID. This ID is the same for all
instances of the same kernel. This allows up to 16 unique kernels to be added to this
framework. The motivation for this is to provide a true task parallel model in FPGAs
(see Section 5.6.1).
Host Interface
PCI-express is used as the host interface, and one of the six base address registers (BAR)
available in PCI-express is utilized for DMA data transfers and passing information to
the kernel controller. BAR is part of the PCI configuration space specification that is
also used in PCI-express. For the system OS to address a device, part of the device
needs to be mapped into either the memory or the IO port address space. For PCI/PCI-
express devices BARs are mapped into the system OS address space. Table 5.1 shows
Chapter 5. OpenCL For FPGA 31
Table 5.1: BAR1 Offsets
Offset Name Meaning
0x00 WRITE FPGA ADDR FPGA memory address for DMA write
0x04 WRITE HOST ADDR Host memory address for DMA write
0x08 WRITE SIZE Number of bytes to write
0x18 WRITE START Initiate DMA write
0x0C READ FPGA ADDR FPGA memory address for DMA read
0x10 READ HOST ADDR Host memory address for DMA read
0x14 READ SIZE Number of bytes to read
0x1C READ START Initiate DMA read
0x40 KERNEL CMD DATA Command data for kernel
0x44 KERNEL CMD Broadcast the current command data to kernel
the registers used in BAR1. The read and write operations are from the FPGA’s point
of view.
The Host interface also sends the interrupt signals to indicate a DMA transfer com-
pletion, or when requested by the kernel controller.
Memory Controller
The memory controller currently only supports pre-fixed point-to-point connections with
priority based access to memory. Only two kernel instances can be connected to the
memory controller. However, this is enough to explore quite a few concepts explained
later in this Chapter. Optimized memory access is a major research topic, but not a
goal for this work. This simple controller can easily be replaced by a more sophisticated
controller in future without much modification to the framework.
Chapter 5. OpenCL For FPGA 32
Kernel Group
KernelLocal id 0Global id 0
KernelController
Kernel Group Control
DoneAggregate
Kernel Settingand Control
KernelDone
KernelLocal id 1Global id 0
DoneAggregate
Global Memory
Figure 5.5: One Kernel Group with two Kernels
5.5.2 Kernel Organization
The framework has been designed to allow the OpenCL concepts work-item and work-
group. However, as there is no support for ‘context’, the number of work-items represents
the number of kernel instances. Going forward, work-item and work-group are simply
referred as kernel and kernel-group respectively. If shared memory is required by the
kernel, then kernels within the same kernel-group have access to the shared memory.
The two configurations shown in Figures 5.5 and 5.6 are tried during the test appli-
cation explained in Chapter 6. The shared memory is not shown in these figures.
Initially it may seem there are no benefits between using the two configurations.
However, kernel-groups can benefit from the presence of shared memory along with
application-specific requirements for using the shared memory.
Chapter 5. OpenCL For FPGA 33
Kernel Group
KernelLocal id 0Global id 0
KernelController
Kernel Group Control
DoneAggregate
Kernel Settingand Control
KernelDone
Kernel Group
KernelLocal id 0Global id 1
Kernel Group Control
DoneAggregate
KernelDone
DoneAggregate
Global Memory
Figure 5.6: Two Kernel Groups with one Kernel each
5.5.3 Kernel Information
Kernel instances require some OpenCL-based information to distinguish themselves from
each other. For example, the global ID of a kernel uniquely identifies itself among all the
kernel instances. The local ID identifies a kernel within a kernel-group. The number of
kernel-groups, the number of kernels in a group, etc. are all important information used
to decide which part of the data a kernel must process. In our work, this information
is passed as parameters to the HDL modules. This method would work even when a
high-level synthesis tool is used to generate the HDL for the kernels.
5.6 Benefits of FPGAs
This section discusses OpenCL specific benefits of using FPGAs, and not the general
benefits. The benefits are weighed mostly against GPUs.
Chapter 5. OpenCL For FPGA 34
5.6.1 Task Parallelism
In GPUs, all the processing elements execute the same instruction, processing data using
a SIMD model. This does not allow task parallelism in a GPU. For FPGAs, true task
parallelism can be implemented very easily. Two different kernels can be running in
parallel seamlessly.
A setup with two different kernels was implemented to show this benefit. Unfortu-
nately the lack of MSI support (see Section 5.4.2) prevented running of both tasks in
parallel. In the legacy interrupt mode, the same interrupt is raised when either kernel
completes its execution and the device driver is unable to decide which kernel finished
execution.
5.6.2 Data lifetime
The data stored in the shared or private memory region of the GPUs is only valid during
the execution of the kernel. A kernel instance in a GPU is a software thread and cannot
retain any state information once the execution is complete. For an application where the
kernel is executed iteratively on a GPU, it is not guaranteed that a kernel instance will
be assigned to the same processing element and use the same shared or private memory
region. This will cause performance degradation when the same data needs to be loaded
in the shared or private memory in consecutive kernel execution. For FPGAs with an
array-of-custom-cores architecture, each custom core represents a kernel instance. Data is
persistent between kernel executions and previously loaded data in the shared or private
memory can be reused.
The test application in Chapter 6 utilizes this idea. The kernel has an extra argu-
ment to indicate whether to load data to the shared memory from the global memory
before starting the actual computation. The argument is set to true from the OpenCL
application the first time, and false for consecutive kernel executions.
Chapter 5. OpenCL For FPGA 35
5.6.3 Resource Utilization
The configurable nature of FPGAs would allow better resource utilization for OpenCL
applications. For example, a GPU has a fixed amount of available memory of all types.
This is not a restriction for FPGAs. The total amount of memory available, including
off-chip memory, can be partitioned as required to optimize performance.
5.7 Challenges of FPGAs
The OpenCL specific challenges of using FPGAs described in this section.
5.7.1 FPGA Configuration
The timing of the FPGA configuration becomes a critical issue for OpenCL applications.
Ideally the FPGA should be configured once the application enqueues a specific kernel
for execution. However, that is not an option when a FPGA board is used where the
FPGA is responsible for the communication with the host. For the setup of this work, the
FPGA implements the PCI-express link to communicate with the host and it should be
configured even before the host is booted. Trying to configure or reconfigure the FPGA
after the host boots up makes the host unstable, and most likely will crash the system.
Also, the input data for the kernel needs to be transferred before enqueuing the kernel
execution. To facilitate the memory transfer, the static part of the FPGA design must be
present as well. For these two reasons the FPGA is configured beforehand in this work.
One possible solution is to use a FPGA board where another device is responsible for
the communication with the host. However, this will add overhead for the communication
and may cause overall performance degradation.
One other solution is to use technology like partial reconfiguration. This method
would eliminate the need for an extra device on the board, and should not impact per-
formance.
Chapter 5. OpenCL For FPGA 36
5.7.2 FPGA Resource Estimation
In the OpenCL framework, the application expects to know the computational capacity
in terms of processing elements of a device at initialization. This allows the application
to partition a problem accordingly. In the case of FPGAs, computational capacity only
in terms of FPGA resources can be known beforehand. How much FPGA resource a
kernel requires is known only after the CAD tool implementation. Even this information
does not provide the knowledge of how many instances of the kernel can be put together
on the device. FPGA CAD tools usually struggle with higher resource utilization, and
FPGA devices can rarely be used fully.
Chapter 6
Example Application
There is continuing research on optimizing compute intensive applications on various pro-
cessing architectures. Focus has been on comparing algorithm implementations among
architectures to understand the best match [4, 5, 8, 21], or improving application per-
formance on a specific architecture. The absence of research work targeting platforms
with CPU, GPU, and FPGA is noticeable, and it is reasonable to assume that the lack
of availability of such a platform has been a barrier.
In this work an example application, i.e. Monte Carlo simulation for Asian Options,
is developed to demonstrate that all three architectures can work together under the
OpenCL framework with OpenCL implementations from the commercial vendors and
O4F. Some of the design choices for the application are made to demonstrate the benefits
of using FPGAs in OpenCL. Section 6.2 describes the details of the example application.
However, before describing the example application, Section 6.1 discusses the application
types that have the potential to benefit from a heterogeneous OpenCL platform.
6.1 Potential Application Types
A heterogeneous platform consisting of CPUs, GPUs, and FPGAs provide an attractive
options to improve overall application performance. However, the lack of peer-to-peer
37
Chapter 6. Example Application 38
GenerateRandom Numbers
Compute DataPoints Using
Random Numbers
Reduction Stepon ComputedData Points
TargetIteration Count
Reached?
Start
Finish
No
Yes
Figure 6.1: Monte Carlo Simulation Flowchart
communication, especially peer-to-peer data transfer, in the OpenCL framework can
restrict the types of application suitable for this platform. This section describes some
of the potential application types for this platform.
6.1.1 Iterative Process
An application with an iterative process, where all computation are encapsulated in a
loop, can be a candidate. The computation within the loop needs to be segmented,
and these segments can be assigned to different processing architectures. An example
application is the Monte Carlo simulation, which has three major segments. These are
Chapter 6. Example Application 39
Atmosphericmodel
Oceanmodel
Landmodel
Icemodel
Coupler
Figure 6.2: Components in Community Climate System Model
shown in the Monte Carlo simulation flowchart illustrated in Figure 6.1. The first
segment generates the random numbers, the second segment uses the random numbers
to compute multiple data points, and the last segment is the reduction step to produce
the result based on the calculated data points.
The iterative nature of the application is necessary to hide some of the latency in-
troduced by transferring data through the host. The example application described in
Section 6.2 will illustrate this more clearly.
6.1.2 Task Parallel
An application containing multiple compute intensive segments that are independent of
each other is an obvious choice. It is unlikely to have completely independent compute
segments, but it may be possible to gain performance even with some communication
done through the host. An example application is climate modeling where multiple
components are simulated simultaneously. Figure 6.2 shows a simplified view of the
software design used in the Community Climate System Model [6]. The four models
are simulated independently, and they intermittently exchange data through the coupler.
Chapter 6. Example Application 40
OpenCL Application
EncryptionTask
ImageProcessing
CommunicationInterface
DecryptionTask
Host
GPU CPU FPGA
Figure 6.3: Possible application of the platform
Depending on the actual computation involved within each model, different processing
architecture may be suitable for individual models.
6.1.3 Other Considerations
This platform can be useful when considering other aspects besides just runtime perfor-
mance. Power usage has become a serious consideration for many applications, and this
platform can be used to balance between power usage and runtime performance.
FPGAs are ideal to interface with external IO devices. An application that interacts
with external IO devices can utilize this platform as the OpenCL framework does not
restrict how data is sent or received by an application. For example, a video confer-
encing application with encrypted data communication can utilize the FPGA to receive
encrypted data and decrypt the data before passing it onto the host. The host can use
the GPU for image processing before displaying the video. Data sent from the host can
be encrypted by the FPGA before sending it out. Figure 6.3 depicts the block diagram
of one such possible application.
Chapter 6. Example Application 41
6.2 Example: Monte Carlo Simulation
A Monte Carlo simulation for Asian options is used as the example application for this
work. For Asian options the payoff is decided by the average price of the underlying
financial instrument, e.g. stock, over a pre-set period of time. The average price is based
on the price of the instrument on pre-set intervals over this period of time. The Monte
Carlo method of calculating the Asian options generates large numbers of trajectories
the price can follow to reach an interval, and takes the average of all the trajectories to
produce the estimated price for the interval. Random numbers are used to generate the
price trajectories.
The computation involved in this Monte Carlo simulation has the same three segments
as depicted in Figure 6.1. One iteration of the Monte Carlo simulation evaluates the
instrument price at one pre-set interval.
6.2.1 Reason for using Monte Carlo simulation
There are two main reasons to use a Monte Carlo simulation for this work: the iter-
ative process involved, and usage of random numbers. Thomas et al. [18] shows the
performance and power advantage of using FPGAs in generating random numbers. The
authors also mention that the random number generation is only one part of the Monte
Carlo simulation, and other architectures may provide better performance for the overall
application.
Quasi-Monte Carlo simulation
A special type of Monte Carlo technique, called the Quasi-Monte Carlo, is used for
the example application. It is similar to the traditional Monte Carlo technique, except
that quasi-random sequences are used instead of pseudo-random ones. A quasi-random
sequence attempts to avoid clustering of numbers by generating a number as far away as
Chapter 6. Example Application 42
Data Transfer
GenerateRandom Numbers
Compute DataPoints Using
Random Numbers
Reduction Stepon ComputedData Points
Data TransferData Transfer
GenerateRandom Numbers
Compute DataPoints Using
Random Numbers
Reduction Stepon ComputedData Points
Data TransferData Transfer
GenerateRandom Numbers
Compute DataPoints Using
Random Numbers
Reduction Stepon ComputedData Points
Data Transfer
Data Flow
FPGA GPU CPU
Figure 6.4: Distribution of the Monte Carlo simulation tasks across three different archi-
tectures
possible from previously generated numbers. The Sobol sequence [9] is one such quasi-
random sequence, and it is used in the example application. The reason for deciding
on the Quasi-Monte Carlo technique is because Sobol sequence can help to demonstrate
several benefits of using FPGAs in the OpenCL framework.
6.2.2 Implementation
A sample application of the Monte Carlo simulation for Asian options is provided by
the AMD-APP-SDK-v2.4-lnx64 from AMD. Two major modifications are made to this
Chapter 6. Example Application 43
sample application to integrate an FPGA as an OpenCL compute element. Firstly,
an extra OpenCL context, besides the context for the GPU, is created for the FPGA
platform provided by O4F. The detailed process of creating multiple OpenCL contexts
has been explained in Section 4.2. Second modification is related to the random number
generation and usage. In the sample application, the GPU kernel generates the random
numbers to calculate the price on various points of the trajectory. In this work the FPGA
generates the random numbers, and these numbers are transferred to the GPU. The GPU
kernel is modified to use these random numbers directly.
Figure 6.4 shows how the tasks are distributed across the three processing architec-
tures of the platform. The FPGA kernel is launched first to generate a block of random
numbers. These are transferred to the GPU by first moving them to the host CPU and
then to the GPU. The FPGA can then begin generation of the next block of random
numbers while the GPU computes price trajectories using the random numbers. Once
this is complete, the results are transferred back to the CPU where the average price is
computed. Note that in the example application, the CPU part does not use an OpenCL
kernel. Instead regular C code is used to perform the reduction step.
FPGA Kernel: Sobol Sequence Generation
A detailed description of Sobol sequence generation can be found in [19]. The description
here focuses on the part that helps illustrate the usefulness of FPGAs in OpenCL. To
construct a Sobol sequence, an initial vector of numbers, called the directional vector,
needs to be generated. To generate w -bit wordlength Sobol numbers, a w size directional
vector is necessary. Multi-dimensional Sobol sequences can be generated (almost always
required for financial Monte Carlo simulation), and each dimension needs its own direc-
tional vector. Note that the directional vectors are generated only once, and they remain
constant afterwords.
As the directional vectors remain constant after being created, there is no need to
Chapter 6. Example Application 44
use FPGA resources to generate these numbers. In [19] these vectors are also generated
offline and loaded in the FPGA during runtime. The OpenCL framework provides a
convenient way to generate these numbers using the CPU and load them in the FPGA
as a memory object.
The kernel generating the Sobol sequence is designed to load these directional vectors
from the global memory to the local memory region. When the kernel is executed for
the first time, the directional vector is loaded based on the true value of one of the
kernel arguments. As Monte Carlo is an iterative process, the argument is set to false in
consecutive iterations, and the kernel uses the previously loaded directional vector. This
demonstrates the benefit of prolonged data lifetime as explained in Section 5.6.2.
The goal of the example application is to demonstrate a working heterogeneous plat-
form. As such, the performance for the kernel is not considered. Appendix C has sug-
gestions to improve the performance of this kernel.
6.2.3 Application Flow
Section 5.2 explained the overall flow necessary for using O4F and the same flow is used
for the example application. First the Sobol sequence generator for the FPGA is coded
manually as a Verilog module. A predefined interface, required to communicate with the
rest of the design, is used for the module as shown in Figure 5.1. An existing template is
used to group two of the module instances to create a kernel group module. The kernel
group module is inserted inside another template module to create a top level module for
the kernel group. A different set of templates is used if multiple kernel groups need to be
instantiated, but the top level kernel module has the same interface. The FPGA design
for the static framework of O4F has a placeholder for the top level kernel group module.
An ISE project is created with the files for the static framework and the kernel related
modules, and FPGA CAD tools are run to generate the configuration bitstream. Note
that as explained in Section 5.7.1, the FPGA is configured before the host is powered on
Chapter 6. Example Application 45
because the PCI-express link of the FPGA board needs to be available when the host is
booted.
Once the host is booted with the configured FPGA board, the O4F device driver is
loaded to provide the O4F API library access to the FPGA. This allows the example
application to run like a regular OpenCL application and access the FPGA as an OpenCL
device.
Executing the Kernels
The actual application execution flow is the same as the one described in Section 4.2.
This section describes how the kernels are executed on multiple devices.
The code listing in appendix B shows the function that executes the kernels in the
example application. The control flow of this function is described using the pseudo code
shown in Pseudo Code 1. The actual API calls are shown in the pseudo code without the
actual arguments, but with the targeted device name for readability. The pseudo code
is also annotated with the line numbers from the appendix. The full application along
with all work related to the middleware will be made public online.
Before entering the main iterative loop of the Monte Carlo simulation, the random
number generation task in the FPGA is executed. Line 133 enqueues the task and line
167 reads the random numbers from the FPGA. These two steps are done outside the
main loop to have the random numbers ready to be used in the first iteration. This
allows overlapping kernel execution inside the loop. Note that after the first execution,
the kernel argument to load the directional vectors is set to false at line 159.
The main for loop starts with a synchronization point at line 192. This is to ensure
the data transfer from the FPGA has been complete. In the first iteration, data transfer
is enqueued before the for loop at line 167. In the following iterations, data transfer is
enqueued during the previous iteration at line 294.
Inside the main loop, the kernel to calculate the price and the next set of random
Chapter 6. Example Application 46
Pseudo Code 1 Executing Kernels on the GPU and the FPGA
clEnqueueTask(FPGA) Line 133
clSetKernelArg(FPGA) Line 159
clEnqueueReadBuffer(FPGA) Line 167
for k = 0 → (steps− 1) do
clWaitForEvents(FPGA) Line 192
clEnqueueNDRangeKernel(GPU) Line 241
clEnqueueTask(FPGA) Line 259
clWaitForEvents(FPGA) Line 272
clWaitForEvents(GPU) Line 283
clEnqueueReadBuffer(FPGA) Line 294
for all prices do
calculate average price Line 362
end for
end for
Chapter 6. Example Application 47
numbers are enqueued at lines 241 and 259 respectively. Note that for the GPU, function
clEnqueueNDRangeKernel is used to create a virtual 2-dimensional index space. For
the FPGA, function clEnqueueTask is used because there is no virtual index space for
the FPGA. The FPGA board is pre-configured with two instances of the Sobol sequence
generation kernel.
Lines 272 and 283 are synchronization points for the FPGA and the GPU to finish
execution. Once the execution is done, a read buffer command is queued for the FPGA
at line 294. As mentioned earlier, the synchronization point for this command is at the
starting point of the main loop.
Once the result buffers are read from the GPU, and the CPU is used to perform the
reduction step at line 362.
6.3 Analysis
The main goal for developing the Monte Carlo simulation for Asian options is to demon-
strate a working platform consisting of CPUs, GPUs, and FPGAs. The lack of high-level
synthesis support and the inability to configure the FPGA during runtime introduces
some manual steps, however, the example shows how OpenCL makes it possible to easily
utilize heterogeneous computing elements as long as the supporting middleware infras-
tructure exists. The manual steps can be removed by adding a high-level synthesis tool,
and using partial reconfiguration methods or special boards (see Section 5.7.1).
6.3.1 Observations
The current OpenCL framework does allow addition of new processor architectures, how-
ever, it appears the framework is more suitable for GPU-like devices with an array of
processors. The concept of having a virtual index space, a core idea of OpenCL, implies
the underlying device needs to handle multiple threads. This provides flexibility and
Chapter 6. Example Application 48
portability for an OpenCL application. The size of the virtual index space can change
based on the input data size, and the application is not tied to a specific device. An
accelerator device is unlikely to support threads.
The OpenCL API has function calls to transfer memory to and from the device,
however, implicit memory transfer can also occur. In the example application that uses
AMD’s OpenCL implementation for the GPU, when a kernel is enqueued for execution
on the GPU, implicit memory transfer is done for memory objects specified as arguments
to the kernel. Notice that in the code listed in appendix B, there is no call to transfer the
random numbers to the GPU. This behaviour is not clearly specified in the specification.
A clarification is necessary to ensure all OpenCL implementations behave similarly, and
an OpenCL application does not require modifications based on the implementation being
used.
It also appears the current API specification does not consider multiple devices very
carefully. For example, according to the OpenCL specification 1.1, the API call clCre-
ateBuffer should return CL OUT OF RESOURCES if the OpenCL implementation
fails to allocate the required resources the on the device. However, the API takes an
OpenCL context object as an argument, not an OpenCL device object. As a context can
have multiple devices, it is not possible for the implementation to decide which device is
the target.
6.3.2 Performance
No performance analysis is done for the Monte Carlo simulation as performance im-
provement was not a goal. However, measurement shows that the sample application
that generates the random numbers and calculates the estimated price on the GPU,
spends half of the kernel execution time to generate the random numbers. As it has
been shown [18] that FPGAs can generate random numbers three-fold faster compared
to GPUs, it is conceivable to achieve an overall performance gain. However, issues men-
Chapter 6. Example Application 49
tioned in Appendix C must be addressed before any performance analysis.
The result from the FPGA is validated by a test application running on the host to
ensure the middleware is functioning properly. A software version of the Sobol sequence
generator is implemented in the test application to create a baseline, and the result
generated by the FPGA Sobol sequence generator is matched against this baseline.
Chapter 7
Summary
The motivation of this thesis is to provide a standardized unified programming model for
platforms with CPUs, GPUs, and FPGAs. The OpenCL framework is used to achieve
this goal by developing the middleware necessary to integrate FPGAs with CPUs and
GPUs. This work is the first known platform to integrate CPUs, GPUs, and FPGAs
under the OpenCL framework. The challenges and benefits of using FPGAs on such a
platform are discussed in Chapter 5.
Previous research has shown that different architectures provide a performance ad-
vantage over other architectures for various types of computation. This platform will
allow researchers to improve overall performance of an application with multiple com-
pute intensive segments by utilizing a suitable architecture for each segment. Potential
application types are discussed in Chapter 6. One such application, a Monte Carlo sim-
ulation for Asian Options, is developed to show a working platform under the OpenCL
framework.
7.1 Future Work
This work provides a first-generation FPGA OpenCL implementation that allows inte-
gration of FPGAs with CPUs and GPUs under the OpenCL framework. However, users
50
Chapter 7. Summary 51
still need to code the kernels for FPGAs in HDL. A major improvement would be to
integrate a high-level synthesis tool into the implementation. A high-level synthesis tool
like LegUp [2] would be ideal as it targets FPGA architecture and the source code is
publicly available. The existing tools in [15, 13] can be integrated as well.
O4F does not support all the API function calls in the OpenCL specification [11].
All the API calls need to be implemented to become fully compliant and allow true
portability of OpenCL applications. As mentioned earlier in Section 5.5.1, currently
a simple memory controller is being used. A more sophisticated memory controller is
necessary, but the design of the memory controller will be dependent on the overall
architecture being used.
The current FPGA design has a static framework to support OpenCL concepts and an
array of custom cores. As the OpenCL API implementation library adds more function
calls, minor modifications to the static framework maybe necessary to support the newer
function calls. However, extensive research for the custom core architecture is necessary.
A template-based architecture with a predefined datapath, similar to the one described
in [15], is one option. Another option is to generate an application-specific custom archi-
tecture using high-level synthesis. A mix of custom accelerators with microprocessors, as
generated by LegUp [2], can also be an option.
Appendix A
Implemented OpenCL API List
extern CL API ENTRY c l int CL API CALL
clGetPlat formIDs ( cl uint p num entr ies ,
cl platform id ∗ p plat forms ,
cl uint ∗p num platforms ) ;
extern CL API ENTRY c l int CL API CALL
c lGetPlat fo rmIn fo ( cl platform id p plat form ,
cl platform info p param name ,
s ize t p param va lue s i ze ,
void ∗ p param value ,
s ize t ∗ p pa r am va lu e s i z e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clGetDeviceIDs ( cl platform id p plat form ,
cl device type p dev i ce type ,
cl uint p num entr ies ,
cl device id ∗ p dev i ce s ,
cl uint ∗ p num devices ) ;
extern CL API ENTRY c l int CL API CALL
c lGetDev ice In fo ( cl device id p dev ice ,
52
Appendix A. Implemented OpenCL API List 53
c l d e v i c e i n f o p param name ,
s ize t p param va lue s i ze ,
void ∗ p param value ,
s ize t ∗ p pa r am va lu e s i z e r e t ) ;
extern CL API ENTRY cl context CL API CALL
clCreateContextFromType ( const c l c o n t e x t p r o p e r t i e s ∗ p prope r t i e s ,
cl device type p dev i ce type ,
void (CL CALLBACK ∗ p p f n no t i f y )
( const char∗ , const void ∗ , s ize t , void ∗ ) ,
void ∗ p user data ,
c l int ∗ p e r r c od e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clGetContextIn fo ( cl context p context ,
cl context info p param name ,
s ize t p param va lue s i ze ,
void ∗p param value ,
s ize t ∗ p pa r am va lu e s i z e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clReta inContext ( cl context p context ) ;
extern CL API ENTRY c l int CL API CALL
c lRe leaseContext ( cl context p context ) ;
extern CL API ENTRY c l int CL API CALL
clGetCommandQueueInfo (cl command queue p command queue ,
c l command queue info p param name ,
s ize t p param va lue s i ze ,
void ∗ p param value ,
s ize t ∗ p pa r am va lu e s i z e r e t ) ;
Appendix A. Implemented OpenCL API List 54
extern CL API ENTRY cl command queue CL API CALL
clCreateCommandQueue ( cl context p context ,
cl device id p dev ice ,
cl command queue properties p prope r t i e s ,
c l int ∗ p e r r c od e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clRetainCommandQueue (cl command queue p command queue ) ;
extern CL API ENTRY c l int CL API CALL
clReleaseCommandQueue (cl command queue p command queue ) ;
extern CL API ENTRY cl mem CL API CALL
c lCrea t eBu f f e r ( cl context p context ,
cl mem flags p f l a g s ,
s ize t p s i z e ,
void ∗ p hos t pt r ,
c l int ∗ p e r r c od e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clRetainMemObject (cl mem p mem ) ;
extern CL API ENTRY c l int CL API CALL
clReleaseMemObject (cl mem p mem ) ;
extern CL API ENTRY c l int CL API CALL
clSetMemObjectDestructorCal lback (cl mem p memobj ,
void (CL CALLBACK p p fn no t i f y ) (cl mem , void ∗ ) ,
void ∗ p use r da ta ) ;
extern CL API ENTRY cl program CL API CALL
Appendix A. Implemented OpenCL API List 55
clCreateProgramWithBinary ( cl context p context ,
cl uint p num devices ,
const cl device id ∗ p d e v i c e l i s t ,
const s ize t ∗ p lengths ,
const unsigned char ∗∗ p b ina r i e s ,
c l int ∗ p b ina ry s ta tu s ,
c l int ∗ p e r r c od e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clRetainProgram (cl program p program ) ;
extern CL API ENTRY c l int CL API CALL
clReleaseProgram (cl program p program ) ;
extern CL API ENTRY cl kernel CL API CALL
c lCreateKerne l (cl program p program ,
const char ∗p kernel name ,
c l int ∗ p e r r c od e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clSetKerne lArg ( cl kernel p kerne l ,
cl uint p arg index ,
s ize t p a r g s i z e ,
const void ∗ p arg va lue ) ;
extern CL API ENTRY c l int CL API CALL
c lReta inKerne l ( cl kernel p ke rne l ) ;
extern CL API ENTRY c l int CL API CALL
c lRe l ea s eKerne l ( cl kernel p ke rne l ) ;
extern CL API ENTRY c l int CL API CALL
Appendix A. Implemented OpenCL API List 56
c lRe leaseEvent ( cl event p event ) ;
extern CL API ENTRY c l int CL API CALL
clReta inEvent ( cl event p event ) ;
extern CL API ENTRY c l int CL API CALL
clSetUserEventStatus ( cl event p event ,
c l int p ex e cu t i on s t a tu s ) ;
extern CL API ENTRY cl event CL API CALL
clCreateUserEvent ( cl context p context ,
c l int ∗ p e r r c od e r e t ) ;
extern CL API ENTRY c l int CL API CALL
clWaitForEvents ( cl uint p num events ,
const cl event ∗ p e v e n t l i s t ) ;
extern CL API ENTRY c l int CL API CALL
clEnqueueWriteBuffer (cl command queue p command queue ,
cl mem p bu f f e r ,
cl bool p b lock ing read ,
s ize t p o f f s e t ,
s ize t p cb ,
const void ∗ p ptr ,
cl uint p num even t s i n wa i t l i s t ,
const cl event ∗ p e v e n t wa i t l i s t ,
cl event ∗ p event ) ;
extern CL API ENTRY c l int CL API CALL
clEnqueueReadBuffer (cl command queue p command queue ,
cl mem p bu f f e r ,
cl bool p b lock ing read ,
Appendix A. Implemented OpenCL API List 57
s ize t p o f f s e t ,
s ize t p cb ,
void ∗ p ptr ,
cl uint p num even t s i n wa i t l i s t ,
const cl event ∗ p e v e n t wa i t l i s t ,
cl event ∗ p event ) ;
extern CL API ENTRY c l int CL API CALL
clEnqueueTask (cl command queue p command queue ,
cl kernel p kerne l ,
cl uint p num even t s i n wa i t l i s t ,
const cl event ∗ p e v e n t wa i t l i s t ,
cl event ∗p event ) ;
Appendix B
Monte Carlo Kernel Execution
1 int
2 MonteCarloAsian : : runCLKernels (void )
3 {
4 c l int s t a tu s ;
5 cl event events [ 1 ] ;
6 cl event f e v en t s [ 3 ] ;
7
8 s ize t globalThreads [ 2 ] = {width , he ight } ;
9 s ize t l o ca lThreads [ 2 ] = {blockSizeX , blockSizeY } ;
10
11 /∗
12 ∗ Declare a t t r i b u t e s t r u c t u r e
13 ∗/
14 MonteCarloAttrib a t t r i b u t e s ;
15
16 i f ( l oca lThreads [ 0 ] > maxWorkItemSizes [ 0 ] | |
17 loca lThreads [ 1 ] > maxWorkItemSizes [ 1 ] | |
18 ( s ize t ) blockSizeX ∗ blockSizeY > maxWorkGroupSize )
19 {
20 std : : cout << ”Unsupported : Device does not support reques ted ”
21 ” : number o f work items . ” ;
58
Appendix B. Monte Carlo Kernel Execution 59
22 return SDK FAILURE;
23 }
24
25 /∗ width − i . e number o f e lements in the array ∗/
26 s t a tu s = clSetKerne lArg ( kerne l , 2 , s izeof ( cl uint ) , (void∗)&width ) ;
27 i f ( ! sampleCommon−>checkVal ( s tatus ,
28 CL SUCCESS,
29 ” c lSetKerne lArg f a i l e d . ( width ) ” ) )
30 {
31 return SDK FAILURE;
32 }
33
34 /∗ whether s o r t i s to be in inc r ea s in g order .
35 CL TRUE imp l i e s i n c r ea s in g ∗/
36 s t a tu s = clSetKerne lArg ( kerne l , 3 , s izeof (cl mem ) , (void∗)&randBuf ) ;
37 i f ( ! sampleCommon−>checkVal ( s tatus ,
38 CL SUCCESS,
39 ” c lSetKerne lArg f a i l e d . ( randBuf ) ” ) )
40 {
41 return SDK FAILURE;
42 }
43
44 s t a tu s = clSetKerne lArg ( kerne l , 4 , s izeof (cl mem ) , (void∗)&pr i ceBuf ) ;
45 i f ( ! sampleCommon−>checkVal ( s tatus ,
46 CL SUCCESS,
47 ” c lSetKerne lArg f a i l e d . ( pr i ceBuf ) ” ) )
48 {
49 return SDK FAILURE;
50 }
51
52 s t a tu s = clSetKerne lArg ( kerne l , 5 , s izeof (cl mem ) ,
53 (void∗)&pr iceDer ivBuf ) ;
Appendix B. Monte Carlo Kernel Execution 60
54 i f ( ! sampleCommon−>checkVal ( s tatus ,
55 CL SUCCESS,
56 ” c lSetKerne lArg f a i l e d . ( pr i ceDer ivBuf ) ” ) )
57 {
58 return SDK FAILURE;
59 }
60
61 s t a tu s = clSetKerne lArg ( kerne l , 1 , s izeof ( c l int ) , (void∗)&noOfSum ) ;
62 i f ( ! sampleCommon−>checkVal ( s tatus ,
63 CL SUCCESS,
64 ” c lSetKerne lArg f a i l e d . (noOfSum) ” ) )
65 {
66 return SDK FAILURE;
67 }
68
69 struct o 4 f k e r n e l a r g karg ;
70
71 karg . mem arg = direc t ionBufF ;
72 karg . type = O4F KERNEL ARG CL MEM;
73 karg . i dx = 0 ;
74 s t a tu s = clSetKerne lArg ( kernelF , 0 ,
75 s izeof ( struct o 4 f k e r n e l a r g ) , (void∗)&karg ) ;
76 i f ( ! sampleCommon−>checkVal ( s tatus ,
77 CL SUCCESS,
78 ” c lSetKerne lArg f a i l e d . ( d i rec t ionBufF ) ” ) )
79 {
80 return SDK FAILURE;
81 }
82
83 karg . mem arg = randBufF ;
84 karg . type = O4F KERNEL ARG CL MEM;
85 karg . i dx = 1 ;
Appendix B. Monte Carlo Kernel Execution 61
86 s t a tu s = clSetKerne lArg ( kernelF , 1 ,
87 s izeof ( struct o 4 f k e r n e l a r g ) , (void∗)&karg ) ;
88 i f ( ! sampleCommon−>checkVal ( s tatus ,
89 CL SUCCESS,
90 ” c lSetKerne lArg f a i l e d . ( randBufF ) ” ) )
91 {
92 return SDK FAILURE;
93 }
94
95 // the r e are two k e rn e l s running on FPGA
96 karg . i n t a r g = ( noOfTraj ∗ noOfTraj ∗ noOfSum)/2 ;
97 karg . type = O4F KERNEL ARG CL INT;
98 karg . i dx = 2 ;
99 s t a tu s = clSetKerne lArg ( kernelF , 2 ,
100 s izeof ( struct o 4 f k e r n e l a r g ) , (void∗)&karg ) ;
101 i f ( ! sampleCommon−>checkVal ( s tatus ,
102 CL SUCCESS,
103 ” c lSetKerne lArg f a i l e d . ( count ) ” ) )
104 {
105 return SDK FAILURE;
106 }
107
108 karg . i n t a r g = 1 ;
109 karg . type = O4F KERNEL ARG CL INT;
110 karg . i dx = 3 ;
111 s t a tu s = clSetKerne lArg ( kernelF , 3 ,
112 s izeof ( struct o 4 f k e r n e l a r g ) , (void∗)&karg ) ;
113 i f ( ! sampleCommon−>checkVal ( s tatus ,
114 CL SUCCESS,
115 ” c lSetKerne lArg f a i l e d . ( l oadD i r e c t i on ) ” ) )
116 {
117 return SDK FAILURE;
Appendix B. Monte Carlo Kernel Execution 62
118 }
119
120 // load the d i r e c t i o n numbers .
121 s t a tu s = clEnqueueWriteBuffer ( th i s−>commandQueueF ,
122 th i s−>direct ionBufF ,
123 CL FALSE,
124 0 ,
125 th i s−>sobolBitWidth∗ th i s−>dimentionCount ,
126 th i s−>directionNumF ,
127 0 ,
128 NULL, // e v e n t l i s t ,
129 &f ev en t s [ 0 ] ) ;
130 s t a tu s = clWaitForEvents (1 , &f ev en t s [ 0 ] ) ;
131 i f ( ! sampleCommon−>checkVal ( s tatus ,
132 CL SUCCESS,
133 ” clWaitForEvents f a i l e d . ” ) )
134 {
135 return SDK FAILURE;
136 }
137 c lRe leaseEvent ( f e v en t s [ 0 ] ) ;
138
139 // Run the rnd genera tor f o r the f i r s t time .
140 s t a tu s = clEnqueueTask ( th i s−>commandQueueF ,
141 th i s−>kernelF ,
142 0 ,
143 NULL,
144 &f ev en t s [ 0 ] ) ;
145 i f ( ! sampleCommon−>checkVal ( s tatus ,
146 CL SUCCESS,
147 ”clEnqueueTask f a i l e d . ” ) )
148 {
149 return SDK FAILURE;
Appendix B. Monte Carlo Kernel Execution 63
150 }
151
152 /∗ wai t f o r the k e rne l c a l l to f i n i s h execu t ion ∗/
153 s t a tu s = clWaitForEvents (1 , &f ev en t s [ 0 ] ) ;
154 i f ( ! sampleCommon−>checkVal ( s tatus ,
155 CL SUCCESS,
156 ” clWaitForEvents f a i l e d . ” ) )
157 {
158 return SDK FAILURE;
159 }
160 c lRe leaseEvent ( f e v en t s [ 0 ] ) ;
161
162 // next time the k e rne l does not need to load the d i r e c t i o n a l v e c t o r
163 karg . i n t a r g = 0 ;
164 karg . type = O4F KERNEL ARG CL INT;
165 karg . i dx = 3 ;
166 s t a tu s = clSetKerne lArg ( kernelF , 3 ,
167 s izeof ( struct o 4 f k e r n e l a r g ) , (void∗)&karg ) ;
168 i f ( ! sampleCommon−>checkVal ( s tatus ,
169 CL SUCCESS,
170 ” c lSetKerne lArg f a i l e d . ( l oadD i r e c t i on ) ” ) )
171 {
172 return SDK FAILURE;
173 }
174
175 s t a tu s = clEnqueueReadBuffer ( th i s−>commandQueueF ,
176 th i s−>randBufF ,
177 CL TRUE, // CL FALSE,
178 0 ,
179 th i s−>noOfSum ∗
180 th i s−>noOfTraj ∗
181 th i s−>noOfTraj ,
Appendix B. Monte Carlo Kernel Execution 64
182 th i s−>randNum ,
183 0 ,
184 NULL, // e v e n t l i s t ,
185 &f ev en t s [ 0 ] ) ;
186
187 f loat t imeStep = maturity / (noOfSum − 1 ) ;
188
189 // I n i t i a l i z e random number genera tor
190 // srand ( 1 ) ;
191
192 for ( int k = 0 ; k < s t ep s ; k++)
193 {
194 // f o r ( i n t j = 0 ; j < ( width ∗ h e i g h t ∗ 4 ) ; j++)
195 //{
196 // randNum [ j ] = ( c l u i n t ) rand ( ) ;
197 //}
198 // For k = 0 , the random numbers are generated b e f o r e g e t t i n g
199 // in t o the loop . We j u s t wai t here to ensure memory t r an s f e r
200 // i s f i n i s h e d .
201 /∗ wai t f o r the random numbers to be t r an s f e r r e d to hos t ∗/
202 s t a tu s = clWaitForEvents (1 , &f ev en t s [ 0 ] ) ;
203 i f ( ! sampleCommon−>checkVal ( s tatus ,
204 CL SUCCESS,
205 ” clWaitForEvents f a i l e d . ” ) )
206 {
207 return SDK FAILURE;
208 }
209 c lRe leaseEvent ( f e v en t s [ 0 ] ) ;
210
211 f loat c1 = ( i n t e r e s t − 0 .5 f ∗ sigma [ k ] ∗ sigma [ k ] ) ∗ t imeStep ;
212 f loat c2 = sigma [ k ] ∗ s q r t ( t imeStep ) ;
213 f loat c3 = ( i n t e r e s t + 0 .5 f ∗ sigma [ k ] ∗ sigma [ k ] ) ;
Appendix B. Monte Carlo Kernel Execution 65
214
215 const c l f l o a t 4 c1F4 = {c1 , c1 , c1 , c1 } ;
216 a t t r i b u t e s . c1 = c1F4 ;
217
218 const c l f l o a t 4 c2F4 = {c2 , c2 , c2 , c2 } ;
219 a t t r i b u t e s . c2 = c2F4 ;
220
221 const c l f l o a t 4 c3F4 = {c3 , c3 , c3 , c3 } ;
222 a t t r i b u t e s . c3 = c3F4 ;
223
224 const c l f l o a t 4 i n i tP r i c eF4 =
225 { i n i tP r i c e , i n i tP r i c e , i n i tP r i c e , i n i t P r i c e } ;
226 a t t r i b u t e s . i n i t P r i c e = in i tP r i c eF4 ;
227
228 const c l f l o a t 4 s t r i k ePr i c eF4 =
229 { s t r i k eP r i c e , s t r i k eP r i c e , s t r i k eP r i c e , s t r i k eP r i c e } ;
230 a t t r i b u t e s . s t r i k eP r i c e = s t r i k ePr i c eF4 ;
231
232 const c l f l o a t 4 sigmaF4 =
233 { sigma [ k ] , sigma [ k ] , sigma [ k ] , sigma [ k ] } ;
234 a t t r i b u t e s . sigma = sigmaF4 ;
235
236 const c l f l o a t 4 timeStepF4 =
237 { timeStep , timeStep , timeStep , t imeStep } ;
238 a t t r i b u t e s . t imeStep = timeStepF4 ;
239
240
241 /∗ Set appropr ia t e arguments to the k e rne l ∗/
242
243 /∗ the input array − a l s o ac t s as output f o r
244 t h i s pass ( input f o r next ) ∗/
245 s t a tu s = clSetKerne lArg ( kerne l , 0 ,
Appendix B. Monte Carlo Kernel Execution 66
246 s izeof ( a t t r i b u t e s ) , (void∗)& a t t r i b u t e s ) ;
247 i f ( ! sampleCommon−>checkVal ( s tatus ,
248 CL SUCCESS,
249 ” c lSetKerne lArg f a i l e d . ( a t t r i b u t e s ) ” ) )
250 {
251 return SDK FAILURE;
252 }
253
254 /∗
255 ∗ Enqueue a ke rne l run c a l l .
256 ∗/
257 s t a tu s = clEnqueueNDRangeKernel (commandQueue ,
258 kerne l ,
259 2 ,
260 NULL,
261 globalThreads ,
262 loca lThreads ,
263 0 ,
264 NULL,
265 &events [ 0 ] ) ;
266
267 i f ( ! sampleCommon−>checkVal ( s tatus ,
268 CL SUCCESS,
269 ”clEnqueueNDRangeKernel f a i l e d . ” ) )
270 {
271 return SDK FAILURE;
272 }
273
274 // Enqueue the rnd genera tor to genera te next s e t o f numbers
275 s t a tu s = clEnqueueTask ( th i s−>commandQueueF ,
276 th i s−>kernelF ,
277 0 ,
Appendix B. Monte Carlo Kernel Execution 67
278 NULL,
279 &f ev en t s [ 0 ] ) ;
280 i f ( ! sampleCommon−>checkVal ( s tatus ,
281 CL SUCCESS,
282 ”clEnqueueTask f a i l e d . ” ) )
283 {
284 return SDK FAILURE;
285 }
286
287 /∗ wai t f o r the rnd number genera tor to f i n i s h execu t i on ∗/
288 s t a tu s = clWaitForEvents (1 , &f ev en t s [ 0 ] ) ;
289 i f ( ! sampleCommon−>checkVal ( s tatus ,
290 CL SUCCESS,
291 ” clWaitForEvents f a i l e d . ” ) )
292 {
293 return SDK FAILURE;
294 }
295 c lRe leaseEvent ( f e v en t s [ 0 ] ) ;
296
297
298 /∗ wai t f o r the k e rne l c a l l to f i n i s h execu t ion ∗/
299 s t a tu s = clWaitForEvents (1 , &events [ 0 ] ) ;
300 i f ( ! sampleCommon−>checkVal ( s tatus ,
301 CL SUCCESS,
302 ” clWaitForEvents f a i l e d . ” ) )
303 {
304 return SDK FAILURE;
305 }
306
307 c lRe leaseEvent ( events [ 0 ] ) ;
308
309 /∗ Enqueue read ing in the rnd numbers ∗/
Appendix B. Monte Carlo Kernel Execution 68
310 s t a tu s = clEnqueueReadBuffer ( th i s−>commandQueueF ,
311 th i s−>randBufF ,
312 CL TRUE, // CL FALSE,
313 0 ,
314 th i s−>noOfSum ∗
315 th i s−>noOfTraj ∗
316 th i s−>noOfTraj ,
317 th i s−>randNum ,
318 0 ,
319 NULL, // e v e n t l i s t ,
320 &f ev en t s [ 0 ] ) ;
321
322
323 /∗ Enqueue the r e s u l t s to a p p l i c a t i o n po in t e r ∗/
324 s t a tu s = clEnqueueReadBuffer (commandQueue ,
325 pr iceBuf ,
326 CL TRUE,
327 0 ,
328 width∗ he ight ∗2∗ s izeof ( c l f l o a t 4 ) ,
329 pr i ceVa l s ,
330 0 ,
331 NULL,
332 &events [ 0 ] ) ;
333 i f ( ! sampleCommon−>checkVal ( s tatus ,
334 CL SUCCESS,
335 ” clEnqueueReadBuffer f a i l e d . ” ) )
336 {
337 return SDK FAILURE;
338 }
339
340 /∗ wai t f o r the read b u f f e r to f i n i s h execu t ion ∗/
341 s t a tu s = clWaitForEvents (1 , &events [ 0 ] ) ;
Appendix B. Monte Carlo Kernel Execution 69
342 i f ( ! sampleCommon−>checkVal ( s tatus ,
343 CL SUCCESS,
344 ” clWaitForEvents f a i l e d . ” ) )
345 {
346 return SDK FAILURE;
347 }
348
349 c lRe leaseEvent ( events [ 0 ] ) ;
350
351 /∗ Enqueue the r e s u l t s to a p p l i c a t i o n po in t e r ∗/
352 s t a tu s = clEnqueueReadBuffer (commandQueue ,
353 pr iceDer ivBuf ,
354 CL TRUE,
355 0 ,
356 width∗ he ight ∗2∗ s izeof ( c l f l o a t 4 ) ,
357 pr i ceDer iv ,
358 0 ,
359 NULL,
360 &events [ 0 ] ) ;
361 i f ( ! sampleCommon−>checkVal ( s tatus ,
362 CL SUCCESS,
363 ” clEnqueueReadBuffer f a i l e d . ” ) )
364 {
365 return SDK FAILURE;
366 }
367
368 /∗ wai t f o r the read b u f f e r to f i n i s h execu t ion ∗/
369 s t a tu s = clWaitForEvents (1 , &events [ 0 ] ) ;
370 i f ( ! sampleCommon−>checkVal ( s tatus ,
371 CL SUCCESS,
372 ” clWaitForEvents f a i l e d . ” ) )
373 {
Appendix B. Monte Carlo Kernel Execution 70
374 return SDK FAILURE;
375 }
376
377 c lRe leaseEvent ( events [ 0 ] ) ;
378
379 /∗ Replace Fo l lowing ” f o r ” loop wi th reduc t ion ke rne l ∗/
380 for ( int i = 0 ; i < noOfTraj ∗ noOfTraj ; i++)
381 {
382 p r i c e [ k ] += pr i c eVa l s [ i ] ;
383 vega [ k ] += pr i c eDer iv [ i ] ;
384 }
385
386 p r i c e [ k ] /= ( noOfTraj ∗ noOfTraj ) ;
387 vega [ k ] /= ( noOfTraj ∗ noOfTraj ) ;
388
389 p r i c e [ k ] = exp(− i n t e r e s t ∗ maturity ) ∗ p r i c e [ k ] ;
390 vega [ k ] = exp(− i n t e r e s t ∗ maturity ) ∗ vega [ k ] ;
391 }
392
393 // we do an ex t ra s e t o f random numbers , and ask to read i t
394 // t h i s s e t won ’ t be used , but j u s t c l ean ing up .
395 s t a tu s = clWaitForEvents (1 , &f ev en t s [ 0 ] ) ;
396 i f ( ! sampleCommon−>checkVal ( s tatus ,
397 CL SUCCESS,
398 ” clWaitForEvents f a i l e d . ” ) )
399 {
400 return SDK FAILURE;
401 }
402 c lRe leaseEvent ( f e v en t s [ 0 ] ) ;
403
404 return SDK SUCCESS;
405 }
Appendix C
Sobol Sequence Implementation
Performance was not a goal for the Sobol sequence generator. The following two topics
need to be addressed foremost to improve performance.
Clock Frequency
The 62.5 MHz clock provided by the PCI-express core is used by the Sobol sequence
module. The floating point operation core used in the module allows a variable length
pipeline to tune performance. This pipeline is adjusted simply to meet the 62.5 MHz
frequency. In contrast, a random number generator running at 180 MHz on an older
FPGA device is reported by Tian [19]. It evident there is room to improve the clock
frequency of the Sobol sequence module.
Random Number Generation Throughput
Currently with two instances of the Sobol sequence module instantiated, the complete
design utilizes 9% of the LUTs and 35% of the block RAMs on the Xilinx Virtex5-
LX110T FPGA device. The two instances themselves only utilizes 1% of the LUTS and
1% of the block RAMs. The random number generation throughput for the FPGA can
be greatly incrased by instantiating more copies of the module. However, the simple
71
Appendix C. Sobol Sequence Implementation 72
memory controller used in the static framework restricts the number to two.
Bibliography
[1] Advanced Micro Devices. AMD OpenCL Zone. developer.amd.com/zones/
OpenCLZone/Pages/default.aspx.
[2] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,
Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. Legup: high-level
synthesis for fpga-based processor/accelerator systems. In Proceedings of the 19th
ACM/SIGDA international symposium on Field programmable gate arrays, FPGA
’11, pages 33–36, New York, NY, USA, 2011. ACM.
[3] Eugene Cartwright, Sen Ma, David Andrews, and Miaoqing Huang. Creating
HW/SW co-designed MPSoPC’s from high level programming models. In High Per-
formance Computing and Simulation (HPCS), 2011 International Conference on,
pages 554 –560, july 2011.
[4] Shuai Che, Jie Li, J.W. Sheaffer, K. Skadron, and J. Lach. Accelerating Compute-
Intensive Applications with GPUs and FPGAs. In Application Specific Processors,
2008. SASP 2008. Symposium on, pages 101 –107, june 2008.
[5] B. Cope, P.Y.K. Cheung, W. Luk, and S. Witt. Have GPUs made FPGAs redun-
dant in the field of video processing? In Field-Programmable Technology, 2005.
Proceedings. 2005 IEEE International Conference on, pages 111 –118, dec. 2005.
73
Bibliography 74
[6] John B. Drake, Philip W. Jones, and George R. Carr. Overview of the software
design of the community climate system model. Int. J. High Perform. Comput.
Appl, 19:177–186, 2005.
[7] Z. Zhang et al. AutoPilot: A Platform-Based ESL Synthesis System”, High-Level
Synthesis, Springer Netherlands. www.autoesl.com, 2008.
[8] L.W. Howes, P. Price, O. Mencer, O. Beckmann, and O. Pell. Comparing FPGAs
to Graphics Accelerators and the Playstation 2 Using a Unified Source Descrip-
tion. In Field Programmable Logic and Applications, 2006. FPL ’06. International
Conference on, pages 1 –6, aug. 2006.
[9] Sobol Ilya. Uniformly distributed sequences with an additional uniform property.
In USSR Computational Mathematics and Mathematical Physics, Volume 16, pages
236–242, 1977.
[10] Khronos Group. Installable Client Drivers (ICD) Loader. http://www.khronos.
org/registry/cl/extensions/khr/cl khr icd.txt.
[11] Khronos Group. OpenCL Specification 1.1. www.khronos.org/registry/cl/
specs/opencl-1.1.pdf.
[12] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong
Program Analysis & Transformation. In Proceedings of the 2004 International Sym-
posium on Code Generation and Optimization (CGO’04), Palo Alto, California, Mar
2004.
[13] Mingjie Lin, I. Lebedev, and J. Wawrzynek. OpenRCL: Low-Power High-
Performance Computing with Reconfigurable Devices. In Field Programmable Logic
and Applications (FPL), 2010 International Conference on, pages 458 –463, 31 2010-
sept. 2 2010.
Bibliography 75
[14] Nvidia Corporation. Nvidia OpenCL Support. http://developer.nvidia.com/
opencl.
[15] M. Owaida, N. Bellas, K. Daloukas, and C.D. Antonopoulos. Synthesis of Platform
Architectures from OpenCL Programs. In Field-Programmable Custom Computing
Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, pages 186
–193, may 2011.
[16] A. Papakonstantinou, K. Gururaj, J.A. Stratton, D. Chen, J. Cong, and W.-M.W.
Hwu. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In
Application Specific Processors, 2009. SASP ’09. IEEE 7th Symposium on, pages 35
–42, july 2009.
[17] Michael Showerman, Jeremy Enos, Avneesh Pant, Volodymyr Kindratenko, Craig
Steffen, Robert Pennington, and Wen mei Hwu. QP: A heterogeneous multi-
accelerator cluster. In Proceedings of the 10th LCI International Conference on
High-performance Clustered Computing, march 2009.
[18] David Barrie Thomas, Lee Howes, and Wayne Luk. A comparison of cpus, gpus,
fpgas, and massively parallel processor arrays for random number generation. In
Proceeding of the ACM/SIGDA international symposium on Field programmable
gate arrays, FPGA ’09, pages 63–72, New York, NY, USA, 2009. ACM.
[19] Xiang Tian and K. Benkrid. Massively parallelized quasi-monte carlo financial sim-
ulation on a fpga supercomputer. In High-Performance Reconfigurable Computing
Technology and Applications, 2008. HPRCTA 2008. Second International Workshop
on, pages 1 –8, nov. 2008.
[20] Kuen Hung Tsoi and Wayne Luk. Axel: a heterogeneous cluster with FPGAs and
GPUs. In Proceedings of the 18th annual ACM/SIGDA international symposium on
Bibliography 76
Field programmable gate arrays, FPGA ’10, pages 115–124, New York, NY, USA,
2010. ACM.
[21] R. Weber, A. Gothandaraman, R.J. Hinde, and G.D. Peterson. Comparing Hardware
Accelerators in Scientific Applications: A Case Study. Parallel and Distributed
Systems, IEEE Transactions on, 22(1):58 –68, jan. 2011.