Altera SDK for OpenCL Programming Guide - intel.cn · Installation and Licensing Manual for the...

30
June 2013 Altera Corporation OCL002-13.0 SP1.0 © 2013 Altera Corporation. All rights reserved. ALTERA, ARRIA, CYCLONE, ENPIRION, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation and registered in the U.S. Patent and Trademark Office and in other countries. OpenCL™ and the OpenCL logo are trademarks of Apple Inc. used by permission of Khronos. * The Altera SDK for OpenCL is based on a published Khronos Specification, and is expected to pass the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org/conformance. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at www.altera.com/common/legal.html. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera. Altera customers are advised to obtain the latest version 101 Innovation Drive San Jose, CA 95134 www.altera.com Feedback Subscribe ISO 9001:2008 Registered Altera SDK for OpenCL Programming Guide The Altera ® SDK for OpenCL Programming Guide describes the content and functionality of the Altera SDK for OpenCL version 13.0 SP1. The SDK is a C-based heterogeneous parallel programming environment for Altera field programmable gate arrays (FPGAs). This document assumes that you have read the Altera SDK for OpenCL Getting Started Guide and have performed the following tasks: Download and install the Quartus II software version 13.0 SP1. Install your Stratix V FPGA board. You must download and install all necessary device support software. Download and install the Altera SDK for OpenCL version 13.0 SP1. Install the board driver. f If you have not performed the tasks described above, refer to the Altera Software Installation and Licensing Manual for the Quartus II software installation instructions. Refer to the Altera SDK for OpenCL Getting Started Guide for the Altera SDK for OpenCL software installation instructions. This programming guide assumes that you are knowledgeable in OpenCL concepts and application programming interfaces (APIs), as described in the Khronos OpenCL Specification version 1.0. In addition, this document assumes that you are familiar with specific sections and clauses of the OpenCL specification. For more information about the OpenCL Specification version 1.0, refer to OpenCL Reference Pages on the Khronos Group website. Contents of the Altera SDK for OpenCL version 13.0 SP1 The Altera SDK for OpenCL version 13.0 SP1 provides the following logical components: Compiler—The compiler translates your OpenCL C device code into a hardware configuration file that can be loaded onto an Altera FPGA. Altera SDK for OpenCL (AOCL) Utility —The AOCL utility includes a set of commands that allows you to perform high-level tasks. For more information, refer to “Altera SDK for OpenCL Utility” on page 9 .

Transcript of Altera SDK for OpenCL Programming Guide - intel.cn · Installation and Licensing Manual for the...

June 2013 Altera Corporation

OCL002-13.0 SP1.0

NIOS, QUARTUS and STRATTrademark Office and in otherKhronos. * The Altera SDK forConformance Testing Process.logos identified as trademarkswww.altera.com/common/legaccordance with Altera's standwithout notice. Altera assumesservice described herein except

101 Innovation DriveSan Jose, CA 95134www.altera.com

Altera SDK for OpenCLProgramming Guide

© 2013 Altera Corporation. All rights reserved. ALTERA, ARRIA, CYCLONE, ENPIRION, HARDCOPY, MAX, MEGACORE,

The Altera® SDK for OpenCL™ Programming Guide describes the content and functionality of the Altera SDK for OpenCL version 13.0 SP1. The SDK is a C-based heterogeneous parallel programming environment for Altera field programmable gate arrays (FPGAs). This document assumes that you have read the Altera SDK for OpenCL Getting Started Guide and have performed the following tasks:

■ Download and install the Quartus II software version 13.0 SP1.

■ Install your Stratix V FPGA board. You must download and install all necessary device support software.

■ Download and install the Altera SDK for OpenCL version 13.0 SP1.

■ Install the board driver.

f If you have not performed the tasks described above, refer to the Altera Software Installation and Licensing Manual for the Quartus II software installation instructions. Refer to the Altera SDK for OpenCL Getting Started Guide for the Altera SDK for OpenCL software installation instructions.

This programming guide assumes that you are knowledgeable in OpenCL concepts and application programming interfaces (APIs), as described in the Khronos OpenCL Specification version 1.0. In addition, this document assumes that you are familiar with specific sections and clauses of the OpenCL specification. For more information about the OpenCL Specification version 1.0, refer to OpenCL Reference Pages on the Khronos Group™ website.

Contents of the Altera SDK for OpenCL version 13.0 SP1The Altera SDK for OpenCL version 13.0 SP1 provides the following logical components:

■ Compiler—The compiler translates your OpenCL C device code into a hardware configuration file that can be loaded onto an Altera FPGA.

■ Altera SDK for OpenCL (AOCL) Utility —The AOCL utility includes a set of commands that allows you to perform high-level tasks. For more information, refer to “Altera SDK for OpenCL Utility” on page 9.

IX words and logos are trademarks of Altera Corporation and registered in the U.S. Patent and countries. OpenCL™ and the OpenCL logo are trademarks of Apple Inc. used by permission of OpenCL is based on a published Khronos Specification, and is expected to pass the Khronos

Current conformance status can be found at www.khronos.org/conformance. All other words and or service marks are the property of their respective holders as described at al.html. Altera warrants performance of its semiconductor products to current specifications in ard warranty, but reserves the right to make changes to any products and services at any time no responsibility or liability arising out of the application or use of any information, product, or as expressly agreed to in writing by Altera. Altera customers are advised to obtain the latest version

Feedback Subscribe

ISO 9001:2008 Registered

Page 2 Contents of the Altera SDK for OpenCL version 13.0 SP1

■ Host Runtime—The host runtime provides the OpenCL host platform API and runtime API for your OpenCL host application. The host runtime consists of the following libraries:

■ Statically-linked libraries provide OpenCL host APIs, hardware abstractions and helper libraries

■ Dynamically-linked libraries (DLLs) provide hardware abstractions and helper libraries

On Windows and Linux machines, the installation process for the Altera SDK for OpenCL installs the SDK into a folder or directory referenced by the ALTERAOCLSDKROOT environment variable. The contents of the Altera SDK for OpenCL version 13.0 SP1 are summarized in Table 1.

Table 1. Contents of the Altera SDK for OpenCL for Windows and Linux

WindowsFolder

Linux Directory Description

\windows64\bin /linux64/binShared libraries, executables and Windows DLLs for your development environment. Include this folder or directory in your PATH environment variable.

\windows64\driver /linux64/driver

The board driver in this folder or directory allows your host computer to communicate with the BittWare FPGA board. Refer to the Altera SDK for OpenCL Getting Started Guide for installation instructions.

Note: The board driver for the Nallatech FPGA board resides in the following location:

■ For Windows system, the path is to the driver source is %ALTERAOCLSDKROOT%\board\pcie385n\windows64\ driver

■ For Linux system, the path is to the driver source is $ALTERAOCKSDKROOT/board/pcie385n/host/linux64/ driver

\board /board The Altera Preferred Board Partner Program for OpenCL for each FPGA board supported by the SDK.

\ip /ip IP cores used to compile device kernels.

\host /host Files used to compile your program.

\host\includeJune /host/includeOpenCL version 1.0 header files and Altera SDK for OpenCL interface files that are used to compile and link your host program. Add this path to the include file search path in your development environment.

\host\windows64\lib /host/linux64/libOpenCL host runtime libraries that comprise a generic layer and a hardware abstraction layer that interface the host CPU with the FPGA board via PCIe. Used to link your host program.

— /linux64/lib Shared libraries for your development environment. Include this directory in your LD_LIBRARY_PATH environment variable.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Example OpenCL Applications Page 3

Example OpenCL ApplicationsThe Altera SDK for OpenCL includes the following example OpenCL applications. To access these example files, navigate to the ALTERAOCLSDKROOT/designs/<design_name> folder or directory.

■ fft—An OpenCL application of a fast fourier transform algorithm.

■ hello_world—An OpenCL application that prints a message on the host system.

■ matrixMult—An OpenCL application of a single precision floating-point-blocked matrix multiplication algorithm.

■ moving_average—An OpenCL application of an averaging filter.

■ vectorAdd—An OpenCL application that adds two vectors.

Altera SDK for OpenCL Compilation FlowCompiling an OpenCL application with the SDK is a two-step process:

1. Compile your OpenCL kernels. Refer to “Compiling the OpenCL Kernel Source File” on page 5.

2. Compile and link your host program. Refer to “Compiling and Linking Your Host Program” on page 10.

Figure 1 depicts the compilation flow of the Altera SDK for OpenCL.

Figure 1. The Altera SDK for OpenCL Compilation Flow

PCIe

Host Binary

AOC OpenCLCompiler

StandardC Compiler

Kernel Code

Kernel Binary

Host Code

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 4 Using the Altera Offline Kernel Compiler

Using the Altera Offline Kernel CompilerFigure 2 illustrates the compilation flow of an OpenCL application that targets an Altera FPGA.

You create an FPGA image of your kernel file using the Altera Offline Compiler (AOC). This kernel compiler translates the kernel source file (.cl) into an Altera Offline Compiler Object file (.aoco). The AOC then uses the .aoco file to build a hardware configuration file, called the Altera Offline Compiler Executable file (.aocx), that targets the FPGA.

1 If you create multiple kernel files, you must consolidate them into a single .cl file prior to compilation.

Setting Up Your FPGA BoardYou can target any FPGA board provided in a board package supplied by an Altera Preferred Board Partner. To specify the board package that the AOC uses, set the AOCL_BOARD_PACKAGE_ROOT environment variable to the path of your board vendor's board package (that is, ALTERAOCLSDKROOT/board/<board_name>).

For example, if you run the SDK on Linux, set the board package environment variable as shown below:

■ For the Nallatech PCIe-385N A7 FPGA computing card, set AOCL_BOARD_PACKAGE_ROOT to $ALTERAOCLSDKROOT/board/pcie385n

■ For the BittWare S5-PCIe-HQ Altera Stratix V GSMD5 half-length PCIe board, set AOCL_BOARD_PACKAGE_ROOT to $ALTERAOCLSDKROOT/board/s5pciehq

Figure 2. Compilation Flow of an OpenCL Application Using the Altera SDK for OpenCL Offline Kernel Compiler

Kernelfile 1

Kernelfile 2

Kernelfile 3

Merged kernel

source file

Altera OfflineKernel Compiler

(AOC)

FPGAimage file

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Using the Altera Offline Kernel Compiler Page 5

Compiling the OpenCL Kernel Source FileThe AOC builds your OpenCL kernels and creates FPGA hardware configuration files from them. Table 2 lists the available aoc compilation options.

You can invoke the following command to direct the AOC to build your OpenCL kernel and create your hardware configuration file in a single step:

aoc <your_kernel_filename>.cl

c The process of creating an FPGA hardware configuration file takes hours to complete. Altera recommends that you perform this hardware configuration step on a computer that is equipped with at least four processor cores and 24 gigabytes (GB) of RAM.

You may also perform the compilation in stages. For illustration purposes, assume you are creating an FPGA hardware configuration file for a vector addition OpenCL application. The AOC builds the .aoco file from the kernel source file shown in Example 1.

Table 2. Available Compilation Options for the AOC

Command Description

Default compilation option:

aoc <your_kernel_filename>.clGenerates a Quartus II hardware design project and then creates the .aocx FPGA hardware configuration file in the project directory.

A two-step compilation option:

aoc -c <your_kernel_filename>.cl

Step 1:

Generates a Quartus II hardware design project in a subdirectory that has the same name as the kernel filename, with special characters converted to underscores.The subdirectory contains intermediate files that the SDK uses to build the bit stream used to program the FPGA.

This command generates the .aoco file.

aoc <your_kernel_filename>.aocoStep 2:

Creates the .aocx file from the .aoco file.

Example 1. vectorAdd.cl Kernel Source File

// AOCL kernel for adding two input vectors

__kernel void vectorAdd(__global const float * restrict x,__global const float * restrict y,__global float * restrict z,const int numElements)

{

// get index of the work-itemsize_t gid = get_global_id(0);

if (gid < numElements){

z[gid] = x[gid] + y[gid];}

}

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 6 Using the Altera Offline Kernel Compiler

1 You can access the kernel source file vectorAdd.cl by navigating to the ALTERAOCLSDKROOT/designs/vectorAdd/vectorAdd_<board_name> folder or directory.

To create the .aoco file, invoke the aoc -c vectorAdd.cl command. When the first compilation step completes and you return to the command prompt, type aoc vectorAdd.aoco to create the hardware .aocx file from the .aoco file.

1 Generating the .aoco file is a rapid process compared to creating the .aocx file. To increase efficiency, iterate on your kernel code using this compilation option before generating a hardware configuration file.

Targeting a Specific FPGA BoardAfter you set the AOCL_BOARD_PACKAGE_ROOT environment variable, you can view a list of the available boards within your current board package by typing the command aoc --list-boards. You can then target any of the listed boards within that board package using the --board <board_name> flag of the aoc command. For example, to target the Nallatech PCIe-385N A7 FPGA computing card, type the command:

aoc --board pcie385n_a7 <your_kernel_filename>.cl

Generating Compilation ReportsBy default, the command prompt appears to signify the completion of the compilation. If you want to review the kernel throughput analysis and the estimated resource usage summary reports for the compilation, add the --report flag to the aoc command. For example, to view the reports for the vectorAdd kernel compilation, shown in Example 2, invoke the aoc -c vectorAdd.cl --report command.

1 If the AOC fails to determine the throughput in work-items per second, the throughput due to control flow analysis value in the kernel throughput analysis report appears as .24M work items/second.

1 You can also view the compilation reports by accessing the <your_kernel_filename>.log file located in the <your_kernel_filename> folder or subdirectory.

Example 2. Output of the aoc -c vectorAdd.cl --report Command

aoc: Selected target board pcie385n_a7Kernel throughput analysis for : vectorAdd .. simd work items : 1 .. compute units : 1 .. throughput due to control flow analysis : 250.00 M work items/second .. kernel global memory bandwidth analysis : 3157.89 MB/second .. kernel number of local memory banks : none

+--------------------------------------------------------------------+; Estimated Resource Usage Summary ;+----------------------------------------+---------------------------+; Resource + Usage ;+----------------------------------------+---------------------------+; Logic utilization ; 17% ;; Dedicated logic registers ; 7% ;; Memory blocks ; 16% ;; DSP blocks ; 0% ;+----------------------------------------+---------------------------;

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Using the Altera Offline Kernel Compiler Page 7

Altera Offline Compiler OptionsYou can type the command aoc --help to obtain information about the command options for the AOC, as summarized in Table 3.

Table 3. AOC Command Options (Part 1 of 3)

Option Description

Help options:

--help or -h Displays a list of aoc command options.

--version Prints out version information and exits.

-v Reports the progress of compilation.

--report Generates reports for kernel throughput analysis and estimated resource usage summary. Reported resource utilization includes external interfaces such as PCIe, memory controller and direct memory access (DMA) engine.

Overall options:

-c Sets up a Quartus II hardware design project without performing a hardware build.

-o <file> Uses <file> as the name for the output.

Usage:To specify the filename of the output .aoco file, type the command:aoc -c -o <your_object_file>.aoco myKernel.cl

To specify the filename of the output .aocx file, type the command:aoc -o <your_executable_file>.aocx myKernel.cl

oraoc -o <your_executable_file>.aocx myKernel.aoco

-I <directory> Adds <directory> to header search path.

Note: The header files residing in <directory> usually refer to the OpenCL kernel files that are included in the main OpenCL kernel.

-D <macro_name> Defines a macro. You can define a macro called <macro_name>, or you can define a macro and assign a value to it, that is, <macro_name=value>.

For example, aoc myKernel.cl -D RADIUS=4 is equivalent to adding a #define RADIUS 4 line in your kernel source file.

-W Suppresses warnings.

-Werror Forces all warnings into errors.

Modifiers:

--board <board name> Runs compilation for the specified board.

For example, the command aoc --board pcie385n_a7 myKernel.cl compiles myKernel for the Nallatech PCIe-385N A7 FPGA computing card.

--list-boards Prints a list of available boards and exits.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 8 Using the Altera Offline Kernel Compiler

Optimization controls:

-fp-relaxed=true Directs the AOC to relax the order of arithmetic floating-point operators. You can use this flag to optimize floating-point operations using balanced trees.If you want to use this optimization option, your program must be able to tolerate small differences in floating-point results.

Usage: aoc -fp-relaxed=true myKernel.cl

For more information, refer to the Tree Balancing section of the Altera SDK for OpenCL Optimization Guide.

-fpc=true Directs the AOC to remove intermediary floating-point rounding operations and conversions whenever possible, and to carry additional bits to maintain precision. Changes rounding mode to round towards zero for multiplications and additions.If possible, this argument causes the AOC to round a floating-point operation once at the end of the balanced tree.

Usage: aoc -fpc=true myKernel.cl

For more information, refer to the Rounding Operations section of the Altera SDK for OpenCL Optimization Guide.

--sw-dimm-partition Configures the global memory as separate address spaces for each external memory or bank.After you configure the global memory, allocate each buffer in one bank or the other with the Altera-specific cl_mem_flags (For example, CL_MEM_BANK_2_ALTERA).

For more information, refer to the Optimizing Global Memory Accesses section of the Altera SDK for OpenCL Optimization Guide.

--const-cache-bytes <N> Configures the constant cache size in bytes, rounded to the next nearest power of 2.This argument has no effect if none of the kernels uses the __constant address space. The default constant cache size is 16 kilobytes (kB).

For example, to configure a 32 kB cache, type the command aoc --const-cache-bytes 32768 myKernel.cl

-O3 Applies resource-driven performance optimizations to improve performance without violating fitting requirements (estimated logic utilization is less than or equal to 85%).

For more information, refer to the Resource-Driven Optimization section of the Altera SDK for OpenCL Optimization Guide.

--util <N> Overrides the logic utilization threshold used by the -O3 option with <N>%.

For more information, refer to the Resource-Driven Optimization section of the Altera SDK for OpenCL Optimization Guide.

Table 3. AOC Command Options (Part 2 of 3)

Option Description

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Using the Altera Offline Kernel Compiler Page 9

Altera SDK for OpenCL UtilityThe Altera SDK for OpenCL (AOCL) utility provides additional functionality that allows you to perform high-level tasks. You can obtain information on your host program such as flags and makefile fragments. You can invoke the aocl help command to obtain information about the command options, as summarized in Table 4.

OpenCL Build Options:

-cl-single-precision-constant

Refer to section 5.4.3 of the OpenCL Specification version 1.0.

-cl-denorms-are-zero

-cl-opt-disable

-cl-strict-aliasing

-cl-mad-enable

-cl-no-signed-zeros

-cl-unsafe-math-optimizations

-cl-finite-math-only

-cl-fast-relaxed-math

Table 3. AOC Command Options (Part 3 of 3)

Option Description

Table 4. AOCL Command Options

Option Description

Subcommands for building your host program:

aocl example-makefile Displays makefile fragments for compiling and linking a host program.

aocl makefile Same as the example-makefile subcommand.

aocl compile-config Displays the flags for compiling your host program.

aocl link-config

Displays the flags for linking your host program with the runtime libraries provided by the Altera SDK for OpenCL.This command combines the function of the ldflags and ldlibs subcommands.

aocl linkflags Same as the link-config subcommand.

aocl ldflagsDisplays the linker flags used to link your host program to the host runtime libraries provided by the Altera SDK for OpenCL. This command does not list the libraries themselves.

aocl ldlibs Displays the list of host runtime libraries provided by the Altera SDK for OpenCL.

aocl installInstalls the driver and tools for your FPGA board into the current host system. The board-specific tools are referenced by the AOCL_BOARD_PACKAGE_ROOT environment variable.

aocl programConfigures a new FPGA image onto the board.

Usage: aocl program <your_executable_file>.aocx.

aocl diagnostic Runs the board vendor’s test program for your FPGA board.

aocl flash[If supported] Initializes the FPGA with a specified startup configuration.

Usage: aocl flash <your_executable_file>.aocx.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 10 Compiling and Linking Your Host Program

Compiling and Linking Your Host ProgramIn an OpenCL application, the host program uses standard OpenCL runtime APIs to manage device configuration, data buffers, kernel launches, and event synchronization. The host program also contains host-side functions such as file I/O or portions of the source code that do not run on an accelerator device. A host program includes C header files that describe the OpenCL APIs. These C header files must link against the runtime API libraries.

Host Program Compilation SettingsTo compile your host program, type aocl compile-config at the command line to obtain the following path you must add to your C preprocessor to include a file search path:

■ For Windows systems, add the path -I%ALTERAOCLSDKROOT%\host\include

■ For Linux systems, add the path -I$ALTERAOCLSDKROOT/host/include

The OpenCL API header files reside in this directory. You should include, in your host source, the OpenCL header file opencl.h, which is located in the ALTERAOCLSDKROOT/host/include/CL folder or directory.

1 On Windows, you must add the /MD flag to link the host runtime libraries against the multi-threaded DLL version of the Microsoft C Runtime library. You must also compile your host program with the /MD compilation flag, or use the /NODEFAULTLIB linker option to override the selection of runtime library.

Library Paths and LinksYour host program must link against the appropriate runtime libraries provided by the SDK. To obtain the link options, type the command aocl link-config.

General:

aocl version Displays the version information.

aocl help Displays a list of aocl options.

aocl help <subcommand> Displays the help content for a particular command.

Table 4. AOCL Command Options (Continued)

Option Description

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Programming the FPGA Hardware Page 11

Programming the FPGA HardwareAfter the AOC compiles the OpenCL kernels, the OpenCL runtime builds the device programs, which are then executed on an FPGA via PCIe CvP, as outlined in “Loading Kernels onto an FPGA” on page 17.

c You must ensure that the PCIe device inside your FPGA board is configured properly prior to loading the kernels into the OpenCL runtime using clCreateProgramWithBinary. To verify the successful instantiation of the PCIe device, type the command aocl diagnostic.

f For Windows systems, refer to the Programming FPGA Hardware on Windows section of the Altera SDK for OpenCL Getting Started Guide for verification instructions.

f For Linux systems, refer to the Programming FPGA Hardware on Linux section of the Altera SDK for OpenCL Getting Started Guide for verification instructions.

1 If the PCIe device on your FPGA board is not configured properly, refer to “Programming the Flash Memory of an FPGA” on page 19 for programming instructions.

Running Your OpenCL ApplicationAfter you load the FPGA hardware configuration file onto the FPGA, you may run the host application.

f For Windows systems, refer to the Running the Host Application on Windows section of the Altera SDK for OpenCL Getting Started Guide for instructions.

f For Linux systems, refer to the Running the Host Application on Linux section of the Altera SDK for OpenCL Getting Started Guide for instructions.

OpenCL Programming ConsiderationsWhen you create your kernel and host applications, keep in mind the following programming considerations:

■ For OpenCL kernels, refer to “Kernel Programming Considerations”.

■ For host code, refer to “Host Programming Considerations”.

Kernel Programming ConsiderationsConsider the following guidelines when you create a kernel or modify a kernel that was originally written to target a different architecture.

f For more information about optimizing your OpenCL kernel code to target an FPGA refer to the Altera SDK for OpenCL Optimization Guide.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 12 OpenCL Programming Considerations

Consolidating Your Kernel Source FilesCollect the source code from all your kernels into a single file. The name of the single file must have a .cl extension. For example, collect all the source code from all your kernel source files into a single file and name the file mykernels.cl.

Specifying the Work-Group SizeIf your kernel uses barriers and it must handle work-groups with more than 256 work-items each, specify the number of work-items you require with either the reqd_work_group_size or the max_work_group_size kernel attribute. The default maximum work-group size is 256 work-items. Refer to “Augmenting Your OpenCL Kernel by Specifying Kernel Attributes and Pragmas” on page 14 for more information.

__constant Address Space QualifiersConsider the following limitations and workarounds when you include __constant address space qualifiers in your kernel.

Function Scope __constant Variables

The AOC does not support function scope __constant variables. Replace function scope __constant variables with file scope constant data, as described in “File Scope __constant Variables”. You can also replace function scope __constant variables with __constant buffers that are passed from the host to the kernel, as described in “Pointers to __constant Parameters from the Host”.

File Scope __constant Variables

The AOC supports only scalar file scope constant data. It does not support vector type file scope __constant variables. To avoid long compilation times, file scope constant data must not exceed 32 kB in size. For example, you may set the __constant address space qualifier as follows:

__constant int my_array[8] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7};

__kernel void my_kernel (__global int * my_buffer){

size_t gid = get_global_id(0);my_buffer[gid] += my_array[gid % 8];

}

In this case, values for my_array are set in a ROM because the file scope constant data does not change between kernel invocations.

c You must not set your file scope __constant variables in the following manner because the AOC does not support vector type __constant arrays declared at the file scope:

__constant int2 my_array[4] = {(0x0, 0x1), (0x2, 0x3), (0x4, 0x5), (0x6, 0x7)};

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

OpenCL Programming Considerations Page 13

Pointers to __constant Parameters from the Host

You can replace file scope constant data with a pointer to a __constant parameter in your kernel code. You must then modify your host program to create cl_mem objects associated with the pointers in global memory, load constant data into them with clEnqueueWriteBuffer prior to kernel execution, and pass the objects to the kernel as arguments with clSetKernelArg.

If a constant variable is a complex type, Altera recommends that, for simplicity, you use a typedef argument. For example:

1 Use the same type definition in both your host program and your kernel.

Passing File Scope Structures to OpenCL KernelsConvert each structure parameter to a pointer that points to a structure. For example:

1 The __global struct declaration creates a new buffer to store the structure. Use a restrict qualifier in the declaration of the pointer to the structure to prevent pointer aliasing.

f Refer to “Modifying Host Program for Structure Parameter Conversion” on page 18 for guidelines on how to modify your host program when you implement the conversion.

Example 3. Replacing File Scope __constant Variable with Pointer to __constant Parameter

If your source code is structured as follows: Rewrite your code to resemble the following syntax:

__constant int Payoff[2][2] = {{ 1, 3}, {5, 3}};

__kernel void original(__global int * A) {

*A = Payoff[1][2];// and so on

}

typedef int Payoff_type[2][2];

__kernel void modified(__global int * A,__constant Payoff_type * PayoffPtr )

{*A = (PayoffPtr)[1][2];// and so on

}

Example 4. Converting Structure Parameters to Pointers that Point to Structures

If your source code is structured as follows: Rewrite your code to resemble the following syntax:

struct Context {

float param1;float param2;int param3;uint param4;

};

__kernel void algorithm( __global float * A, struct Context c)

{if ( c.param3 ) {

// and so on}

}

struct Context {

float param1;float param2;int param3;uint param4;

};

__kernel void algorithm(__global float * A,__global struct Context * restrict c)

{if ( c->param3){

// Dereference through a pointer// and so on

}}

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 14 OpenCL Programming Considerations

Augmenting Your OpenCL Kernel by Specifying Kernel Attributes and PragmasYou can increase the data processing efficiency of your OpenCL kernel by specifying kernel attributes and pragmas. These kernel attributes and pragmas direct the AOC to process your work-items and work-groups in such a way that allows you to achieve higher throughput. The AOC conducts static performance estimations and automatically selects operating conditions that do not degrade performance.

f For more information on the usage of kernel attributes and pragmas, refer to the Altera SDK for OpenCL Optimization Guide.

The following kernel attributes are available in the Altera SDK for OpenCL version 13.0 SP1:

■ max_work_group_size—Specifies the upper limit of the number of work-items that the AOC can allocate to a work-group in a kernel.

■ reqd_work_group_size—Specifies the required work-group size so that the AOC can allocate the exact amount of hardware resources to manage the work-items in a work-group.

■ num_compute_units—Specifies the number of compute units instantiated to process the kernel. Work-groups are distributed across the specified number of compute units.

For example, the code fragment below directs the AOC to instantiate two compute units in a kernel using the num_compute_units attribute:

__attribute__((num_compute_units(2)))__kernel void test (__global const float * restrict a,

__global const float * restrict b,__global float * restrict answer)

{size_t gid = get_global_id(0);answer[gid] = a[gid] + b[gid];

}

The increased number of compute units achieves higher throughput at the expense of global memory bandwidth contention among compute units.

■ num_simd_work_items—Specifies the number of work-items within a work-group the AOC should execute in an SIMD manner. The AOC replicates the kernel datapath according to the value specified with this attribute whenever possible. You must use the num_simd_work_items attribute in conjunction with the reqd_work_group_size attribute. The num_simd_work_items attribute must be set so that it evenly divides the work-group size specified by the reqd_work_group_size attribute.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

OpenCL Programming Considerations Page 15

For example, the code fragment below assigns a fixed work-group size of 64 work-items to a kernel and then consolidates the work-items within each work-group into four SIMD vector lanes:

__attribute__((num_simd_work_items(4)))__attribute__((reqd_work_group_size(64,1,1)))__kernel void test (__global const float * restrict a,

__global const float * restrict b,__global float * restrict answer)

{size_t gid = get_global_id(0);answer[gid] = a[gid] + b[gid];

}

■ max_unroll_loops—Specifies the maximum number of times all loops in a given kernel can be unrolled automatically by the AOC. You can override this attribute by specifying unroll factors for individual loops using the #pragma unroll directive.

1 If a #pragma unroll value is not specified for a loop, the AOC might unroll that loop up to the amount specified by max_unroll_loops.

■ max_share_resources—Specifies the upper limit on the number of times a shared resource (operator) can be reused by the AOC. The AOC will attempt to reuse the shared resource without degrading performance.

■ num_share_resources—Specifies the number of times a shared resource (operator) should be reused. The AOC will attempt to reuse each resource by at least this amount.

1 Using this attribute might lead to an X-time performance drop, where X equals to the value specified for num_share_resources.

■ local_mem_size—Specifies a pointer size other than the default size of 16 kB to optimize local memory performance. For example, you can use the local_mem_size attribute in your pointer declaration in the following manner:

__kernel void myLocalMemoryPointer(__local float * A,__attribute__((local_mem_size(1024))) __local float * B,__attribute__((local_mem_size(32768))) __local float * C)

{. . .

}

In this myLocalMemoryPointer function, 16 kB of local memory (default) is allocated to pointer A, 1 kB is allocated to pointer B, and 32 kB is allocated to pointer C.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 16 OpenCL Programming Considerations

The following pragma is available in the Altera SDK for OpenCL:

■ unroll—Instructs the AOC to unroll a loop. The AOC might unroll simple loops even if they are not annotated by a pragma. When #pragma unroll <N> is given, the AOC attempts to unroll the loop at most <N> times. Consider the following code fragment:

#pragma unroll 2for(size_t k = 0; k < 4; k++){

mac += data_in[(gid * 4) + k] * coeff[k];}

By assigning a value of 2 as an argument to #pragma unroll, you direct the AOC to unroll the loop twice.

If you do not specify a value for <N>, the AOC attempts to unroll the loop fully, subject to understanding the trip count. The AOC issues a warning if the unrolling request is unfeasible. Altera recommends that you provide an unroll factor whenever possible.

Host Programming ConsiderationsConsider the following guidelines when you create a kernel or modify a kernel that was originally written to target a different architecture.

Host Binary RequirementYou must build 64-bit host binaries from your host code. The Altera OpenCL runtime does not support 32-bit binaries.

Host Machine Memory RequirementsThe machine that runs the host program must have enough host memory to support all of the following components simultaneously:

■ The host program and operating system.

■ The working set for the host program.

■ The maximum amount of OpenCL memory buffers that can be allocated at once. Every device-side cl_mem buffer is associated with a corresponding storage area in the host process. Therefore, the amount of host memory required can be as large as the amount of external memory supported by the FPGA.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

OpenCL Programming Considerations Page 17

Loading Kernels onto an FPGAThe AOC builds the .cl file for an OpenCL application. Kernels are compiled separately by the workstation offline and then loaded into the OpenCL runtime using clCreateProgramWithBinary, shown in Example 5.

After the .aocx file is built, you may use it to create the OpenCL device program. You can use either clCreateKernelsInProgram or clCreateKernel to create kernel objects. Instead of clCreateProgramWithSource, the clCreateProgramWithBinary function uses the .aocx file to create cl_program objects for the targeted FPGA. You can load multiple FPGA programs into memory, and the host runtime will reprogram the FPGA as required to execute the scheduled kernels via the clEnqueueNDRangeKernel and clEnqueueTask API calls.

Aligned Memory AllocationYou should use aligned memory allocations when assigning a host-side buffer to be written or read from the FPGA. This allows DMA transfers to occur to and from the FPGA, which is a more efficient form of memory movement. Altera recommends that you use at least a 64-byte alignment. To set up aligned memory allocations, add the source code in Example 6 to your host code.

To free up an aligned memory block, include the following functions:

■ On Windows, the function call is _aligned_free(ptr);

■ On Linux, the function call is free(ptr);

Example 5. Host Code Showing Usage of clCreateProgramWithBinary

size_t lengths[1];unsigned char* binaries[1] ={NULL};cl_int status[1];cl_int error;cl_program program;const char options[] = "";

FILE *fp = fopen("program.aocx","rb");fseek(fp,0,SEEK_END);lengths[0] =ftell(fp);binaries[0]= (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]);rewind(fp);fread(binaries[0],lengths[0],1,fp);fclose(fp);

program = clCreateProgramWithBinary(context,1,device_list,lengths,(const unsigned char **)binaries,status,&error);

clBuildProgram(program,1,device_list,options,NULL,NULL);

Example 6. Source Code for Enabling Aligned Memory Allocations

For Windows:

#define AOCL_ALIGNMENT 64#include <malloc.h>void *ptr = _aligned_malloc (size, AOCL_ALIGNMENT);

For Linux:

#define AOCL_ALIGNMENT 64#include <stdlib.h>void *ptr = NULL;posix_memalign (&ptr, AOCL_ALIGNMENT, size);

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 18 OpenCL Programming Considerations

Optimizing Global Memory AccessesBy default, the AOC configures the two memory interfaces of the FPGA board in a burst-interleaved configuration. You can maximize global memory bandwidth by eliminating memory contention with better load balancing. You can invoke the --sw-dimm-partition flag of the aoc command to partition data into two memory banks manually so that both banks are accessed simultaneously, effectively doubling your memory bandwidth.

To manually partition global memory, perform the following steps:

1. Compile your OpenCL kernels and configure the memory banks as separate address spaces with the --sw-dimm-partition flag. At a command prompt, type:

aoc --sw-dimm-partition <your_kernel_filename>.cl

2. When you create an OpenCL buffer in your host program, allocate the buffer to one of the two banks with one of the following CL_MEM_BANK flags:

■ To allocate the buffer to the lowest available memory region, use CL_MEM_BANK_1_ALTERA

■ To allocate memory to the second bank (if available), useCL_MEM_BANK_2_ALTERA

For example, the following clCreateBuffer call allocates memory into the second bank if it is available.

clCreateBuffer(context, CL_MEM_BANK_2_ALTERA | CL_MEM_READ_WRITE, size, 0, 0);

If the second bank is not available at runtime, the memory is allocated to the first bank. If no global memory is available, the clCreateBuffer call fails with the error message CL_MEM_OBJECT_ALLOCATION_FAILURE.

Out-of-Order Command QueuesCommand queues do not support out-of-order execution.

Modifying Host Program for Structure Parameter ConversionIf you converted any structure parameters to pointers-to-constant structures, as outlined in “Passing File Scope Structures to OpenCL Kernels” on page 13, make the following changes to your host program:

1. Allocate a cl_mem buffer to store the structure contents. (You need a separate cl_mem buffer for every kernel that uses a different structure value.)

2. Set the structure kernel argument with a pointer to the structure buffer, not with a pointer to the structure contents.

3. Populate the structure buffer contents before enqueuing the kernel. Ensure that the structure buffer is populated before the kernel launches by either enqueuing the structure buffer on the same command queue as the kernel enqueuing, or by synchronizing separate kernel enqueuing and structure buffer enqueuing with an event.

4. When your program no longer needs to call a kernel that uses the structure buffer, release the cl_mem buffer.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Appendix A Page 19

Appendix A

Programming the Flash Memory of an FPGABy default, you configure an FPGA using the image (that is, the hardware configuration file) stored in the flash memory of the device. When there is no power, the FPGA retains the hardware configuration file stored in the flash memory. When you power up the system, the FPGA circuitry is configured based on the hardware configuration file stored in flash memory. Therefore, it is imperative that an OpenCL-compatible default image is loaded into the flash memory to ensure proper configuration of the FPGA.

The reason for programming the flash memory of your FPGA is that FPGA OpenCL kernels communicate with the host program via a PCIe device in the FPGA. An FPGA board supported by the SDK is preprogrammed with a default hardware configuration file that configures the PCIe device inside the FPGA. If the PCIe device in your FPGA board is not configured properly, you must program the flash memory of the FPGA with an OpenCL-compatible image. When you power down and then restart your host, this default image will configure the FPGA and instantiate the PCIe device.

c When you load the hardware configuration file into the flash memory of the FPGA, it is imperative to not prematurely power down the system, or attempt to launch any host code that calls OpenCL kernels or might otherwise communicate with the FPGA board.

1 Ensure that the USB-Blaster driver is installed successfully so that you can load your hardware configuration file into the flash memory.

For Windows systems, if your USB-Blaster driver is not installed properly, refer to the Installing the USB-Blaster Driver on Windows for Flash Programming section of the Altera SDK for OpenCL Getting Started Guide for installation instructions.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 20 Appendix A

Programming the Flash Memory of a Nallatech BoardTo load your hardware configuration file into the flash memory of the Nallatech board, perform the following steps:

1. Ensure that you have set the AOCL_BOARD_PACKAGE_ROOT environment variable to the board package that corresponds to your FPGA board. Refer to “Setting Up Your FPGA Board” on page 4 for instructions.

2. Generate a hardware configuration file by invoking the command aoc <your_kernel_filename>.cl. Refer to “Compiling the OpenCL Kernel Source File” on page 5 for instructions.

3. Type aocl flash <your_kernel_filename>.aocx to load the hardware configuration file into the flash memory.

Alternatively, you can program the flash memory using any of the example .aocx files, available in the ALTERAOCLSDKROOT/designs/<design_name>/ <design_name>_<board_name> folder or directory. For example, you can type the command aocl flash vectorAdd.aocx to program the flash memory with vectorAdd.aocx.

4. Power down your host machine and then power it up again.

w If you do not power down your host machine after you program the flash memory, you might not have a valid image loaded onto your FPGA.

Programming the Flash Memory of a BittWare BoardIf you have a BittWare board, you might receive the following message when you invoke the aocl flash command:

aocl flash--------------------------------------------------------------------No board flash routine supplied.Please consult your board manufacturer's documentation or supportteam for information on how to modify the default FPGA image.--------------------------------------------------------------------

If you receive this message, verify that you have installed your FPGA board and the board driver properly.

■ For more information, refer to the operating system-specific instructions in the Verifying the Functionality of the BittWare Board section of the Altera SDK for OpenCL Getting Started Guide.

If you continue to receive this warning message after your have reinstalled your board and board driver, contact BittWare for support to program the flash memory and instantiate the PCIe device in your BittWare board. Refer to the BittWare Developer Site of the BittWare website for more information.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Appendix B Page 21

Appendix BThe Altera SDK for OpenCL supports the OpenCL Specification version 1.0. The SDK host runtime conforms with the OpenCL platform layer and API, with clarifications and exceptions outlined below.

Allocation Limits of the Altera SDK for OpenCL

OpenCL Features Supported by the Altera SDK for OpenCLThe following sections outline the support statuses of the OpenCL features described in the OpenCL Specification version 1.0.

OpenCL Programming Language ImplementationSection 6 of the OpenCL Specification version 1.0 describes the OpenCL C programming language. OpenCL C is based on C99 with some limitations. The Altera SDK for OpenCL conforms with the OpenCL C programming language with clarifications and exceptions summarized in Table 6. The following symbols indicate the support statuses of the features:

■ “v” means that the feature is supported, and there is additional information available in the Notes column in Table 6.

■ “v*” means that the feature is supported with exceptions and clarifications.

■ “X” means that the feature is not supported.

■ “—” means that an explanation or clarification for the feature is available in the Notes column in Table 6.

1 OpenCL programming language implementations that are supported with no additional clarifications are not shown in Table 6.

Table 5. Allocation Limits of the Altera SDK for OpenCL

Item Limit

Maximum number of contexts 4

Maximum number of queues per context 28

Maximum number of program objects per context 20

Maximum number of concurrently running kernels The total number of queues

Maximum number of enqueued kernels More than 1000

Maximum number of kernels per FPGA device 32

Maximum number of arguments per kernel 128

Maximum total size of kernel arguments 256 bytes

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 22 Appendix B

Table 6. OpenCL Programming Language Implementation (Part 1 of 5)

Section Feature Supported Notes

6.1.1 Built-in Scalar Data Types

double precision float v**Preliminary support for all double precision float built-in scalar data types. This feature might not conform with the OpenCL Specification version 1.0.

half X

6.1.2 Built-in Vector Data Types v**Preliminary support for vectors with three elements. Three-element vectors might not conform with the OpenCL Specification version 1.0.

6.1.3 Built-in Data Types X

6.1.4 Reserved Data Types X

6.1.5 Alignment of Types —All scalar and vector types are aligned as required (vectors with three elements are aligned as if they had four elements).

6.2.1 Implicit Conversions vRefer to Section 6.2.6: Usual Arithmetic Conversions in the OpenCL Specification version 1.2 for an important clarification of implicit conversions between scalar and vector types.

6.2.2 Explicit Casts v The SDK allows scalar data casts to a vector with a different element type.

6.5 Address Space Qualifiers v* *Function scope __constant variables are not supported.

6.6 Image Access Qualifiers X

6.7 Function Qualifiers

6.7.2Optional AttributeQualifiers v

Refer to the Altera SDK for OpenCL Optimization Guide for tips on using reqd_work_group_size to improve kernel performance.

The SDK parses but ignores the vec_type_hint and work_group_size_hint attribute qualifiers.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Appendix B Page 23

6.8 Restrictions—The SDK implements the restrictions listed in the OpenCL Specification version 1.0 as follows:

pointer assignmentsbetween address spaces —

Arguments to __kernel functions declared in a program that are pointers must be declared with the __global, __constant, or __local qualifier.

The AOC enforces the OpenCL restriction against pointer assignments between address spaces.

pointers to functions X The AOC does not enforce this restriction—you must ensure that your code contains no pointers to functions.

structure-type kernelarguments

X Convert structure arguments to a pointer to a structure in global memory.

images X

bit fields X The AOC does not enforce this restriction—you must ensure that your code contains no bit fields.

variable length arrays andstructures

X

variadic macros andfunctions

X

C99 headers X C99 headers are not available.

extern, static,auto, and registerstorage-class specifiers

X

The AOC does not enforce this restriction—you must ensure that your code contains no extern or static specifiers; the auto and register specifiers should have no effect.

predefined identifiers v Use the -D option of the aoc command to provide preprocessor symbol definitions in your kernel code.

recursion X The AOC does not enforce this restriction—you must ensure that your code contains no recursive functions.

irreducible control flow X

Irreducible control flow is invalid. Code with irreducible control flows might result in a hardware implementation with unexpected behavior. Generally, irreducible control flow occurs in loops that have multiple entry points.

writes to memory ofbuilt-in types less than32 bits in size

v* *Performance of such writes is poor.

declaration of argumentsto __kernel functionsof type event_t

XThe AOC does not enforce this restriction—you must ensure that your code does not include __kernel function arguments of type event_t.

elements of a struct ora union belonging todifferent address spaces

X

The AOC does not enforce this restriction—you must ensure that all elements of a struct or a union belong to the same address space.

Warning: Assigning elements of a struct or a union to different address spaces might cause a fatal error.

Table 6. OpenCL Programming Language Implementation (Part 2 of 5)

Section Feature Supported Notes

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 24 Appendix B

6.9 Preprocessor Directives and Macros

#pragma directive:#pragma unroll

v

The AOC supports only #pragma unroll. You have the option to assign an integer argument to the unroll directive to control the extent of loop unrolling.

For example, #pragma unroll 4 unrolls a loop with four iterations.

By default, an unroll directive with no unroll factor causes the AOC to attempt to fully unroll the loop.

Refer to the Altera SDK for OpenCL Optimization Guide for tips on using #pragma unroll to improve kernel performance.

__ENDIAN_LITTLE__defined to be value 1

v The target FPGA is little-endian.

__IMAGE_SUPPORT__ X __IMAGE_SUPPORT__ is undefined; the SDK does not support images.

6.10 Attribute Qualifiers—The AOC parses attribute qualifiers as follows:

6.10.2Specifying Attributes of Functions—Structure-type kernel arguments

XConvert structure arguments to a pointer to a structure in global memory.

6.10.3 Specifying Attributes of Variables—endian

X

6.10.4Specifying Attributes ofBlocks and Control-Flow-Statements

X

6.10.5Extending AttributeQualifiers —

The AOC can parse attributes on various syntactic structures. It reserves some attribute names for its own internal use.

Refer to “Augmenting Your OpenCL Kernel by Specifying Kernel Attributes and Pragmas” on page 14 for more information about these attribute names.

Refer to the Altera SDK for OpenCL Optimization Guide for tips on how to optimize kernel performance using these kernel attributes.

6.11.2 Math Functions

built-in math functions v**Built-in math functions for double precision float are supported preliminarily. These functions might not conform with OpenCL Specification version 1.0.

built-in half_ andnative_ math functions

v*

*Built-in half_ and native_ math functions for double precision float are supported preliminarily. These functions might not conform with OpenCL Specification version 1.0.

6.11.5 Geometric Functions v

*Built-in geometric functions for double precision float are supported preliminarily. These functions might not conform with OpenCL Specification version 1.0.

Refer to Table 7 on page 26 for a list of built-in geometric functions supported by the SDK.

Table 6. OpenCL Programming Language Implementation (Part 3 of 5)

Section Feature Supported Notes

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Appendix B Page 25

6.11.8 Image Read and Write Functions

X

6.11.9Synchronization Functions—the barrier synchronization function

v*

*Clarifications and exceptions:

If a kernel specifies the reqd_work_group_size or max_work_group_size attribute, barrier supports the corresponding number of work-items.

If neither attribute is specified, a barrier is instantiated with a default limit of 256 work-items.

The work-item limit is the maximum supported work-group size for the kernel; this limit is enforced by the runtime.

6.11.11Async Copies from Global to Local Memory, Local to Global Memory, and Prefetch

v*

*The implementation is naive:

Work-item (0,0,0) performs the copy and the wait_group_events is implemented as a barrier.

If a kernel specifies the reqd_work_group_size or max_work_group_size attribute, wait_group_events supports the corresponding number of work-items.

If neither attribute is specified, wait_group_events is instantiated with a default limit of 256 work-items.

Additional built-in vector functions from the OpenCL Specification version 1.2 Section 6.12.12: Miscellaneous Vector Functions:

vec_step vshuffle vshuffle2 v

Table 6. OpenCL Programming Language Implementation (Part 4 of 5)

Section Feature Supported Notes

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 26 Appendix B

OpenCL Specification version 1.2 Section 6.12.13: printf v*

*Preliminary support. This feature might not conform with the OpenCL Specification version 1.0. See below for details.

The printf function in OpenCL has syntax and features similar to the printf function in C99, with a few exceptions. For details, refer to the OpenCL Specification version 1.2.

To use a printf function, there are no requirements for special compilation steps, buffers, or flags; you can compile kernels that include printf instructions with the usual aoc command.

During kernel execution, printf data is stored in a global printf buffer that is automatically allocated by the AOC. The size of this buffer is 64 kB; the total size of data arguments to a printf call should not exceed this size. When kernel execution completes, the contents of the printf buffer are printed to standard output.

Buffer overflows are handled seamlessly; printf instructions can be executed an unlimited number of times. However, if the printf buffer overflows, kernel pipeline execution stalls until the host reads the buffer and prints the buffer contents.

Because printf functions store their data into a global memory buffer, the performance of your kernel will drop if it includes such functions.

There are no usage limitations on printf functions. You can use printf instructions inside if-then-else statements, loops, etc. A kernel can contain multiple printf instructions that are executed by multiple work-items.

Format string arguments and literal string arguments of printf calls are transferred to the host system from the FPGA using a special memory region. This memory region can overflow if the total size of the printf string arguments is large (3000 characters or less is usually safe in a typical OpenCL application). An overflow causes the following error message to be printed during the host program execution:

cannot parse auto-discovery string at byte offset 4096

Output from printf is never intermixed, even though work-items may execute printf functions concurrently; however, the order of concurrent printf execution is not guaranteed. In other words, printf outputs may not appear in program order if the printf instructions are in concurrent datapaths.

Table 7. Supported Scalar and Vector Argument Built-in Geometric Functions

FunctionArgument Type

float double

cross v vdot v vdistance v vlength v vnormalize v vfast_distance v —

fast_length v —

fast_normalize v —

Table 6. OpenCL Programming Language Implementation (Part 5 of 5)

Section Feature Supported Notes

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Appendix B Page 27

Numerical Compliance ImplementationSection 7 of the OpenCL Specification version 1.0 describes features of the C99 and IEEE 754 standards that must be supported by OpenCL compliant devices. The Altera SDK for OpenCL operates on 32-bit and 64-bit floating-point values in IEEE Standard 754-2008 format, but not all floating-point operators have been implemented, as summarized in Table 8.

Image Addressing and Filtering ImplementationImage addressing and filtering are not supported. The SDK does not support images.

Table 8. Numerical Compliance Implementation

Section Feature Supported Notes

7.1 Rounding Modes v*

Conversions between integer and single and half precision floating-point types support all rounding modes.

*Conversions between integer and double precision floating-point types support all rounding modes on a preliminary basis. This feature might not conform with the OpenCL Specification version 1.0.

7.2 INF, NaN and Denormalized Numbers

v*

INF and NaN results for single precision operations are generated in a manner that conforms with the OpenCL Specification version 1.0. Most operations that handle denormalized numbers are flushed prior to and after a floating-point operation.

*Double precision floating-point operation is supported on a preliminary basis. This feature might not conform with the OpenCL Specification version 1.0.

7.3 Floating-Point Exceptions X The SDK discards floating-point exceptions.

7.4 Relative Error as ULPs v*

Single precision floating-point operations conform with the numerical accuracy requirements for an embedded profile of the OpenCL Specification version 1.0.

*Double precision floating-point operation support is preliminary. This feature might not conform with the OpenCL Specification version 1.0.

7.5 Edge Case Behavior v

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 28 Appendix B

Atomic FunctionsSection 9 of the OpenCL Specification version 1.0 describes a list of optional features that may be supported by some OpenCL implementations.

■ Section 9.5: Atomic Functions for 32-bit Integers—The SDK supports all 32-bit global and local memory atomic functions. The SDK also supports 32-bit atomic functions, described in Section 6.11.11 of the OpenCL Specification version 1.1 and Section 6.12.11 of the OpenCL Specification version 1.2. The SDK does not support 64-bit atomic functions described in Section 9.7 of the OpenCL Specification version 1.0.

1 The implementation of atomic functions may lower the performance of your design. The operating frequency of the hardware might decrease if more than one type of atomic functions are used in the kernel.

Embedded Profile ImplementationSection 10 of the OpenCL Specification version 1.0 describes the OpenCL embedded profile. The Altera SDK for OpenCL conforms with the OpenCL embedded profile with clarifications and exceptions summarized in Table 9.

Multiple Host ThreadsAppendix A.2: Multiple Host Threads of the OpenCL Specification version 1.0—The Altera SDK for OpenCL host library is not thread safe.

Table 9. Clarifications and Exceptions to OpenCL Embedded Profile

Clause Feature Supported Notes

1 64-bit integers v2 3D images X The SDK does not support images.

3Create 2D and 3D images with image_channel_data_type values

X The SDK does not support images.

4 Samplers X

5 Rounding modes —

6Restrictions listed for single precision basic floating-point operations

X

7 half type X This clause of the OpenCL Specification version 1.0 does not apply to the SDK.

8

Error bounds listed for conversions from CL_UNORM_INT8, CL_SNORM_INT8, CL_UNORM_INT16 and CL_SNORM_INT16 to float

XRefer to Table 5 on page 21 for a list of allocation limits.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Document Revision History Page 29

Document Revision HistoryTable 10 lists the revision history for this document.

Table 10. Document Revision History

Date Version Changes

June 2013 13.0 SP1.0

Added the section Setting Up Your FPGA Board.

Removed the subsection Specifying a Target FPGA Board under Kernel Programming Considerations.

Inserted the subsections Targeting a Specific FPGA Board and Generating Compilation Reports under Compiling the OpenCL Kernel Source File.

Renamed the section title File Scope __constant Address Space Qualifier to __constant Address Space Qualifiers, and inserted the following subsections:

■ Function Scope __constant Variables.

■ File Scope __constant Variables.

■ Pointers to __constant Parameters from the Host.

Inserted the subsection Passing File Scope Structures to OpenCL Kernels under Kernel Programming Considerations.

Renamed the section Modifying Your OpenCL Kernel by Specifying Kernel Attributes and Pragmas to Augmenting Your OpenCL Kernel by Specifying Kernel Attributes and Pragmas.

Updated content for the unroll pragma directive in the section Augmenting Your OpenCL Kernel by Specifying Kernel Attributes and Pragmas.

Inserted the subsections Out-of-Order Command Queues and Modifying Host Program for Structure Parameter Conversion under Host Programming Considerations.

Updated the sections Loading Kernels onto an FPGA Using clCreateProgramWithBinary and Aligned Memory Allocation.

Updated flash programming instructions.

Renamed the section Optional Extensions in Appendix B to Atomic Functions, and updated its content.

Removed the section Platform Layer and Runtime Implementation from Appendix B.

May 2013 13.0.1

Explicit memory fence functions are now supported; the entry is removed from the table OpenCL Programming Language Implementation.

Updated the Programming the Flash Memory of an FPGA section.

Added the section Modifying Your OpenCL Kernel by Specifying Kernel Attributes and Pragmas to introduce kernel attributes and pragmas that can be implemented to optimize kernel performance.

Added the section Optimizing Global Memory Accesses to discuss data partitioning.

Removed the section Programming the FPGA with the aocl program Command from Appendix A.

May 2013 13.0.0

Updated compilation flow.

Updated kernel compiler commands.

Included Altera SDK for OpenCL Utility commands.

Inserted a OpenCL Programming Considerations section.

Updated flash programming procedure and moved it to Appendix A.

Included a new clCreateProgramWithBinary FPGA hardware programming flow; moved the old clCreateProgramWithBinary hardware programming flow to Appendix A.

Moved updated information on allocation limits and OpenCL language support to Appendix B.

November 2012 12.1.0 Initial release.

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation

Page 30 Document Revision History

Altera SDK for OpenCLProgramming Guide

June 2013 Altera Corporation