Productive OpenCL Programming An Introduction to OpenCL Libraries with ArrayFire COO Oded Green
-
Upload
amd-developer-central -
Category
Technology
-
view
737 -
download
8
description
Transcript of Productive OpenCL Programming An Introduction to OpenCL Libraries with ArrayFire COO Oded Green
An Introduction to OpenCL Libraries
Productive OpenCL Programming
● We make code run faster○ Started in 2007 by Georgia Tech researchers○ 1000s of paying customers
● We build an acceleration library○ for really cool science, engineering, and finance applications○ for mobile computing
Libraries are Great!
Eliminate Hidden Costs
Library Types
● Specialized GPU Libs○ Targeted at a specific set of operators (functionality) ○ Optimized for specific systems○ C-like interface○ Raw pointer interface
● General GPU Libs○ Manage GPU resources using containers○ Applicable to a large set of applications and domains○ Portable across multiple architectures○ Higher level functions○ C++ interface (supports templates)
Specialized GPU Libraries
● Fast Fourier Transforms○ clFFT
● Random Number Generation○ Random123
● Linear Algebra○ clBLAS○ MAGMA
● Signal and Image Processing○ OpenCLIPP
Specialized GPU Libraries
● C Interface○ Use pointers to reference data
● Memory management is programmer responsibility● Mimic existing libraries
○ clBLAS ≈ BLAS○ MAGMA ≈ BLAS + LAPACK○ clFFT ≈ FFTW
● Simplifies GPU integration of specialized scientific libraries○ Still requires setting up the GPU
clFFT
● 1D, 2D and 3D transforms● CPU and GPU backends● Supports
○ Real and complex data types○ Single and double-precision ○ Execution of multiple transformations concurrently
Random123
● Counter-based RNG● Passed SmallCrush, Crush and BigCrush tests● Four RNG families
○ Threefry○ Philox○ AESNI○ ARS
● Not suitable for cryptography
Magma & clBLAS
● Implements many popular linear algebra routines● Supports
○ Real and complex data types ○ Single and double-precision
OpenCLIPP
● Supports multiple image types● Similar to Intel IPP● Primitives
○ Arithmetic and logic○ LUT○ Morphology○ Transform○ Resize○ Histogram○ Many more…
● C and C++ interface
General-Purpose GPU Libraries
● Bolt● OpenCV● ArrayFire
Images taken from: http://wordlesstech.com/2012/10/12/leatherman-oht-multi-tool/
Bolt
● GPU library which resembles C++ STL○ STL like data structures○ Iterators○ Fully interoperable with OpenCL
● Parallel vector operation methods○ Reductions○ Sorting○ Prefix-Sum
● Customizable GPU kernels using functors● Some functions only supported on AMD GPUs
Bolt - Data Structures
● Built around the device_vector● Supports the same data types as C++
○ device_vector<float> data(2e6);
● Useful when performing multiple operations on a vector
● Can be passed into STL algorithms○ Always interoperability○ Data transfer will be costly
Bolt - Algorithms
● Uses a C++ STL like interface○ Pass the begin and end iterators
● Accept functors which allow you to run custom operations on OpenCL devices
● Multiple backends○ OpenCL, C++AMP, and TBB○ Not all algorithms implemented across all backends
● Works on vector and device_vector
OpenCV
● Open source computer vision library● C++ interface with many language wrappers● Hundreds of CV functions
OpenCV ArrayFire Interop
● Helper Functions○ https://github.com/arrayfire-community/arrayfire_opencv.git
Mat R; Rodrigues(poses(Rect(0, 0, 1, 3)), R);af::array af_R = mat_to_array(R);
ArrayFire - Data Structures
● Built around a flexible data structure named "array"○ Lightweight wrapper around the data on the compute device
○ Manages the data and basic metadata such as size, type and dimensions
● You can transfer data into an array using constructors● Column major
float hA[6] = {0, 1, 2, 3, 4, 5};array A(2, 3, hA);
ArrayFire - Indexing#include <arrayfire.h>
#include <af/utils.h>
void af_example()
{
float f[8] = {1, 2, 4, 8, 16, 32, 64, 128};
array a(2, 4, f); // 2 rows x 4 col array initialized with f values
array sumSecondCol = sum(a(span, 1)); // reduce-sum over the second column
print(sumSecondCol); // 12
}
Using ArrayFire:
array tmp = img(span,span,0); // save the R channel
img(span,span,0) = img(span,span,2); // R channel gets values of B
img(span,span,2) = tmp; // B channel gets value of R
Can also do it this way:
array swapped = join(2, img(span,span,2), // blue
img(span,span,1), // green
img(span,span,0)); // red
Or simply:
array swapped = img(span,span,seq(2,-1,0));
ArrayFire Example - swap R and B
Using ArrayFire:array img = loadimage("image.jpg", false); // load grayscale image from disk to
device
array img_T = img.T(); // transpose
ArrayFire Functions
Original
Grayscale
Box filter blur
Gaussian blur
Image Negative
ArrayFire // erode an image, 8-neighbor connectivity
array mask8 = constant(1,3, 3);
array img_out = erode(img_in, mask8);
// erode an image, 4-neighbor connectivity
const float h_mask4[] = { 0.0, 1.0, 0.0,
1.0, 1.0, 1.0,
0.0, 1.0, 0.0 };
array mask4 = array(3, 3, h_mask4);
array img_out = erode(img_in, mask4);
Erosion
Erosion
ArrayFire
array R = convolve(img, ker); // 1, 2 and 3d convolution filter
array R = convolve(fcol, frow, img); // Separable convolution
array R = filter(img, ker); // 2d correlation filter
Filtering
Histograms
ArrayFireint nbins = 256;
array hist = histogram(img,nbins);
Transforms
ArrayFirearray half = resize(0.5, img);
array rot90 = rotate(img, af::Pi/2);
array warped = approx2(img, xLocations, yLocations);
Image smoothing
ArrayFire
array S = bilateral(I, sigma_r, sigma_c);
array M = meanshift(I, sigma_r, sigma_c, iter);
array R = medfilt(img, 3, 3);
// Gaussian blur
array gker = gaussiankernel(ncols, ncols);
array res = convolve(img, gker);
FFT
ArrayFire
array R1 = fft2(I); // 2d fft. check fft, fft3
array R2 = fft2(I, M, N); // fft2 with padding
array R3 = ifft2(fft2(I, M, N) * fft2(K, M, N)); // convolve using fft2
ArrayFire Capabilities
● Hundreds of parallel functions for multi-disciplinary work○ Image processing○ Machine learning○ Graphics○ Sets
● Support for multiple languages○ C/C++, Fortran, Java and R
● Linux, Windows, Mac OS X
ArrayFire Capabilities
● OpenGL based graphics● JIT
○ Combine multiple operations into one kernel
● GFOR - data parallel loop○ Allows concurrent execution over multiple data sets (for example
images)
ArrayFire Functions
● Supports hundreds of parallel functions○ Building blocks
■ Reductions■ Scan■ Set operations■ Sorting■ Statistics■ Basic matrix manipulation
Images taken from: http://technogems.blogspot.com/2011/06/sorting-included-files-by-importance.htmlhttp://www.cmsoft.com.br/tutorialOpenCL/CLMatrixMultExplanationSubMatrixes.png
ArrayFire Functions
● Hundreds of highly-optimized parallel functions○ Signal/image processing
■ Convolution■ FFT■ Histograms■ Interpolation■ Connected components
○ Linear Algebra■ Matrix multiply■ Linear system solving■ Factorization
GFOR: What is it?
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;
gfor (array i, 3) C(span,span,i) = A(span,span,i) * B;
Serial matrix-vector multiplications (3 kernel launches)
Parallel matrix-vector multiplications (1 kernel launch)
Example: Matrix Multiply
• Data-Parallel for loop, e.g.
*
BA(,,1)
iteration i = 1
C(,,1)
=
for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;
Serial matrix-vector multiplications (3 kernel launches)
Example: Matrix Multiply
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;
*
BA(,,1)
iteration i = 1
C(,,1)
= *
BA(,,2)
iteration i = 2
C(,,2)
=
Serial matrix-vector multiplications (3 kernel launches)
Example: Matrix Multiply
• Data-Parallel for loop, e.g.
for (i = 0; i < 3; i++) C(span,span,i) = A(span,span,i) * B;
*
BA(,,1)
iteration i = 1
C(,,1)
= *
BA(,,2)
iteration i = 2
C(,,2)
= *
BA(,,3)
iteration i = 3
C(,,3)
=
Serial matrix-vector multiplications (3 kernel launches)
Example: Matrix Multiply
gfor (array i, 3) C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
simultaneous iterations i = 1:3
*
BA(,,1)C(,,1)
= *
BA(,,2)C(,,2)
= *
BA(,,3)C(,,3)
=
Example: Matrix Multiply
simultaneous iterations i = 1:3
BA(,,1:3)C(,,1:3)
*=*=
*=
Think of GFOR as compiling 1 stacked kernel with all iterations.
gfor (array i, 3) C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
JIT Code Generation
● Run time kernel generation● Combines multiple element wise operations into one
kernel● Reduces kernel launching overhead● Intermediate data not allocated● Improves cache performance
Success Stories
Field Application Speedup
Academia Power Systems Simulations 35x
Finance Option Pricing 52x
Government Radar Image Formation 45x
Life Sciences Pathology Advances > 100x
Manufacturing Tomography of Vegetation 10x
Media & Computer Vision Digital Holography 17x
Oil & Gas Ground Water Simulations > 20x
Future capabilities
● We are interested in Big Data applications● Create capabilities for
○ Streaming video○ Large number of images○ Machine learning○ Data analysis○ Dynamic data
● Faster rendering utilities for Big Data
Comments on Open Source
● https://github.com/arrayfire-community
Q & A
Speaker: Oded Green ([email protected])
Engineers: Umar Urshad ([email protected])
Pavan Yalamanchili ([email protected])
Sales:Scott Blakeslee ([email protected])
Look us up
www.ArrayFire.com
For language wrappers and exampleshttps://github.com/ArrayFire