Compute API –Past & Future
-
Upload
ofer-rosenberg -
Category
Documents
-
view
611 -
download
0
description
Transcript of Compute API –Past & Future
11
Compute API – Past & Future
Ofer Rosenberg
Visual Computing Software
2
Intro and acknowledgments
• Who am I ?
– For the past two years leading the Intel representation in OpenCL working group @
Khronos
– Additional background of Media, Signal Processing, etc.
– http://il.linkedin.com/in/oferrosenberg
• Acknowledgments:
– This presentation contains ideas based on talks with lots of people (who should be
mentioned here)
– Partial list:
– AMD: Mike Houston, Ben Gaster
– Apple: Aaftab Munshi
– DICE: Johan Andersson
– Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and more…
– And others…
Agenda
• The beginning – From Shaders to Compute
• The Past/Present: 1st Generation of Compute API’s
– Caveats of the 1st generation
• The Future: 2nd Generation of Compute API’s
4
From Shaders to Compute
• In the beginning, GPU HW was fixed & optimized for Graphics…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:
5
From Shaders to Compute
• Graphics stages became programmable GPUs evolved …
• This led to the traditional GPGPU approach…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:
6
From Shaders to Compute Traditional GPGPU
• Write in graphics language and use the GPU
• Highly effective, but :
– The developer needs to learn another (not intuitive) language
– The developer was limited by the graphics language
• Then came CUDA & CTM…
6Slides from “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab
7
The cradle of GPU Compute API’s
Slides from “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06, & “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007
GeForce 8800 GTX (G80) was released on Nov. 2006
CUDA 0.8 was released on Feb. 2007 (first official Beta)
ATI x1900 (R580) released on Jan 2006
CTM was released on Nov. 2006
8
The 1st generation of Platform Compute API
• CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL
• DirectCompute is a Microsoft standard
– Released as part of WIn7/DX11, a.k.a. Compute Shaders
– Only runs under Windows on a GPU device
• OpenCL is a cross-OS / cross-Vendor standard
– Managed by a working group in Khronos
– Apple is the spec editor & conformance owner
– Work can be scheduled on both GPUs and CPUs
CUDA 1.0Released
June2007
OpenCL 1.0Released
Dec2008
OpenCL 1.1Released
June2010
DirectX 11Released
Oct2009
CUDA 2.0Released
Aug2008
CUDA 3.0Released
Mar2010
The 1st Generation was developed on GPU HW which was tuned for graphics usages –
just extended it for general usage
CTMReleased
Nov2006
StreamSDKReleased
Dec2007
9
The 1st generation of Platform Compute APIExecution Model
• Execution model was driven directly from shader programming in graphics (“fragment
processing”) :
– Shader Programming : initiate one instance of the shader per vertex/pixel
– Compute : initiate one instance for each point in an N-dimensional grid
• Fits GPU’s vision of array of scalar (or stream) processors
Drawing from OpenCL 1.1 Specification , Rev36
10
The 1st generation of Platform Compute APIMemory Model
• Distributed Memory system:
– Abstraction: Application gets a “handle” to the memory object / resource
– Explicit transactions: API for sync between Host & Device(s) : read/write, map/unmap
• Three address spaces: Global, Local (Shared) & Private
– Local/Shared Memory: the non-trivial memory space…
App OCL RT
Dev1
Dev2H
A
A
11
Disclaimer
Next slides provide my opinion and thoughts on caveats and
future improvements to the Platform Compute API.
12
The 2nd generation of Platform Compute API
• Recap:– The 1st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS
– Defined on HW optimized for GFX, extended to General Compute
• The “cheese” has moved for GPUs– Compute becomes an important usage scenario
– Advanced Graphics: Physics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space Rendering
– Media: Video Encoding & Processing, Image Processing, Image Segmentation, Face Recognition
– Throughput: Scientific Simulations, Finance, Oil Searches
– Developers feedback based on the 1st generation enables creating better HW/API
• The Second generation of Platform Compute API: “OpenCL Next”,
DirectX12 ?
The 2nd Generation of Compute API will run on HW which is designed with
Compute in mind
13
Caveats of the 1st generation:Execution Model
• Developers input:
– Most “real world” usages for compute use fine-grain granularity (the gird is small – 100’s at best)
– “Real world” kernels got sequential parts interleaved with the parallel code (reduction, condition testing, etc.)
Battlefield 2execution phase DAG
(Image courtesy Johan Andersson, DICE)
Using “fragment processing” for these usages results inefficient use of the machine
__kernel foo()
{
// code here runs for each point in the grid
barrier(CLK_LOCAL_MEM_FENCE);
if (local_id == 0)
{
// this code runs once per workgroup
}
// code here runs for each point in the grid
barrier(CLK_GLOBAL_MEM_FENCE);
if (global_id == 0)
{
// this code runs only once
}
// code here runs for each point in the grid
}
14
Caveats of the 1st generationExecution Model
• The “array of scalar/stream processors” model is not optimal for CPU’s & GPU’s
• Works well for large grids (like in traditional graphics), but on finer grain there is a better
model…
AMD R600
CPU’s and GPU’s are better modeled as multi-threaded vector machines
NV Fermi Intel NHM
15
• Goals
– Support fine-grain task parallelism
– Support complex application execution graphs:
– Better match HW evolution: target multi-threaded vector machines
– Aligned with CPU evolution, and SoC integration of CPU/GPU
• Solution: Tasking system as execution model foundation
The 2nd generation of Platform Compute API Ideas for new execution model
Device
HW
compute
unit
HW
compute
unit
HW
compute
unit
HW
compute
unit
task . . .
task
task
task
task . . .
task
task . . .
task
task
task
task
task
task
task
SW Thread
task
task
task
task
Tasking system:
• Task Q’s mapped to independent
HW units (~compute cores)
• Device load balancing enabled via
task stealing
• OpenCL Analogy: Tasks execute at
the “work group level”
• OpenCL Task ≠ CPU Task • More restricted: No Preemption
• Evolved: Braided Task (sequential parts
& fine-grain parallel parts interleaved)
Task Pool
task
tasktask
task
task
task
task
task
task
task
task
Device Domain
16
The 2nd generation of Platform Compute API Ideas for new execution model
• There are others who think along the same lines …
Slides from “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day
17
Caveats of the 1st generation:Memory Model
• Developers input:
– A growing number of compute workloads uses complex data structures (linked lists, trees, etc.)
– Performance: Cost of pointer marshaling & re-construct on device is high
– Porting complexity: need to add explicit transactions, marshaling, etc.
– Supporting a shared/unified address space (API & HW) is required
Shared/Unified Address Space between Host & Devices
App OCL
RT
Dev1
Dev2H
A
A
App OCL
RT
Dev1
Dev2A
A
A
S
18
The 2nd generation of Platform Compute API Ideas for new memory model
Shared Address Space
Shared Address Space
w. relaxed consistency
Shared Address Space
w. full coherency
Baseline:Memory objects / resources will have the same starting address between Host & Devices
• Extend existing OCL 1.x / DX11 Memory Model
• Use explicit API calls to sync between Host & Device
• Suitable for Disjoint memory architectures (Discrete
GPU’s, for example…)
• New Model - Memory is coherent between Host & Device
• Use known “language level” mechanisms for concurrent
access: atomics, volatile
• Suitable for Shared Memory architectures
Host Memory
PP
PP
P
Device Memory
PP
PP
P
Host Device
Coherent/Shared Memory
PP
PP
P
Host Device
19
Some more thoughts for the 2nd generation(and beyond)
• Promote Heterogeneous Processing – not GPU only…
– Running code pending on problem domain:
– Matrix Multiply of 16x16 should run on the CPU
– Matrix Multiply of 1000x1000 should run on the GPU
– Where’s the decision point ? Better leave it to the Runtime… (requires API)
– Load Balancing
– Relevant especially on systems where the CPU & GPU are close in compute power
• One API to rule them all
– Compute API as the underlying infrastructure to run Media & GFX
– Extend the API to contain flexible pipeline, fixed-function HW, etc.
Problem size
Execution
TimeGPU
CPU
Slide from “Parallel Future of a Game Engine”, Johan Andersson, DICE
20
References:
• “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06
– http://gpgpu.org/static/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf
• “GPU Architecture: Implications & Trends”, David Luebke, NVIDIA Research, SIGGRAPH 2008:
– http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
• “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab
– http://www-graphics.stanford.edu/~mhouston/public_talks/R520-mhouston.pdf
• “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007
– http://gpgpu.org/static/s2007/slides/07-CTM-overview.pdf
• “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”, Peter N. Glaskowsky
– http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-
TheFirstCompleteGPUComputingArchitecture.pdf
• “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day
– http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Njk3NTJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1
• “Parallel Future of a Game Engine”, Johan Andersson, DICE
– http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448