Post on 03-Aug-2020
© Copyright Khronos Group 2016 - Page 1
Silicon Acceleration APIsEmbedded Technology 2016, Yokohama
Neil TrevettVice President Developer Ecosystem, NVIDIA | President, Khronos
ntrevett@nvidia.com | @neilt3d November 2016
© Copyright Khronos Group 2016 - Page 2
Khronos and Open Standards
Software
Silicon
Khronos is an Industry Consortium of over 100 companies
We create royalty-free, open standard APIs for hardware acceleration of
Graphics, Parallel Compute, Neural Networks and Vision
© Copyright Khronos Group 2016 - Page 3
Accelerated API Landscape
Vision Frameworks
High-level Language-based
Acceleration Frameworks
Explicit
Kernels
GPUFPGA
DSPDedicated
Hardware
Neural Net LibrariesOpenVX Neural
Net Extension
3D
Graphics
© Copyright Khronos Group 2016 - Page 4
OpenCL – Low-level Parallel Programing• Low level programming of heterogeneous parallel compute resources
- One code tree can be executed on CPUs, GPUs, DSPs and FPGA
• OpenCL C language to write kernel programs to execute on any compute device
- Platform Layer API - to query, select and initialize compute devices
- Runtime API - to build and execute kernels programs on multiple devices
• New in OpenCL 2.2 - OpenCL C++ kernel language - a static subset of C++14
- Adaptable and elegant sharable code – great for building libraries
- Templates enable meta-programming for highly adaptive software
- Lambdas used to implement nested/dynamic parallelism
OpenCL
Kernel
Code
OpenCL
Kernel
Code
OpenCL
Kernel
Code
OpenCL
Kernel
Code
GPU
DSPCPU
CPUFPGA
Kernel code
compiled for
devicesDevices
CPU
Host
Runtime API
loads and executes
kernels across devices
© Copyright Khronos Group 2016 - Page 5
OpenCL Conformant Implementations
OpenCL 1.0Specification
Dec08 Jun10OpenCL 1.1
Specification
Nov11OpenCL 1.2
SpecificationOpenCL 2.0
Specification
Nov13
1.0 | Jul13
1.0 | Aug09
1.0 | May09
1.0 | May10
1.0 | Feb11
1.0 | May09
1.0 | Jan10
1.1 | Aug10
1.1 | Jul11
1.2 | May12
1.2 | Jun12
1.1 | Feb11
1.1 |Mar11
1.1 | Jun10
1.1 | Aug12
1.1 | Nov12
1.1 | May13
1.1 | Apr12
1.2 | Apr14
1.2 | Sep13
1.2 | Dec12
Desktop
Mobile
FPGA
2.0 | Jul14
OpenCL 2.1 Specification
Nov15
1.2 | May15
2.0 | Dec14
1.0 | Dec14
1.2 | Dec14
1.2 | Sep14
Vendor timelines are
first implementation of
each spec generation
1.2 | May15
Embedded
1.2 | Aug15
1.2 | Mar16
2.0 | Nov15
2.1 | Jun15
© Copyright Khronos Group 2016 - Page 6
SYCL for OpenCL• Single-source heterogeneous programming using STANDARD C++
- Use C++ templates and lambda functions for host & device code
• Aligns the hardware acceleration of OpenCL with direction of the C++ standard
- C++14 with open source C++17 Parallel STL hosted by Khronos
C++ Kernel LanguageLow Level Control
‘GPGPU’-style separation of
device-side kernel source
code and host code
Single-source C++Programmer FamiliarityApproach also taken by
C++ AMP and OpenMP
Developer ChoiceThe development of the two specifications are aligned so
code can be easily shared between the two approaches
© Copyright Khronos Group 2016 - Page 7
Embedded Vision AccelerationEmbedded Technology, Yokohama
Shorin Kyo, Huawei
© Copyright Khronos Group 2016 - Page 8
OpenVX – Low Power Vision Acceleration • Targeted at vision acceleration in real-time, mobile and embedded platforms
- Precisely defined API for production deployment
• Higher abstraction than OpenCL for performance portability across diverse architectures
- Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware…
• Extends portable vision acceleration to very low power domains
- Doesn’t require high-power CPU/GPU Complex or OpenCL precision
GPU
Vision Engine
Middleware
Application
DSP
Hardware
Pow
er
Eff
icie
ncy
Computation Flexibility
Dedicated Hardware
GPUCompute
Multi-coreCPUX1
X10
X100Vision Processing
Efficiency
Vision DSPs
© Copyright Khronos Group 2016 - Page 9
OpenVX Graphs• OpenVX developers express a graph of image operations (‘Nodes’)
- Nodes can be on any hardware or processor coded in any language
• Graphs can execute almost autonomously
- Possible to Minimize host interaction during frame-rate graph execution
• Graphs are the key to run-time optimization opportunities…
Array of
Keypoints
YUV
Frame
Gray
Frame
Camera
Input
Rendering
Output
Pyrt
Color Conversion
Channel Extract
Optical Flow
Harris Track
Image Pyramid
RGB
Frame
Array of
FeaturesFtrt-1OpenVX Graph
OpenVX Nodes
Feature Extraction Example Graph
© Copyright Khronos Group 2016 - Page 10
OpenVX Efficiency through Graphs..
Reuse pre-allocated memory for
multiple intermediate data
MemoryManagement
Less allocation overhead,more memory forother applications
Replace a sub-graph with a
single faster node
Kernel Merge
Better memorylocality, less kernel launch overhead
Split the graph execution across
the whole system: CPU / GPU /
dedicated HW
GraphScheduling
Faster executionor lower powerconsumption
Execute a sub-graph at tile
granularity instead of image
granularity
DataTiling
Better use of data cache andlocal memory
© Copyright Khronos Group 2016 - Page 11
Example Relative Performance
1.1
2.9
8.7
1.5
2.5
0
1
2
3
4
5
6
7
8
9
10
Arithmetic Analysis Filter Geometric Overall
OpenCV (GPU accelerated)
OpenVX (GPU accelerated)
Relative Performance
NVIDIA
implementation
experience.Geometric mean of
>2200 primitives,
grouped into each
categories,
running at different
image sizes and
parameter settings
© Copyright Khronos Group 2016 - Page 12
Dedicated Vision
Hardware
Layered Vision Processing Ecosystem
Programmable Vision
Processors
Application
C/C++
Implementers may use OpenCL or Compute Shaders to
implement OpenVX nodes on programmable processors
And then developers can use OpenVX to enable a
developer to easily connect those nodes into a graph
The OpenVX graph enables implementers to optimize execution across
diverse hardware architectures an drive to lower power implementations
OpenVX enables the graph to be extended to include hardware
architectures that don’t support programmable APIs
AMD OpenVX- Open source, highly optimized
for x86 CPU and OpenCL for GPU
- “Graph Optimizer” looks at
entire processing pipeline and
removes/replaces/merges
functions to improve
performance and bandwidth
- Scripting for rapid prototyping,
without re-compiling, at
production performance levelshttp://gpuopen.com/compute-product/amd-openvx/
© Copyright Khronos Group 2016 - Page 13
OpenVX 1.0 Shipping, OpenVX 1.1 Released!• Multiple OpenVX 1.0 Implementations shipping – spec in October 2014
- Open source sample implementation and conformance tests available
• OpenVX 1.1 Specification released 2nd May 2016
- Expands node functionality AND enhances graph framework
• OpenVX is EXTENSIBLE
- Implementers can add their own nodes at any time to meet customer and market needs
= shipping implementations
© Copyright Khronos Group 2016 - Page 14
OpenVX Neural Net Extension• Convolution Neural Network topologies can be represented as OpenVX graphs
- Layers are represented as OpenVX nodes
- Layers connected by multi-dimensional tensors objects
- Layer types include convolution, activation, pooling, fully-connected, soft-max
- CNN nodes can be mixed with traditional vision nodes
• Import/Export Extension
- Efficient handling of network Weights/Biases or complete networks
• The specification is provisional
- Welcome feedback from the deep learning community
VisionNode
VisionNode
VisionNode
Downstream
Application
Processing
Native
Camera
Control CNN Nodes
An OpenVX graph mixing CNN nodes
with traditional vision nodes
© Copyright Khronos Group 2016 - Page 15
NNEF - Neural Network Exchange Format
NN Authoring Framework 1
NN Authoring Framework 2
NN Authoring Framework 3
Inference Engine 1
Inference Engine 2
Inference Engine 3
NNEF encapsulates neural network structure, data formats, commonly used operations
(such as convolution, pooling, normalization, etc.) and formal network semantics
NN Authoring Framework 1
NN Authoring Framework 2
NN Authoring Framework 3
Inference Engine 1
Inference Engine 2
Inference Engine 3
Neural Net
Exchange
Format
NNEF 1.0 is currently being defined
OpenVX will import NNEF files
© Copyright Khronos Group 2016 - Page 16http://hwstats.unity3d.com/mobile/gpu.html
OpenGL ES 2.0
OpenGL ES 3.x
OpenGL ESCompute Shaders
32-bit integers and floats
NPOT, 3D/depth textures
Texture arrays
Multiple Render TargetsVertex and
fragment shadersFixed function
Pipeline
Tessellation and geometry shaders
ASTC Texture Compression
Floating point render targets
Debug and robustness for security
Epic’s Rivalry demo using full Unreal Engine 4
https://www.youtube.com/watch?v=jRr-G95GdaM
Vertex and Fragment Shaders Compute Shaders
2003
ES 1.0
2004
ES 1.1
2007
ES 2.0
2012
ES 3.0
2014
ES 3.1
Driver
Update
Silicon
Update
Silicon
Update
Driver
Update2015
ES 3.2
Silicon
Update
Close to 2 Billion OpenGL ES
devices shipped in 2015
2016
NL
Driver
Update
AEP
© Copyright Khronos Group 2016 - Page 17
The Key Principle of Vulkan: Explicit Control
• Application tells the driver what it is going to do
- In enough detail that driver doesn’t have to guess
- When the driver needs to know it
• In return, driver promises to do
- What the application asks for
- When it asks for it
- Very quickly
• No driver magic – no surprises
You have to supply a
LOT of information!
And, you have to
supply it early
Efficient, predictable
performance
© Copyright Khronos Group 2016 - Page 18
Vulkan Explicit GPU Control
GPU
High-level Driver
AbstractionContext management
Memory allocation
Full GLSL compiler
Error detection
Layered GPU Control
ApplicationSingle thread per context
GPU
Thin DriverExplicit GPU Control
ApplicationMemory allocation
Thread management
Synchronization
Multi-threaded generation
of command buffers
Language Front-end
CompilersInitially GLSL
Loadable debug and
validation layers
Vulkan 1.0 provides access to
OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality
but with increased performance and flexibility
Loadable LayersNo error handling overhead in
production code
SPIR-V Pre-compiled Shaders:No front-end compiler in driver
Future shading language flexibility
Simpler drivers:Improved efficiency/performance
Reduced CPU bottlenecks
Lower latency
Increased portability
Graphics, compute and DMA queues: Work dispatch flexibility
Multi-threaded Command Buffers:Command creation can be multi-threaded
Multiple CPU cores increase performance
Resource management in app code: Less driver hitches and surprises
Vulkan Benefits
SPIR-V pre-compiled shaders
© Copyright Khronos Group 2016 - Page 19
Vulkan Multi-threading Efficiency
GPU
Command
Buffer
Command
Buffer
Command
Buffer
Command
Buffer
Command
Buffer
Command
Queue
CPU
Thread
CPU
Thread
CPU
Thread
CPU
Thread
CPU
Thread
CPU
Thread
1. Multiple threads can construct Command Buffers in parallel
Application is responsible for thread management and synch
2. Command Buffers placed in Command
Queue by separate submission thread
Applications can create graphics, compute
and DMA command buffers with a general
queue model that can be extended to more
heterogeneous processing in the future
© Copyright Khronos Group 2016 - Page 20
Khronos Safety Critical APIs
New Generation APIs for
safety certifiable vision,
graphics and compute
OpenGL ES 1.0 - 2003Fixed function graphics
OpenGL ES 2.0 - 2007Shader programmable pipeline
OpenGL SC 1.0 - 2005Fixed function graphics subset
OpenGL SC 2.0 - April 2016Shader programmable pipeline subset
Experience and Guidelines
Small driver size
Advanced functionality
Graphics and compute
Safety Critical
vision
processing
https://www.khronos.org/news/press/advisory-panel-to-create-design-guidelines-for-safety-critical-apis
Safety Critical
heterogeneous
compute
ISO 26262
DO-178B/C
© Copyright Khronos Group 2016 - Page 21
Please Consider Joining Khronos!
Influence how
standards evolve!
Access draft specs to
build products faster!
Understand early
industry requirements!
Ship products that conform to
international standards!
Khronos is proven to RAPIDLY generate hardware API standards
that create significant market opportunities
Any company or organization is welcome to join Khronos
for a voice and a vote in any of its standards
www.khronos.org
© Copyright Khronos Group 2016 - Page 22
VeriSilicon
Computer Vision solution with OpenVX over “VIP”
Gaku Ogura
Simon Jones
18 November 2016
© Copyright Khronos Group 2016 - Page 23
VeriSilicon – Who We Are
• Founded in 2001
• Over 650 employees
• 70% dedicated to R&D
© Copyright Khronos Group 2016 - Page 24
VeriSilicon – What we can do ?
Idea to Spec
Spec to RTL
RTL to Netlist
Netlist to
GDSManufacture
Shipping
Test
Package
IP Development & Licensing
System Architecture
Silicon Design
© Copyright Khronos Group 2016 - Page 25
Movement of Intelligent Devices
• Natural User Interface- VISION- Natural Language Processing- Sensors
▲Multi-Media
► Graphics, Video, Audio, Voice
▲Many applications – need flexibility
▲1000s of algorithms – new one every day
▲Compute intensive – need hardware acceleration
▲Algorithms keep changing – need programmability
© Copyright Khronos Group 2016 - Page 26
Deep Learning: Neural Networks Return with
Vengeance• With big data and high density compute, deep neural networks can be trained to solve
“impossible” computer vision problems
% Teams using Deep Learning
© Copyright Khronos Group 2016 - Page 27
The Basics of Deep Neural Networks (DNN)• Training: 1016-1022 MACs/dataset
• Inference: 106-1011 MACs/image
forward
backward
“Pug”
“Rottweiler”
error
=?
forward“Rottweiler”
100K-100MNetwork
Coefficients
Offline (mostly)Analogous to compiling
On-the-fly
Analogous to running s/wEmbedded systems use this
© Copyright Khronos Group 2016 - Page 28
Challenges of Real-Time Inference
Classification Detection Segmentation
100G MAC/s 1T MAC/s 10T MAC/s
• Majority of deep neural net computation is Multiply-Accumulate (MAC) for convolutions and inner products
• Memory bandwidth is a bigger challenge- Convolution in deep neural nets works with not just 2D images, but high-dimensional arrays (i.e. “tensor”)- Example: AlexNet (classification) has 60M parameters (240MB FP32), DDR BW with brute force
implementation: 35GB/s
© Copyright Khronos Group 2016 - Page 29
Processor Architectures
Energy
Programmability
CustomRTL
DSP• VeriSilicon• CEVA• Videantis• Cadence• Synopsys
CPU
VIPVeriSilicon
GPU• VeriSilicon• ARM• Imagination
OpenCLOpenVXOpenCV
© Copyright Khronos Group 2016 - Page 30
Efficient Processor Architecture to Enable DNN
for Embedded Vision
Intelligent Data Fabrics
Mature CV BlockHW Accelerator
Local Memory / Cache
Mature CV BlockHW Accelerator
Customer DefinedHW Accelerator
Custom RTL to accelerate mature CV algorithms for 24/7 low power operation
• Programmable engine to cope with new CV algorithms
• Future new NN layers can be implemented here
• Scalable architecture to support different PPA (Performance, Power, Area) requirements
• High utilization MACs for DNN• Scalable architecture to
support different PPA (Performance, Power, Area) requirements
• Handles other DNN functions (normalization, pooling,…)
• Handles pruning, compression, batching… to increase MAC utilization and decrease DDR BW
For data synchronization between threads/blocks and minimize DDR BW
I/F to extend HW acceleration for application specific purpose
ProgrammableEngine
ProgrammableEngine
ProgrammableEngine
Scalable MACs
Scalable MACs
Scalable MACs
Deep LearningFrameworks
VisionFramework
API
Vision Applications
© Copyright Khronos Group 2016 - Page 31
Use Case Study: Faster RCNN• One of state-of-the-art object detection CNNs
- Network configuration- 300 region proposals
- 60M parameters, 30G MACs/image
• Collaborative computing in VIP
- NN Engine: number crunching- Convolution layers, full connected layers
- Tensor Processing Fabric: data rearrangement- LRN, max pooling, ROI pooling
- Programmable Engine: precision compute- RPN, NMS, softmax
• Real-time performance (30fps HD) under 0.5W @ 28nm HPM
Programmable
Engine
NN Cores
TP Fabric
Local Memory / Cache
Output of a layer is queued in DDR if next layer is on different processing engine
© Copyright Khronos Group 2016 - Page 32
Scalable Performance from VIP8000
• Scalable performance with SAME application code on different processor
variants
VIP8000 Series VIP Nano VIP8000UL VIP8000L VIP8000
Number of Execution cores 1 2 4 8
Clock Frequency SVT @WC125C (MHz) 800 800 800 800
HD Performance@800MHz
Perspective Warping 60 fps 120 fps 240 fps 480 fps
Optical Flow LK 30 fps 60 fps 120 fps 240 fps
Pedestrian Detection 30 fps 60 fps 120 fps 240 fps
Convolutional Neural Network (AlexNet) 125 fps 250 fps 500 fps 1000 fps
© Copyright Khronos Group 2016 - Page 33
Comprehensive Multi-Media/Pixel Subsystems
Company Proprietary and
Confidential
CPU
VeriSilicon Universal Compression
VPUCore(Video)
VIP Core(Vision/Image)
IO Controller
GC Core(GPU)
DMAEngine
DC(Display)
Memory Controller
ZSP(Audio/Voice)
AXI Bus
WirelessConnectivity
Local Bus
MixedSignal IPs
ISP
© Copyright Khronos Group 2016 - Page 34
japan-sales@verisilicon.com