Silicon Acceleration APIs - Khronos Group · 2016-11-23 · •Natural User Interface - VISION -...

Silicon Acceleration APIsEmbedded Technology 2016, Yokohama

Neil TrevettVice President Developer Ecosystem, NVIDIA | President, Khronos

ntrevett@nvidia.com | @neilt3d November 2016

Khronos and Open Standards

Software

Silicon

Khronos is an Industry Consortium of over 100 companies

We create royalty-free, open standard APIs for hardware acceleration of

Graphics, Parallel Compute, Neural Networks and Vision

Accelerated API Landscape

Vision Frameworks

High-level Language-based

Acceleration Frameworks

Explicit

Kernels

GPUFPGA

DSPDedicated

Hardware

Neural Net LibrariesOpenVX Neural

Net Extension

Graphics

OpenCL – Low-level Parallel Programing• Low level programming of heterogeneous parallel compute resources

- One code tree can be executed on CPUs, GPUs, DSPs and FPGA

• OpenCL C language to write kernel programs to execute on any compute device

- Platform Layer API - to query, select and initialize compute devices

- Runtime API - to build and execute kernels programs on multiple devices

• New in OpenCL 2.2 - OpenCL C++ kernel language - a static subset of C++14

- Adaptable and elegant sharable code – great for building libraries

- Templates enable meta-programming for highly adaptive software

- Lambdas used to implement nested/dynamic parallelism

OpenCL

Kernel

OpenCL

Kernel

OpenCL

Kernel

OpenCL

Kernel

DSPCPU

CPUFPGA

Kernel code

compiled for

devicesDevices

Runtime API

loads and executes

kernels across devices

OpenCL Conformant Implementations

OpenCL 1.0Specification

Dec08 Jun10OpenCL 1.1

Specification

Nov11OpenCL 1.2

SpecificationOpenCL 2.0

Specification

1.0 | Jul13

1.0 | Aug09

1.0 | May09

1.0 | May10

1.0 | Feb11

1.0 | May09

1.0 | Jan10

1.1 | Aug10

1.1 | Jul11

1.2 | May12

1.2 | Jun12

1.1 | Feb11

1.1 |Mar11

1.1 | Jun10

1.1 | Aug12

1.1 | Nov12

1.1 | May13

1.1 | Apr12

1.2 | Apr14

1.2 | Sep13

1.2 | Dec12

Desktop

Mobile

2.0 | Jul14

OpenCL 2.1 Specification

1.2 | May15

2.0 | Dec14

1.0 | Dec14

1.2 | Dec14

1.2 | Sep14

Vendor timelines are

first implementation of

each spec generation

1.2 | May15

Embedded

1.2 | Aug15

1.2 | Mar16

2.0 | Nov15

2.1 | Jun15

SYCL for OpenCL• Single-source heterogeneous programming using STANDARD C++

- Use C++ templates and lambda functions for host & device code

• Aligns the hardware acceleration of OpenCL with direction of the C++ standard

- C++14 with open source C++17 Parallel STL hosted by Khronos

C++ Kernel LanguageLow Level Control

‘GPGPU’-style separation of

device-side kernel source

code and host code

Single-source C++Programmer FamiliarityApproach also taken by

C++ AMP and OpenMP

Developer ChoiceThe development of the two specifications are aligned so

code can be easily shared between the two approaches

Embedded Vision AccelerationEmbedded Technology, Yokohama

Shorin Kyo, Huawei

OpenVX – Low Power Vision Acceleration • Targeted at vision acceleration in real-time, mobile and embedded platforms

- Precisely defined API for production deployment

• Higher abstraction than OpenCL for performance portability across diverse architectures

- Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware…

• Extends portable vision acceleration to very low power domains

- Doesn’t require high-power CPU/GPU Complex or OpenCL precision

Vision Engine

Middleware

Application

Hardware

Computation Flexibility

Dedicated Hardware

GPUCompute

Multi-coreCPUX1

X100Vision Processing

Efficiency

Vision DSPs

OpenVX Graphs• OpenVX developers express a graph of image operations (‘Nodes’)

- Nodes can be on any hardware or processor coded in any language

• Graphs can execute almost autonomously

- Possible to Minimize host interaction during frame-rate graph execution

• Graphs are the key to run-time optimization opportunities…

Array of

Keypoints

Camera

Rendering

Output

Color Conversion

Channel Extract

Optical Flow

Harris Track

Image Pyramid

Array of

FeaturesFtrt-1OpenVX Graph

OpenVX Nodes

Feature Extraction Example Graph

OpenVX Efficiency through Graphs..

Reuse pre-allocated memory for

multiple intermediate data

MemoryManagement

Less allocation overhead,more memory forother applications

Replace a sub-graph with a

single faster node

Kernel Merge

Better memorylocality, less kernel launch overhead

Split the graph execution across

the whole system: CPU / GPU /

dedicated HW

GraphScheduling

Faster executionor lower powerconsumption

Execute a sub-graph at tile

granularity instead of image

granularity

DataTiling

Better use of data cache andlocal memory

Example Relative Performance

Arithmetic Analysis Filter Geometric Overall

OpenCV (GPU accelerated)

OpenVX (GPU accelerated)

Relative Performance

NVIDIA

implementation

experience.Geometric mean of

>2200 primitives,

grouped into each

categories,

running at different

image sizes and

parameter settings

Dedicated Vision

Hardware

Layered Vision Processing Ecosystem

Programmable Vision

Processors

Application

Implementers may use OpenCL or Compute Shaders to

implement OpenVX nodes on programmable processors

And then developers can use OpenVX to enable a

developer to easily connect those nodes into a graph

The OpenVX graph enables implementers to optimize execution across

diverse hardware architectures an drive to lower power implementations

OpenVX enables the graph to be extended to include hardware

architectures that don’t support programmable APIs

AMD OpenVX- Open source, highly optimized

for x86 CPU and OpenCL for GPU

- “Graph Optimizer” looks at

entire processing pipeline and

removes/replaces/merges

functions to improve

performance and bandwidth

- Scripting for rapid prototyping,

without re-compiling, at

production performance levelshttp://gpuopen.com/compute-product/amd-openvx/

OpenVX 1.0 Shipping, OpenVX 1.1 Released!• Multiple OpenVX 1.0 Implementations shipping – spec in October 2014

- Open source sample implementation and conformance tests available

• OpenVX 1.1 Specification released 2nd May 2016

- Expands node functionality AND enhances graph framework

• OpenVX is EXTENSIBLE

- Implementers can add their own nodes at any time to meet customer and market needs

= shipping implementations

OpenVX Neural Net Extension• Convolution Neural Network topologies can be represented as OpenVX graphs

- Layers are represented as OpenVX nodes

- Layers connected by multi-dimensional tensors objects

- Layer types include convolution, activation, pooling, fully-connected, soft-max

- CNN nodes can be mixed with traditional vision nodes

• Import/Export Extension

- Efficient handling of network Weights/Biases or complete networks

• The specification is provisional

- Welcome feedback from the deep learning community

VisionNode

Downstream

Application

Processing

Native

Camera

Control CNN Nodes

An OpenVX graph mixing CNN nodes

with traditional vision nodes

NNEF - Neural Network Exchange Format

NN Authoring Framework 1

Inference Engine 1

Inference Engine 2

Inference Engine 3

NNEF encapsulates neural network structure, data formats, commonly used operations

(such as convolution, pooling, normalization, etc.) and formal network semantics

Inference Engine 1

Inference Engine 2

Inference Engine 3

Neural Net

Exchange

Format

NNEF 1.0 is currently being defined

OpenVX will import NNEF files

OpenGL ES 2.0

OpenGL ES 3.x

OpenGL ESCompute Shaders

32-bit integers and floats

NPOT, 3D/depth textures

Texture arrays

Multiple Render TargetsVertex and

fragment shadersFixed function

Pipeline

Tessellation and geometry shaders

ASTC Texture Compression

Floating point render targets

Debug and robustness for security

Epic’s Rivalry demo using full Unreal Engine 4

https://www.youtube.com/watch?v=jRr-G95GdaM

Vertex and Fragment Shaders Compute Shaders

ES 1.0

ES 1.1

ES 2.0

ES 3.0

ES 3.1

Driver

Update

Silicon

Update

Silicon

Update

Driver

Update2015

ES 3.2

Silicon

Update

Close to 2 Billion OpenGL ES

devices shipped in 2015

Driver

Update

The Key Principle of Vulkan: Explicit Control

• Application tells the driver what it is going to do

- In enough detail that driver doesn’t have to guess

- When the driver needs to know it

• In return, driver promises to do

- What the application asks for

- When it asks for it

- Very quickly

• No driver magic – no surprises

You have to supply a

LOT of information!

And, you have to

supply it early

Efficient, predictable

performance

Vulkan Explicit GPU Control

High-level Driver

AbstractionContext management

Memory allocation

Full GLSL compiler

Error detection

Layered GPU Control

ApplicationSingle thread per context

Thin DriverExplicit GPU Control

ApplicationMemory allocation

Thread management

Synchronization

Multi-threaded generation

of command buffers

Language Front-end

CompilersInitially GLSL

Loadable debug and

validation layers

Vulkan 1.0 provides access to

OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality

but with increased performance and flexibility

Loadable LayersNo error handling overhead in

production code

SPIR-V Pre-compiled Shaders:No front-end compiler in driver

Future shading language flexibility

Simpler drivers:Improved efficiency/performance

Reduced CPU bottlenecks

Lower latency

Increased portability

Graphics, compute and DMA queues: Work dispatch flexibility

Multi-threaded Command Buffers:Command creation can be multi-threaded

Multiple CPU cores increase performance

Resource management in app code: Less driver hitches and surprises

Vulkan Benefits

SPIR-V pre-compiled shaders

Vulkan Multi-threading Efficiency

Command

Buffer

Command

Buffer

Command

Buffer

Command

Buffer

Command

Buffer

Command

Thread

1. Multiple threads can construct Command Buffers in parallel

Application is responsible for thread management and synch

2. Command Buffers placed in Command

Queue by separate submission thread

Applications can create graphics, compute

and DMA command buffers with a general

queue model that can be extended to more

heterogeneous processing in the future

Khronos Safety Critical APIs

New Generation APIs for

safety certifiable vision,

graphics and compute

OpenGL ES 1.0 - 2003Fixed function graphics

OpenGL ES 2.0 - 2007Shader programmable pipeline

OpenGL SC 1.0 - 2005Fixed function graphics subset

OpenGL SC 2.0 - April 2016Shader programmable pipeline subset

Experience and Guidelines

Small driver size

Advanced functionality

Graphics and compute

Safety Critical

vision

processing

https://www.khronos.org/news/press/advisory-panel-to-create-design-guidelines-for-safety-critical-apis

Safety Critical

heterogeneous

compute

ISO 26262

DO-178B/C

Please Consider Joining Khronos!

Influence how

standards evolve!

Access draft specs to

build products faster!

Understand early

industry requirements!

Ship products that conform to

international standards!

Khronos is proven to RAPIDLY generate hardware API standards

that create significant market opportunities

Any company or organization is welcome to join Khronos

for a voice and a vote in any of its standards

www.khronos.org

VeriSilicon

Computer Vision solution with OpenVX over “VIP”

Gaku Ogura

Simon Jones

18 November 2016

VeriSilicon – Who We Are

• Founded in 2001

• Over 650 employees

• 70% dedicated to R&D

VeriSilicon – What we can do ?

Idea to Spec

Spec to RTL

RTL to Netlist

Netlist to

GDSManufacture

Shipping

Package

IP Development & Licensing

System Architecture

Silicon Design

Movement of Intelligent Devices

• Natural User Interface- VISION- Natural Language Processing- Sensors

▲Multi-Media

► Graphics, Video, Audio, Voice

▲Many applications – need flexibility

▲1000s of algorithms – new one every day

▲Compute intensive – need hardware acceleration

▲Algorithms keep changing – need programmability

Deep Learning: Neural Networks Return with

Vengeance• With big data and high density compute, deep neural networks can be trained to solve

“impossible” computer vision problems

% Teams using Deep Learning

The Basics of Deep Neural Networks (DNN)• Training: 1016-1022 MACs/dataset

• Inference: 106-1011 MACs/image

forward

backward

“Pug”

“Rottweiler”

forward“Rottweiler”

100K-100MNetwork

Coefficients

Offline (mostly)Analogous to compiling

On-the-fly

Analogous to running s/wEmbedded systems use this

Challenges of Real-Time Inference

Classification Detection Segmentation

100G MAC/s 1T MAC/s 10T MAC/s

• Majority of deep neural net computation is Multiply-Accumulate (MAC) for convolutions and inner products

• Memory bandwidth is a bigger challenge- Convolution in deep neural nets works with not just 2D images, but high-dimensional arrays (i.e. “tensor”)- Example: AlexNet (classification) has 60M parameters (240MB FP32), DDR BW with brute force

implementation: 35GB/s

Processor Architectures

Energy

Programmability

CustomRTL

DSP• VeriSilicon• CEVA• Videantis• Cadence• Synopsys

VIPVeriSilicon

GPU• VeriSilicon• ARM• Imagination

OpenCLOpenVXOpenCV

Efficient Processor Architecture to Enable DNN

for Embedded Vision

Intelligent Data Fabrics

Mature CV BlockHW Accelerator

Local Memory / Cache

Mature CV BlockHW Accelerator

Customer DefinedHW Accelerator

Custom RTL to accelerate mature CV algorithms for 24/7 low power operation

• Programmable engine to cope with new CV algorithms

• Future new NN layers can be implemented here

• Scalable architecture to support different PPA (Performance, Power, Area) requirements

• High utilization MACs for DNN• Scalable architecture to

support different PPA (Performance, Power, Area) requirements

• Handles other DNN functions (normalization, pooling,…)

• Handles pruning, compression, batching… to increase MAC utilization and decrease DDR BW

For data synchronization between threads/blocks and minimize DDR BW

I/F to extend HW acceleration for application specific purpose

ProgrammableEngine

Scalable MACs

Deep LearningFrameworks

VisionFramework

Vision Applications

Use Case Study: Faster RCNN• One of state-of-the-art object detection CNNs

- Network configuration- 300 region proposals

- 60M parameters, 30G MACs/image

• Collaborative computing in VIP

- NN Engine: number crunching- Convolution layers, full connected layers

- Tensor Processing Fabric: data rearrangement- LRN, max pooling, ROI pooling

- Programmable Engine: precision compute- RPN, NMS, softmax

• Real-time performance (30fps HD) under 0.5W @ 28nm HPM

Programmable

Engine

NN Cores

TP Fabric

Local Memory / Cache

Output of a layer is queued in DDR if next layer is on different processing engine

Scalable Performance from VIP8000

• Scalable performance with SAME application code on different processor

variants

VIP8000 Series VIP Nano VIP8000UL VIP8000L VIP8000

Number of Execution cores 1 2 4 8

Clock Frequency SVT @WC125C (MHz) 800 800 800 800

HD Performance@800MHz

Perspective Warping 60 fps 120 fps 240 fps 480 fps

Optical Flow LK 30 fps 60 fps 120 fps 240 fps

Pedestrian Detection 30 fps 60 fps 120 fps 240 fps

Convolutional Neural Network (AlexNet) 125 fps 250 fps 500 fps 1000 fps

Comprehensive Multi-Media/Pixel Subsystems

Company Proprietary and

Confidential

VeriSilicon Universal Compression

VPUCore(Video)

VIP Core(Vision/Image)

IO Controller

GC Core(GPU)

DMAEngine

DC(Display)

Memory Controller

ZSP(Audio/Voice)

AXI Bus

WirelessConnectivity

Local Bus

MixedSignal IPs

japan-sales@verisilicon.com

Silicon Acceleration APIs - Khronos Group · 2016-11-23 · •Natural User Interface - VISION -...

Documents

Transcript of Silicon Acceleration APIs - Khronos Group · 2016-11-23 · •Natural User Interface - VISION -...

© Copyright Khronos Group, 2007 - Page 1 Khronos Overview Neil Trevett Vice President Mobile Content, NVIDIA President, Khronos Group.

Khronos Conformance Test Process document - Khronos Group

Khronos Group KITE Overview

© Copyright Khronos Group, 2006 - Page 1 Khronos and OpenGL ES Status Neil Trevett Vice President Embedded Content, NVIDIA President, Khronos.

OpenCL 3 - Khronos Group

Khronos computing and graphics APIs in smart devices · PDF fileWorkshop on Compiler Techniques and System Software for High-Performance and Embedded Computing ... OpenGL ES 3.1

Khronos Conformance Process

Khronos Update and Overview...apps innovation. Mobile APIs unlock hardware and conserve battery life Apps embrace mobility’s unique strengths and need complex, interoperating APIs

Khronos Open API Standards - NVIDIA...Mobile is the new platform for apps innovation. Mobile APIs unlock hardware and conserve battery life Apps need interoperating APIs with rich

Update & AMA - Khronos Group · Augmented and Virtual Reality Parallel Computation Vision, Inferencing, Machine Learning 3D Commerce Guidelines for creating APIs to streamline system

The Khronos Open APIs - University of Oulujiechen/Course/SIGGRAPH/The-Khronos-APIs.pdf · 2010-03-01 · Language Highlights (cont’d) zWork-item functions zQuery work-item identifiers

APIs for Accelerating Vision and Inferencing: An Overview ... · Validator. Caffe and Caffe2 Convertor. NNEF open source projects hosted on Khronos NNEF GitHub repository under Apache

Using Natural Language APIs in SEO

Building Standards - Khronos

Seoul DevU December 2010 - Khronos Group...Developers creating user experiences need efficient silicon accelerator access Standard APIs define ... SDKs, Sample Implementations, Documentation

APIS M / APIS 15 / APIS 13 / APIS WR / APIS E - skynet.beusers.skynet.be/am259776/APIS FOLDER.pdf · APIS M / APIS 15 / APIS 13 / APIS WR / APIS E APIS, CARAT, AM- 01, DG- 303 ELAN,

Khronos Overview · 2014-04-08 · Apps embrace mobility’s unique strengths and need complex, interoperating APIs with rich sensory inputs e.g. Augmented Reality Diverse platforms

Khronos Native Platform Graphics Interface (EGL Version 1.2 ...Contents 1 Overview 1 2 EGL Operation 2 2.1 Native Window System and Rendering APIs . . . . . . . . . . . . 2 2.1.1 Scalar

SC24/WG9 Liaison Meeting - Khronos Group€¦ · - Brief introduction to Khronos - Khronos Native API standards relevant to Augmented Reality - Industry Adoption of Khronos APIs -

Mobile Media APIs - Khronos Group · Shipping 2007 2008 Products OpenGL ES 1.1 with hardware acceleration remains the “Sweet Spot” at least through 2008 OpenGL ES 2.0 accelerated