高性能汎用GPUの半導体実装からシステム実装までの最先端技術 · 馬路徹...

馬路徹

技術顧問、GPUエバンジェリスト

2019年5月14日

高性能汎用GPUの半導体実装からシステム実装までの最先端技術

LSIとシステムのワークショップ 2019

2

講演目次

1. GPUとCPU性能の変遷

1) 2005年頃よりムーアの法則を享受できなくなったCPUの性能向上

2) ムーアの法則を受けてGPUは性を能向上、ムーアの法則終焉後もなお性能向上を維持

3) 性能向上実績：国際スーパーコンピュータ学会TOP500で上位を占める

4) GPUはフル・プログラマブルなプロセッサとして最も電力効率が高い

5) 電力効率実績：国際スーパーコンピュータ学会Green500で上位を占める

2. GPU, DLA(Deep Learning Accelerator)はAI実装用の最適なプロセッサ

1) AI応用の急速な拡大及びAI実装の２つの技術要件（プログラマビリティと性能）

2) Tensor Coreアクセラレータによる学習と推論の高速化

3) DLA (Deep Learning Accelerator)による高効率、高性能推論

4) 推論DNN最適化のためのTensorRTソフトウエア・エンジン

5) データセンター及びスーパーコンピュータ用のインフラ構成

3. 自動運転用AIプロセッサXavier及びEnd-to-End開発システム

1) One GPUアーキテクチャによるスーパーコンピュータから車載プロセッサまでの技術資産の共用

2) レベル5の完全自動運転プロセッサを量産ベースで提供可能なのはテスラ社とNVIDIA。NVIDIAはオープン・プラットフォーム

3) AI学習に必要な性能及びインフラ

4) シミュレーションを導入した完全な自動運転検証

5) NVIDIA DRIVEプラットフォームによる米国高速道路自動運転デモ

3

講演目次



















4

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

CPU Single-threaded perf

1.5X per year

1.1X per year

Transistors

(thousands)

NO MORE MOORE’S LAW BENEFITSCPU PERFORMANCE INCREASE STOPPED

5

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10

N

R : 0%

R : 10%

R : 20%

R : 50%

N: Number of CPU Corea

; M

ult

i C

PU

Core

Eff

icie

ncy

N: Number of CPU Cores

R: Ratio of Sequential Processing

1- R: Ratio of Parallel Processing

a: Multi CPU Core Efficiency

(a =1 is equivalent to a single CPU)

Ta: Single CPU Execution Time

Tb: N CPU Core Execution Time

Tb = Ta * ( R + (1 – R)/N )

a = Ta/Tb = 1 / ( R + (1 – R)/N )

Amdahl’s Law: Limit Multi CPU Core Processing Efficiency

If R=20%, 8 CPU core can achieve just 3.3 CPU core performance

GPU

dedicated for

fully parallel

processing

6

FULLY PARALLEL APPLICATION EXAMPLE(3D GRAPHICS)

Normal

N

L Rq q

Ｖ

Light

source Observation

Point

a

Poligon Normal

R = 2cosq N + L = -2(L N) N + LC = KdLi (N L) + KsLi(R V)

= KdLi cosq+ KsLi cosa

N

L

R

V

Li

Kd

Ks

s

C

Normal Vector

Light Source Vector

Reflection Vector

Observation Vector

Light Intensity

Reflaction Coef. (0<Kd<1)

Reflection Coef. (0<Ks<1)

Sharpness Coef.(s>0)

Reflection Intensity

7

NVIDIA GPU

ACCELERATING

FOUR INDUSTRY

FIELDS

Scientific Calculation

AI/Deep Learning

Computer Graphics

Data Analysis/Data Base

NVIDIA CUDA (Massive parallel computation platform)

AMBERMolecularDynamics

COSMOClimateWeather

ChaNGaAstrophysics

GaussianQuantumChemistry

Schlumberger WGSeismic Processing

PowerGridMedical Imaging

ANSYS FluentComputational Fluid Dynamics

SIMULIA Abaqus Finite-ElementAnalysis

K-Means Clustering Gradient Boosting

Support Vector

MachineGeneralized Linear

Model

DATABASES

ANALYTICS

❑ 645,000 GPU Developersx15 in 5 Years

❑ 1,800,000 CUDA Downloads x5 in 5 Years

CUDA: Compute Unified Device Architecture

8

GPU ENJOYED THE TR COUNT INCREASE BY MOOR’S LAW, INCREASING ITS NUMBER OF CORES ACCORDINGLY

2

1

4

8

16

32

64

128

256

512

1024

480

240

216

128

112

12

8

64

32

16

8+16

6+16

5+12

3+4

2+2

1+4

6+16

5+12

4+8

3+4

1+2

3+

8

3+2

1+

2

GeForce

3

2+

4

0+

2

1+4

Number of GPU Core

Vertex + Pixel Shader

Unified

Shader

GeForce

4

GeForce

5

GeForce

6

GeForce

7

12

896

32

16

8

GeForce

8

GeForce

9

GeForce

200

GeForce

400

GeForce

500

2001

TESLA

(architecture)

2002 2003 2004-5 2006-7 2008 2008 2009 2010 2010-11

480336

288

192144

96

48

480384

288

192

96

512

FERMI

48

KEPLER

2048

GeForce

600

2012

15361344

384

288

192

2016

Tesla P100

FP32 Core:

3,584

FP64 Core:

1,792

FP32: 2880, FP64: 960

PASCAL

MAXWELL

2014

VOLTA

Tesla V100

FP32 Core:

5,120

FP64 Core:

2,560

2017

Tesla M40

FP32 Core:

3,072

FP64 Core:

96

9

GPU IS A MUST TO HAVE ACCELRATOR

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

GPU-Computing perf

1.5X per year

1980 1990 2000 2010 2020

102

103

104

105

106

107

CPU Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

Performance is growing even

after Moor’s Law saturates

10

NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER

~ 40,000Volta Tensor Core GPUs

Summit Becomes First System To Scale The 100 Petaflops Milestone

143 PetaFLOPS 3 ExaFLOPSHPC AI

11

AMONG TOP10 FASTEST SUPERCOMUTER IN THE WORLD

5 ARE USING NVIDIA GPU ACCELERATION

NO.1 AND 2 ARE USING NVIDIA GPUISC2018(INTERNATIONAL SUPERCOMPUTING CONFERENCE) NOV. 2018日

12

HOW IS POWER SPENT IN A CPU AND GPU?

High-performance CPUOut-of-Order Instruction Execution

Clock + Pins

45%

ALU

4%

Fetch

11%

Rename

10%

Issue

11%

RF

14%

DataSupply

5%

Natarajan [2003] (Alpha 21264)

Overhead

15pJ

Payload

Arithmetic

15pJ

Bill Dally, Keynote in Deep Learning Institute 2017 Tokyo, Jan. 2017

Many Core GPU

13

64-bit DP20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

256-bit buses

16 nJDRAMRd/Wr

256-bit access8 kB SRAM 50 pJ

20mm

28nm CMOS

20x20mm2


Save Every pJ (Energy) in the DesignGPU: from Architecture, Circuit Design to Layout“Execute Arithmetic within the shortest distance”

14

CPU VS GPU: ~ X5 ENERGY EFFICIENCY

CPU126 pJ/flop (SP)

Optimized for Latency

Deep Cache Hierarchy

Broadwell E5 v414 nm

GPU28 pJ/flop (SP)

Optimized for Throughput

Explicit Managementof On-chip Memory

Pascal P10016 nm


15

AMONG TOP25 MOST ENERGY-EFFICIENT SUPERCOMUTER IN THE WORLD

22 ARE USING NVIDIA GPU ACCELERATION

Supercomputers with

yellow cells in the

TOP500 Ranking shows

the supercomputers that

is also ranked within

TOP25 in performance.

This means that the GPU

acceleration is leading

the power efficiency for

big-scale supercomputers.

16

講演目次



















17

AI / Deep Learning Everywhere

Internet & CloudImage ClassificationSpeech Recognition

Language TranslationRecommendations

Medicine & Biology

Cancer Cell DetectionDiabetic Grading Drug Discovery

Media & EntertainmentVideo Captioning | Video Search

Real Time Translation

Intelligent Video AnalyticsTraffic Analysis | Retail Analytics

Access Control

TransportationPedestrian Detection

Lane Tracking Traffic Sign Recognition

18

EXPLOSION OF NETWORK DESIGNREQUIRE PROGRAMMABILITY

Recurrent

Networks

Generative Adversarial

Networks

Convolution

Networks

Reinforcement

Learning

GRU HighwayLSTM

Embedding BiDirectionalProjection

ReLuPRelu

Dropout PoolingConcat

BatchNorm

A3C

Dueling DQNDQNConditional GAN

Latent space GAN

3D-GAN

Coupled GAN

Rank GAN

Speech

Enhancement GAN

19

REAL AI APPLICATION ARE REALIZED BY MANY AI/ML/AV/GRAPHICS MODULESFULLY-PROGRAMMABLE AI/ML/AV/GRAPHICS PROCESSOR IS MANDATORY

EXAMPLE: AI CONVERSATIONAL SEARCH

20-30 containers end-to-end | RNN, CNN, MLP in INT8, FP16, FP32 | Latency <300ms

Text to Speech

Speech Recognition

Object DetectionJPEG Decode Resize

Denoising

Voice Encoder

Language

Model

AUDIO

AUDIO

VISUAL

VISUAL

SocialNews

Visual Search

Web

Question and

Answer

Page Layout

Entity RecognitionQuery Annotation

Query Search

Auto Correct

Recommendation

What are

different types

of lighting for

a living room?

There are three

main types,

surface, recessed

and pendant

fixtures. Surface

lighting is …..

20

2014 2015 2016 2017 20182013 2014 2015 2016 2017 20182011 2013 2015 2017

EXPLOSION OF NETWORK COMPLEXITYREQUIRE OPTIMIZATION / ACCELERATION

Translation Network

Complexity

GOPS * Bandwidth

Image Network Complexity

GOPS * Bandwidth

Speech Network Complexity

GOPS * Bandwidth

2012 2014 2016

ResNet-50

Inception-v2

Inception-v4

AlexNet GoogLeNet

350X30X

DeepSpeech 3

DeepSpeech

2DeepSpeech

10X

GNMT

OpenNMT

MoE

21

YEAR 2013 GPU TRAINING ADVANTAGE OVER CPUCNN (Convolutional Neural Net) Training Time

Batch SizeTraining Time

(CPU)

Training Time

(GPU)

GPU/CPU

Acceleration

64 Images 64s 7.5s 8.5X

128 Images 124s 14.5s 8.5X

256 Images 257s 28.5s 9.0X

CPU: Dual 10-Core Ivy Bridge

GPU: 1 Tesla K40

CPU Library： Intel MKL BLAS

GPU Library： cuBLAS

ILSVRC12 Supervision DNN

7-Layers ( 5-CNN, 2-FCN)

Caffe Framework

Training Time for 20-Iteration

Extrapolation to 1M Image Training

CPU: 11.6 days, GPU: 1.3 days

22

Speed-u

p o

f im

ages/

Sec v

s K

40 in 2

013

Alexnet training throughput on:

CPU: 1xE5-2680v3 12 Co 2.5GHz 128GB System Memory, Ubuntu 14.04

M40 bar: 8xM40 GPUs in a node. P100: 8xP100 NVLink-enabled

YEAR 2016 GPU TRAINING SPEED ENHANCED BY X60BY CUDNN LIBRARY AND GPU PERFORMANCE ENHANCEMENT

23

YEAR 2017 FURTHER ENHANCEMENT BY X12TENSOR CORE INTRODUCTION

New CUDA TensorOp instructions & data formats

4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]

Optimized for deep learning

Activation Inputs Weights Inputs Output Results

64 MACs/cycle * 2FLOP/MAC * 1.455GHz * 8 Tensor Core/SM * 80 SMs = 120TFLOPS

24

YEAR 2018: MULTI-PRECISION TENSOR CORE IN TURING GPUMulti-Precision for AI Inference Further Acceleration

25

WORLD’S MOST PERFORMANT INFERENCE PLATFORM

Up to 27X Faster Than CPUs | Accelerates All AI Workloads

26

Tensor Core

114 TFLOPS FP16

228 TOPS INT8

455 TOPS INT4

RT Core

10 Giga Rays/sec

Ray Triangle Intersection, BVH Traversal

Turing SM

14 TFLOPS + 14 TIPS

Concurrent FP & INT

Execution

Variable Rate Shading

LATEST TURING GPU QUADRO RTX 8000

SM (Streaming Multiprocessor): GPU Minimal Scalable Unit

Tensor Core: Matrix Multiplication Accelerator

RT Core: Real Time Ray Tracing Accelerator

BVH (Bounding Volume Hierarchy): Tree Structure on a set of Geometric Objects

27

NVDLA (NVIDIA DEEP LEARNING ACCELERATOR)INTEGRATED IN XAVIER SOCFURTHER POWER EFFICIENCY INCREASE

Command Interface

Tensor Execution Micro-controller

Memory Interface

Input DMA

(Activations and

Weights)

Unified

512KB

Input

Buffer

Activations

and

Weights

Sparse Weight

Decompre-

ssion

Native

Winograd

Input

Transform

MAC

Array

2048 Int8

or

1024 Int16

or

1024 FP16

Output

Accumu-

lators

Output Post

processor

(Activation

Function,

Pooling etc.)Output

DMA

Reduce the memory access band-withs by dealing with the sparseness of the weight coefficients

Further reducing power consumptions by reducing the number of multiplication

Unique New Technologies

Other features are common in DL accelerators

Reference NVDLA: http://nvdla.org

http://nvdla.org/

28

- 7.72 Billion Operation to process one 225 x 225 Image (ImageNet Contest)

- 7.72 x 30 = 230 GOPS for 30fps

- 230 GOPS x 1920*1080/225/225 = 9.4TOPS for HD camera, 30fps

50-Layer High Performance DNN used in NVIDIA Autonomous Driving

ResNet-50 Based (ImageNet Contest winner in 2015, exceeding the performance of Human Eye)

29

Inference DNN Optimizer: TensorRT

Kernel

Auto-Tuning

Layer &

Tensor Fusion

Dynamic Tensor

Memory

Precision

Calibration

Platforms

TESLA V100

DRIVE AGX

TESLA P4/T4

JETSON AGX

NVIDIA DLA

Optimizer Runtime

TensorRT

Trained DNN as is Optimized DNN

Various DL Frameworks

30

READY TO GO PLATFORM (HW + SW)FOR SERVER LOAD DISTRIBUTION

AND EASY BRING-UP / MAINTENANCE

DNN Models

NV DL SDK

NV Docker

Kubernetes(Load Distribution)

TensorRTInference Server

GPU Server HWs

31

WELL DISTRIBUTED WORKLOAD FOR IMAGE RECOGNITIONDemand: 17,000 Images/sec, Delivered: 16,990 Images/sec

32

SPACE AND POWER REDUCTIONGame-Changing Inference Performance

200 CPU Servers One T4 GPU Accelerator Server

33

講演目次



















34

NVIDIA ONE-ARCHITECTUREFROM SUPERCOMPUTER TO AUTONOMOUS-DRIVING SOC

TeslaIn Super Computers

QuadroIn Work Stations

GeForceIn PCs

Mobile GPU

In Tegra

Autonomous-Driving Processor Xavier

35

XAVIER AUTONOMOUS-DRIVING PROCESSOR WITH FULL FUNCTION-SAFETY FEATURES

Volta GPU

FP32 / FP16 / INT8 Multi Precision

512 CUDA Cores

1.3 CUDA TFLOPS

20 Tensor Core TOPS

ISP

1.5 GPIX/s

Native Full-range HDR

Tile-based Processing

PVA

1.6 TOPS

Stereo Disparity

Optical Flow

Image Processing

Video Processor

1.2 GPIX/s Encode

1.8 GPIX/s Decode

16 CSI

109 Gbps

1Gbps E & 10Gbps Eithernet

256-Bit LPDDR4

137 GB/s

DLA5 TFLOPS FP16

10 TOPS INT8

Carmel ARM64 CPU

8 Cores

10-wide Superscalar

2700 SpecInt2000

Functional Safety Features

Dual Execution Mode

Parity & ECC

▪ Diverse Engines• Computation with GPU/CPU

• DL with GPU/DLA

• CV with GPU/PVA

and more

▪ Dual Execution• Carmel ARM64 CPU has a dual

execution mode (duplicate instruction streams)

▪ ECC/Parity▪ On chip SRAMs, caches,

regisers

▪ External DDR memories

▪ Diagnosis, BIST▪ SCE (Safety Cluster Engine) with

Lock-step ARM Cortex R5 processor pair

DL: Deep LearningCV: Computer VisionDLA: Deep Learning AcceleratorPVA: Programmable Vision AcceleratorISP: Image Signal Processor

Most Complex SOC Ever Made | 9 Billion Transistors, 350mm2, 12nFFN TÜV SÜD’s team determined Xavier’s architecture meets the ISO 26262 requirements

to avoid unreasonable risk in situations that could result in serious injury.

36

NVIDIA DRIVE AGX VS TESLA FSD COMPUTERONE AUTONOMOUS VEHICLE ARCHITECTURE FROM L2+ TO ROBO TAXI (~L5)

Xavier (for Level2+, Level3)

One Xavier SoC30 TOPS DL

1.3 TFLOPS FP32

DRIVE AGX Xavier

Pegasus (for Level4, Robo Taxi)

Also available: Xavier SoC x 1

Discrete GPU x 1160 TOPS DL

9.5 TFLOPS FP32

DRIVE AGX Pegasus

NVIDIA Open Platform for 370+ PartnersMore Flexibility, More GPU & DL Performance

Xavier SoC x 2Discrete GPU x 2

320 TOPS DL19 TFLOPS FP32

Tesla Platform

Only for Tesla

Tesla FSDC

FSD SoC x 2144 TOPS DL

1.2 TFLOPS FP32

37

370+ PARTNERS USING NVIDIA DRIVE

Auto OEMs

Truck OEMs

MobilityServices

SystemSuppliers

Mapping

LIDAR

Camera/Radar

Startups

38

Example of 10 DNNs in Autonomous VehicleExternal Environment Perception DNNs

39

Required Computation Resources▪ 3 M labeled images / car / year

▪ 1 DGX-1 trains 3 M labeled images on 1 DNN in 10 days

(300 K images in 1 days)

▪ 10 DNNs required for self-driving

▪ 10 parallel experiments at all times

▪ 100 DGX-1 per car

40

NVIDIA DRIVE END-TO-END PLATFORM

COLLECT & PROCESS DATA TRAIN MODELS

PedestriansCars

Lanes Path

LightsSigns

SIMULATE DRIVE

PedestriansCars

Lanes Path

LightsSigns

41

SSIMULATIONBillion Miles Testing/Verification Means

World drives trillions of miles each year.

U.S. has 770 accidents per billion miles.

A fleet of 20 test cars cover 1 million miles

per year.

42

NVIDIA DRIVE SIM AND CONSTELLATIONAV VALIDATION SYSTEM

Virtual Reality AV Simulator

Same Architecture as DRIVE Computer

Simulate Rare and Difficult Conditions,

Recreate Scenarios, Run Regression Tests,

Drive Billions of Virtual Miles

10,000 Constellations Drive 3B Miles per Year

• 8 Camera signals over GMSL2• Radar and LIDAR signals over 1 Gbit Ether

• Autonomous Vehicle Responses

HIL

44

ANNOUNCING DRIVE CONSTELLATION AVAILABLE NOW

Virtual AV Test Fleet

Bit-accurate, hardware-in-the-loop simulator | Test corner and rare conditions

Simulate previous failure scenarios | Cloud-based workflow | Open platform

46

NVIDIA DRIVE AGX Xavier Highway Loop to NVIDIA HQ (Video Demo)(77-miles, 124-km, 0-Disengagements)

高性能汎用GPUの半導体実装からシステム実装までの最先端技術 · 馬路徹...

Documents

Transcript of 高性能汎用GPUの半導体実装からシステム実装までの最先端技術 · 馬路徹...

高性能汎用GPUの半導体実装から システム実装までの最先端技術 · 馬路徹...

Documents

Transcript of 高性能汎用GPUの半導体実装から システム実装までの最先端技術 · 馬路徹...

高性能汎用GPUの半導体実装からシステム実装までの最先端技術 · 馬路徹...

Transcript of 高性能汎用GPUの半導体実装からシステム実装までの最先端技術 · 馬路徹...