高性能汎用GPUの半導体実装から システム実装までの最先端技術 · 馬路徹...
Transcript of 高性能汎用GPUの半導体実装から システム実装までの最先端技術 · 馬路徹...
馬路 徹
技術顧問、GPUエバンジェリスト
2019年5月14日
高性能汎用GPUの半導体実装からシステム実装までの最先端技術
LSIとシステムのワークショップ 2019
2
講演目次
1. GPUとCPU性能の変遷
1) 2005年頃よりムーアの法則を享受できなくなったCPUの性能向上
2) ムーアの法則を受けてGPUは性を能向上、ムーアの法則終焉後もなお性能向上を維持
3) 性能向上実績:国際スーパーコンピュータ学会TOP500で上位を占める
4) GPUはフル・プログラマブルなプロセッサとして最も電力効率が高い
5) 電力効率実績:国際スーパーコンピュータ学会Green500で上位を占める
2. GPU, DLA(Deep Learning Accelerator)はAI実装用の最適なプロセッサ
1) AI応用の急速な拡大及びAI実装の2つの技術要件(プログラマビリティと性能)
2) Tensor Coreアクセラレータによる学習と推論の高速化
3) DLA (Deep Learning Accelerator)による高効率、高性能推論
4) 推論DNN最適化のためのTensorRTソフトウエア・エンジン
5) データセンター及びスーパーコンピュータ用のインフラ構成
3. 自動運転用AIプロセッサXavier及びEnd-to-End開発システム
1) One GPUアーキテクチャによるスーパーコンピュータから車載プロセッサまでの技術資産の共用
2) レベル5の完全自動運転プロセッサを量産ベースで提供可能なのはテスラ社とNVIDIA。NVIDIAはオープン・プラットフォーム
3) AI学習に必要な性能及びインフラ
4) シミュレーションを導入した完全な自動運転検証
5) NVIDIA DRIVEプラットフォームによる米国高速道路自動運転デモ
3
講演目次
1. GPUとCPU性能の変遷
1) 2005年頃よりムーアの法則を享受できなくなったCPUの性能向上
2) ムーアの法則を受けてGPUは性を能向上、ムーアの法則終焉後もなお性能向上を維持
3) 性能向上実績:国際スーパーコンピュータ学会TOP500で上位を占める
4) GPUはフル・プログラマブルなプロセッサとして最も電力効率が高い
5) 電力効率実績:国際スーパーコンピュータ学会Green500で上位を占める
2. GPU, DLA(Deep Learning Accelerator)はAI実装用の最適なプロセッサ
1) AI応用の急速な拡大及びAI実装の2つの技術要件(プログラマビリティと性能)
2) Tensor Coreアクセラレータによる学習と推論の高速化
3) DLA (Deep Learning Accelerator)による高効率、高性能推論
4) 推論DNN最適化のためのTensorRTソフトウエア・エンジン
5) データセンター及びスーパーコンピュータ用のインフラ構成
3. 自動運転用AIプロセッサXavier及びEnd-to-End開発システム
1) One GPUアーキテクチャによるスーパーコンピュータから車載プロセッサまでの技術資産の共用
2) レベル5の完全自動運転プロセッサを量産ベースで提供可能なのはテスラ社とNVIDIA。NVIDIAはオープン・プラットフォーム
3) AI学習に必要な性能及びインフラ
4) シミュレーションを導入した完全な自動運転検証
5) NVIDIA DRIVEプラットフォームによる米国高速道路自動運転デモ
4
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
CPU Single-threaded perf
1.5X per year
1.1X per year
Transistors
(thousands)
NO MORE MOORE’S LAW BENEFITSCPU PERFORMANCE INCREASE STOPPED
5
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10
N
R : 0%
R : 10%
R : 20%
R : 50%
N: Number of CPU Corea
; M
ult
i C
PU
Core
Eff
icie
ncy
N: Number of CPU Cores
R: Ratio of Sequential Processing
1- R: Ratio of Parallel Processing
a: Multi CPU Core Efficiency
(a =1 is equivalent to a single CPU)
Ta: Single CPU Execution Time
Tb: N CPU Core Execution Time
Tb = Ta * ( R + (1 – R)/N )
a = Ta/Tb = 1 / ( R + (1 – R)/N )
Amdahl’s Law: Limit Multi CPU Core Processing Efficiency
If R=20%, 8 CPU core can achieve just 3.3 CPU core performance
GPU
dedicated for
fully parallel
processing
6
FULLY PARALLEL APPLICATION EXAMPLE(3D GRAPHICS)
Normal
N
L Rq q
V
Light
source Observation
Point
a
Poligon Normal
R = 2cosq N + L = -2(L N) N + LC = KdLi (N L) + KsLi(R V)
= KdLi cosq+ KsLi cosa
N
L
R
V
Li
Kd
Ks
s
C
Normal Vector
Light Source Vector
Reflection Vector
Observation Vector
Light Intensity
Reflaction Coef. (0<Kd<1)
Reflection Coef. (0<Ks<1)
Sharpness Coef.(s>0)
Reflection Intensity
7
NVIDIA GPU
ACCELERATING
FOUR INDUSTRY
FIELDS
Scientific Calculation
AI/Deep Learning
Computer Graphics
Data Analysis/Data Base
NVIDIA CUDA (Massive parallel computation platform)
AMBERMolecularDynamics
COSMOClimateWeather
ChaNGaAstrophysics
GaussianQuantumChemistry
Schlumberger WGSeismic Processing
PowerGridMedical Imaging
ANSYS FluentComputational Fluid Dynamics
SIMULIA Abaqus Finite-ElementAnalysis
K-Means Clustering Gradient Boosting
Support Vector
MachineGeneralized Linear
Model
DATABASES
ANALYTICS
❑ 645,000 GPU Developersx15 in 5 Years
❑ 1,800,000 CUDA Downloads x5 in 5 Years
CUDA: Compute Unified Device Architecture
8
GPU ENJOYED THE TR COUNT INCREASE BY MOOR’S LAW, INCREASING ITS NUMBER OF CORES ACCORDINGLY
2
1
4
8
16
32
64
128
256
512
1024
480
240
216
128
112
12
8
64
32
16
8+16
6+16
5+12
3+4
2+2
1+4
6+16
5+12
4+8
3+4
1+2
3+
8
3+2
1+
2
GeForce
3
2+
4
0+
2
1+4
Number of GPU Core
Vertex + Pixel Shader
Unified
Shader
GeForce
4
GeForce
5
GeForce
6
GeForce
7
12
896
32
16
8
GeForce
8
GeForce
9
GeForce
200
GeForce
400
GeForce
500
2001
TESLA
(architecture)
2002 2003 2004-5 2006-7 2008 2008 2009 2010 2010-11
480336
288
192144
96
48
480384
288
192
96
512
FERMI
48
KEPLER
2048
GeForce
600
2012
15361344
384
288
192
2016
Tesla P100
FP32 Core:
3,584
FP64 Core:
1,792
FP32: 2880, FP64: 960
PASCAL
MAXWELL
2014
VOLTA
Tesla V100
FP32 Core:
5,120
FP64 Core:
2,560
2017
Tesla M40
FP32 Core:
3,072
FP64 Core:
96
9
GPU IS A MUST TO HAVE ACCELRATOR
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
GPU-Computing perf
1.5X per year
1980 1990 2000 2010 2020
102
103
104
105
106
107
CPU Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
Performance is growing even
after Moor’s Law saturates
10
NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER
~ 40,000Volta Tensor Core GPUs
Summit Becomes First System To Scale The 100 Petaflops Milestone
143 PetaFLOPS 3 ExaFLOPSHPC AI
11
AMONG TOP10 FASTEST SUPERCOMUTER IN THE WORLD
5 ARE USING NVIDIA GPU ACCELERATION
NO.1 AND 2 ARE USING NVIDIA GPUISC2018(INTERNATIONAL SUPERCOMPUTING CONFERENCE) NOV. 2018日
12
HOW IS POWER SPENT IN A CPU AND GPU?
High-performance CPUOut-of-Order Instruction Execution
Clock + Pins
45%
ALU
4%
Fetch
11%
Rename
10%
Issue
11%
RF
14%
DataSupply
5%
Natarajan [2003] (Alpha 21264)
Overhead
15pJ
Payload
Arithmetic
15pJ
Bill Dally, Keynote in Deep Learning Institute 2017 Tokyo, Jan. 2017
Many Core GPU
13
64-bit DP20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficientoff-chip link
256-bit buses
16 nJDRAMRd/Wr
256-bit access8 kB SRAM 50 pJ
20mm
28nm CMOS
20x20mm2
Bill Dally, Keynote in Deep Learning Institute 2017 Tokyo, Jan. 2017
Save Every pJ (Energy) in the DesignGPU: from Architecture, Circuit Design to Layout“Execute Arithmetic within the shortest distance”
14
CPU VS GPU: ~ X5 ENERGY EFFICIENCY
CPU126 pJ/flop (SP)
Optimized for Latency
Deep Cache Hierarchy
Broadwell E5 v414 nm
GPU28 pJ/flop (SP)
Optimized for Throughput
Explicit Managementof On-chip Memory
Pascal P10016 nm
Bill Dally, Keynote in Deep Learning Institute 2017 Tokyo, Jan. 2017
15
AMONG TOP25 MOST ENERGY-EFFICIENT SUPERCOMUTER IN THE WORLD
22 ARE USING NVIDIA GPU ACCELERATION
Supercomputers with
yellow cells in the
TOP500 Ranking shows
the supercomputers that
is also ranked within
TOP25 in performance.
This means that the GPU
acceleration is leading
the power efficiency for
big-scale supercomputers.
16
講演目次
1. GPUとCPU性能の変遷
1) 2005年頃よりムーアの法則を享受できなくなったCPUの性能向上
2) ムーアの法則を受けてGPUは性を能向上、ムーアの法則終焉後もなお性能向上を維持
3) 性能向上実績:国際スーパーコンピュータ学会TOP500で上位を占める
4) GPUはフル・プログラマブルなプロセッサとして最も電力効率が高い
5) 電力効率実績:国際スーパーコンピュータ学会Green500で上位を占める
2. GPU, DLA(Deep Learning Accelerator)はAI実装用の最適なプロセッサ
1) AI応用の急速な拡大及びAI実装の2つの技術要件(プログラマビリティと性能)
2) Tensor Coreアクセラレータによる学習と推論の高速化
3) DLA (Deep Learning Accelerator)による高効率、高性能推論
4) 推論DNN最適化のためのTensorRTソフトウエア・エンジン
5) データセンター及びスーパーコンピュータ用のインフラ構成
3. 自動運転用AIプロセッサXavier及びEnd-to-End開発システム
1) One GPUアーキテクチャによるスーパーコンピュータから車載プロセッサまでの技術資産の共用
2) レベル5の完全自動運転プロセッサを量産ベースで提供可能なのはテスラ社とNVIDIA。NVIDIAはオープン・プラットフォーム
3) AI学習に必要な性能及びインフラ
4) シミュレーションを導入した完全な自動運転検証
5) NVIDIA DRIVEプラットフォームによる米国高速道路自動運転デモ
17
AI / Deep Learning Everywhere
Internet & CloudImage ClassificationSpeech Recognition
Language TranslationRecommendations
Medicine & Biology
Cancer Cell DetectionDiabetic Grading Drug Discovery
Media & EntertainmentVideo Captioning | Video Search
Real Time Translation
Intelligent Video AnalyticsTraffic Analysis | Retail Analytics
Access Control
TransportationPedestrian Detection
Lane Tracking Traffic Sign Recognition
18
EXPLOSION OF NETWORK DESIGNREQUIRE PROGRAMMABILITY
Recurrent
Networks
Generative Adversarial
Networks
Convolution
Networks
Reinforcement
Learning
GRU HighwayLSTM
Embedding BiDirectionalProjection
ReLuPRelu
Dropout PoolingConcat
BatchNorm
A3C
Dueling DQNDQNConditional GAN
Latent space GAN
3D-GAN
Coupled GAN
Rank GAN
Speech
Enhancement GAN
19
REAL AI APPLICATION ARE REALIZED BY MANY AI/ML/AV/GRAPHICS MODULESFULLY-PROGRAMMABLE AI/ML/AV/GRAPHICS PROCESSOR IS MANDATORY
EXAMPLE: AI CONVERSATIONAL SEARCH
20-30 containers end-to-end | RNN, CNN, MLP in INT8, FP16, FP32 | Latency <300ms
Text to Speech
Speech Recognition
Object DetectionJPEG Decode Resize
Denoising
Voice Encoder
Language
Model
AUDIO
AUDIO
VISUAL
VISUAL
SocialNews
Visual Search
Web
Question and
Answer
Page Layout
Entity RecognitionQuery Annotation
Query Search
Auto Correct
Recommendation
What are
different types
of lighting for
a living room?
There are three
main types,
surface, recessed
and pendant
fixtures. Surface
lighting is …..
20
2014 2015 2016 2017 20182013 2014 2015 2016 2017 20182011 2013 2015 2017
EXPLOSION OF NETWORK COMPLEXITYREQUIRE OPTIMIZATION / ACCELERATION
Translation Network
Complexity
GOPS * Bandwidth
Image Network Complexity
GOPS * Bandwidth
Speech Network Complexity
GOPS * Bandwidth
2012 2014 2016
ResNet-50
Inception-v2
Inception-v4
AlexNet GoogLeNet
350X30X
DeepSpeech 3
DeepSpeech
2DeepSpeech
10X
GNMT
OpenNMT
MoE
21
YEAR 2013 GPU TRAINING ADVANTAGE OVER CPUCNN (Convolutional Neural Net) Training Time
Batch SizeTraining Time
(CPU)
Training Time
(GPU)
GPU/CPU
Acceleration
64 Images 64s 7.5s 8.5X
128 Images 124s 14.5s 8.5X
256 Images 257s 28.5s 9.0X
CPU: Dual 10-Core Ivy Bridge
GPU: 1 Tesla K40
CPU Library: Intel MKL BLAS
GPU Library: cuBLAS
ILSVRC12 Supervision DNN
7-Layers ( 5-CNN, 2-FCN)
Caffe Framework
Training Time for 20-Iteration
Extrapolation to 1M Image Training
CPU: 11.6 days, GPU: 1.3 days
22
Speed-u
p o
f im
ages/
Sec v
s K
40 in 2
013
Alexnet training throughput on:
CPU: 1xE5-2680v3 12 Co 2.5GHz 128GB System Memory, Ubuntu 14.04
M40 bar: 8xM40 GPUs in a node. P100: 8xP100 NVLink-enabled
YEAR 2016 GPU TRAINING SPEED ENHANCED BY X60BY CUDNN LIBRARY AND GPU PERFORMANCE ENHANCEMENT
23
YEAR 2017 FURTHER ENHANCEMENT BY X12TENSOR CORE INTRODUCTION
New CUDA TensorOp instructions & data formats
4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Optimized for deep learning
Activation Inputs Weights Inputs Output Results
64 MACs/cycle * 2FLOP/MAC * 1.455GHz * 8 Tensor Core/SM * 80 SMs = 120TFLOPS
24
YEAR 2018: MULTI-PRECISION TENSOR CORE IN TURING GPUMulti-Precision for AI Inference Further Acceleration
25
WORLD’S MOST PERFORMANT INFERENCE PLATFORM
Up to 27X Faster Than CPUs | Accelerates All AI Workloads
26
Tensor Core
114 TFLOPS FP16
228 TOPS INT8
455 TOPS INT4
RT Core
10 Giga Rays/sec
Ray Triangle Intersection, BVH Traversal
Turing SM
14 TFLOPS + 14 TIPS
Concurrent FP & INT
Execution
Variable Rate Shading
LATEST TURING GPU QUADRO RTX 8000
SM (Streaming Multiprocessor): GPU Minimal Scalable Unit
Tensor Core: Matrix Multiplication Accelerator
RT Core: Real Time Ray Tracing Accelerator
BVH (Bounding Volume Hierarchy): Tree Structure on a set of Geometric Objects
27
NVDLA (NVIDIA DEEP LEARNING ACCELERATOR)INTEGRATED IN XAVIER SOCFURTHER POWER EFFICIENCY INCREASE
Command Interface
Tensor Execution Micro-controller
Memory Interface
Input DMA
(Activations and
Weights)
Unified
512KB
Input
Buffer
Activations
and
Weights
Sparse Weight
Decompre-
ssion
Native
Winograd
Input
Transform
MAC
Array
2048 Int8
or
1024 Int16
or
1024 FP16
Output
Accumu-
lators
Output Post
processor
(Activation
Function,
Pooling etc.)Output
DMA
Reduce the memory access band-withs by dealing with the sparseness of the weight coefficients
Further reducing power consumptions by reducing the number of multiplication
Unique New Technologies
Other features are common in DL accelerators
Reference NVDLA: http://nvdla.org
28
- 7.72 Billion Operation to process one 225 x 225 Image (ImageNet Contest)
- 7.72 x 30 = 230 GOPS for 30fps
- 230 GOPS x 1920*1080/225/225 = 9.4TOPS for HD camera, 30fps
50-Layer High Performance DNN used in NVIDIA Autonomous Driving
ResNet-50 Based (ImageNet Contest winner in 2015, exceeding the performance of Human Eye)
29
Inference DNN Optimizer: TensorRT
Kernel
Auto-Tuning
Layer &
Tensor Fusion
Dynamic Tensor
Memory
Precision
Calibration
Platforms
TESLA V100
DRIVE AGX
TESLA P4/T4
JETSON AGX
NVIDIA DLA
Optimizer Runtime
TensorRT
Trained DNN as is Optimized DNN
Various DL Frameworks
30
READY TO GO PLATFORM (HW + SW)FOR SERVER LOAD DISTRIBUTION
AND EASY BRING-UP / MAINTENANCE
DNN Models
NV DL SDK
NV Docker
Kubernetes(Load Distribution)
TensorRTInference Server
GPU Server HWs
31
WELL DISTRIBUTED WORKLOAD FOR IMAGE RECOGNITIONDemand: 17,000 Images/sec, Delivered: 16,990 Images/sec
32
SPACE AND POWER REDUCTIONGame-Changing Inference Performance
200 CPU Servers One T4 GPU Accelerator Server
33
講演目次
1. GPUとCPU性能の変遷
1) 2005年頃よりムーアの法則を享受できなくなったCPUの性能向上
2) ムーアの法則を受けてGPUは性を能向上、ムーアの法則終焉後もなお性能向上を維持
3) 性能向上実績:国際スーパーコンピュータ学会TOP500で上位を占める
4) GPUはフル・プログラマブルなプロセッサとして最も電力効率が高い
5) 電力効率実績:国際スーパーコンピュータ学会Green500で上位を占める
2. GPU, DLA(Deep Learning Accelerator)はAI実装用の最適なプロセッサ
1) AI応用の急速な拡大及びAI実装の2つの技術要件(プログラマビリティと性能)
2) Tensor Coreアクセラレータによる学習と推論の高速化
3) DLA (Deep Learning Accelerator)による高効率、高性能推論
4) 推論DNN最適化のためのTensorRTソフトウエア・エンジン
5) データセンター及びスーパーコンピュータ用のインフラ構成
3. 自動運転用AIプロセッサXavier及びEnd-to-End開発システム
1) One GPUアーキテクチャによるスーパーコンピュータから車載プロセッサまでの技術資産の共用
2) レベル5の完全自動運転プロセッサを量産ベースで提供可能なのはテスラ社とNVIDIA。NVIDIAはオープン・プラットフォーム
3) AI学習に必要な性能及びインフラ
4) シミュレーションを導入した完全な自動運転検証
5) NVIDIA DRIVEプラットフォームによる米国高速道路自動運転デモ
34
NVIDIA ONE-ARCHITECTUREFROM SUPERCOMPUTER TO AUTONOMOUS-DRIVING SOC
TeslaIn Super Computers
QuadroIn Work Stations
GeForceIn PCs
Mobile GPU
In Tegra
Autonomous-Driving Processor Xavier
35
XAVIER AUTONOMOUS-DRIVING PROCESSOR WITH FULL FUNCTION-SAFETY FEATURES
Volta GPU
FP32 / FP16 / INT8 Multi Precision
512 CUDA Cores
1.3 CUDA TFLOPS
20 Tensor Core TOPS
ISP
1.5 GPIX/s
Native Full-range HDR
Tile-based Processing
PVA
1.6 TOPS
Stereo Disparity
Optical Flow
Image Processing
Video Processor
1.2 GPIX/s Encode
1.8 GPIX/s Decode
16 CSI
109 Gbps
1Gbps E & 10Gbps Eithernet
256-Bit LPDDR4
137 GB/s
DLA5 TFLOPS FP16
10 TOPS INT8
Carmel ARM64 CPU
8 Cores
10-wide Superscalar
2700 SpecInt2000
Functional Safety Features
Dual Execution Mode
Parity & ECC
▪ Diverse Engines• Computation with GPU/CPU
• DL with GPU/DLA
• CV with GPU/PVA
and more
▪ Dual Execution• Carmel ARM64 CPU has a dual
execution mode (duplicate instruction streams)
▪ ECC/Parity▪ On chip SRAMs, caches,
regisers
▪ External DDR memories
▪ Diagnosis, BIST▪ SCE (Safety Cluster Engine) with
Lock-step ARM Cortex R5 processor pair
DL: Deep LearningCV: Computer VisionDLA: Deep Learning AcceleratorPVA: Programmable Vision AcceleratorISP: Image Signal Processor
Most Complex SOC Ever Made | 9 Billion Transistors, 350mm2, 12nFFN TÜV SÜD’s team determined Xavier’s architecture meets the ISO 26262 requirements
to avoid unreasonable risk in situations that could result in serious injury.
36
NVIDIA DRIVE AGX VS TESLA FSD COMPUTERONE AUTONOMOUS VEHICLE ARCHITECTURE FROM L2+ TO ROBO TAXI (~L5)
Xavier (for Level2+, Level3)
One Xavier SoC30 TOPS DL
1.3 TFLOPS FP32
DRIVE AGX Xavier
Pegasus (for Level4, Robo Taxi)
Also available: Xavier SoC x 1
Discrete GPU x 1160 TOPS DL
9.5 TFLOPS FP32
DRIVE AGX Pegasus
NVIDIA Open Platform for 370+ PartnersMore Flexibility, More GPU & DL Performance
Xavier SoC x 2Discrete GPU x 2
320 TOPS DL19 TFLOPS FP32
Tesla Platform
Only for Tesla
Tesla FSDC
FSD SoC x 2144 TOPS DL
1.2 TFLOPS FP32
37
370+ PARTNERS USING NVIDIA DRIVE
Auto OEMs
Truck OEMs
MobilityServices
SystemSuppliers
Mapping
LIDAR
Camera/Radar
Startups
38
Example of 10 DNNs in Autonomous VehicleExternal Environment Perception DNNs
39
Required Computation Resources▪ 3 M labeled images / car / year
▪ 1 DGX-1 trains 3 M labeled images on 1 DNN in 10 days
(300 K images in 1 days)
▪ 10 DNNs required for self-driving
▪ 10 parallel experiments at all times
▪ 100 DGX-1 per car
40
NVIDIA DRIVE END-TO-END PLATFORM
COLLECT & PROCESS DATA TRAIN MODELS
PedestriansCars
Lanes Path
LightsSigns
SIMULATE DRIVE
PedestriansCars
Lanes Path
LightsSigns
41
SSIMULATIONBillion Miles Testing/Verification Means
World drives trillions of miles each year.
U.S. has 770 accidents per billion miles.
A fleet of 20 test cars cover 1 million miles
per year.
42
NVIDIA DRIVE SIM AND CONSTELLATIONAV VALIDATION SYSTEM
Virtual Reality AV Simulator
Same Architecture as DRIVE Computer
Simulate Rare and Difficult Conditions,
Recreate Scenarios, Run Regression Tests,
Drive Billions of Virtual Miles
10,000 Constellations Drive 3B Miles per Year
• 8 Camera signals over GMSL2• Radar and LIDAR signals over 1 Gbit Ether
• Autonomous Vehicle Responses
HIL
43
44
ANNOUNCING DRIVE CONSTELLATION AVAILABLE NOW
Virtual AV Test Fleet
Bit-accurate, hardware-in-the-loop simulator | Test corner and rare conditions
Simulate previous failure scenarios | Cloud-based workflow | Open platform
45
46
NVIDIA DRIVE AGX Xavier Highway Loop to NVIDIA HQ (Video Demo)(77-miles, 124-km, 0-Disengagements)
47