Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr...

17
1 Deep Learning Processing Technologies for Embedded Systems October 2018

Transcript of Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr...

Page 1: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

1

Deep Learning Processing Technologies

for Embedded Systems

October 2018

Page 3: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Popular Computer Vision Tasks

Object Classification

Semantic Segmentation

Object Detection

CAT

Page 4: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Case Study: Rear-View Camera

Use Case I: Rear-Collision Warning (RCW)

4

Use Case II: Lane-Change Assist (LCA)

Use Case III: Parking Assist (PA)

Page 5: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Use case Example: Rear-Collision Warning

Scenario

Environment: High-speed cruising

Hovering speed (Rear car): 120 Km/H

Max Closing speed: 80 Km/H

Typical acceleration: 3 m/s2

Viewing angle: 120 deg

System Requirements

Policy: Always-on

Strategy: Match speed using ACC

Safety margin: 2 sec

120 Km/H

80 Km/H

Page 6: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Δ𝑉 = 11 𝑚/𝑠𝑒𝑐

𝐷 = 50𝑚

𝑊𝑐𝑎𝑟 = ~ 8 𝑝𝑖𝑥𝑒𝑙𝑠

𝑊 = 174𝑚 ~ 640 𝑝𝑖𝑥𝑒𝑙𝑠

Use case Example: Rear-Collision Warning

120 Km/H

80 Km/H

H

W

Perception Subsystem Requirements

Resolution: 640x480 pixles

2 positive detections: 0.65 sec

Page 7: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Use case Example: Rear-Collision Warning

Neural Network Model Definition

Task: Object Detection

Model: SSD detector

Feature extractor: VGG-16

Compute per frame: ~106 GMACs

mAP (VOC2007): 75.1%

Frames for 1 detection (5σ) 8

Compute Requirements @ 20 fps

2.1 TMAC

https://arxiv.org/abs/1512.02325

Page 8: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Theoretical Efficiency

Input (8b)

Weight (8b)

Psum (16b)

Sample

(16b)

~0.1pJ / cycle Theoretical Efficiency: ~10 TMACs/W

Page 9: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

0.001 0.01 0.1 1 10

Nvidia Volta V100 (***)

Nvidia Pascal P4 (***)

Google TPU (***)

Graphcore IPU (**, ***)

Q'comm Snapdragon 835 (*)

Huawei Kirin970 - Cambricon (*)

Nvidia Parker

Intel Movidius Myriad 2

EFFECTIVE TMACS/W

Real-world Processors Efficiency for Deep Learning Image Processing

Image classification

inference task

Based on vendors’ public

benchmarks

(*) Excluding host and memory

(**) Estimated

(***) Using batch-processing

Em

be

dd

ed

D

ata

ce

nte

r

Page 10: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Control

Memory

Compute

• Flexibility during compile time

• Fully deterministic in runtime

• Parameters and partial sums are localized

• Layer outputs move around (but not too far)

Neural Networks - Observations

Interconnect

• Recurring operations (MACs >> Activations)

• Mostly low precision

Memory

Compute Control

Page 11: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

1

10

100

1,000

10,000

Memory to compute ratio

GoogleNet Resnet-50

Neural Networks - Resource Balance

Memory, Control and Compute balance changes dramatically along the network’s layers

1

10

100

1,000

10,000

100,000

Cumulative Memory [KB]

GoogLeNet ResNet-50

Network direction Network direction

Page 12: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Approaches To Deep Learning Processing

Von-Neumann Architecture

Symmetric Dataflow Architecture

• Temporal resource allocation

• Common memory space

• Classical flow control programming model

• Spatial resource allocation

• Segregated memory spaces

• Balanced graph oriented

Fixed Function Accelerator

• Theoretically optimal at a specific workload

• Minimal flexibility

Page 13: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Von-Neumann Architecture

System Bus

Control Compute

Memory

Full Addressability,

Narrow Access

Cycle by Cycle

Flexibility

Fixed Compute to

Memory BW Ratio

Fixed Control to

Compute Ratio

Page 14: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Symmetric Dataflow Architecture

Inter-element Bus

Control Compute

D-Mem

Element size -

Utilization vs. Control

and Comms overhead

Fixed Compute to

Memory BW and Size

Ratio I-Mem

Page 15: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

0.001 0.01 0.1 1 10

Nvidia Volta V100 (***)

Nvidia Pascal P4 (***)

Google TPU (***)

Graphcore IPU (**, ***)

Q'comm Snapdragon 835 (*)

Huawei Kirin970 - Cambricon (*)

Nvidia Parker

Intel Movidius Myriad 2

EFFECTIVE TMACS/W

Real-world Processors Efficiency for Deep Learning Image Processing

Image classification

inference task

Based on vendors’ public

benchmarks

(*) Excluding host and memory

(**) Estimated

(***) Using batch-processing

Em

be

dd

ed

D

ata

ce

nte

r

Page 16: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

|

Use case Example Summary: Rear-Collision Warning

50 m Detection Distance

0.3MP Camera Resolution

20 fps Frame Rate

20% Compute utilization

(SOTA detection)

Hailo Based Solution

<2W Power consumption

(Including sensor)

Hailo Based Solution

Page 17: Deep Learning Processing Technologies for Embedded Systemsautonomoustechtlv.com/Portals/90/Orr Danon_1.pdf · Hailo Based Solution

Thank You!

17 Icons from icons8.com

[email protected]