Accelerator Evaluation on Real Edge-Inference ... - Flex Logix · Flex Logix Technologies, Inc....

Accelerator Evaluation on Real Edge-Inference Applications

Vinay Mehta, Inference Technical Marketing ManagerFlex Logix Technologies, Inc.

[email protected]

Linley Spring Processor ConferenceApril 6-9, 2020, Santa Clara, CA

mailto:[email protected]?subject=Request%20for%20Information:%20InferX1%20Product%20Line

InferX™ X1

2

• 54mm2 TSMC 16FFC• 933MHz Operation

• 4K MACs @ INT8 2K MACs @ BF16 Winograd acceleration for INT8

• 8MB L2 SRAM + 4MB L3 SRAM• x32 LPDDR4 (14.9GB/s peak BW)

• 13.5 W (max)

• Partners: TSMC, GUC, Synopsys, Arteris, Analog Bits, Cadence, Mentor

• Available as Chip & PCIe card Q3

x32GPIO

4K MACs8MB

distributed L2 SRAM

4MB L3 SRAM eFPGA

x32LPDDR4

Host PCIeGen3/4 x4

Segmenting Edge Customers by CNN Complexity

All share requirements for real-time (streaming: batch=1, low latency) with large input size

3

Perception

Icons by Freepik from www.flaticon.com

Learned DSP

Medical segmentation and classification

Image denoisingCCTV with shoplifting alerts

Attention monitoring for ADAS

Quality assurance and inspections

mailto:https://www.flaticon.com/authors/freepik

http://www.flaticon.com/

• Choose a representative workload• Be clear with metrics (eg, latency and throughput)• Tie performance back to the design requirements

4

Evaluating Like a Customer

Customers’ Evaluation Involves More Than Performance

5

TDP Die Size

InferX X1 7-13.5 W 54 mm2

Nvidia Xavier NX 15 W 350 mm2

Nvidia Tesla T4 75 W 545 mm2

21 mm

70 mm

175 mm

Right Benchmark: Characterizing Models (Actual Model Still Best!)

6

0

25000

50000

75000

MobileNet ResNet-50 Inception v4 YOLOv3

Ope

ratio

ns p

er in

put p

ixel

-cha

nnel

Arithmetic Intensity Across Common CNNs

0

10

20

30

40

200 450 700 950 1200 1440

Meg

abyt

es

Input Size (pixels^2)

ResNet-50, Relative Memory Footprint

Weights (total)

Weights (max layer)

Reporting Benchmarks: Single Stream vs Pooling

7

Latency (ms) FPS

DLA_0 290 3.4

DLA_1 290 3.4

GPU 95 10.5

“17.3 FPS”+

YOLOv3-1440 INT8, b=1 on Nvidia Jetson NX0 200 ms 400 ms

DLA_0 InferenceImage 1

DLA_0 InferenceImage 2Assumes pool of

data DLA_1 InferenceImage 3


GPUImage 5

GPUImage 6

GPUImage 7

GPUImage 8

Reporting Benchmarks: Input Data is a Stream

8

Latency (ms) FPS

DLA_0 290 3.4

DLA_1 290 3.4

GPU 95 10.5

“17.3 FPS”+

YOLOv3-1440 INT8, b=1 on Nvidia Jetson NX

Can I run inf. on a 15 FPS sensor?

Reporting Benchmarks: Input Data is a Stream

9

Latency (ms) FPS

DLA_0 290 3.4

DLA_1 290 3.4

GPU 95 10.5

“17.3 FPS”+


Can I run inf. on a 15 FPS sensor?

DLA_0 InferenceDLA_1 Inference

GPU

0 67 ms 200 ms

GPUDLA_0 Inference

DLA_1 InferenceGPU

GPUGPU

DLA_0 Inference

Image bufferedResource idled

(no, not as you expect)

Throughput Does Not Correspond to Effective Latency

10

Latency (ms) FPS

DLA_0 290 3.4

DLA_1 290 3.4

GPU 95 10.5

• Cannot use all available resources on same inference• Difficult to schedule processing engines• Accessible performance demonstrated by latency


Datacenter vs Edge Benchmarks: Summary

11

DLA_0 InferenceDLA_1 Inference

GPU

0 67 ms 200 ms

GPUDLA_0 Inference

DLA_1 InferenceGPU

GPUGPU

DLA_0 Inference

Image bufferedResource idled

Datacenter Inference: Out-of-Order, No Fixed FPS Edge Data: Single Stream, 15 FPS

0 200 ms 400 ms





GPUImage 5

GPUImage 6

GPUImage 7

GPUImage 8

12

Real World Benchmark: Latency (If Power and Cost Didn’t Matter…)

X1

NX: GPU

T4

0

2

4

6

8

10

12

14

Customer: Model X(Bfloat16)

Late

ncy

(ms)

(lower is better)

X1

NX: GPU

T4

NX: DLA

0

20

40

60

80

100

120

140

160

180

Customer: Model Z(Bfloat16)

X1NX: GPU

T4

NX: DLA

0

50

100

150

200

250

300

YOLOv3, 1440 (INT8)

X1

NX: GPU

T4

NX: DLA

0

10

20

30

40

50

60

YOLOv3, 608 (INT8)

13

InferX X1 Has Superior Performance for the Price

X1 X1 X1 X1

NX:GPU

NX:GPU

NX:GPU

NX:GPU

T4

T4

T4

T4

NX:DLA

NX:DLA

NX:DLA

0

0.2

0.4

0.6

0.8

1

1.2

Customer: Model X(Bfloat16)

Customer: Model Z(Bfloat16)

YOLOv3, 608 (INT8) YOLOv3, 1440 (INT8)

(higher is better)

Thro

ughp

ut /

Die

Size

14

Key to X1 Efficiency is in Data Packing

A

B

3 6 9

2 5 8

1 4 7

Layer 1: 3D Activation Space



Layer 2: Activation Space

1 2 8 9

10 11 8 9

10 4 7

11 5 8

12 6 9

6 9

5 8

4 7

12

11

10

1 4 7

2 5 8

3 6 9

3x3

Win

dow

13x

3W

indo

w 2

15

X1 Flexibility Achieves Efficiency Where Others Can’t, Eg. 3D Convolutions

2D Conv

3D Conv

InferX™ X1

16

• 54mm2 TSMC 16FFC• 933MHz Operation

• 4K MACs @ INT8 2K MACs @ BF16 Winograd acceleration for INT8

• 8MB L2 SRAM + 4MB L3 SRAM• x32 LPDDR4 (14.9GB/s peak BW)

• 13.5 W (max)

• Partners: TSMC, GUC, Synopsys, Arteris, Analog Bits, Cadence, Mentor

• Available as Chip & PCIe card Q3

x32GPIO

4K MACs8MB

distributed L2 SRAM

4MB L3 SRAM eFPGA

x32LPDDR4

Host PCIeGen3/4 x4

17

Appendix

18

Benchmark Data

Latency (ms) Throughput (inf/s) Throughput/Die Size

NX-DLA (1) NX-DLA (2) NX-GPU T4 X1* NX-DLA (1) NX-DLA (2) NX-GPU T4 X1* NX-DLA (1) NX-GPU T4Customer Model Z

BF/FP16 157.8 163.1 53.5 15.54 35.7 6.3 12.3 18.7 64.4 28.0 0.0349 0.1030 0.2276

Customer Model XBF/FP16 - - 12.3 2.01 1.1 - - 81.3 497.5 909.1 #VALUE! 0.0138 0.0542

YOLOv3-608INT8 50.5 53.7 18.3 4.2 18.5 19.8 37.2 54.6 238.1 54.1 0.0565 0.1560 0.4364

YOLOv3-1440INT8 279.2 289.8 94.9 19.3 108.6 3.6 6.9 10.5 51.8 9.2 0.0600 0.1766 0.5575

*performance estimate

Accelerator Evaluation on Real Edge-Inference ... - Flex Logix · Flex Logix Technologies, Inc....

Documents

Transcript of Accelerator Evaluation on Real Edge-Inference ... - Flex Logix · Flex Logix Technologies, Inc....