Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale...

Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter

1Case study: CNN-based Image Classification (inference)

+ Excellent maintainability in datacenter

+ Maximum flexibility for all workloads

- Performance of CNNs/DNNs vastly slower than specialized HW

+ CNNs/DNNs that utilize GPUs or ASICs benefit significantly

- CNNs/DNNs cannot scale beyond limited pools

- Heterogeneity challenging for maintainability

+ Homogeneous

- Increased cost and power per server (particularly GPUs)

- Not economical for all applications in the datacenter (GPUs and ASICs)

+ Homogeneous

+ Low overhead in power and cost per server

+ Flexibility benefits many workloads?

- Lower peak performance than GPUs or ASICs on some workloads Overtake through scale?

http://www.wired.com/2014/06/microsoft-fpga/

FPGA FPGA FPGA FPGA

Learning Compression

Service

Physics

Engine

Web Search Pipeline

• Altera Stratix V D5

• 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs

• PCIe Gen 3 x8

• 8GB DDR3-1333

• Powered by PCIe slot

• Torus Network

Stratix V

8GB DDR3

PCIe Gen3 x8

• Two 8-core Xeon 2.1 GHz CPUs

• 64 GB DRAM

• 4 HDDs @ 2 TB, 2 SSDs @ 512 GB

• 10 Gb Ethernet

• No cable attachments to server 68 ⁰C

Krizhevsky

3-D Convolution and Max Pooling Dense Layers

“Dog”

INPUT OUTPUT

Stride

pooling

256 Max

pooling

pooling 4096 4096

dense dense

Input Feature Map

Convolution Output Max Pooled Output

(Optional)

H = # feature maps

S = kernel stride

* N, k, H, and p may vary across layers

N = input height and width

k = kernel height and width

D = input depth

Convolution between k x k x D kernel

and region of Input Feature Map

Max value over p x p region

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

Input Kernel

Weights

Output

PE PE PE PE

Input Volume

Segment 0

Output feature

mapInput kernel weight

Input Volume

Segment 1

Input Volume

Segment N-2

Input Volume

Segment N-1

Output feature

Layer Controller

Input Volume

Broad-cast

Scan chain

Address generation

TopController

Output Volume

Layer Config.

Data re-distribution

FF(ACC)

X FF(SR)

From south

Termination To north

Platform Library/OS

ImageNet 1K

Inference

Throughput

Peak TFLOPs Effective

TFLOPs

Estimated

Peak Power with

Server

Estimated

GOPs/J

(assuming

peak power)

16-core, 2-socket Xeon

E5-2450, 2.1GHz

Caffe + Intel MKL

Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3

Arria 10 GX1150 Windows Server 2012 369 images/s1 1.366T 0.51T (38%) ~265W ~1.9

32 2https://github.com/soumith/convnet-benchmarks

1Dense layer time estimated

Platform Library/OS

ImageNet 1K

Inference

Throughput

TFLOPs

Estimated

Peak Power with

Server

Estimated

GOPs/J

(assuming

peak power)

E5-2450, 2.1GHz

Caffe + Intel MKL

NervanaSys-32 on

NVIDIA Titan X

NervanaSys-32 on

Ubuntu 14.0.4 4129 images/s2 6.1T 5.75T (94%) ~475W ~12.1

Includes server power; however, CPUs available to other jobs in the datacenter 1Dense layer time estimated

Platform Library/OS

ImageNet 1K

Inference

Throughput

TFLOPs

Estimated

Peak Power for

Computation

Estimated

GOPs/J

(assuming

peak power)

E5-2450, 2.1GHz

Caffe + Intel MKL

NervanaSys-32 on

NVIDIA Titan X

NervanaSys-32 on

Under-utilized FPGA vs. highly tuned GPU-friendly workload

Platform Library/OS

ImageNet 1K

Inference

Throughput

TFLOPs

Estimated

Peak Power for

Computation

Estimated

GOPs/J

(assuming

peak power)

E5-2450, 2.1GHz

Caffe + Intel MKL

Arria 10 GX1150 Windows Server 2012 369 images/s1

~880 images/s 1.366T

0.51T (38%)

~1.2T (89%) ~40W

NervanaSys-32 on

NVIDIA Titan X

NervanaSys-32 on

Projected Results Assuming Floorplanning and Scaling Up PEs

Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale...

Documents

Transcript of Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale...

Original Article Study of right ventricular function with ... · of the right ventricle, thereby accelerating or delaying the progression toward right ventricu-lar dysfunction [6].

T4 Oczan Final...MONTHLY&TRADE&PROMOTION GUIDELINES CC,&CCZ&2.5&lt 5Cases&+&1Case My&Coke&IC Extra&Discount&3% Extra&Discount&for& each&500&TL&2% SPRITE&&FANTA&&FC 5Cases&+&1Case

A new era Accelerating toward 2020 — An automotive industry ...

PVHO-1Case 21 a TransportableFlexible Coated Aramid ...

A new era Accelerating toward 2020 — An automotive … · · 2018-02-24A new era Accelerating toward 2020 — an automotive industry transformed 1 ... A recalibration of the automotive

IconIntent: Automatic Identification of Sensitive UI ...penggao/resources/publications/... · Xusheng Xiao1,XiaoyinWang2, ZhihaoCao1, HanlinWang1, and Peng Gao3 1Case Western Reserve

The Convergence Agenda Meeting the Challenge of Accelerating Progress toward the Poverty Reduction MDG among LDCs OUSMANE BADIANE Director for Africa International.

Consti 1Case Digests

2010 The Natural Step The Natural Step Accelerating change toward sustainability Brought to you by:

REGIONAL DOCUMENT FOR ACCELERATING PROGRESS TOWARD … · regional document for accelerating progress toward the mdgs: department of cundinamarca, colombia colombia colombia september

Seeing far and seeing wide: moving toward a …...Seeing far and seeing wide: moving toward a visionary board WomenCorporateDirectors FOUNDATION globally accelerating best practices

Toward the Fourth Industrial Revolution, the challenge of ...opencomputejapan.org/wp-content/uploads/2018/10/NTT-Communic… · Wave of the 4th Industrial Revolution accelerating

GARMIN INTERNATIONAL, INC. Garmin.com/aviation · GARMIN INTERNATIONAL, INC. ... Aviation GPS Portables 06 ... the industry’s accelerating move toward satellite-based air traffic

237299274 1case-report-11-new

in the looPhearingloop.org/SoundCommunicationsArticle_2010.pdf · By DaviD G. MyeRs, PhD, & Juliette steRkens, auD Momentum is accelerating toward a new world of assistive listening

U.S. Embassy & Consulates in South Africa - 01a PEPFAR ......South Africa in accelerating South Africa's progress toward achieving HIV/AIDS epidemic control. U.S. PEPFAR supports more

IBM Global Process Services · Smarter Lending, Revenue Cycle ... IBM Global Process Services has always been geared toward accelerating business process innovation. ... - Rick Ede,

Future Ready Schools - NAESP · 3. Empowering educators through professional learning opportunities 4. Accelerating Progress Toward Access for All Students to Quality Devices 5. Providing

Operations Research for Accelerating Results toward Ending ... · • Smart integration of services (for continuum of care for maternal, newborn, and child health; cross-sectoral

The Next Evolution - Accelerating Toward Abundance