in Computer Engineering Recent Trends and Applications...

65
ENGRI 1210 Recent Trends and Applications in Computer Engineering Christopher Batten School of Electrical and Computer Engineering Cornell University

Transcript of in Computer Engineering Recent Trends and Applications...

Page 1: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

ENGRI 1210Recent Trends and Applications

in Computer Engineering

Christopher Batten

School of Electrical and Computer EngineeringCornell University

Page 2: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

• The Computer Systems Stack • Activity 1 Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

The Computer Systems Stack

Technology

Application

Gap too large to bridge in one step(but there are exceptions, e.g., a magnetic compass)

ENGRI 1210 Recent Trends and Applications in Computer Engineering 2 / 63

Page 3: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

• The Computer Systems Stack • Activity 1 Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

The Computer Systems Stack

In its broadest definition, computer engineering is thedevelopment of the abstraction/implementation layers that allow us to

execute information processing applications efficientlyusing available manufacturing technologies

Register-Transfer Level

Circuits

Devices

Instruction Set Architecture

Programming Language

Algorithm

Microarchitecture

Com

pute

r E

ng

ineeri

ng

Technology

Application

Operating System

Gate Level

TraditionalComputer Science

TraditionalElectrical Engineering

Computer Engineering is at theinterface between hardware and softwareand considers the entire system

ENGRI 1210 Recent Trends and Applications in Computer Engineering 2 / 63

Page 4: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

• The Computer Systems Stack • Activity 1 Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

Cornell Computer Engineering Curriculum

Register-Transfer Level

Circuits

Devices

Instruction Set Architecture

Programming Language

Algorithm

Microarchitecture

Com

pute

r E

ng

ineeri

ng

Technology

Application

Operating System

Gate Level ECE 2300 Digital Logic & Computer Org

ECE 3140 Embedded Systems

ECE 4750 Computer Architecture

ECE 4760 Design with Microcontrollers

ECE 2400 Computer Systems Programming

ENGRI 1210 Recent Trends and Applications in Computer Engineering 3 / 63

Page 5: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

• The Computer Systems Stack • Activity 1 Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

Cornell Computer Engineering Curriculum

ENGRI 1210 Recent Trends and Applications in Computer Engineering 4 / 63

Page 6: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

• The Computer Systems Stack • Activity 1 Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

Processors, Memories, and Networks

Register-Transfer Level

Circuits

Devices

Instruction Set Architecture

Programming Language

Algorithm

Microarchitecture

Com

pute

r E

ng

ineeri

ng

Technology

Application

Operating System

Gate Level

Computer engineering basic building blocks

• Processors for computation

• Memories for storage

• Networks for communication

Network

Processor

Memory

InputData

OutputData

Computedata

Movedata

Storedata

ENGRI 1210 Recent Trends and Applications in Computer Engineering 5 / 63

Page 7: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack • Activity 1 • Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

RTL

Devices

ISA

PL

Algorithm

μArch

Technology

Application

OS

Gates

Circuits

Agenda

The Computer Systems Stack

Activity 1

Trends in Computer Engineering

Activity 2

Hardware Acceleration for Deep Learning

ENGRI 1210 Recent Trends and Applications in Computer Engineering 6 / 63

Page 8: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack • Activity 1 • Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

Activity #1: Sorting with a Sequential Processor

Network

Processor

Memory

I Application: Sort 32 numbers

I Simulated Sequential Computing System. Processor: You!. Memory: Worksheet, read input data, write output data. Network: Passing/collecting the worksheets

I Activity Steps. 1. Discuss strategy with neighbors. 2. When instructor starts timer, flip over worksheet. 3. Sort 32 numbers as fast as possible. 4. Lookup when completed and write time on worksheet. 5. Raise hand. 6. When everyone is finished, then analyze data

ENGRI 1210 Recent Trends and Applications in Computer Engineering 7 / 63

Page 9: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

RTL

Devices

ISA

PL

Algorithm

μArch

Technology

Application

OS

Gates

Circuits

Agenda

The Computer Systems Stack

Activity 1

Trends in Computer Engineering

Activity 2

Hardware Acceleration for Deep Learning

ENGRI 1210 Recent Trends and Applications in Computer Engineering 8 / 63

Page 10: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Trend 1: Bell’s Law

Roughly every decade a new, smaller, lower priced computer class formsbased on a new programming platform resulting in entire new industries

103

104

105

106

107

108

102

Pric

e in

Dol

lars

1950 1960 1970 1980 2000 2010

G. Bell. "Bell's Law for the Birth and Death of Computer Classes."CACM, Jan 2008.

1990

Volu

me

in c

m3

2020 1950 1960 1970 1980 2000 20101990 2020

Y. Lee et al. "Modular 1mm3 Die-Stacked Sensing Platform ..."JSSC, Jan 2013.

104

106

108

102

100

107

105

103

101

Mainframes

Mainframes

Minicomputers

Minicomputers

Workstations

WorkstationsPersonalComputers

PersonalComputers

Laptops

LaptopsHandhelds

Handhelds

"Ultra"Embedded "Ultra"

Embedded

Supercomputers

ScalableClusters

CloudComputing

Internetof Things

ENGRI 1210 Recent Trends and Applications in Computer Engineering 9 / 63

Page 11: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Trend 1: Growing Diversity in Apps & Systems

Data Centers

MedicalDevices

Internet Routers

Humanoid RobotsUnmanned Vehicles

GPS Devicesand Satellites

Automobiles

DigitalCameras

Wearable Activity MonitorsSmart Home

WearableComputing

Sensor Networks

GameConsoles Computing: From Handhelds to Servers

ENGRI 1210 Recent Trends and Applications in Computer Engineering 10 / 63

Page 12: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Example: Internet of Things

$1.7 trillionMarket for IoT by 2020— IDC

25 billionConnected “things” by 2020— Gartner

ENGRI 1210 Recent Trends and Applications in Computer Engineering 11 / 63

Page 13: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

IoT Platform Startups

Particle: PhotonWiFiconnected�controllersw/ParticleCloud

PunchThrough

ENGRI 1210 Recent Trends and Applications in Computer Engineering 12 / 63

Page 14: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

IoT Chip Startups

B. Calhoun, D. Wentzloff, et al.Univ. of Virginia, Univ. of Michigan

Chip startup foundedin 2014 to useultra-low-powercircuits in energyharvesting IoTdevices

ENGRI 1210 Recent Trends and Applications in Computer Engineering 13 / 63

Page 15: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

IoT for Truly Personalized Medicine

ENGRI 1210 Recent Trends and Applications in Computer Engineering 14 / 63

Page 16: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

The Computer Systems Stack

Register-Transfer Level

Circuits

Devices

Instruction Set Architecture

Programming Language

Algorithm

Microarchitecture

Com

pute

r E

ng

ineeri

ng

Technology

Application

Operating System

Gate Level

Trend #2: Software/ArchitectureInterface Changing Radically

ENGRI 1210 Recent Trends and Applications in Computer Engineering 15 / 63

Page 17: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Trends in High-Performance Processors

Transistors(Thousands)

MIPSR2K

IntelP4

1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

107

DECAlpha 21264

TypicalPower (W)

Frequency(MHz)

SPECintPerformance

~9%/year

~15%/year

ENGRI 1210 Recent Trends and Applications in Computer Engineering 16 / 63

Page 18: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Transition to Multicore Processors

P

M

N

P

M

N

P

M

P

M

Intel Pentium 4Single monolithic processor

Cray XT3 Supercomputer1024 single-core processors

P

M

P

M

P

M

P

M

N

AMD Quad-Core Opteron Four cores on the same die

P

M

P

M

N

IBM Blue Gene QSupercomputer Thousands of 18-core processors

P

M

P

M

ENGRI 1210 Recent Trends and Applications in Computer Engineering 17 / 63

Page 19: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

The Multicore “Hail Mary Pass”

Transistors(Thousands)

MIPSR2K

IntelP4

1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

107

DECAlpha 21264

TypicalPower (W)

Frequency(MHz)

SPECintPerformance

~9%/year

~15%/year

Numberof Cores

Intel 48-Core Prototype

Parallelism?

AMD 4-Core Opteron

ENGRI 1210 Recent Trends and Applications in Computer Engineering 18 / 63

Page 20: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Inevitable End of Moore’s Law as We Know It

Adapted from R. Courtland, “Transistors Could Stop Shrinking in 2021,” IEEE Spectrum, 2016.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 19 / 63

Page 21: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Slowing Technology ScalingMeans Golden Age of Design

Technological'Fallow'Period''

CMOS&scaling&is&running&out&&&

6&

[Colwell&2012]&

Adapted from D. Brooks Keynote at NSF XPS Workshop, May 2015.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 20 / 63

Page 22: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Slowing Technology ScalingMeans Golden Age of Design…and&it’s&about&=me.&

7&

Golden'Age'Of'Design'

'Technological'Fallow'Period''

[Colwell&2012]&

7nm,&~50B&tx&

Adapted from D. Brooks Keynote at NSF XPS Workshop, May 2015.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 20 / 63

Page 23: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

The Specialization “Hail Mary Pass”

Transistors(Thousands)

MIPSR2K

IntelP4

1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

107

DECAlpha 21264

TypicalPower (W)

Frequency(MHz)

SPECintPerformance

~9%/year

~15%/year

Numberof Cores

Intel 48-Core Prototype

Parallelism?

AMD 4-Core Opteron

Specializa

tion?

ENGRI 1210 Recent Trends and Applications in Computer Engineering 21 / 63

Page 24: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Heterogeneous Systems-on-Chip

OMAP'4'SoC'

Today’s&SoC&

ARM&Cores& GPU&DSP& DSP&

System&Bus&

Secondary&&Bus&

Secondary&&Bus&

Ter=ary&&Bus&

DMA&

DMA& SD&USB&Audio& Video& Face& Imaging&

USB&

26&Adapted from D. Brooks Keynote at NSF XPS Workshop, May 2015.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 22 / 63

Page 25: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Heterogeneous Systems-on-Chip

OutOofOCore&Accelerators&

Mal=el&Consul=ng&&es=mates&

[www.anandtech.com/show/8562/chipworksOa8]&

[Y.&Shao,&IEEE&Micro&2015]&

Adapted from D. Brooks Keynote at NSF XPS Workshop, May 2015.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 23 / 63

Page 26: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Microsoft Catapult: FPGAs in the Data Center

Figure 2: (a) A block diagram of the FPGA board. (b) A picture of the manufactured board. (c) A diagram of the 1~U, half-width server that hosts the FPGA board. The air flows from the left to the right, leaving

the FPGA in the exhaust of both CPUs.

Figure 3: Cumulative distribution of compressed document sizes. Nearly all compressed documents are

64 kB or less.

I Custom FPGA board for accelerating Bing search and other workloads

I Accelerators developed with/by app developers

I Tightly integrated into Microsoft data center’s and cloud computingplatforms, access gradually being given to outside developers

ENGRI 1210 Recent Trends and Applications in Computer Engineering 24 / 63

Page 27: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

The Computer Systems Stack

Register-Transfer Level

Circuits

Devices

Instruction Set Architecture

Programming Language

Algorithm

Microarchitecture

Com

pute

r E

ng

ineeri

ng

Technology

Application

Operating System

Gate Level Trend #3: Technology/ArchitectureInterface Changing Radically

ENGRI 1210 Recent Trends and Applications in Computer Engineering 25 / 63

Page 28: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Trend 5: Emerging Device Technologies

`90`80`70`60`50`40`30`20`10 `10

?Vertical MOSFETsGrapheneCarbon NanotubesNanorelaysQuantum ComputingMolecular ComputingMemristersPhase-Change MemSpintronics3D IntegrationNanophotonics

Adapted from R. Kurzweil, “The Singularity is Near,” Penguin Books, 2006.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 26 / 63

Page 29: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Examples of Emerging Technologies

Intel 3D Crosspoint MemoryResistive memory enables very

high density, non-volatile storagewith fast access times

D-WaveQuantum annealing computersuitable for solving complex

optimization problems

ENGRI 1210 Recent Trends and Applications in Computer Engineering 27 / 63

Page 30: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

A Carbon Nanotube “Computer”

Adapted from M. Shulaker, et al., “Carbon Nanotube Computer,” Nature, 2013.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 28 / 63

Page 31: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 • Trends in Computer Engineering • Activity 2 Hardware Xcel for Deep Learning

Three Key Trends in Computer Engineering

RTL

Devices

ISA

PL

Algorithm

μArch

Technology

Application

OS

Gates

Circuits

Trend #2:Software/ArchInterface ChangingRadically

Trend #3:Technology/ArchInterface ChangingRadically

Trend #1: Growing Diversity inApplications and Systems

Students entering the field of computer engineeringhave a unique opportunity to shape the future of computing

and how it will impact society

ENGRI 1210 Recent Trends and Applications in Computer Engineering 29 / 63

Page 32: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering • Activity 2 • Hardware Xcel for Deep Learning

RTL

Devices

ISA

PL

Algorithm

μArch

Technology

Application

OS

Gates

Circuits

Agenda

The Computer Systems Stack

Activity 1

Trends in Computer Engineering

Activity 2

Hardware Acceleration for Deep Learning

ENGRI 1210 Recent Trends and Applications in Computer Engineering 30 / 63

Page 33: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering • Activity 2 • Hardware Xcel for Deep Learning

Activity #2: Sorting with a Parallel Processor

P

M

P

M

P

M

P

M

N

I Application: Sort 32 numbers

I Simulated Parallel Computing System. Processor: Group of 2–8 students. Memory: Worksheet, scratch paper. Network: Communicating between students

I Activity Steps. 1. Discuss strategy with group. 2. When instructor starts timer, master processor flips over worksheet. 3. Sort 32 numbers as fast as possible. 4. Lookup when completed and write time on worksheet. 5. Master processor only raises hand. 6. When everyone is finished, then analyze data

ENGRI 1210 Recent Trends and Applications in Computer Engineering 31 / 63

Page 34: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering • Activity 2 • Hardware Xcel for Deep Learning

Activity #2: Discussion

unsorted

sorted

Distribute

Sort 4 Numbers

Merge Phase 1 > merge 4+4 = 8

Merge Phase 2 > merge 8+8 = 16

Merge Phase 3 > merge 16+16 = 32

Proc/Mem

Proc/Mem

Network

Network

Network

Proc/Mem

Proc/Mem

Network

Network

AlgorithmCommunicationLoad BalancingFault Tolerance

Dataset Size

ENGRI 1210 Recent Trends and Applications in Computer Engineering 32 / 63

Page 35: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

RTL

Devices

ISA

PL

Algorithm

μArch

Technology

Application

OS

Gates

Circuits

Agenda

The Computer Systems Stack

Activity 1

Trends in Computer Engineering

Activity 2

Hardware Acceleration for Deep Learning

ENGRI 1210 Recent Trends and Applications in Computer Engineering 33 / 63

Page 36: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

ImageNet: Object Recognition CompetitionImageNet: object recognition

ENGRI 1210 Recent Trends and Applications in Computer Engineering 34 / 63

Page 37: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Machine Learning: Training vs. Inference

ENGRI 1210 Recent Trends and Applications in Computer Engineering 35 / 63

Page 38: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

An Inflection Point due to Algorithms and Hardware

Adapted from A. Gray, “NVIDIA and IBM Cloud Support ImageNet Large Scale Visual Recognition Challenge,” NVIDIA Blog, 2015.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 36 / 63

Page 39: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

SGI InfiniteReality GPU

RealityEngine; there are three distinct board types: the Geometry,Raster Memory, and Display Generator boards (Figure 1).

The Geometry board comprises a host computer interface, com-mand interpretation and geometry distribution logic, and fourGeometry Engine processors in a MIMD arrangement. Each Ras-ter Memory board comprises a single fragment generator with asingle copy of texture memory, 80 image engines, and enoughframebuffer memory to allocate 512 bits per pixel to a 1280x1024framebuffer. The display generator board contains hardware todrive up to eight display output channels, each with its own videotiming generator, video resize hardware, gamma correction, anddigital-to-analog conversion hardware.

Systems can be configured with one, two or four raster memoryboards, resulting in one, two, or four fragment generators and 80,160, or 320 image engines.

Figure 1: Board-level block diagram of the maximumconfiguration with 4 Geometry Engines, 4 Raster Memory boards,

and a Display Generator board with 8 output channels.

2.1 Host InterfaceThere were significant system constraints that influenced the archi-tectural design of InfiniteReality. Specifically, the graphics systemhad to be capable of working on two generations of host platforms.The Onyx2 differs significantly from the shared memory multipro-cessor Onyx in that it is a distributed shared memory multiproces-sor system with cache-coherent non-uniform memory access. Themost significant difference in the graphics system design is that theOnyx2 provides twice the host-to-graphics bandwidth (400MB/secvs. 200MB/sec) as does Onyx. Our challenge was to design a sys-tem that would be matched to the host-to-graphics data rate of theOnyx2, but still provide similar performance with the limited I/Ocapabilities of Onyx.

We addressed this problem with the design of the display list sub-system. In the RealityEngine system, display list processing hadbeen handled by the host. Compiled display list objects werestored in host memory, and one of the host processors traversed thedisplay list and transferred the data to the graphics pipeline usingprogrammed I/O (PIO).

With the InfiniteReality system, display list processing is handledin two ways. First, compiled display list objects are stored in hostmemory in such a way that leaf display objects can be “pulled”into the graphics subsystem using DMA transfers set up by theHost Interface Processor (Figure 1). Because DMA transfers arefaster and more efficient than PIO, this technique significantlyreduces the computational load on the host processor so it can bebetter utilized for application computations. However, on the origi-nal Onyx system, DMA transfers alone were not fast enough tofeed the graphics pipe at the rate at which it could consume data.The solution was to incorporate local display list processing intothe design.

Attached to the Host Interface Processor is 16MB of synchronousdynamic RAM (SDRAM). Approximately 15MB of this memoryis available to cache leaf display list objects. Locally stored displaylists are traversed and processed by an embedded RISC core.Based on a priority specified using an OpenGL extension and thesize of the display list object, the OpenGL display list managerdetermines whether or not a display list object should be cachedlocally on the Geometry board. Locally cached display lists areread at the maximum rate that can be consumed by the remainderof the InfiniteReality pipeline. As a result, the local display listprovides a mechanism to mitigate the host to graphics I/O bottle-neck of the original Onyx. Note that if the total size of leaf displaylist objects exceeds the resident 15MB limit, then some number ofobjects will be pulled from host memory at the reduced rate.

2.2 Geometry DistributionThe Geometry Distributor (Figure 1) passes incoming data andcommands from the Host Interface Processor to individual Geome-try Engines for further processing. The hardware supports bothround-robin and least-busy distribution schemes. Since geometricprocessing requirements can vary from one vertex to another, aleast-busy distribution scheme has a slight performance advantageover round-robin. With each command, an identifier is includedwhich the Geometry-Raster FIFO (Figure 1) uses to recreate theoriginal order of incoming primitives.

2.3 Geometry EnginesWhen we began the design of the InfiniteReality system, it becameapparent that no commercial off-the-shelf floating point processorswere being developed which would offer suitable price/perfor-mance. As a result, we chose to implement the Geometry EngineProcessor as a semicustom application specific integrated circuit(ASIC).

The heart of the Geometry Engine is a single instruction multipledatapath (SIMD) arrangement of three floating point cores, each ofwhich comprises an ALU and a multiplier plus a 32 word register

Fragment Generator Fragment Generator Fragment Generator Fragment Generator

Vertex Bus

ImageEngines

Geometry−Raster FIFO

GeometryEngine

GeometryEngine

GeometryEngine

GeometryEngine

Geometry Distributor

Host Interface Processor

Host System Bus

Geometry Board

Raster Memory Board Raster Memory Board Raster Memory BoardRaster Memory Board

Display GeneratorBoard

De−Interleaver

DisplayChannel

DisplayChannel

DisplayChannel

DisplayChannel

DisplayChannel

DisplayChannel

DisplayChannel

DisplayChannel

VidAdapted from J. Montrym, “InfiniteReality: A Real-Time Graphics System,” ACM SIGGRAPH, 1997.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 37 / 63

Page 40: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

NVIDIA GeForce 6800

A tour of the GeForce 6800Figure 5 is a top-level dia-

gram of the GeForce 6800.Work flows from top to bot-tom, starting with the sixidentical programmable ver-tex processors. Because allvertices are independent ofeach other, the data fetcherassigns incoming work to anyidle processor, and the paral-lel utilization is nearly perfect.The “GeForce 6800 statis-tics” sidebar provides morespecifics.

Results from the vertexstage are reassembled in theoriginal application-specifiedorder to feed the trianglesetup and rasterization units.For each primitive, the ras-

46

HOT CHIPS 16

IEEE MICRO

Command and data fetch

Triangle setup rasterizer

Shader thread dispatch

Fragment crossbar

Z-cull

Memorypartition

Memorypartition

Memorypartition

Memorypartition

Level 2texturecache

Pixel-blendingunits

Vertex processors

Fragmentprocessors

Figure 5. GeForce 6800 block diagram.

Constant RAM512 ! 128 bits

Inputregisters

16 ! 128 bits

Outputregisters

16 ! 128 bits

Temporaryregisters

32 ! 128 bits

Special-function

unit

InstructionRAM

512 ! 123 bits

Vertextexture

unit

Level 2texturecache

Multiply

Add

MemoryTexture relatedComputation unit

Figure 6. Vertex processor block diagram.

Adapted from J. Montrym, “The GeForce 6800,” IEEE Micro, Mar/Apr 2005.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 38 / 63

Page 41: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

NVIDIA G80

bution block distributes vertex work packetsto the various TPCs in the SPA. The TPCsexecute vertex shader programs, and (ifenabled) geometry shader programs. Theresulting output data is written to on-chipbuffers. These buffers then pass their resultsto the viewport/clip/setup/raster/zcull blockto be rasterized into pixel fragments. Thepixel work distribution unit distributes pixelfragments to the appropriate TPCs forpixel-fragment processing. Shaded pixel-fragments are sent across the interconnec-tion network for processing by depth andcolor ROP units. The compute workdistribution block dispatches computethread arrays to the TPCs. The SPA acceptsand processes work for multiple logicalstreams simultaneously. Multiple clockdomains for GPU units, processors,DRAM, and other units allow independentpower and performance optimizations.

Command processingThe GPU host interface unit communi-

cates with the host CPU, responds tocommands from the CPU, fetches data fromsystem memory, checks command consisten-cy, and performs context switching.

The input assembler collects geometricprimitives (points, lines, triangles, linestrips, and triangle strips) and fetchesassociated vertex input attribute data. Ithas peak rates of one primitive per clockand eight scalar attributes per clock at theGPU core clock, which is typically600 MHz.

The work distribution units forward theinput assembler’s output stream to the arrayof processors, which execute vertex, geom-etry, and pixel shader programs, as well ascomputing programs. The vertex and com-pute work distribution units deliver work toprocessors in a round-robin scheme. Pixel

Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming

multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.

........................................................................

MARCH–APRIL 2008 41

Adapted from E. Lindholm, “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro, Mar/Apr 2008.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 39 / 63

Page 42: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

What is in a GPU?

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

What’s in a GPU?

4

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

ShaderCore

Tex

Tex

Tex

Tex

Input Assembly

Rasterizer

Output Blend

Video Decode

WorkDistributor

A GPU is a heterogeneous chip multi-processor (highly tuned for graphics)

HWor

SW?

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 40 / 63

Page 43: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Compiling a Shader

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Compile shader

6

<diffuseShader>:

sample(r0,(v4,(t0,(s0

mul((r3,(v0,(cb0[0]

madd(r3,(v1,(cb0[1],(r3

madd(r3,(v2,(cb0[2],(r3

clmp(r3,(r3,(l(0.0),(l(1.0)

mul((o0,(r0,(r3

mul((o1,(r1,(r3

mul((o2,(r2,(r3

mov((o3,(l(1.0)

1 unshaded fragment input record

1 shaded fragment output record

sampler(mySamp;

Texture2D<float3>(myTex;

float3(lightDir;

float4(diffuseShader(float3(norm,(float2(uv)

{

((float3(kd;

((kd(=(myTex.Sample(mySamp,(uv);

((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0);

((return(float4(kd,(1.0);(((

}

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 41 / 63

Page 44: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Executing a Shader

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Execute shader

7

<diffuseShader>:

sample(r0,(v4,(t0,(s0

mul((r3,(v0,(cb0[0]

madd(r3,(v1,(cb0[1],(r3

madd(r3,(v2,(cb0[2],(r3

clmp(r3,(r3,(l(0.0),(l(1.0)

mul((o0,(r0,(r3

mul((o1,(r1,(r3

mul((o2,(r2,(r3

mov((o3,(l(1.0)

Fetch/Decode

ExecutionContext

ALU(Execute)

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 42 / 63

Page 45: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

“CPU-Style” Cores

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

“CPU-style” cores

14

Fetch/Decode

ExecutionContext

ALU(Execute)

Data cache(a big one)

Out-of-order control logic

Fancy branch predictor

Memory pre-fetcher

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 43 / 63

Page 46: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Slimming Down

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Slimming down

15

Fetch/Decode

ExecutionContext

ALU(Execute)

Idea #1: Remove components thathelp a single instructionstream run fast

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 44 / 63

Page 47: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Two Cores (Execute Two Fragments in Parallel)

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Two cores (two fragments in parallel)

16

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

<diffuseShader>:sample(r0,(v4,(t0,(s0

mul((r3,(v0,(cb0[0]

madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0[2],(r3

clmp(r3,(r3,(l(0.0),(l(1.0)mul((o0,(r0,(r3

mul((o1,(r1,(r3

mul((o2,(r2,(r3mov((o3,(l(1.0)

fragment 1

<diffuseShader>:sample(r0,(v4,(t0,(s0

mul((r3,(v0,(cb0[0]

madd(r3,(v1,(cb0[1],(r3madd(r3,(v2,(cb0[2],(r3

clmp(r3,(r3,(l(0.0),(l(1.0)mul((o0,(r0,(r3

mul((o1,(r1,(r3

mul((o2,(r2,(r3mov((o3,(l(1.0)

fragment 2

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 45 / 63

Page 48: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Four Cores (Execute Four Fragments in Parallel)

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Four cores (four fragments in parallel)

17

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Fetch/Decode

ExecutionContext

ALU(Execute)

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 46 / 63

Page 49: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

16 Cores (Execute 16 Fragments in Parallel)

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Sixteen cores (sixteen fragments in parallel)

18

16 cores = 16 simultaneous instruction streams

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 47 / 63

Page 50: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Instruction Stream Sharing

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Instruction stream sharing

19

But ... many fragments should be able to share an instruction stream!

<diffuseShader>:

sample(r0,(v4,(t0,(s0

mul((r3,(v0,(cb0[0]

madd(r3,(v1,(cb0[1],(r3

madd(r3,(v2,(cb0[2],(r3

clmp(r3,(r3,(l(0.0),(l(1.0)

mul((o0,(r0,(r3

mul((o1,(r1,(r3

mul((o2,(r2,(r3

mov((o3,(l(1.0)

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 48 / 63

Page 51: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Recall: Simple Processing Core

Fetch/Decode

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Recall: simple processing core

20

ExecutionContext

ALU(Execute)

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 49 / 63

Page 52: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Add ALUs

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Add ALUs

21

Fetch/Decode

Idea #2:Amortize cost/complexity of managing an instruction stream across many ALUs

SIMD processingCtx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 50 / 63

Page 53: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Execute 128 Fragments in Parallel

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

128 [ ] in parallel

26

vertices/fragmentsprimitives

OpenCL work itemsCUDA threads

fragments

vertices

primitives

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 51 / 63

Page 54: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

But What About Branches?

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

But what about branches?

30

ALU 1 ALU 2 . . . ALU 8. . . Time (clocks) 2 . . . 1 . . . 8

if((x(>(0)({

}(else({

}

<unconditional(shader(code>

<resume(unconditional(shader(code>

y(=(pow(x,(exp);

y(*=(Ks;

refl(=(y(+(Ka;((

x(=(0;(

refl(=(Ka;((

T T T F FF F F

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 52 / 63

Page 55: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

But What About Memory Stalls?

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Hiding shader stalls

37

Time (clocks)

Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8

1 2 3 4

Stall

Runnable

Stall

Stall

Stall

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 53 / 63

Page 56: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Key is Throughput!

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Throughput!

38

Time (clocks)

Frag 9 … 16 Frag 17 … 24 Frag 25 … 32Frag 1 … 8

1 2 3 4

Stall

Runnable

Stall

Runnable

Stall

Runnable

Stall

Runnable

Done!

Done!

Done!

Done!

Start

Start

Start

Increase run time of one groupto increase throughput of many groups

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 54 / 63

Page 57: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

18 Contexts

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

Eighteen small contexts

40

Fetch/Decode

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

(maximal latency hiding)

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 55 / 63

Page 58: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Complete GPU

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

My chip!

44

16 cores

8 mul-add ALUs per core(128 total)

16 simultaneousinstruction streams

64 concurrent (but interleaved)instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 56 / 63

Page 59: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Complete “Big” GPU

07/29/10 Beyond,Programmable,Shading,Course,,ACM,SIGGRAPH,2010

My “enthusiast” chip!

45

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)

Thursday, July 29, 2010

Adapted from K. Fatahalian, “Beyond Programmable Shading Course,” ACM SIGGRAPH 2010.

ENGRI 1210 Recent Trends and Applications in Computer Engineering 57 / 63

Page 60: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Using GPUs for General-Purpose Computing

int main( int argc, char* argv[] )

{

// ... copy data to GPGPU ...

vvadd<<<block_count,threads_per_block >>>

( dest, src0, src1, n );

// ... copy data from GPGPU to CPU ...

}

__global__ void vvadd

( int dest[], int src0[], int src1[], int n )

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if ( i < n )

z[i] = x[i] + y[i];

}

ENGRI 1210 Recent Trends and Applications in Computer Engineering 58 / 63

Page 61: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

NVIDIA DGX-1 for Deep Learning Training

ENGRI 1210 Recent Trends and Applications in Computer Engineering 59 / 63

Page 62: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

NVIDIA PX 2 for Deep Learning Inference

ENGRI 1210 Recent Trends and Applications in Computer Engineering 60 / 63

Page 63: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 • Hardware Xcel for Deep Learning •

Hardware Xcel for ML is Significant Growth Area

Google’s TPUI Custom chip for accelerating

Google’s TensorFlow library

I Tightly integrated into Google’sdata centers

Hardware ML StartupsI GraphcoreI Wave ComputingI CerebrasI Mobileye (Intel)I Nervana (Intel)I Movidius (Intel)

ENGRI 1210 Recent Trends and Applications in Computer Engineering 61 / 63

Page 64: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

RTL

Devices

ISA

PL

Algorithm

μArch

Technology

Application

OS

Gates

Circuits

Take-Away Points

I We are entering an exciting new era of computerengineering. Growing diversity in applications & systems. Radical rethinking of software/architecture interface. Radical rethinking of technology/architecture interface

I This era offers tremendous challenges andopportunities, which makes it a wonderful time tostudy and contribute to the field of computerengineering

ENGRI 1210 Recent Trends and Applications in Computer Engineering 62 / 63

Page 65: in Computer Engineering Recent Trends and Applications ...cbatten/misc/batten-compeng-trends-apps-eng… · ENGRI 1210 Recent Trends and Applications in Computer Engineering 15

The Computer Systems Stack Activity 1 Trends in Computer Engineering Activity 2 Hardware Xcel for Deep Learning

ECE 2400 Computer Systems Programming

I Part 1: Basic Data Structures andAlgorithms with C. static typing, functions, control flow, arrays,

strings, pointers, dynamic memory management. recursion, divide-and-conquer, dynamic

programming. sorting, lists, stacks, queues, sets, maps

I Part 2: Advanced Data Structures andAlgorithms with C++. objects, inheritance, polymorphism, templates. binary search trees, priority queues, hash

tables, graphs, spanning trees

I Part 3: Systems Programming with C/C++. POSIX I/O, processes, threads

template < typename T >

T* find_max( T* array, int n )

{

if ( n == 0 )

return NULL;

T* result = &array[0];

for ( int i = 1; i < n; i++ ) {

if ( array[i] > *result )

result = &array[i];

}

}

ProgrammingAssignments

I Version controlI Test-driven designI Continuous integrationI Debugging & profilingI Code optimization

ENGRI 1210 Recent Trends and Applications in Computer Engineering 63 / 63