Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size...

25
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture

description

Contemporary Streaming Multiprocessor (SM) 3  1,000’s of schedulable threads  Amortize front end and memory overheads by grouping threads into warps.  Size of the warp is fixed based on the architecture Tim RogersA Variable Warp-Size Architecture

Transcript of Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size...

Page 1: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Timothy G. RogersDaniel R. JohnsonMike O’ConnorStephen W. Keckler

A Variable Warp-Size Architecture

Page 2: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture2

Contemporary GPU Massively Multithreaded 10,000’s of threads concurrently executing on

10’s of Streaming Multiprocessors (SM) GPU

Interconnect

SM SM

SM SM

...

...

MemCtrlL2$

...

... MemCtrlL2$

MemCtrlL2$

MemCtrlL2$

Tim Rogers

Page 3: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture3

ContemporaryStreaming Multiprocessor (SM) 1,000’s of schedulable threads Amortize front end and memory overheads by

grouping threads into warps. Size of the warp is fixed based on the architecture

GPU

Interconnect

SM SM

SM SM

...

...

MemCtrlL2$

...

... MemCtrlL2$

MemCtrlL2$

MemCtrlL2$

Streaming MultiprocessorFrontend

Warp Datapath

L1 I-Cache

Memory Unit

Decode

WarpControl Logic 32-wide

Tim Rogers

Page 4: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture4

Contemporary GPU Software Regular structured computation Predictable access and control flow patterns Can take advantage of HW amortization for

increased performance and energy efficiency

Execute Efficiently on a

GPU TodayGraphics Shaders

Matrix Multiply

Tim Rogers

Page 5: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture5

Forward-Looking GPU Software Still Massively Parallel Less Structured

Memory access and control flow patterns are less predictable

Execute efficiently on a

GPU todayGraphics Shaders

Matrix Multiply

Less efficient on today’s GPU

RaytracingMolecular Dynamics

Object Classificati

on…

Tim Rogers

Page 6: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture6

Divergence: Source of Inefficiency Regular hardware that amortizes front end

and overhead Irregular software with many different control

flow paths and less predicable memory accesses.Branch Divergence Memory Divergence

if (…) { …

}…

Tim Rogers

Load R1, 0(R2)

Can cut function unit utilization to 1/32.

32- Wide 32- Wide

Main Memory

Instruction may wait for 32 cache

lines

Page 7: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture7

Irregular “Divergent” Applications:Perform better with a smaller warp size

Tim Rogers

if (…) { …

}…

Allows more

threads to proceed

concurrently

Branch Divergence Memory DivergenceLoad R1, 0(R2)

Increased function

unit utilization

Main Memory

Each instruction waits on fewer

accesses

Page 8: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture8

Negative effects of smaller warp size

Tim Rogers

Less front end amortization Increase in fetch/decode energy

Negative performance effects Scheduling skew increases pressure on the

memory system

Page 9: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture9

Regular “Convergent” Applications:Perform better with a wider warp

Tim Rogers

32- Wide

Load R1, 0(R2)Load R1, 0(R2)

Main Memory

GPU memory coalescing

One memory system request can service all

32 threads

Smaller warps: Less coalescing

Main Memory

8 redundant memory accesses – no longer occurring together

Page 10: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture10

Performance vs. Warp Size

Tim Rogers

00.20.40.60.8

11.21.41.61.8

Warp Size 4

IPC

nor

mal

ized

to w

arp

size

32

Application

Convergent

Applications

Warp-Size Insensitive

Applications

Divergent Applicatio

ns

165 Applications

Page 11: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture11

Goals Convergent Applications

Maintain wide-warp performance Maintain front end efficiency

Warp Size Insensitive Applications Maintain front end efficiency

Divergent Applications Gain small warp performance

Tim Rogers

Set the warp size based on the

executing application

Page 12: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture12

Sliced Datapath + Ganged Scheduling Split the SM datapath into narrow slices.

Extensively studied 4-thread slices Gang slice execution to gain efficiencies of

wider warp.

Tim Rogers

Frontend

Warp Datapath

L1 I-Cache

Memory Unit

WarpControl Logic 32-wide

Slice

Frontend

Slice Datapath

L1 I-Cache

Memory Unit

SliceFront End 4-wide

...SliceSlice DatapathSlice

Front End 4-wide

Slices can execute

independently

Slices share an L1I-Cache and Memory Unit

GangingUnit

Page 13: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture13

Slice

Frontend

Slice Datapath

L1 I-Cache

Memory Unit

SliceFront End 4-wide

...SliceSlice DatapathSlice

Front End 4-wide

GangingUnit

Initial operation Slices begin execution in ganged mode

Mirrors the baseline 32-wide warp system Question: When to break the gang?

Tim Rogers

Ganging unit drives the

slices

Instructions are fetched and decoded

once

Page 14: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture14

Slice

Frontend

Slice Datapath

L1 I-Cache

Memory Unit

SliceFront End 4-wide

...SliceSlice DatapathSlice

Front End 4-wide

GangingUnit

Breaking Gangs on Control Flow Divergence PCs common to more than one slice form a

new gang Slices that follow a unique PC in the gang are

transferred to independent control

Tim Rogers

Unique PCs: no longer controlled by ganging unit

Observes different PCs

from each slice

Page 15: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture15

Slice

Frontend

Slice Datapath

L1 I-Cache

Memory Unit

SliceFront End 4-wide

...SliceSlice DatapathSlice

Front End 4-wide

GangingUnit

Breaking Gangs on Memory Divergence Latency of accesses from each slice can differ Evaluated several heuristics on breaking the

gang when this occurs

Tim Rogers

Hits in L1

Goes to memory

Page 16: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture16

Slice

Frontend

Slice Datapath

L1 I-Cache

Memory Unit

SliceFront End 4-wide

...SliceSlice DatapathSlice

Front End 4-wide

GangingUnit

Gang Reformation Performed opportunistically

Ganging unit checks for gangs or independent slices at the same PC

Forms them into a gang

Tim Rogers

Independent BUT at the same

PC

Ganging unit re-gangs them

More details in the paper

Page 17: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture17

Methodology In House, Cycle-Level Streaming

Multiprocessor Model 1 In-order core 64KB L1 Data cache 128KB L2 Data Cache (One SM’s worth) 48KB Shared Memory Texture memory unit Limited BW memory system Greedy-Then-Oldest (GTO) Issue Scheduler

Tim Rogers

Page 18: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture18

Configurations

Tim Rogers

Warp Size 32 (WS 32) Warp Size 4 (WS 4)

Inelastic Variable Warp Sizing (I-VWS) Gangs break on control flow divergence Are not reformed

Elastic Variable Warp Sizing (E-VWS) Like I-VWS, except gangs are opportunistically reformed

Studied 5 applications from each category in detailPaper Explores Many More Configurations

Page 19: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture19

CoM

D

Ligh

ting

Gam

ePhy

sics

Obj

Cla

ssifi

er

Rayt

raci

ng

HMEA

N-DI

V

Divergent Applications

00.20.40.60.8

11.21.41.61.8 WS 32 WS 4 I-VWS E-VWS

IPC

nor

mal

ized

to w

arp

size

32

Divergent Application Performance

Tim Rogers

Warp Size 4I-VWS:

Break on CF Only

E-VWS: Break + Reform

Page 20: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture20

CoM

D

Ligh

ting

Gam

ePhy

sics

Obj

Cla

ssifi

er

Rayt

raci

ng

AVG

-DIV

Divergent Applications

012345678 WS 32 WS 4 I-VWS E-VWS

Avg.

Fet

ches

Per

Cyc

le

Divergent Application Fetch Overhead

Tim Rogers

Warp Size 4I-VWS:

Break on CF Only

Used as a proxy for energy

consumption

E-VWS: Break + Reform

Page 21: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture21

Gam

e 1

Mat

rixM

ultip

ly

Gam

e 2

Feat

ureD

etec

t

Radi

x So

rt

HMEA

N-C

ON

Convergent Applications

0

0.2

0.4

0.6

0.8

1

1.2 WS 32 WS 4 I-VWS E-VWS

IPC

nor

mal

ized

to w

arp

size

32

Convergent Application Performance

Tim Rogers

Warp Size 4I-VWS:

Break on CF Only

Warp-Size Insensitive

Applications Unaffected

E-VWS: Break + Reform

Page 22: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture22

Imag

e Pr

oc.

Gam

e 3

Conv

olut

ion

Gam

e 4

FFT

AVG

-WSI

Gam

e 1

Mat

rixM

ultip

ly

Gam

e 2

Feat

ureD

etec

t

Radi

x So

rt

AVG

-CO

N

Warp-Size Insensitive Applications Convergent Applications

0123456789 WS 32 WS 4 I-VWS E-VWS

Avg.

Fet

ches

Per

Cyc

leConvergent/Insensitive Application Fetch Overhead

Tim Rogers

Warp Size 4

I-VWS: Break on CF

OnlyE-VWS: Break + Reform

Page 23: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture23

165 Application Performance

Tim Rogers

Convergent

Applications

Divergent Applicatio

ns

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8Warp Size 4I-VWS

IPC

nor

mal

ized

to w

arp

size

32

Application

Warp-Size Insensitive

Applications

Page 24: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture24

Related Work

Tim Rogers

Compaction/Formation Subdivision/Multipath

Improve Function

Unit Utilization

Decrease Thread Level

Parallelism Decreased Function

Unit Utilization

Increase Thread Level

Parallelism

Variable Warp Sizing

Improve Function

Unit Utilization

Increase Thread Level

Parallelism

Area Cost Area Cost

Area CostVWS Estimate:5% for 4-wide

slices2.5% for 8-wide

slices

Page 25: Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

A Variable Warp-Size Architecture25

Conclusion Explored space surrounding warp size and

perfromance

Vary the size of the warp to meet the depends of the workload 35% performance improvement on divergent apps No performance degradation on convergent apps

Narrow slices with ganged execution Improves both SIMD efficiency and thread-level

parallelism

Tim Rogers

Questions?