1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan...

28
1 A GPU-Like Soft Processor for High- Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of Toronto

Transcript of 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan...

Page 1: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

1

A GPU-Like Soft Processor for High-Throughput

Acceleration

Jeffrey Kingyens and J. Gregory SteffanElectrical and Computer Engineering

University of Toronto

Page 2: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

FGPA-Based AccelerationIn-socket acceleration platforms

FPGA and CPU on same motherboardXtremedata, Nallatech, SGI RASCIntel Quick-Assist, AMD Torrenza

How to program them?HDL is for expertsBehavioural synthesis is limited

Can we provide a more familiar programming model?

2

XD1000

Page 3: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Potential Solution: Soft ProcessorAdvantages of soft processors:

Familiar, portable, customizableOur Goal: Develop a new S.P architecture that is:

Naturally capable of utilizing FPGA resourcesSuitable for high-throughput workloads

Challenges:Memory latencyPipeline latency and hazardsExploiting parallelismScaling

3

Page 4: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Inspiration: GPU ArchitectureMultithreading

Tolerate memory and pipeline latenciesVector instructions

Data-level parallelism, scalingPredication

Minimize impact of control flowMultiple processors

Scaling

We propose a GPU-like S.P. and programming model

4

Page 5: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

OverviewA GPU-based system

NVIDIA’s CgAMD’s CTM r5xx ISA

A GPU-like architectureOvercoming port limitationsAvoiding stalls

Preliminary resultsSimulation based on Xtremedata XD1000

5

Page 6: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

A GPU-Based System

6

Page 7: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

GPU Shader Processors

7

ShaderProgram

( X o , Yo )

Input Buffers

Fetch( n,x,y)

Data

Register File

ConstantRegisters

Output Buffer

Xo

Yo

Domain Size

Separate in/out buffers simplify memory coherence

Page 8: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Software Compilation

8

CTM Binary

Cg Shader Program

C/C++ Host Program

Written by Developer:

CTM Driver

GPU Processor

System Memorycgc

ps3 asm

amucomp

ctm asm

Our system behaves like a real graphics card!

Page 9: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

NVIDIA’s Cg Language (C-like)Cg Shader Program

struct data_out { float4 sum : COLOR;};data_out multadd(float2 coord : TEXCOORD0, uniform sampler2D A: TEXUNIT0, uniform sampler2D B: TEXUNIT1) { data_out r; float4 offset = {1.0f, 1.0f, 1.0f, 1.0f}; r.sum = tex2D(A,coord)*tex2D(B,coord)+offset; return r;}

9

Matrix-matrix element-wise multiplication + offset

Page 10: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

AMD’s CTM r5xx ISA (simplified)multadd:TEX r1 r0 s1TEX r0 r0 s0MAD o0 r1 r0 c0END

Loads

A B CSource regsDest regs

10

ALU op

Each register is a 4-element vector

Central regfile: need 2 writes and 4 reads per cycle

Page 11: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

A GPU-Like Architecture

11

Page 12: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Soft Processor Architecture

12

Soft Processor

Coordinate Generator

Register File

ConfigHT Slave

Constant File

HT Master

Output Register

ALU

A B C

TEX

Fifo

Fifo

A

FPGA Block-RAMshave only 2 ports!

64 Cycles!

305 cycles!

Must tolerate port limitations and latencies

Page 13: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Overcoming Port LimitationsProblem: central register file:

Needs four reads and two writes per cycleFPGA block RAMs have only two ports

Solution: exploit symmetry of threadsGroup threads into batches of 4Fetch operands across batches in lock-step

13

Only read one operand per thread per cycle

Page 14: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Transposed RegFile Access

14

3 2 1 0

C

3 2 1 0

B

3 2 1 0

A

T3RF

T2RF

T1RF

T0RF

Page 15: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Avoiding Stalls

Problem: long pipeline and memory latency Frequent long stalls lead to underutilized ALU datapath

Solution: exploit abundance of threads Store contexts for multiple batches of threads Issue instructions from different batches to hide latencies

15

Requires logic to issue-from and manage batches

Page 16: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Methodology and Results

16

Page 17: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Simulation Methodology

SystemC-based simulation Parameterized to model XD1000 Assume conservative 100Mhz soft processor clock Cycle accurate at the block interfaces Models HyperTransport (bandwidth and latency)

Benchmarksphoton: monte-carlo heat-transfer sim (ALU-intensive)matmatmult: dense matrix multiplication (mem-intensive)

17

Page 18: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

ALU Utilization

18

Number of Hardware Batch Contexts(Photon)

Utilized

Mem Wait

Not ALU

Inside ALU

100%

80%

60%

40%

20%

1 2 4 8 16

32

640%

Page 19: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor 19

ALU Utilization

19

Utilized Mem WaitNot ALU Inside ALU

32 batches is sufficient

Page 20: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

ConclusionsGPU-inspired soft processor architecture

exploits multithreading, vector operations, predicationThread symmetry and batching allows:

tolerating limited block RAM portstolerating long memory and pipeline latencies

32 batches sufficient to achieve 100% ALU utilization

Future work:customize programming model and arch. to FPGAsexploit longer vectors, multiple CPUs, custom ops

20

Page 21: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Backups

Page 22: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

ALU Datapath

ALU

FPADD FPADD FPADD FPADD

FPMULT FPMULT FPMULT FPMULT

FPADD

11

FPADD FPADD FPADD FPADD

1414

14

pre-subtract

multiply

add

dp3/dp4

64 Cycles!

Page 23: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Reducing Register File Size

M-RAM64KB

.r

M-RAM64KB

.g

M-RAM64KB

.b

M-RAM64KB

.a

R W R W R W R WW

Reg=3 Reg=5r3.r0-3 r3.g0-3 r3.b0-3 r5.a0-3

512 Bits

Possible to shrink this?Some programs use much fewer registersEx) photon uses 4, can replace below with 16

x M4k

23

Page 24: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Multi-threading

24

Next Type: ALU, TEX, ALU, TEX, ALU, TEX…

Context RAM

True False FalseTrue

ALUReady

ALUReady Executing

TEXWaiting

0 1 2 3

CurrentBatch States

Next Batch ( to decode) Instruction (to decode)

= 0Next Batch

Rotate

Priority Encode

Last Batch = 2

False True True False3 0 1 2

Highest Priority

Instruction RAM

ProgramCounter[0 ]

Page 25: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Implementation

Assume an Xtremedata XD1000 system (or similar)FPGA plugs into second socket on dual CPU boardCommunication with CPU via HyperTransport 16-bit wide, 400 Mhz DDR interface 1.6GB/sec

CPU FPGA

System Memory

HT Master

HT Slave

Accelerator

xd1000Driver

Host Program

25

Page 26: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor

Estimating Memory Latency

26

Page 27: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor 27

ALU Utilization

Utilized Mem WaitNot ALU Inside ALU

Matmatmult is bottlenecked on load latency

Page 28: 1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.

A GPU-Like Soft Processor 28

Performance (16 Bit HT Link)

Number of Hardware Batch Contexts 2 4 8

16 32

64