CMPP 2012 held in conjunction with ICNC’12

47
Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations 2012/12/07 The Third International Conference on Networking and Computing International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30) 25-minute presentation and 5-minute question and discussion time ☆Ryohei Kobayashi† 1 Shinya Takamaeda-Yamazaki† 1 2 Kenji Kise† 1 1 Tokyo Institute of Technology, Japan 2 JSPS Research Fellow, Japan

Transcript of CMPP 2012 held in conjunction with ICNC’12

Page 1: CMPP 2012 held in conjunction with ICNC’12

Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations

2012/12/07 The Third International Conference on Networking and Computing International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30) 25-minute presentation and 5-minute question and discussion time

☆Ryohei Kobayashi†1 Shinya Takamaeda-Yamazaki†1 †2 Kenji Kise†1

†1 Tokyo Institute of Technology, Japan

†2 JSPS Research Fellow, Japan

Page 2: CMPP 2012 held in conjunction with ICNC’12

Motivation(1/2)

GPU or FPGA ??

1

or

Page 3: CMPP 2012 held in conjunction with ICNC’12

FPGA Based Accelerator

Growing demand to perform scientific computation in low-power and high performance

Designed various accelerators to solve scientific computing kernels by using FPGA ►CUBE

◇Systolic array of 512 FPGAs

◇For encryption, pattern matching

►Stencil computation accelerator composed of 9 FPGAs

◇Scalable streaming-Array with constant memory-bandwidth

2

Sano, K., IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, (2011).

Mencer, O SPL.2009

Page 4: CMPP 2012 held in conjunction with ICNC’12

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) + (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);

2D Stencil Computation

Iterative computation updating data set by using nearest neighbor values called stencil

One of methods to obtain approximate solution of partial differential equation (e.g. Thermodynamics, Hydrodynamics, Electromagnetism …)

Time-step k

Update data set 3

v1[i][j] is updated by the summation of four values. Cx : weighting factor

Page 5: CMPP 2012 held in conjunction with ICNC’12

Motivation(2/2)

Small or Big ??

4

or

Page 6: CMPP 2012 held in conjunction with ICNC’12

ScalableCore System

Tile architecture simulator by Multiple low end FPGAs ►High speed simulation environment for many-core processors

research

►We use hardware components of the system as an infrastructure for HPC hardware accelerators.

5

One FPGA node

FPGA

SRAM PROM

*Takamaeda-Yamazaki, S., (ARC 2012) (2012).

Page 7: CMPP 2012 held in conjunction with ICNC’12

Our Plan

6

One node 4 nodes(2×2) 100 nodes(10×10) Final goal

Now implementing

Page 8: CMPP 2012 held in conjunction with ICNC’12

Parallel Stencil Computation by Using Multi-FPGA

7

Page 9: CMPP 2012 held in conjunction with ICNC’12

Block Division and Assigned to Each FPGA

8

Group of grid-points assigned one FPGA

:communication :data subset communicated

with neighbor FPGAs

・Data set is divided into several blocks according to the number of FPGAs ・Each FPGA performs stencil computation in parallel

:grid-point

Page 10: CMPP 2012 held in conjunction with ICNC’12

The Computing Order of Grid-points on FPGA

9

Proposed method

Our proposed method increases the acceptable communication latency! Now, let’s compare (a)’s model with proposed method

Page 11: CMPP 2012 held in conjunction with ICNC’12

Comparison between (a) and (b) (1/2)

10

・”Iteration” : a sequent process to compute all the grid-points at a time-step ・Now we suppose a computation updating a value of one grid-point takes just a cycle. ・Each FPGA updates the assigned data of sixteen grid-points (from 0 to 15) during every Iteration.

A0 A1

A4 A5

A8 A9

A12 A13

A2

A6

A10

A14

A3

A7

A11

A15

B0 B1

B4 B5

B8 B9

B12 B13

B2

B6

B10

B14

B3

B7

B11

B15

FP

GA

(A)

FP

GA

(B)

C12 C13

C8 C9

C4 C5

C0 C1

C14

C10

C6

C2

C15

C11

C7

C3

D0 D1

D4 D5

D8 D9

D12 D13

D2

D6

D10

D14

D3

D7

D11

D15

FP

GA

(C)

FP

GA

(D)

vs

(a) (b) Proposed method

Page 12: CMPP 2012 held in conjunction with ICNC’12

Comparison between (a) and (b) (2/2)

11

A1 A0 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15

B1 B0 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0

A0

0 16

A1

B1

First Iteration end A0 A1

A4 A5

A8 A9

A12 A13

A2

A6

A10

A14

A3

A7

A11

A15

B0 B1

B4 B5

B8 B9

B12 B13

B2

B6

B10

B14

B3

B7

B11

B15

FP

GA

(A)

FP

GA

(B)

C12 C13

C8 C9

C4 C5

C0 C1

C14

C10

C6

C2

C15

C11

C7

C3

D0 D1

D4 D5

D8 D9

D12 D13

D2

D6

D10

D14

D3

D7

D11

D15

FP

GA

(C)

FP

GA

(D)

(a)

(b)

Proposed method

C1 C0 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15

D1 D0 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0

C0

0 16

C1

D1

First Iteration end

Page 13: CMPP 2012 held in conjunction with ICNC’12

Comparison between (a) and (b) (2/2)

12

C12 C13

C8 C9

C4 C5

C0 C1

C14

C10

C6

C2

C15

C11

C7

C3

D0 D1

D4 D5

D8 D9

D12 D13

D2

D6

D10

D14

D3

D7

D11

D15

FP

GA

(C)

FP

GA

(D)

(b)

A1 A0 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15

B1 B0 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0

A0

0 16

A1

B1

First Iteration end A0 A1

A4 A5

A8 A9

A12 A13

A2

A6

A10

A14

A3

A7

A11

A15

B0 B1

B4 B5

B8 B9

B12 B13

B2

B6

B10

B14

B3

B7

B11

B15

FP

GA

(A)

FP

GA

(B)

(a)

In order not to stall the computation of B1, the value of A13 must be communicated within three cycles (14, 15, 16) after the computation…

Proposed method

C1 C0 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15

D1 D0 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0

C0

0 16

C1

D1

First Iteration end

Page 14: CMPP 2012 held in conjunction with ICNC’12

Comparison between (a) and (b) (2/2)

13

A1 A0 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15

B1 B0 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0

A0

0 16

A1

B1

First Iteration end A0 A1

A4 A5

A8 A9

A12 A13

A2

A6

A10

A14

A3

A7

A11

A15

B0 B1

B4 B5

B8 B9

B12 B13

B2

B6

B10

B14

B3

B7

B11

B15

FP

GA

(A)

FP

GA

(B)

C12 C13

C8 C9

C4 C5

C0 C1

C14

C10

C6

C2

C15

C11

C7

C3

D0 D1

D4 D5

D8 D9

D12 D13

D2

D6

D10

D14

D3

D7

D11

D15

FP

GA

(C)

FP

GA

(D)

C1 C0 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15

D1 D0 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0

C0

0 16

C1

D1

First Iteration end

(a)

(b)

Proposed method

In order not to stall the computation of B1, the value of A13 must be communicated within three cycles (14, 15, 16) after the computation…

In order not to stall the computation of D1 of Iteration 2 (17th cycle), the margin to send value of C1 (1st cycle) is 15 cycles

Page 15: CMPP 2012 held in conjunction with ICNC’12

Comparison between (a) and (b) (N×M grid-points)

14

FP

GA

FP

GA

If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles

N

M

FP

GA

FP

GA

M

N

… …

Iteration end

N-1 cycles

… …

Iteration end

N×M-1 cycles

If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N×M–1cycles

(a)

(b)

Proposed method

Page 16: CMPP 2012 held in conjunction with ICNC’12

Comparison between (a) and (b) (N×M grid-points)

15

FP

GA

FP

GA

If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles

N

M

FP

GA

FP

GA

M

N

… …

Iteration end

N-1 cycles

… …

Iteration end

N×M-1 cycles

If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N×M–1cycles

(a)

(b)

Proposed method

Proposed method gives increase acceptable latency of communication !!

Page 17: CMPP 2012 held in conjunction with ICNC’12

Computing Order Applied Proposed Method

16

:computation order

This method ensures margin of about one Iteration. As the number of grid-points increases, acceptable latency is scaled.

Page 18: CMPP 2012 held in conjunction with ICNC’12

Architecture and Implementation

17

Page 19: CMPP 2012 held in conjunction with ICNC’12

System Architecture

18

to/from Adjacent

Units

Ser/Des

Ser/Des

Ser/Des

Ser/Des

Clock

Reset

FPGA Spartan-6

Configuration ROM

XCF04S JTAG port mux mux mux mux mux mux mux mux

mux8

to West

to North to South

to East

from West

from North from South

from East

MADD MADD MADD MADD MADD MADD MADD MADD

GATE[0]

GATE[1] GATE[2]

GATE[3]

mux2 Memory unit (BlockRAMs)

Computation unit

Page 20: CMPP 2012 held in conjunction with ICNC’12

Relationship between The Data Subset and BlockRAM(Memory unit)

19

BlockRAM: low-latency SRAM which each FPGA has.

The data set which assigned to each FPGA is split in the vertical direction, and is stored in each BlockRAM (0~7)

If the data set of 64×128 is assigned to one FPGA, the split data set (8×128) is stored in each BlockRAM (0~7).

FPGA array 4×4 (Data is assigned)

BlockRAMs

Page 21: CMPP 2012 held in conjunction with ICNC’12

Relationship between MADD and BlockRAM(Memory unit)

20

・The data set stored in each BlockRAM is computed by each MADD. ・Each MADD performs the computation in parallel ・The computed data is stored in BlockRAM.

Page 22: CMPP 2012 held in conjunction with ICNC’12

MADD Architecture(Computation unit)

MADD ►Multiply: seven pipeline stages

►Adder: seven pipeline stages

►Both multiply and adder are single precision floating-point unit which conforms to IEEE 754.

21

Page 23: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

22

8-stages

8-stages

Page 24: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

23

8-stages

8-stages

C0

Page 25: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

24

8-stages

8-stages

Take 8 cycles

C1

Page 26: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

25

8-stages

8-stages

C1

Page 27: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

26

8-stages

8-stages

Take 8 cycles

Take 8 cycles

C2

Page 28: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

27

8-stages

8-stages

C2

Page 29: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

28

8-stages

8-stages

C3

Page 30: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

29

8-stages

8-stages

Page 31: CMPP 2012 held in conjunction with ICNC’12

Stencil Computation at MADD

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]);

30

8-stages

8-stages

V1[i][j]

Page 32: CMPP 2012 held in conjunction with ICNC’12

MADD Pipeline Operation(Computation unit)

The computation of grid-points 11~18

31

Input2(adder)

Input1(adder)

8-stages

8-stages

Page 33: CMPP 2012 held in conjunction with ICNC’12

MADD Pipeline Operation (in cycles 0〜7)

The computation of grid-points 11~18

32

The grid-points 1~8 are loaded from BlockRAM and they are input to the multiplier in cycles 0~7.

1 2 3 4 5 6 7 8

Input2(adder)

Input1(adder)

8-stages

8-stages

Page 34: CMPP 2012 held in conjunction with ICNC’12

MADD Pipeline Operation (in cycles 8〜15)

The computation of grid-points 11~18

33

The computation result is output from multiplier, at the same times, grid-points 10~17 are input to the multiplier in cycles 8~15.

10 11 12 13 14 15 16 17

1 2 3 4 5 6 7 8

Input2(adder)

Input1(adder)

8-stages

8-stages

Page 35: CMPP 2012 held in conjunction with ICNC’12

Input2(adder)

Input1(adder)

8-stages

8-stages

MADD Pipeline Operation (in cycles 16〜23)

The computation of grid-points 11~18

34

The grid-points 12~19 are input to the multiplier, at the same time, value of grid-points 1〜8 and 10~17 multiplied by a weighting factor are summed in cycles 16~23.

12 13 14 15 16 17 18 19

1 2 3 4 5 6 7 8

10 11 12 13 14 15 16 17

Page 36: CMPP 2012 held in conjunction with ICNC’12

MADD Pipeline Operation (in cycles 24〜31)

The computation of grid-points 11~18

35

1

2

3

4

5

6

7

8

10

11

12

13

14

15

16

17

12 13 14 15 16 17 18 19

21 22 23 24 25 26 27 28

Input2(adder): 1~8 and 10~17 grid-points Input1(adder): 12~19 grid-points Input(multiplier): 21~28 grid-points

Input2(adder)

Input1(adder)

8-stages

8-stages

Page 37: CMPP 2012 held in conjunction with ICNC’12

MADD Pipeline Operation (in cycles 32〜39)

The computation of grid-points 11~18

36

1

2

3

4

5

6

7

8

10

11

12

13

14

15

16

17

12

13

14

15

16

17

18

19

21 22 23 24 25 26 27 28

11 12 13 14 15 16 17 18

Input2(adder): 1~8, 10~17 and 12~19 grid-points Input1(adder): 21~28 grid-points Input(multiplier): 11~18 grid-points

Input2(adder)

Input1(adder)

8-stages

8-stages

Page 38: CMPP 2012 held in conjunction with ICNC’12

Input2(adder)

Input1(adder)

8-stages

8-stages

MADD Pipeline Operation (in cycles 40〜48)

The computation of grid-points 11~18

37

1 2

3

4 5

6

7

8

10

11

12

13

14

15

16

17

12

13

14

15

16

17

18

19

21

22

23

24

25

26 27

28

The computation results that data of up, down, left and right gird-points are multiplied by a weighting factor and summed are output in cycles 40~48.

11 12 13 14 15 16 17 18

20 21 22 23 24 25 26 27

Page 39: CMPP 2012 held in conjunction with ICNC’12

MADD Pipeline Operation(Computation unit)

The filing rate of the pipeline: (N-8/N)×100% (N is

cycles which taken this computation.)

► Achievement of high computation performance and the small circuit area

► This scheduling is valid only when width of computed grid is equal to the pipeline stages of multiplier and adder.

38

Page 40: CMPP 2012 held in conjunction with ICNC’12

Initialization Mechanism(1/2)

39

Master (0,0)

(1,0) (2,0) (3,0)

(0,1) (1,1) (2,1) (3,1)

(0,2) (1,2) (2,2) (3,2)

(0,3) (1,3) (2,3) (3,3)

:x-coordinate + 1

:y-coordinate + 1

・To determine the computation order of each FPGA, every FPGA uses own position coordinate in the system.

Page 41: CMPP 2012 held in conjunction with ICNC’12

Initialization Mechanism(2/2)

40

Sending start signal of computation

・It is necessary for this array system to be synchronized precisely the timing of start of computation in the first Iteration. ・Because this array system is not able to get the data of communication region to be used for the next Iteration if there is a skew.

FPGA FPGA FPGA FPGA

FPGA FPGA FPGA FPGA

FPGA FPGA FPGA FPGA

FPGA FPGA FPGA FPGA

Page 42: CMPP 2012 held in conjunction with ICNC’12

Evaluation

41

Page 43: CMPP 2012 held in conjunction with ICNC’12

Environment

FPGA:Xilinx Spartan-6 XC6SLX16

► BlockRAM: 72KB

Design tool: Xilinx ISE webpack 13.3

Hardware description language: Verilog HDL

Implementation of MADD:IP core generated by Xilinx core-generator

► Implementing single MADD expends four pieces of 32 DSP-blocks which a Spartan-6 FPGA has.

◇Therefore, the number of MADD to be able to be implemented in single FPGA is eight

42 ScalableCore board Hardware configuration of FPGA array

SRAM is not used.

Page 44: CMPP 2012 held in conjunction with ICNC’12

Performance of Single FPGA Node(1/2)

Grid-size:64×128

Iteration:500,000

Performance and Power Consumption(160MHz)

►Performance:2.24GFlop/s

►Power Consumption:2.37W

43

Peak = 2×F×NFPGA×NMADD×7/8 Peak:Peak performance[GFlop/s] F:Operation frequency[GHz] NFPGA:the number of FPGA NMADD:the number of MADD 7/8: Average utilization of MADD unit → The four multiplications and the three additions

Peak performance[GFlop/s]

v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) + (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);

Page 45: CMPP 2012 held in conjunction with ICNC’12

Performance of Single FPGA Node(2/2)

Performance and Performance par watt (160MHz) ►Performance:2.24GFlop/s

►Performance par watt:0.95GFlop/sW

Hardware Resource Consumption ►LUT: 50%

►Slice: 67%

►BlockRAM: 75%

►DSP48A1: 100% 44

26% of Intel Core i7-2600 (single thread, 3.4GHz, -O3 option)

Nvidia GTX 280 card

Performance/W value is about six-times better than Nvidia GTX280 GPU card.

Page 46: CMPP 2012 held in conjunction with ICNC’12

Estimation of Effective Performance in 256 FPGA Nodes

Upper Limit of Effective Performance ►573GFlop/s =(8 multipliers + 8 adders)× 256FPGA × 160MHz × 7/8

Performance par Watt ►0.944GFlop/sW

45

1

10

100

1000

2 4 8 16 32 64 128 256

Effecveperform

ance[GFlop/s]

NumberofFPGAnodes

Freqency:0.16[GHz]

Estimation of effective performance improvement rate.

Page 47: CMPP 2012 held in conjunction with ICNC’12

Conclusion

Proposition of high performance stencil computing method and architecture

Implementation result (One-FPGA node) ►Frequency 160MHz (no communication)

►Effective performance 2.24GFlop/s. Power consumption 2.37W.

►Hardware resource consumption : Slices 67%

Estimation of performance in 256 FPGA nodes ►Upper limit of effective performance:573GFlop/s

►Effective performance par watt:0.944GFlop/sW

Low end FPGAs array system is promising ! (Better than Nvidia GTX280 GPU card)

Future works ► Implementation and evaluation of more scaled FPGA array

► Implementation towards lower-power 46