Systolic Design

8/2/2019 Systolic Design

http://slidepdf.com/reader/full/systolic-design 1/23

Systolic Algorithm Design: Hardware Merge Sortand Spatial FPGA Cell Placement Case Studies

Henry BarnorMentor: Andr´e Dehon

October 1, 2004



Contents

1 Introduction and motivation 21.1 Algorithms in Hardware . . . . . . . . . . . . . . . . . . . . . . 21.2 Systolic Nature of Hardware Algorithms . . . . . . . . . . . . . . 21.3 Hardware Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Spatial FPGA cell placement . . . . . . . . . . . . . . . . . . . . 3

2 Hardware Merge Sort 32.1 Algorithm: Overview . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Algorithm: Detailed Description . . . . . . . . . . . . . . . . . . 5

2.2.1 Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Merger . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Pipelining Proof . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Spatial FPGA Cell Placement 143.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Implementation Progress . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Entropy Assembly . . . . . . . . . . . . . . . . . . . . . 153.2.2 Accumulator Assembly . . . . . . . . . . . . . . . . . . . 163.2.3 Swap Assembly . . . . . . . . . . . . . . . . . . . . . . . 163.2.4 SwapMemory Assembly . . . . . . . . . . . . . . . . . . 173.2.5 Memory Assembly . . . . . . . . . . . . . . . . . . . . . 17

3.3 PositionUpdate Assembly . . . . . . . . . . . . . . . . . . . . . . 173.3.1 Control Assembly . . . . . . . . . . . . . . . . . . . . . 17

4 Methods 184.1 Merge Sort Implementation . . . . . . . . . . . . . . . . . . . . . 18

5 Conclusion 19

6 Future Work 19

7 Acknowledgements 19

2



List of Figures

1 Hardware Merge Sort: Algorithm Flow Chart . . . . . . . . . . . 42 Hardware sort data structure . . . . . . . . . . . . . . . . . . . . 53 State Diagram for Splitter . . . . . . . . . . . . . . . . . . . . . 74 Pseudo-Schematic for Splitter . . . . . . . . . . . . . . . . . . . 75 State Diagram for Merger . . . . . . . . . . . . . . . . . . . . . 96 Pseudo-Schematic for Merger . . . . . . . . . . . . . . . . . . . 117 Systolic Placer: High level block diagram of PE . . . . . . . . . 15

List of Tables

1 Operating frequency for different data widths with block ram anddistributed ram implementation of FIFO structure . . . . . . . . . 13

2 Total Resource usage for different data widths . . . . . . . . . . . 143 Resource usage per Component for different data widths . . . . . 14

3



Abstract

The availability and increasing power of Field Programmable Gate Arrays (FP-GAs) is causing a shift towards implementation of algorithms in spatially pro-grammable hardware. This seems to hint that in the future basic algorithms willbe implemented in programmable hardware to achieve higher performance thanpossible with software running on sequential processors. Algorithms implementedon programmable gate arrays are inherently systolic. This project takes two al-gorithms that have been previously implemented in software to run on sequentialarchitecture processors and transforms them to systolic algorithms implemented inhardware. This is achieved by rethinking the algorithm in terms of a group of cellsworking together to achieve one aim. Each cell is designed for a specic task inthe ow



1 Introduction and motivation

1.1 Algorithms in Hardware

Computation intensive algorithms have for the most part been designed to run oneveryday computing machines(sequential architecture microprocessors). This in-variably leads to a software implementation of such algorithms. To achieve gainsin speed and computational power these algorithms are parallelized and run on aconnected grid of multiple sequential microprocessors. This trend is spurred bythe availability and low per-unit cost of sequential architecture microprocessors.However, recent trends in the per-unit cost and per-unit power of programmablelogic chips is changing the status quo. The trend is to do as much as possible inhardware as opposed to software.

The shift in ideology is not due to cost alone but more importantly gains inspeed. Algorithms in hardware have access to distributed-embedded memory thuseliminating the memory access bottleneck characteristic of sequential microproces-sors. There is no time-sharing of processing power, the algorithm is the processor.In addition, we can have multiple processing elements running on the same chip,without having to pay a huge speed cost for inter-process communication.

To sum up, algorithms in hardware have a higher communication bandwidth,can exploit spatial parallelism and have quicker memory-access time over the tra-ditional software approach. With the decreasing cost of programmable logic, hard-ware algorithms are now practical.

1.2 Systolic Nature of Hardware Algorithms

A systolic system is dened as a “ network of processors which rhythmically com-pute and pass data through the system”[2]. Algorithms that can be mapped tosuch a system are called a systolic algorithms. Hardware algorithm design is bestachieved by breaking the system into task-specic processing elements. Each such

element computes on data received and sends the results to the next element formore computation if necessary. The end result is a pipelined, multi-processorsystem. This is basically a systolic system and we can conclude that hardwarealgorithms are inherently systolic in nature and systolic algorithms can be easilyimplemented in hardware.

2



1.3 Hardware Merge Sort

Sorting is a basic and necessary computation for most computers. It is believed thatup to 25 percent of non-numerical computer time is spent sorting[4]. This overheadcan be eliminated by moving the sort to a specialized hardware component.

The basic sequential methods of sorting is probably the most understood topicin computer science. The same cannot be said of systolic methods but with such adeep understanding of the sequential methods do we need to develop new methods?It can be shown that most sorting algorithms use a divide and conquer approach.However, merge sort is a natural iterative divide and conquer algorithm. This prop-erty of the algorithm allows us to design a systolic sorting processor.

1.4 Spatial FPGA cell placement

Recongurable computing is a hot topic in academia and research. A large andgrowing community of researchers have used eld programmable gate arrays (FP-GAs) to accelerate computing applications and have achieved performance gainsof one or two orders of magnitude as opposed to general use processors[1]. TheIEEE in April of 2003 organized its 11th Symposium on Field Programmable Cus-tom computing machines[3], yet recongurable computing has not made it ontothe consumer market. Before this can happen a number of key drawbacks need tobe overcome. One such drawback is the time required to map the program logicto physical programmable resources anytime the machine recongures itself for atask. Spatial FPGA cell placement is one possible solution to this drawback.

2 Hardware Merge Sort

2.1 Algorithm: Overview

Given, n unsorted inputs the algorithm proceeds by using the fact that each inputby itself is sorted. Merging any two of these inputs in order will produce a size 2sorted array. This size 2 sorted array can be merged in order with another size 2sorted array to produce a size 4 sorted array. By doing this merging iteratively the

algorithm producesn

sorted outputs.

We note that there is a simple recurring structure for this divide and conquer al-gorithm: a split followed by merge. By pipelining an array of processing elementsto split and merge the input, we create a systolic merge sort machine.

3



We conclude therefore that a queue of size n+1 is sufcient to have the system

running synchronously without any handshakes.

Figure 1: Hardware Merge Sort: Algorithm Flow Chart

4



2.2 Algorithm: Detailed Description

Input: A sequence of k-bit values for which its possible to do binary compar-isons.

output: An ordered permutation from largest to smallest of the input values.

Structure of Data

The data will be (k+2)-bit values. The two most signicant bit are used as ags inthe algorithm and do not contribute to the data value. These bits represent end of sorted (EOS) subset and end of input (EOI). The EOS bit signals to the algorithmthe end of a sorted stream and beginning a new sorted stream. Thus all inputs tothe sorter start with their EOS bit set. The EOI bit is used to separate streams of numbers being sorted.

Figure 2: Hardware sort data structure

EOI

EOS

Data Wordb[k−1]..b[0]

EOS = End of Sorted Subset

EOI = End of Input

Hardware Elements

There will basically be three independent hardware logic blocks in the algorithm.

These are: the splitter, the merger and the queue.

2.2.1 Splitter

Takes a single stream of inputs and outputs two streams of numbers.

5



Input:

• input - (k+2)-bit input data values

• input _ da - data available signal

• clk - clock signal

• reset - reset signal

Output:

• output0 - (k+2)-bit output data value

• output1 - (k+2)-bit output data value

• rd _ en - read enable signal for input queue

• wr _ en0 - write enable signal to queue 0

• wr _ en1 - write enable signal to queue 1

Functional Summary: Outputs its input on one output bus until it sees an activeEOS bit and then switches to the other output bus. It asserts the write enable signalto the queue as it outputs data and stops demanding and outputting data when thecurrent queue becomes full.

Pseudo-Code for State Machine: In the pseudo-code an assignment of 1 meansenable and a 0 means disable

S0:RD_EN = ’1’IF INPUT_DA = ’1’

OUTPUT0 = INPUTWR_EN0 = ’1’WR_EN1 = ’0’

IF(!EOS(INPUT))GOTO S0

ELSEGOTO S1

ENDIF

6



Figure 3: State Diagram for Splitter

Figure 4: Pseudo-Schematic for Splitter

ELSEWR_EN0 = ’0’WR_EN1 = ’0’OUT0_REG = 0

ENDIFEND S0--S1:

RD_EN = ’1’IF INPUT_DA = ’1’

OUT1_REG = INPUTWR_EN1 = ’1’WR_EN0 = ’0’

7



IF(!EOS(INPUT))

GOTO S1ELSE

GOTO S0ENDIF

ELSEWR_EN0 = ’0’WR_EN1 = ’0’OUT1_REG = 0

ENDIFEND S1--

2.2.2 Merger

Takes two sorted inputs and produces a single sorted output of the two inputs.

Input:

• input0, input1 - (k+2)-bit input data values

• input0 _ da, input1 _ da - data available signals for input0 and input1 respec-tively.

•

clk - clock signal• reset - the reset signal

• sys _ en - system enable signal

Output:

• output - (k+2) bit output data (sorted)

• rd _ en0 - read enable signal to input queue 0

• rd _ en1 - read enable signal for input queue 1

• wr _ en - write enable signal for output queue

Functional Summary: Compares the two numbers on its input and outputs thegreater of the two until it sees an EOS on one of its input. The EOS on one inputcauses it to pipe the other input until it sees an EOS on that input.

8



Figure 5: State Diagram for Merger

Pseudo-Code for State Machine:

START:DEMAND ADEMAND BIF( NO A and NO B)

GOTO STARTIF A

GOTO WAITBELSIF B

GOTO WAITAELSE

GOTO BOTH_AVAILENDIF

END START--WAITA:DEMAND AIF A

9



GOTO BOTH_AVAIL

ELSEGOTO WAITA

ENDIF

END WAITA--WAITB:DEMAND BIF B

GOTO BOTH_AVAILELSE

GOTO WAITBENDIF

END WAITB

--BOTH_AVAIL:

IF(A_IN > B_IN)OUT_REG = A_INOUT_REG.EOS = 0DEMAND AIF(EOS(A_IN))

GOTO PASS_BELSE

GOTO M0ENDIF

ELSEOUT_REG = B_INOUT_REG.EOS = 0DEMAND BIF(EOS(B_IN))

GOTO PASS_A

ELSEGOTO M0

ENDIFENDIF

10



END BOTH_AVAIL

--PASS_A:IF(EOS)

GOTO STARTELSE

OUT_REG = A_INDEMAND A

ENDIF

END PASS_A--

PASS_B:IF(EOS)

GOTO STARTELSE

OUT_REG = B_INDEMAND B

ENDIF

END PASS_B

Figure 6: Pseudo-Schematic for Merger

2.2.3 Queue

This is basically a hardware rst-in rst-out (FIFO) data structure.

11



2.3 Implementation

The system has been implemented in an industry standard hardware descriptionlanguage(HDL), VHDL, 1 and veried by simulation. Considerable effort wasapplied to optimizing the algorithm to output data every clock cycle. This wasachieved in two ways.

Both the splitter and the merger were designed with an extra register/buffer tohold values before there were output. This allowed us to have data available tooutput on the rising edge of the clock.

In addition, a “demand and receive” model of handshaking was used. This re-placed the request and acknowledge handshaking which takes more than one cycleto accomplish.

A better optimization would be to get rid of all handshaking and have the systemrun synchronously. This can only be achieved if we can guarantee that there willbe no stall in the system. By making the queue sizes arbitrarily large enough, wecan ensure that no stalls occur. However, this increases the number of resourcesused which is not an optimal solution. We suggest and prove below that instead of making the queue size arbitrarily large, a queue size of n + 1 is sufcient to ensurethat no stalls occur in the system.

2.3.1 Pipelining Proof

We consider two arbitrary empty queues, ’A’ and ’B’ of sizes n. We note that 2n-clocks would fully ll up both queues without stalling. Without loss of generality,we assume that queue A was lled rst. On the 2n+1 clock, queue A is scheduledto receive an input. Two things can happen in that case.

CASE 1: Queue A has not been popped If Queue A has not been popped thenA is full and the input can go into the spare queue register/buffer that we claimis sufcient. The fact that Queue A has not been popped implies that Queue A issorted; this requires that for the next n clocks, A will be popped freeing up spacefor more inputs on A. Thus there will be no stall provided the queue has the extraregister/buffer.

1VHSIC Hardware Description Language

12



CASE 2: Queue A has been popped Assume m ; m > 0 A’s have been popped.

On the 2n+1 clock we have no problem since space is available for at least onemore input. For subsequent clocks, we have the worst case scenario if only B’s arepopped. We note that if m A’s have been popped on the rst 2n clock then n − m

B’s have been popped leaving us with m B’s. Thus only m B’s can be poppedmeaning only m A inputs will be pushed onto the A queue before an A is poppedand we therefore will not have any stalls in the system.

2.4 Results

A 16-input/4-stage sorter was implemented and characterized for this paper. Thecode was run through the full FPGA design ow except for generating a program-

ming bit-le for the FPGA. Table 1 contains running speed data obtained fromXilinx ISE. Total resource usage in terms of LUTS 2 was also obtained from Xil-inx ISE and is shown in Table 2. Synplicity Pro synthesis tool was used to obtainresource usage for the merger and splitter component whereas Xilinx Coregen pro-vided footprint data for the queue components.

In all 3 tables, data is presented for two cases.

• Distributed memory: - Queue component is implemented using embeddeddistributed memory.

• Block ram: - Queue component is implemented using dedicated blocks of memory.

Data Width - 2 8 16 32 64Distributed Memory(Mhz) 117.4 105.5 92.2 71.6

Block Ram(Mhz) 85.8 86.0 61.7 n/a

Table 1: Operating frequency for different data widths with block ram and dis-tributed ram implementation of FIFO structure

Data is not available for 64-bit block ram implementation because it required more block ram than was available on the spartan3 FPGA in the current congu-ration.

2Basic unit of FPGA resources

13



Figure 7: Systolic Placer: High level block diagram of PE

ControlAssembly

EntropyAssembly

MemoryAssemblySwapMemoryAssembly

AccumulatorAssembly

SwapAssembly

PositionUpdateAssembly

N ei gh

b o ur M

ux

UpLeft

Right

Down

DeltaCost

External Delta Cost

to neighbours

Swap Decision

Done

‘Swap

3.2 Implementation Progress

Implementation of the placer is still ongoing at the time of writing. VHDL isbeing used to implement the algorithm in hardware. The recurring structure in thisalgorithm is the processing element that models the crystal molecule.

Processing Element The processing element can be broken into 7 major blocks:the Entropy assembly, the Accumulator assembly, the Memory assembly, the Po-sitionUpdate assembly, Swap assembly, SwapMemory assembly and nally theControl assembly.

3.2.1 Entropy Assembly

The entropy assembly is responsible for the randomness and cooling schedule of the simulated annealing process. It consists of two main elements. A variable

15



cooling schedule that can be set at synthesis time and an LFSR 3 used as a random

number generator. The initial system will only have a linearly decreasing coolingschedule. More cooling schedules can be added using the VHDL feature that al-lows a designer to dene more than one architecture. The possibility of changingthe cooling schedule at run-time will also be explored.

3.2.2 Accumulator Assembly

This block calculates the delta cost for the PE. It breaks down further into thefollowing blocks.

CurrentCost Accumulator

Based on the belief that the primitive for a placement engine is position i.e all costfunctions can be expressed as functions of position, the CurrentCost Accumulatorcalculates the PE’s contribution to global cost using information it has about thepositions of logic blocks its connected to. This block will also be implemented toallow the cost function to be changed at synthesis-time or run-time.

Position: For the purposes of this implementation position will be a four-tupledata structure consisting of an XY-position of the upper left corner, a width and aheight.

HypoCost Accumultor

This is a counterpart to the CurrentCost Accumulator. It calculates the cost assum-ing a swap is made with the current neighbour under consideration.

Diff Accumulator

The Diff Accumulator calculates the delta cost for the PE if the swap is taken.

3.2.3 Swap Assembly

The swap assembly makes the decision on whether a swap is to be made. A swapis made if the entropy signal is high irrespective of delta costs. If the entropy signalis low then it considers the total delta cost and issues a swap decision based on that.

3Linear-Feedback-Shift-Register

16



3.2.4 SwapMemory Assembly

This block handles swapping the PE’s memory with a neighbour whenever a swapis being made.

3.2.5 Memory Assembly

This processing element keeps track of the logic blocks that are connected to it inthe memory assembly. This information is necessary to calculate costs.

3.3 PositionUpdate Assembly

Whenever a logic block performs a swap, it needs to alert logic blocks that are

connected to it so that cost function calculations reect current state of the system.The PositionUpdate of each processing element acts as a leaf-node on an H-tree.An H-tree allows us to propagate the update information such that all processingelements of the array receive the information at the same time.

3.3.1 Control Assembly

The control block will be a nite state machine responsible for initiating a cycle of swap decision. This involves enabling the appropriate neighbour through the inputdecoder(mux), starting the swap decision process and initiating an actual swap if necessary.

Pseudo-Code for State Machine

S0:clock entropyinitiate AccumulatorAssemblyenable next neighbour in muxGOTO WAIT

WAIT:IF swapdecision=1

initiate SwapMemoryAssembly

GOTO SWAPELSEGOTO S0ENDIF

17



SWAP:

IF swap=doneGOTO S0

ENDIF

4 Methods

4.1 Merge Sort Implementation

Code Generation: Majority of the code for the sorter was generated by handusing a text editor. The FIFO structures were however generated using Xilinx CoreGenerator 6.2.03i 4 with the following settings:

• Memory type: distributed memory / block ram

• Data width : varied

• Fifo depth : 16(minimum depth possible)

Code Simulation The VHDL code was veried by simulation using ModelSimSE PLUS 5.7G from Mentor graphics 5 . The verication was done by writing a test-bench VHDL le that simulated inputs on the design. The post-synthesis generated

VHDL le was also simulated and veried using the same test-bench.

Synthesis: Synplify Pro 7.6.1 6 was used to synthesize the code into an EDIF lefor a XILINX Spartan3, XC3S400 part with a speed grade of -4 7 . Synthesis wasdone with a user-specied clock constraint of 200Mhz for the design. A mappedVHDL netlist was also generated and veried as explained above.

Placement and Routing: Placement and routing of the design onto the Spartan3chip was done using Xilinx ISE 6.2.03i 8 . Inputs to the ISE where the coregenproject les and the Synplify-generated EDIF le.

4http://www.xilinx.com/products/logicore/coregen/5http://www.model.com/products/60/se.asp6http://www.synplicity.com/products/synplifypro/7 The speed grade species how fast the FPGA can run8http://www.xilinx.com/products/

18

http://www.xilinx.com/products/

http://www.synplicity.com/products/synplifypro/

http://www.model.com/products/60/se.asp

http://www.xilinx.com/products/logicore/coregen/



5 Conclusion

Frequency/Speed: We note from Table 1 that the frequency decreases as datawidth increases for distributed memory whereas frequency is relatively constantfor block ram. This hints that the block ram is the limiting factor. However, Xilinxdata sheets claim that block rams can run at speeds close to 200Mhz. We concludethat the limiting of the speed must be due to inadequate pipelining between theblock ram and the other components.

Resource Usage: A cursory examination would seem to hint that both block ramand distributed ram use more resources as data width increases. A closer exam-ination will show that distributed ram has a higher rate of increase. Consideringdata from Table 3, we note that up until about 64 bits of data, the FIFO adds a con-stant resource usage as opposed to the increasing resource usage of the distributedmemory.

We conclude from this that we can build a system that uses distributed memoryfor the smaller merges in the algorithm ow and as we get to bigger merges switchover to block ram. This would enable us to implement a bigger design on a singlechip and also use the resources available more efciently.

6 Future Work

The conclusions drawn above points us in the direction of doing a better job of pipelining the data. This would enable the system to run at faster speeds than hasbeen demonstrated in this paper.

7 Acknowledgements

I would like to thank my advisor, Professor Dehon rst and foremost for his supportand guidance, secondly for taking a chance on me and welcoming me to his lab

even though i lacked the necessary background. I would also like to thank NachiketKapre for explaining to me innumerable times how to think about VHDL code. Iam also grateful to Micheal Wrighton for helping me understand his thesis work[6].My gratitude also goes to the SURF committee for giving me the opportunity to dosummer research. Thank you to everyone at the IC lab.

19



References

[1] Andre DeHon. The density advantage of congurable computing. Computer ,33(4):41–49, Apr 2000.

[2] David J. Evans. Systolic algorithms. In David J. Evans, editor, Systolic Al-gorithms , number 3 in Topics in Computer Mathematics. Gordon and Breach,1991.

[3] IEEE. Field-Programmable Custom Computing Machines, 11th Annual IEEE Symposium on , April 2003.

[4] G. m. Megson. An Introduction to Systolic Algorithm design . Oxford Univer-sity Press, 1992.

[5] Maogang Wand Majid Sarrafzadeh and Xiaojian Yang. Modern Placement Techniques . kluwer Academic Publishers, Norwell, USA, 2003.

[6] Michael Wrighton. Spatial approach to FPGA cell placement by simulatedannealing. Master’s thesis, California Institute of Technology, 2003.

20

Systolic Design

Documents

Transcript of Systolic Design