Datapath Designs

13
Datapath Designs CK Cheng CSE Department UC, San Diego

description

Datapath Designs. CK Cheng CSE Department UC, San Diego. Prefix Adder – Well-known and Well-developed?. Classic prefix networks: Sklansky, Kogge-Stone, Brent-Kung, Ladner-Fischer, Han-Carlson, Knowles etc. Prefix Adder – New Respects, New Method. - PowerPoint PPT Presentation

Transcript of Datapath Designs

Page 1: Datapath Designs

Datapath Designs

CK Cheng

CSE Department

UC, San Diego

Page 2: Datapath Designs

Prefix Adder – Well-known and Well-developed?

• Classic prefix networks: Sklansky, Kogge-Stone, Brent-Kung, Ladner-Fischer, Han-Carlson, Knowles etc.

Page 3: Datapath Designs

Prefix Adder – New Respects, New Method

• Realistic design considerations: Timing, Power and Area.

• Integer Linear Programming for prefix adder:– Logic effort timing model (gate cap. + wire cap.)– Activity-statistic power model– Non-uniform signal arrival/required times

Logic Levels

Max Fanouts Max Wire Tracks

Timing

Power Area

Page 4: Datapath Designs

Prefix Adder – Optimum Prefix adders

• Uniform signal arrival/required times

Sklansky Adder Kogge-Stone Adder

Fastest depth-4 optimal prefix adderFastest depth-3 optimal prefix adder

Page 5: Datapath Designs

Prefix Adder – Optimum Prefix adders

• Uniform signal arrival/required times

45

50

55

60

65

70

75

80

30 35 40 45 50 55 60

Timi ng

Power

Depth = 3 Depth = 4 Depth = 5

Brent - Kung Kogge- Stone Skl ansky

Page 6: Datapath Designs

Prefix Adder – Optimum Prefix adders

• Non-uniform signal arrival/required times

Increasing Signal Arrival Times Decreasing Signal Arrival Times Convex Signal Arrival Times

Page 7: Datapath Designs

Division – Iteration effort

• Pencil and paper method: (A=QB+2-nR and R<B)

1 bit partial quotient per iteration, n iterationsA = 0.1001,

B = 0.1010;

Q = A / B.

Q = 0.1101

+Qi: Partial Quotient

Ri: Partial Remainder

Ri+1 = Ri – B Qi

1 0 0 11 0 1 0 R0=A

1

1 0 1 00 1 0 0 R2

0

0 0 0 01 0 0 0 R3

1

1 0 1 00 1 1 0 R4

1 0 1 0

0.1

1 0 0 0 R1Q1 = 0.1Q2 = 0.01Q3 = 0.000Q4 = 0.0001

Page 8: Datapath Designs

Division – Memory effort

• Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration.

• SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor.

Table size: 22m m

• STR method is limited by memory wall.

Page 9: Datapath Designs

Division – Arithmetic effort

• Partial quotient is calculated by arithmetic functions.• Prescaling:

• Taylor expansion:

• Series expansion:

ERRQB

A

EB

EA

B

A

BE

iii

1

ERQ

BB

BB

BBE

ii

hl

hl

h

322 )1

()1

(11

ERQ

XXXXXXB

E

XB

ii

)1)(1)(1(11

1

4232

Page 10: Datapath Designs

Division – Solution space

• Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider.

Iteration Effort

Memory Effort

Arithmetic Effort

Memory Wall

Pencil-and-paper

SRT

Prescaling

Taylor Expansion

Low area

Series Expansion

Low latency

Our target

Page 11: Datapath Designs

Division – PST algorithm

• Utilize the power of series expansion, but need a good start point.

• Prescaling provide a scaled divisor close to 1.

• 0-order Taylor expansion iterates to reach the final quotient

21)1)(1(

11

1

XXXEB

XB

EXB

B

A

EB

EA

B

A

BE

1

ERQ ii

Page 12: Datapath Designs

Division – PST algorithm

E0 = Table (B(m)) 1/B

A1 = AE0; B1 = BE0

E1 = (2 B1) INV(B1(2m))

Qi = Ri-1 E1

Ri = Ri-1 Qi B1

Q = Q + Qi

A = 0.1011,0110B = 0.1100,1011

B(m) = 0.1100 E0 = 1.0011

E1 = INV(B1(2m)) = 1.0000,1110

A1 = A E0 = 0.1101,1000,0010B1 = B E0 = 0.1111,0001,0001

Q1 = A1 E1 = 0.1110,0011R1 = B1 – Q1 B1 = 0.0000,0010,0101,1110,1101

Q2 = R1 E1 = 0.1001,1111R2 = R1 – Q2 B1 = 0.0000,0001,1111,1011,0001

Q = 0.1110,0011 + 0.0000,0010,0111,11 = 0.1110,0101,0111,11

Page 13: Datapath Designs

Division – FPGA Implementation• PST algorithm is suitable for high-perform

ance division unit design in FPGAs

Fmax(Period)

ALUTs

Memory Bits

DSP Blocks

Power Consumption

(Dynamic+Static)

Throughput

IP Core(no DSP)

50.16MHz

(19.935ns)

1203 84 0 381mW(52mW+329mW)

50.16Mdiv/s

PST(DSP)

72.8MHz(13.737n

s)

213 768 28 350mW(23mW+327mW)

24.3Mdiv/s

PST(no DSP)

73.20MHz

(13.661ns)

1437 768 0 378mW(50mW+328mW)

24.4Mdiv/s

PST-pipelined(DSP)

74.15MHz

(13.486ns)

261 768 40 344mW(17mW+327mW)

74.15Mdiv/s

PSTp(no DSP)

76.05MHz

(13.150ns)

1940 768 0 359mW(31mW+328mW)

76.05Mdiv/s

32-bit division with 5-cycle latency