Datapath Designs CK Cheng CSE Department UC, San Diego.

13
Datapath Designs CK Cheng CSE Department UC, San Diego

Transcript of Datapath Designs CK Cheng CSE Department UC, San Diego.

Page 1: Datapath Designs CK Cheng CSE Department UC, San Diego.

Datapath Designs

CK Cheng

CSE Department

UC, San Diego

Page 2: Datapath Designs CK Cheng CSE Department UC, San Diego.

Prefix Adder – Well-known and Well-developed?

• Classic prefix networks: Sklansky, Kogge-Stone, Brent-Kung, Ladner-Fischer, Han-Carlson, Knowles etc.

Page 3: Datapath Designs CK Cheng CSE Department UC, San Diego.

Prefix Adder – New Respects, New Method

• Realistic design considerations: Timing, Power and Area.

• Integer Linear Programming for prefix adder:– Logic effort timing model (gate cap. + wire cap.)– Activity-statistic power model– Non-uniform signal arrival/required times

Logic Levels

Max Fanouts Max Wire Tracks

Timing

Power Area

Page 4: Datapath Designs CK Cheng CSE Department UC, San Diego.

Prefix Adder – Optimum Prefix adders

• Uniform signal arrival/required times

Sklansky Adder Kogge-Stone Adder

Fastest depth-4 optimal prefix adderFastest depth-3 optimal prefix adder

Page 5: Datapath Designs CK Cheng CSE Department UC, San Diego.

Prefix Adder – Optimum Prefix adders

• Uniform signal arrival/required times

45

50

55

60

65

70

75

80

30 35 40 45 50 55 60

Timi ng

Power

Depth = 3 Depth = 4 Depth = 5

Brent - Kung Kogge- Stone Skl ansky

Page 6: Datapath Designs CK Cheng CSE Department UC, San Diego.

Prefix Adder – Optimum Prefix adders

• Non-uniform signal arrival/required times

Increasing Signal Arrival Times Decreasing Signal Arrival Times Convex Signal Arrival Times

Page 7: Datapath Designs CK Cheng CSE Department UC, San Diego.

Division – Iteration effort

• Pencil and paper method: (A=QB+2-nR and R<B)

1 bit partial quotient per iteration, n iterationsA = 0.1001,

B = 0.1010;

Q = A / B.

Q = 0.1101

+Qi: Partial Quotient

Ri: Partial Remainder

Ri+1 = Ri – B Qi

1 0 0 11 0 1 0 R0=A

1

1 0 1 00 1 0 0 R2

0

0 0 0 01 0 0 0 R3

1

1 0 1 00 1 1 0 R4

1 0 1 0

0.1

1 0 0 0 R1Q1 = 0.1Q2 = 0.01Q3 = 0.000Q4 = 0.0001

Page 8: Datapath Designs CK Cheng CSE Department UC, San Diego.

Division – Memory effort

• Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration.

• SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor.

Table size: 22m m

• STR method is limited by memory wall.

Page 9: Datapath Designs CK Cheng CSE Department UC, San Diego.

Division – Arithmetic effort

• Partial quotient is calculated by arithmetic functions.• Prescaling:

• Taylor expansion:

• Series expansion:

ERRQB

A

EB

EA

B

A

BE

iii

1

ERQ

BB

BB

BBE

ii

hl

hl

h

322 )1

()1

(11

ERQ

XXXXXXB

E

XB

ii

)1)(1)(1(11

1

4232

Page 10: Datapath Designs CK Cheng CSE Department UC, San Diego.

Division – Solution space

• Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider.

Iteration Effort

Memory Effort

Arithmetic Effort

Memory Wall

Pencil-and-paper

SRT

Prescaling

Taylor Expansion

Low area

Series Expansion

Low latency

Our target

Page 11: Datapath Designs CK Cheng CSE Department UC, San Diego.

Division – PST algorithm

• Utilize the power of series expansion, but need a good start point.

• Prescaling provide a scaled divisor close to 1.

• 0-order Taylor expansion iterates to reach the final quotient

21)1)(1(

11

1

XXXEB

XB

EXB

B

A

EB

EA

B

A

BE

1

ERQ ii

Page 12: Datapath Designs CK Cheng CSE Department UC, San Diego.

Division – PST algorithm

E0 = Table (B(m)) 1/B

A1 = AE0; B1 = BE0

E1 = (2 B1) INV(B1(2m))

Qi = Ri-1 E1

Ri = Ri-1 Qi B1

Q = Q + Qi

A = 0.1011,0110B = 0.1100,1011

B(m) = 0.1100 E0 = 1.0011

E1 = INV(B1(2m)) = 1.0000,1110

A1 = A E0 = 0.1101,1000,0010B1 = B E0 = 0.1111,0001,0001

Q1 = A1 E1 = 0.1110,0011R1 = B1 – Q1 B1 = 0.0000,0010,0101,1110,1101

Q2 = R1 E1 = 0.1001,1111R2 = R1 – Q2 B1 = 0.0000,0001,1111,1011,0001

Q = 0.1110,0011 + 0.0000,0010,0111,11 = 0.1110,0101,0111,11

Page 13: Datapath Designs CK Cheng CSE Department UC, San Diego.

Division – FPGA Implementation• PST algorithm is suitable for high-perform

ance division unit design in FPGAs

Fmax(Period)

ALUTs

Memory Bits

DSP Blocks

Power Consumption

(Dynamic+Static)

Throughput

IP Core(no DSP)

50.16MHz

(19.935ns)

1203 84 0 381mW(52mW+329mW)

50.16Mdiv/s

PST(DSP)

72.8MHz(13.737n

s)

213 768 28 350mW(23mW+327mW)

24.3Mdiv/s

PST(no DSP)

73.20MHz

(13.661ns)

1437 768 0 378mW(50mW+328mW)

24.4Mdiv/s

PST-pipelined(DSP)

74.15MHz

(13.486ns)

261 768 40 344mW(17mW+327mW)

74.15Mdiv/s

PSTp(no DSP)

76.05MHz

(13.150ns)

1940 768 0 359mW(31mW+328mW)

76.05Mdiv/s

32-bit division with 5-cycle latency