1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu,...

30
1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu,...

Page 1: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

1

EE457 DiscussionFall 2006

Final Review

Brandon Franzke, Maryam Soltan, USC2006

and Wei-Jen Hsu, USC 2005

Page 2: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

2

Review Questions

• Question 2, Fall 2004 (Multi-cycle CPU)

• Question 3, Summer 2004 (Pipeline CPU)

• Question 1, Fall 2004 (Based on lab 7 pipeline)

• Carry Look Ahead Adder

• Question 5, Summer 2004 (CLA)

Page 3: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

3

Example - Multicycle CPU• Modifications to the 2nd Edition CU (state

diagram) and DPU.– Mr Trojan already modified DPU

• Notice: Standalone registers (MDR, ALUout) are fast even though RegFile is not.– Standalone = instantaneous– Register File = ½ clock

• So we want to skip states 4 (lw) & 7 (r-type)– implement “posted-write” (next page)

Page 4: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

4

Posted-Write• We needed states 4 & 7 because Register

Writing takes ½ clock– But we already have the data stored in MDR and ALUout for these states.

– Can we delay writing until the beginning of the next instruction? (state 0)

– What about control signals?

• This is a “Posted-Write”– a write operation “posted” (scheduled) to occur later

Page 5: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

5

Posted-Write Implementation• Well, we just save the control signals for 1 extra

clock with Flip-Flops!– RegDst, RegWrite, MemToReg

• Now the signals are available for 1 clock extra

Page 6: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

6

Questions• DPU modifications are complete, modify

the CU to implement register posted-write.– DPU and CU next pages

• What justification did Mr. T tell his boss for using Positive Edge-triggered flip-flops?– The design team says that positive-edged

FF’s cost extra. Can Mr. Trojan use negative-edged FF instead?

Page 7: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

7

Page 8: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

8

Page 9: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

9

When to load FF• Ms. Bruin suggested a RegWrite_FF_Write

as shown below. Comment on the design and its necessity.

Page 10: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

10

Posted Write for sw• Ms. Bruin was given another chance by the lead

engineer. She tried to copy Mr. Trojan and suggested saving a clock in the sw instruction by skipping state 5 and adding the following 2 FF. Advice?

Page 11: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

11

Example – Pipeline CPU A new 4-stage Pipeline

• MEM before EX• No spurious stalls• New R-Type instr. (ex.:addm,…)

– Use memory operand as a source operand

• Writing to RegFile takes very little time => No separate WB stage

• Memory : One read port• Beq in Ex stage• EAC not possible

=> Revised lw and sw

Page 12: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

12

New 4-stage Pipeline ….

• addm

• Investigate data dependencies and implement HDU and FU

• Avoid any spurious stalls. (really dependent)

• No internal forwarding in memory– Cannot write and read to/from memory

simultaneously.

Page 13: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

13

Page 14: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

14

New 4-stage Pipeline …. (sw, lw)

• BEQ is executing after _____stage in ____ stage.

• Where should we execute sw?

• Where should we execute lw?

beq rs,rt, Target;

sw rt, (rs); MEM[(rs)]<= (rt)

Page 15: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

15

New 4-stage Pipeline …. (Hazard and stalling)

Regular pipeline 4-stage pipeline

Dependencies/RAW hazards for register operand

Dependencies/RAW hazards for register operand

Instruction to activate MemRead

Stalling instruction in ______ stage

Condition of stalling

Page 16: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

16

New 4-stage Pipeline …. (Hazard and stalling…)

Page 17: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

17

sw $1, ($2);lw $4, ($2);addm $8, ($2), $4;subm $16, ($8), $4;

Page 18: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

18

Page 19: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

19

Page 20: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

20

Page 21: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

21

Lab7, modified

• Now implement SUB3 and SUB6 instructions (SUB3 in EX1 and EX2).– still have NOP

• Optimize performance by performing SUB3 in EX1 or EX2 (i.e. minimize stalling)

• The new stalling policy:– Never stall SUB3 and stall SUB6 iff it is

dependent on the preceding instruction.

Page 22: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

22

Page 23: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

23

Logic Blocks

• Postponing logic– assertions to perform SUB3 in EX1 or EX2– prefer EX1 so data is available to forward.

• HDU– Stall only dependant SUB6 instructions

• FU1 and FU2– forwarding from EX2→EX1 and WB→EX2

Page 24: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

24

Stall vs. Flushing

• When do you flush and when do you stall?– How many instructions do you flush at a time?– How many instructions in the pipe do you stall?– Do flushing & stalling have anything in common?– Which of them result in producing bubbles?– Is the penalty due to flushing / stalling more severe in

deeper pipelines? (say 7-10 stages)– How do delay slots affect the penalty?

Page 25: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

25

1-bit CLA adder

(+)

A B

CinS

p g

• p: propagator => p = A+B (If either A or B is 1, Cin = 1 causes Cout = 1)• g: generator => g = AB (If both A and B are 1, Cout = 1 for sure)• p, g are generated in 1 gate delay after we have A, B. • Note that Cin is not needed to produce p and g.• S is generated in 2 gate delay after we get Cin (SOP).

Page 26: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

26

4-bit CLA

(+)

A0 B0

C0(+)

A1 B1

(+)

A2 B2

(+)

A3 B3

CLL (carry look-ahead logic)

p0 g0p1 g1p2 g2p3 g3

C1C2C3

S3 S2 S1 S0

• The CLL takes p,g from all 4 bits and C0 as input to generate all Cs in 2 gate delay. • C1=g0+p0C0,• C2=g1+p1g0+p1p0C0,• C3=g2+p2g1+p2p1g0+p2p1p0c0,• C4=g3+p3g2+p3p2g1+p3p2p1g0+p3p2p1p0c0 (Note: C4 is too complicated, however it is a 2-level SOP representation)

Page 27: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

27

4-bit CLA

(+)

A0 B0

C0(+)

A1 B1

(+)

A2 B2

(+)

A3 B3

CLL (carry look-ahead logic)

p0 g0p1 g1p2 g2p3 g3

• Given A,B’s, all p,g’s are generated in 1 gate delay in parallel.

C1C2C3

• Given all p,g’s, all C’s are generated in 2 gate delay in parallel.

S3 S2 S1 S0

• Given all C’s, all S’s are generated in 2 gate delay in parallel.

• Key virtue of CLA: sequential operation in RCA is broken into parallel operation!!

Page 28: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

28

16-bit CLA

• Same as before, p,g’s are generated in parallel in 1 gate delay

• The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s” for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from P,G’s is exactly the same to generating C’s from p,g’s!)

• With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in 2 gate delay• With all C’s in place, S’s are calculated in 2 gate delay

Therefore, totally1+2+2+2+2=9 gate delayto finish the whole thing!!

• Now, without input carry, the first-tier CLL cannot generate C’s…… Instead they generate P,G’s (group propagator and group generator) in 2 gate delay P => This group will propagate the input carry to the group P=p0p1p2p3 G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0

Page 29: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

29

Example - 64bit-CLA

• S39 takes longer to become valid.• List of primary and intermediate signals in producing S39: (Back

tracking: S39 = A39 B39 C39 , S39<-C39<-C36…)– Do we need P39_36* and G39_36*?– Primary inputs:– Gate delay to generate p38_0, g38_0 :– Gate delay for second level P*, G*:

– Gate delay for second level P**, G**:

– Gate delay C32:

– p38 ,p37 ,p36 , and g38 ,g37 ,g36

– C32 C36 C39

Delay:

Page 30: 1 EE457 Discussion Fall 2006 Final Review Brandon Franzke, Maryam Soltan, USC2006 and Wei-Jen Hsu, USC 2005.

30

Other Topics

• Usually there is a question on non-linear pipeline.

• Please make sure that you are comfortable with cache and virtual memory organization.