ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in...

13
May 6, 2016 11:18 am EE457 Final - Spring 2016 1 / 11 C Copyright 2016 Gandhi Puvvada EE457 Final (~30%) Closed-book Closed-notes Exam; No cheat sheets; Calculators are not needed and are not allowed. Verilog Guides are not needed and are not allowed. Smart phones, tablets (and any kind of computing/Internet devices) are not allowed. This is a Crowdmark exam. Please do not write on margins or on backside. Spring 2016 Instructor: Gandhi Puvvada Saturday, 5/7/2016 10:30 AM - 01:30 PM (3 Hour 00 min. = 180 min) Location: SGM123 Viterbi School of Engineering University of Southern California Ques# Topic Page# Time Points 1 Lab 7 Part 3 modified 2-5 70 min. 116 2 FIFO and ROB 6 15 min. 39 3 Branch Prediction 7 25 min. 48 4 CMP, CMT, Cache Coherency, LL/SC 8-9 35 min. 91 5 Tomasulo 10-11 20 min. 42 6 Virtual Memory 11 15 min. 29 Total 11 180 min. 365 Perfect Score 340 Student’s Last Name: _______________________________________ Student’s First Name: _______________________________________ Student’s DEN Bb username: ______________________________ @usc.edu

Transcript of ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in...

Page 1: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 1 / 11 C Copyright 2016 Gandhi Puvvada

EE457 Final (~30%)Closed-book Closed-notes Exam; No cheat sheets;

Calculators are not needed and are not allowed. Verilog Guides are not needed and are not allowed.Smart phones, tablets (and any kind of computing/Internet devices) are not allowed.

This is a Crowdmark exam. Please do not write on margins or on backside.

Spring 2016Instructor: Gandhi Puvvada

Saturday, 5/7/201610:30 AM - 01:30 PM (3 Hour 00 min. = 180 min)

Location: SGM123

Viterbi School of EngineeringUniversity of Southern California

Ques# Topic Page# Time Points

1 Lab 7 Part 3 modified 2-5 70 min. 116

2 FIFO and ROB 6 15 min. 39

3 Branch Prediction 7 25 min. 48

4 CMP, CMT, Cache Coherency, LL/SC 8-9 35 min. 91

5 Tomasulo 10-11 20 min. 42

6 Virtual Memory 11 15 min. 29

Total 11 180 min. 365

Perfect Score 340

Student’s Last Name: _______________________________________

Student’s First Name: _______________________________________

Student’s DEN Bb username: [email protected]

Page 2: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 2 / 11 C Copyright 2016 Gandhi Puvvada

1 ( 42 + 28 + 46 = 116 points) 70 min. Pipelining (Lab 7 Part 3 modified)

This is a mix of the two past midterm questions (from Fall 2010 and Spring 2011) that you were asked to go through. Please see the block diagram on the next page. It has five stages, IF, ID, EX1, EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010 Midterm solution. The BZ (branch if zero) line was added from the Spring 2011 midterm. Here, in EX1 also we have an ADD4 unit. No SUB3 at all. One can perform ADD4 in each of thetwo stages EX1 and EX2 to do ADD8 as shown below.

An ADD4 instruction can now execute from either EX1 or EX2. ADD4 tries to execute as soon as possible so that he can provide forwarding help to his juniors.However, he will not insist on executing from EX1, if he himself is dependent on, say, an ADD8ahead of him. In that case he will skip EX1 (SKIP1 = 1) and execute from EX2. Let us call him,"ADD4_Skpd1" (ADD4 Skipped EX1). Now if the next ADD4 is dependent on this ADD4_Sk-pd1, then he will also skip EX1 and will become another ADD4_Skpd1! Now if ADD8 comesafter him he needs to stall himself as he can not get help from this ADD4_Skpd1 in time.

It is important to know if the ADD4, currently standing in EX2 has finished execution alreadyin EX1 (and hence may activate SKIP2 here) or skipped execution in EX1 and came here to getforwarding help and then perform the addition of 4 here in EX2. To this end, Mr. Trojan carriedthe SKIP1 signal into EX2 using a FF in the EX1/EX2 stage register. This signal is called EX2_SKIP1 (SKIP1 carried into EX2).

Do not forget the MOV instruction. Let us call the MOV instruction, a "she". She necessarily skipsboth EX1 and EX2 and may receive forwarding help in either EX1 or EX2 or perhaps in bothEX1 and EX2. Circle all correct statements below.

A. There are occasions where she has to receive the needed forwarding help only in EX1. B. There are occasions where she has to receive the needed forwarding help only in EX2. C. There is no harm if she receives help in EX1 from EX2 occupied by ADD8 or ADD4_Skpd1, as this wasteful help will be replaced by correct help in the next clock. D. Unlike ADD4 or ADD8, she will never regret (feel sorry) for receiving help in EX2. E. In short, she can receive help in EX1 as well as EX2 without any worry.

You never need to stall (circle all correct answers): a NOP, a MOV, an ADD4, an ADD8

Instruction Operation Opcode MSD 32-bit instruction in hex

MOV SUB3 ADD4 ADD8 D=Destination, S=Source

NOP 0 0 0 0 0 000000DS

MOV $R, $X; ($R) <= ($X) 1 0 0 0 8 800000DS

SUB3 $R, $X; ($R) <= ($X) - 3 0 1 0 0 4 400000DS

BZ $X, JJJJ; (PC) <= JJJJ if ($X) = 0 0 1 0 0 4 4JJJJ0DS

ADD4 $R, $X; ($R) <= ($X) + 4 0 0 1 0 2 200000DS

ADD8 $R, $X; ($R) <= ($X) + 8 0 0 0 1 1 100000DS

BZ

Page 3: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010
Page 4: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am

EE457 Final - Spring 2016 3 / 11C

Copyright 2016 G

andhi Puvvada

PC

XA0

10

1

A

Cout

IF EX1 WB

A+4

FU_EX1

EN

RD

Writ

e

RA

RESET_BRESET_B

WB_RA

WB_Write

WB_RDXA

_Mux

R1_Mux

SKIP1

Modified LAB 7 Part 3 Block Diagram

I-M

EM

EN

RESET_B

ADD4EN

FORW_1A

Q#1

RA

BZ

+1

0

1

JJJJ

16

16

16

16

PCSource

IF_Flush

Comp Station in the ID Stage

(ID_XMEX1)

P Q

ID_XA S1_RA ID_XA S2_RA

P=Q

ID_XMS1= ID_XA Matched with his S1_RA

EN

XM

S1

XM

S2

XD

RESET_BRA

Reg. File

XA

RA

RDR-Write

ID

XD

XD_ZERO

ID_BZ

BranchAddress

WB_RDWB_RA

WB_Write

0

1

FA_Mux

0

1

FB_Mux

HDU_BR

FU_BR

FA_Sel

FB_Sel

ID_RA

Contr

ol si

gnal

sO

ther

tha

n B

Z

Con

trol

signal

sO

ther

than

BZ

EN

XD

RA

Contr

ol

sign

als

Oth

er t

han

BZ

0

1

FU_EX2

FORW_2

STALL_ID

IFRF

0

1

XB

_Mux

FORW_1B

2. Complete the 14 items marked as here on this page.

Notes:

14 items: These are 5 EN (enables) for the 5 registers,

ID_Bubble

PCSource, IF_Flush, ID_Bubble and EX1_Bubble.

3. Produce the 8 items marked as on the next few pages.

EX1_Bubble

Senior#2 RA(= EX2_RA)

P Q

P=Q

Senior#1 RA(= EX1_RA)

ID_XMS1(ID_XMEX2)ID_XMS2

S1_RA = Senior #1 RA (i.e. here EX1_RA)

XM

S1

XM

S2

HDU_EX1

STALL_EX1

0

1

R2_Mux

SKIP2

ADD4

A

Cout

A+4

EX2

8 items: There are 2 HDUs, 3 FUs, 2 Skips and 1 Write control.

RESET_B

5 forwarding paths for the 5 forwarding muxes,

Comp Station in the EX1 Stage

(=/= EX1_XMEX1)

P Q

EX1_XA S1_RA EX1_XA S2_RA

P=Q

EX1_XMS1= EX1_XA Matched with his S1_RA

Senior#2 RA(= WB_RA)

P Q

P=Q

Senior#1 RA(= EX2_RA)

EX1_XMS1(=/= EX1_XMEX2)EX1_XMS2

S1_RA = Senior #1 RA (i.e. here EX2_RA)

ID_XMS1

ID_XMS2

EX1_XMS1

EX1_XMS2

EX2_XMS1

EX2_XMS2

XA

EX1_

XAID_XA

1. Cross-out unneeded comparison units and unneeded comparison signal propagating FFs.

Actually you can cross-out either ID_Bubble or EX1_Bubble.

EX2_SKIP1

EX2_Write

42pts

Page 5: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 4 / 11 C Copyright 2016 Gandhi Puvvada

The following paragraph is from the Spring 2011Midterm regarding the BZ instruction.

Specifics of this semester’s question:

1. Mr. Trojan says that unlike in the MIPS 5-stage pipeline, in this 5-stage pipeline, you can stall an instruction such as ADD8 in the EX1 stage instead of in the ID stage. Explain the difference between the two pipelines. ______________________________________________________________________________________________________________________________________________________________________________________________________________So, here only BZ instruction is stalled if required in the ID stage. If any other instruction needs a dependency stall, it is done at EX1 stage. When a BZ instruction is stalled in the ID stage, you also stall the _________________. When an ADD8 instruction is stalled in EX1, you also stall the __________________. BZ instruction may get stalled at most for _____ (1/2/3) clocks. ADD8 instruction may get stalled at most for _____ (1/2/3) clocks.

2. Similar to the early branch of our lab 6, we have HDU_BR and FU_BR in the ID stage and FU_EX1 in the EX1 stage. But, unlike in our lab #6, we have HDU_EX1 in the EX1 stage.Also there is FU_EX2 in EX2 stage.

3. Notice that we started using more appropriate names for the register match signals.For example the earlier ID_XMEX1 is now called ID_XMS1 [ID_XA Matches with the S1_RA (standing for Senior #1 RA) where Senior #1 is the closest senior]. However, Miss Trojan has crossed out two of the earlier pipeline FFs in the ID/EX1 stage register (as shown in the diagram on the

previous page) and added a fresh comparison station in EX1. She says that, in this design, you can use the ID Stage comparator station inferences only in the ID stage and should not be carried into the EX1 stage. Please explain why? __________________________________________________________________________________________________________________________________________________________________________________________________________The EX1 stage comparisons were however carried into the EX2 stage. Cross-out on the diagram any unneeded comparison units and comparison signals unnecessarily propagated downstream.

In the lab 6 Part 5, we removed 2 comparison units inside the IFRF as they are duplicates. T / FIn this design, we could remove 1 comparison unit inside the IFRF as it is a duplicate. T / F

4.The last MOV instruction in the sequence of the four MOV instructions on the side receivesforwarding help (though redundantly) _____(2/3/4/5/6/7) times (including inside the IFRF.

The BZ (Branch if Zero) instruction uses the opcode previously allocated to the SUB3 instruction. The instructions are 32-bits in size, but the addresses are only 16-bit. PC is 16-bitwide and is incremented by a "1". The JJJJ in the BZ $X, JJJJ stands for a 16-bit (4-digithex) absolute branch address. If the source register $X is a zero then we branch to JJJJ [ (PC) <= JJJJ if ($X) = 0 ]. The "D" in "4JJJJ0DS" is a random hex digit and should not be treated asa valid destination, similar to the "DS" in "000000DS" for a NOP instruction. BZ executes from theID stage.

5pts

8pts

8pts

7pts

MOV $5, $1;MOV $5, $2;MOV $5, $3;MOV $6, $5;

Page 6: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010
Page 7: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 5 / 11 C Copyright 2016 Gandhi Puvvada

STALL_ID

FA_Sel

FB_Sel

STALL_EX1

FORW_1A

FORW_1B

SKIP_1

FORW_2

SKIP_2

EX2_Write

46pts

Page 8: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 6 / 11 C Copyright 2016 Gandhi Puvvada

2 ( 20 + 19 = 39 points) 15 min. The FIFO and the ROB labs

2.1 FIFO (single-clock FIFO):

2.1.1 When we use (n+1)-bit pointers for WP[n:0] and RP[n:0] for a 2**n location FIFO, we use the ________ (lower/upper) n bits namely _____________ ([n-1:0]/[n:1]) to ________________________________ (index the array of locations/perform WP-RP to arrive at depth/both/neither).

2.1.2 For a 128-location deep FIFO, since populated depth can vary from completely empty to completely full, we have ________ (127/128/129) depth values and should be doing__ (a/b/c/d).(a) (WP-RP) mod-64 or (b) (WP-RP) mod-128 or (c) (WP-RP) mod-256 or (d) otherIf multiple choices are possible, state where you would use what ? ___________________________________________________________________________________________________

2.1.3 If you are using just 4-bit pointers for a 16-location FIFO, you need to set a lower threshold RAE and upper threshold RAF. Four of your junior engineers have set the thresholds as shown below. Comment/correct/prise them.

2.2 ROB: Compare our ROB lab with the Tomasulo part 2 (IoI-OoE-IoC) taught in class:

2.2.1 In Tomasulo, an LS-buffer was provided after cache as cache is a ___________ (fixed/variable) latency unit. Here, the dividers are all ___________ (fixed/variable) latency units. Once a divider is done, it _______ (A/B). A. remains in its DONE state until the Issue Units selects it for transfer to ROB. B. goes to ROB.

2.2.2 Associative search of ROB is conducted in ______ (A/B/C/D)A. only in the Tomasulo P2, B. only in this ROB lab, C. both, D. neither

2.2.3 WP goes ahead and then RP follows it in the ROB of ______ (A/B/C/D)A. only in the Tomasulo P2, B. only in this ROB lab, C. both, D. neither

2.2.4 In the ROB diagrams on the side for the 8-location ROB in this lab, indicate the populated locations by shading ( ) them. If you have shaded more than 4 locations, explain how it is possible as there are only 4 single-dividers. ___________ __________________________________________________________________________________________________

6pts

6pts

8pts

6pts

2pts

2pts

01234567

WP

RP

01234567

WP

RP

9pts

Page 9: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 7 / 11 C Copyright 2016 Gandhi Puvvada

3 ( 48 points) 25 min. Branch Prediction

3.1 _________ (Early / Late) branch ___________ (is likely to / will) cause more branch penalty. _________ (Early / Late) branch ___________ (is likely to / will) cause more dependency stalls.

Branch penalty refers to the clocks lost due to __________________ (flushing by/stalling due to/forwarding to) _________________________ (a taken branch / an untaken branch / any branch).

3.2 Branch direction prediction becomes more important in __________ (deeper / shallow) pipelines.Branch direction prediction becomes more important in __________ (out-of-order / in-order) executing pipelines. Branch target address needs to predicted if branch prediction is done from the __________ (IF / ID) stage.

3.3 ________ (JAL only / JR$31 only / both / neither) cause changes to the content of the RAS.________ (JAL only / JR$31 only / both / neither) is helped by the RAS.RAS stands for _______________________ and may be __________ (4 / 4K) locations deep.

3.4 A 2-bit branch direction predictor is better than a 1-bit predictor. T / F______________ (However / Similarly) a 3-bit predictor ______________________________________________________________________________________________________________________________________________________________________________________

3.5 No aliasing if you are predicting from _______ (IF / ID) stage, but aliasing is OK if you are predicting from _______ (IF / ID) stage. Out of the two below ______________ (A/B/A and B/neither A nor B) can cause serious drop in performanceA. Predicting a non-branch instruction such as an ADD instruction as a taken branch and correcting laterB. Applying the prediction information of one branch to another branch

3.6 BPB (Branch Prediction Buffer) with depth =2K, needs K bits to index it. The K bits are correctly taken from the PC in the _________ (Left /Right)-side design.

3.7 Our Lab 6 design has no branch prediction. It is equivalent to predicting always ______ (taken/not-taken). If the real dynamic execution trace of most codes show that 60% of the conditional branches are taken, it appears that we should choose to predicting "always taken" rather than predicting "always not-taken". Mr. Trojan voiced a caution. He said ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

3.8 In the (m, n) predictor, _______ (m / n / neither) refers to the size in bits of the global history shift register. A (m, n) predictor improves prediction accuracy if a branch behavior is correlating to the past few branches globally. T / F

8pts

6pts

8pts

6pts

6pts

01010011

00

K-bits

K30-K PC

BPB 2K

01010011

00

K-bits

30-KK PC

BPB2K

Left Right

2pts

8pts

4pts

Page 10: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 8 / 11 C Copyright 2016 Gandhi Puvvada

4 ( 60 + 31 = 91 points) 35 min. CMP, CMT, Cache Coherency, LL/SC

4.1 In the two CMP (Chip Multi Processor) organizations shown below, we discussed in EE457, the _________ (Left / Right) design. If a CROSS BAR of 8x4 is used as the memory interconnection network in the left design, the L2 cache can be divided into ___________ (8 / 4 / other banks). The L1 cache is ____ (A / B) and L2 cache is ____ (A / B). A. private to a core B. shared by all cores.The L1 cache is a common resource to all threads of a core. T / F.

4.1.1 Locking the BUS in order to implement an atomic RMW operation on a semaphore is _____________ (an old / a current) technique. ___________ (Both/Neither/Only LL/Only SC) requires locking the Bus. During "busy" wait, excessive traffic on the Bus is _____________(caused / avoided) by spinning around a ____________ (local/global) copy of the lock variable using LL instruction.

4.2 The SCU(s) in ___________ (one / multiple) core(s) may be driving address and BusRdX on the Bus the SCU(s) in ___________ (one / multiple) core(s) may be listening on the bus to invalidate that block. Division of the cache operation into CCU and SCU is a key feature of a ______________________ (Blocking/Non-Blocking) cache. In a 4-threaded core, there is/are _____ (1/4) SCU(s). In an 8-core processor, there is/are _____ (1/8) SCU(s). In the context of Non-blocking cache, MSHR stands for ____________________________________.________ (CCU/SCU) leaves a request in MSHR and ________ (CCU/SCU) attends to it.

4.3 In the MPI (Miss rate per instruction) calculations, ____________________ (all / only memory accessing) instructions are considered. If L1 cache MPI is 5% and the L1 miss penalty is 10 clocks and if L2 cache MPI is 1% and the L2 miss penalty is 200 clocks, what is the overall CPI assuming there are no other problems causing lowering of the CPI. __________________________________________________________________________________________________

4.4 Assume simple in-order pipeline like the 6-stage Oracle Niagara (T1) processor.Switching between threads on every clock in a fine-grain multi-threaded core wastes 1 or more clocks per switch. T / FSwitching between threads because of data cache miss in the running thread wastes 1 or more clocks per switch. T / FNow assume out of order execution in SMT (same as HTT in Intel terminology). Here do you incur loss of clocks due to switching threads? ________ (Y / N). SMT stands for _______________________. HTT stands for ___________________________.Since threads carry their _______ (thread ID/process ID) with them, they write to their respective register files in the WB stage. TLB can be common or separate for the threads. T/ F If it is common, it has ASN (Address Space Number). T/FOperating System _____ (is/isn’t) informed when a thread switch occurs in a multi-threaded core.

7.5pts

C0

L1$

C1

L1$

C7

L1$

Memory Interconnection Network

Shared (banked)L2$ L2$ L2 cache

C0

L1$

C1

L1$

C7

L1$

Shared L2 cache (no banks)

Bus

Core 0

8.5pts

16pts

6pts

18pts

Page 11: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 9 / 11 C Copyright 2016 Gandhi Puvvada

4.5 MOESI state encoding: Fill-up the encoding table on the side for the 5 states.

4.6 The advantage of the "E" state is _________________________________________________________________________________________________________________________________________________________The advantage of the "O" state is _________________________________________________________________________________________________________________________________________________________

4.7 The word "Flush" in the diagrams below means helping a neighbor L1 cache. It is _________________ (wrong/wasteful) to flush to the MM also unless needed.

Mark appropriate state transitions in the EE457 design with either R/FMM (meaning replacement causing flush to main memory) or R/-- (meaning replacement causing no flush to main memory). The R/FMM or R/-- markings for the EE557 design would be identical to EE457 markings. T / FWe ______________ (wish to / do not wish to) defer (postpone) updating the MM as farther in time as possible.An example of this is (narrate a case of transferring responsibility to update the MM to another cache):______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

4.8 LL stands for Load _________ and SC stands for Store ___________. In LL $2, 2000($0); $2 is a _______________________ (source / destination / both). In SC $2, 2000($0); $2 is a _______________________ (source / destination / both). It is ______________________ (possible / not possible) that a thread in a core executes two LL instructions with the same address without any intervening SC instruction. It is ______________________ (possible / not possible) that a thread in a core executes two SC instructions with the same address without any intervening LL instruction. Successful execution of a SC $2, 2000($0); by one thread in a core should have broken LL links for the address 2000 first for all threads in other cores when the block was ________________________________________________________________________________________________________and then for all threads in that very core when the SC instruction is ______________________

Valid

bit

Onl

y-Co

py b

it

Dirt

y bi

t

I (Invalid)

S (Shared)

O (Owner)

E (Exclsive)

M (Modified

State

Property Code

5pts

6pts

10pts

EE557 EE457

M

O

I

PrRd(S)/

PrWr/BusUpgr

BusRd

PrWr/

PrRd/--

BusRdX

BusRd/

BusRd/--

PrRd/--

E

PrRd(S)/

PrWr/--

BusRdX/

S

BusRd/

PrWr/

BusUpgr/--BusRdX/Flush

PrWr/--

BusUpgr/--BusRdX/--BusRdX/

PrRd/--

PrRd/--

BusRdX/--BusUpgr/--

BusRd/Flush

Flush

Flush

BusRd

Flush

BusUpgr

Flush

M

O

I

PrRd(S)/

PrWr/BusUpgr

BusRd

PrWr/

PrRd/--

BusRdX

BusRd/

BusRd/--

PrRd/--

E

PrRd(S)/

PrWr/--

BusRdX/

S

BusRd/

PrWr/

BusUpgr/--BusRdX/Flush

PrWr/--

BusUpgr/--BusRdX/--BusRdX/

PrRd/--

PrRd/--

BusRdX/--BusUpgr/--

BusRd/Flush

Flush

Flush

BusRd

Flush

BusUpgr

Flush

Figure 5.12 State-transition diagram of a MOESI protocolCopyright 2012 Michel Dubois, Murali Annavaram and Per Stenström

16pts

Can

cele

d

Page 12: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 10 / 11 C Copyright 2016 Gandhi Puvvada

5 ( 42 points) 20 min. Tomasulo

5.1 In the LHS (Left-Hand Side) out of order commitment design (IoI-OoE-OoC), the store word instructions are committed ___________ (in order / out of order). In this design the register renaming using tokens from the TAG FIFO for the ____________ (source/destination) registers avoids WAW and WAR problems associated with the ______________________ (registers only / memory locations only / both). RAW problems among registers are ___________________ (made to disappear / preserved) by the register renaming.

5.2 The memory disambiguation rules are simpler in the ________ (LHS/RHS/both) design(s) because the WAW and WAR problems are taken care of by the strict ________________________________________________ in the ________ (LHS/RHS/both) design(s).

5.3 Every register write reaches the register file in the ________ (LHS/RHS/both) design(s).

5.4 In his implementation of the RHS design, Mr. Bruin was careless and did not check to see if the 2-bit bypass counter (associated with the load instruction) in the LSQ became FULL when he was letting a junior SW instruction bypass over the senior LW instructions with matching address. Please explain to him how this could potentially cause deadlocks in his system. __________________ ________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

5.5 For an 8-location ROB using 4-bit pointers for WP and RP some of the four situations below namely ________ (#1, #2, #3, #4) is/are illegal. Among the remaining legal situations any one of them could be occurring before the other(s) as ROB is a circular buffer. T / F . Circle the populated locations for the legal situations and state the depth.

B

I -Cache

����

Dispatch

I-Fetch Queue

Integer Queue

Load/StoreQueue

DivQueue

Mult Queue

CDB

Back-end

Front-end

ROB(Re-order Buffer)

Reg File

BPB

Exe Unit Exe UnitCache

Exe Unit Exe Unit

Issue Unit

����

Addr Buff

OoO Execution and In-Order Committing with ROB (Re-Order Buffer)

LS_Buffer

CacheIoI - OoE - OoC design IoI - OoE - IoC design

Load Buffer

10pts

8pts

2pts

10pts

12pts

Page 13: ee457 Final Spring2016 - University of Southern California...EX2, and WB. There is a ADD4 unit in each of the EX1 and EX2 stages. Rest of this page is reproduced from the Fall 2010

May 6, 2016 11:18 am EE457 Final - Spring 2016 11 / 11 C Copyright 2016 Gandhi Puvvada

6 ( 29 points) 15 min. Virtual Memory:

Virtual Memory:

6.1 PTBR stands for _____________________________________________________________.It is initiated by _________________________ (hardware / operating system) and is utilized by ___________ (MMU / CCU) (i.e. memory management unit or cache control unit) to look up ______________________ (TLB / Page Table / Cache Tag RAM).

6.2 Page Table: Number of A,B,C,D Tables built by the OS:

PQRST on the side represents a 20-bit (5-digit hex) VPN in a 4-level page table with upper 8 bits (PQ) indexing the A-level table, next 4 bits (R) indexing the B-level tables, next 4 bits (S) indexing the C-level tables, and the last 4 bits (T) indexing the D-level tables. Suppose the first 8 distinct virtual pages accessed by the application program had the VPNs as stated in TABLE-I (in sorted order).How many tables of what size were built by OS by this time?A-level: _____________________________________________ B-level: _____________________________________________ C-level: _____________________________________________ D-level: _____________________________________________

6.3 Memory addresses: In a 32-bit virtual address system using 4KB pages, state any two consecutive 32-bit word addresses (in hex) which do not fall in the same virtual page.______________________I am evicting a page containing the byte with virtual address 2345B789h. What is its virtual page number (in hex)? __________. What is the range of byte addresses residing in that page (lowest virtual byte address to highest virtual byte address). ____________________________________The physical page frame number in the main memory is 2 (just 2). What is the range of byte addresses residing in that page (lowest physical byte address to highest physical byte address). ___________________________________________________________________________

6.4 ________ (VIPT/PIPT) is advantageous over ________ (VIPT/PIPT).

01

2

3

4

5

678

9

10

11

12

13

14

15

#3

RP WP

01

2

3

4

5

678

9

10

11

12

13

14

15

#4

01

2

3

4

5

678

9

10

11

12

13

14

15

#2WP

RPWP

01

2

3

4

5

678

9

10

11

12

13

14

15

#1WPRP

RP

Depth= _______ Depth= _______ Depth= _______ Depth= _______

6pts

P Q R S TTABLE-I

7 2 6 4 57 2 6 4 77 3 8 6 57 3 8 6 77 4 9 6 57 5 9 6 57 6 9 6 57 6 9 7 5

12pts

8pts

3pts