PH-4-Quiz

Date:

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [15 points] Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 80 MHz and M2 has a clock rate of 100 MHz. The average number of cycles for each instruction class and their frequencies (for a typical program) are as follows: Instruction Class Machine M1 –

Cycles/Instruction Class

Machine M2 – Cycles/Instruction Class

Frequency

A 1 2 60% B 2 3 30% C 4 4 10% (a) Calculate the average CPI for each machine, M1, and M2. (b) Calculate the average MIPS ratings for each machine, M1 and M2.

(c) Which machine has a smaller MIPS rating ? Which individual instruction class CPI do you need to change, and by how much, to have this machine have the same or better performance as the machine with the higher MIPS rating (you can only change the CPI for one of the instruction classes on the slower machine)?

Name: _____________________

Quiz for Chapter 1 Computer Abstractions and Technology of 12

2. [10 points] (Amdahl’s law question) Suppose you have a machine which executes a program consisting of 50% floating point multiply, 20% floating point divide, and the remaining 30% are from other instructions. (a) Management wants the machine to run 4 times faster. You can make the divide run at most 3 times faster and the multiply run at most 8 times faster. Can you meet management’s goal by making only one improvement, and which one? (b) Dogbert has now taken over the company removing all the previous managers. If you make both the multiply and divide improvements, what is the speed of the improved machine relative to the original machine?

Name: _____________________


3. [5 points] Suppose that we can improve the floating point instruction performance of machine by a factor of 15 (the same floating point instructions run 15 times faster on this new machine). What percent of the instructions must be floating point to achieve a Speedup of at least 4?

Name: _____________________


4. [6 points] Just like we defined MIPS rating, we can also define something called the MFLOPS rating which stands for Millions of Floating Point operations per Second. If Machine A has a higher MIPS rating than that of Machine B, then does Machine A necessarily have a higher MFLOPS rating in comparison to Machine B?

Name: _____________________


5. [6 points] Consider the SPEC benchmark. Name two factors that influence the resulting performance on any particular architecture.

Name: _____________________


6. [5 points] How did the development of the transistor affect computers? What did the transistor replace?

Name: _____________________


7. [25 points] A two-part question: (Part A) Assume that a design team is considering enhancing a machine by adding MMX (multimedia extension instruction) hardware to a processor. When a computation is run in MMX mode on the MMX hardware, it is 10 times faster than the normal mode of execution. Call the percentage of time that could be spent using the MMX mode the percentage of media enhancement. (a) What percentage of media enhancement is needed to achieve an overall speedup of 2? (b) What percentage of the run-time is spent in MMX mode if a speedup of 2 is achieved? (Hint: You will need to calculate the new overall time.) (c) What percentage of the media enhancement is needed to achieve one-half the maximum speedup attainable from using the MMX mode? (Part B) If processor A has a higher clock rate than processor B, and processor A also has a higher MIPS rating than processor B, explain whether processor A will always execute faster than processor B. Suppose that there are two implementations of the same instruction set architecture. Machine A has a clock cycle time of 20ns and an effective CPI of 1.5 for some program, and machine B has a clock cycle time of 15ns and an effective CPI of 1.0 for the same program. Which machine is faster for this program, and by how much?

Name: _____________________


8. [6 points] Suppose a program segment consists of a purely sequential part which takes 25 cycles to execute, and an iterated loop which takes 100 cycles per iteration. Assume the loop iterations are independent, and cannot be further parallelized. If the loop is to be executed 100 times, what is the maximum speedup possible using an infinite number of processors (compared to a single processor)?

Name: _____________________


9. [5 points] Computer A has an overall CPI of 1.3 and can be run at a clock rate of 600MHz. Computer B has a CPI of 2.5 and can be run at a clock rate of 750 Mhz. We have a particular program we wish to run. When compiled for computer A, this program has exactly 100,000 instructions. How many instructions would the program need to have when compiled for Computer B, in order for the two computers to have exactly the same execution time for this program?

Name: _____________________


10. [5 points] Imagine that you are able to perform benchmarking “races” to compare two computers you are thinking about buying. Come up with a list of 5 benchmark programs or usage scenarios you would use to create your own personalized benchmark suite. For each program you select, justify it. For the benchmark suite as a whole, discuss a method for calculated a weighted average of the different program run-times.

Name: _____________________


11. [8 points] The design team for a simple, single-issue processor is choosing between a pipelined or non-pipelined implementation. Here are some design parameters for the two possibilities:

Parameter Pipelined Version Non-Pipelined Version

Clock Rate 500MHz 350 MHz CPI for ALU instructions 1 1 CPI for Control instructions

2 1

CPI for Memory instructions

2.7 1

(a) For a program with 20% ALU instructions, 10% control instructions and 75% memory instructions, which design will be faster? Give a quantitative CPI average for each case. (b) For a program with 80% ALU instructions, 10% control instructions and 10% memory instructions, which design will be faster? Give a quantitative CPI average for each case.

Name: _____________________


12. [5 points] A designer wants to improve the overall performance of a given machine with respect to a target benchmark suite and is considering an enhancement X that applies to 50% of the original dynamically-executed instructions, and speeds each of them up by a factor of 3. The designer’s manager has some concerns about the complexity and the cost-effectiveness of X and suggests that the designer should consider an alternative enhancement Y. Enhancement Y, if applied only to some (as yet unknown) fraction of the original dynamically-executed instructions, would make them only 75% faster. Determine what percentage of all dynamically-executed instructions should be optimized using enhancement Y in order to achieve the same overall speedup as obtained using enhancement X.

Date:

Quiz for Chapter 2 Instructions: Language of the Computer3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [5 points] Prior to the early 1980s, machines were built with more and more complex instruction set. The MIPS is a RISC machine. Why has there been a move to RISC machines away from complex instruction machines?

Name: _____________________

Quiz for Chapter 2 Instructions: Language of the Computer Page 2 of 12

2. [5 points] Write the following sequence of code into MIPS assembler: x = x + y + z - q; Assume that x, y, z, q are stored in registers $s1-$s4.

Name: _____________________


3. [10 points] In MIPS assembly, write an assembly language version of the following C code segment: int A[100], B[100]; for (i=1; i < 100; i++) { A[i] = A[i-1] + B[i]; } At the beginning of this code segment, the only values in registers are the base address of arrays A and B in registers $a0 and $a1. Avoid the use of multiplication instructions–they are unnecessary.

Name: _____________________


4. [6 points] Some machines have a special flag register which contains status bits. These bits often include the carry and overflow bits. Describe the difference between the functionality of these two bits and give an example of an arithmetic operation that would lead to them being set to different values.

Name: _____________________


5. [6 points] The MIPS instruction set includes several shift instructions. They include logical-shift-left, logical-shift-right, and arithmetic-shift-right. Other architectures only provide an arithmetic-shift-right instruction. a) Why doesn’t MIPS offer an “arithmetic-shift-left” opcode? b) How would you implement in the assembler a logical-shift-left (LSL) pseudo-operation for a machine that didn’t have this particular instruction? Be sure your LSL instruction can shift up to W-bits where W is the machine word size in bits.

Name: _____________________


6. [6 points] Consider the following assembly code for parts 1 and 2. r1 = 99 Loop: r1 = r1 – 1 branch r1 > 0, Loop halt (a) During the execution of the above code, how many dynamic instructions are executed? (b) Assuming a standard unicycle machine running at 100 KHz, how long will the above code take to complete?

Name: _____________________


7. [15 points] Convert the C function below to MIPS assembly language. Make sure that your assembly language code could be called from a standard C program (that is to say, make sure you follow the MIPS calling conventions). unsigned int sum(unsigned int n) { if (n == 0) return 0; else return n + sum(n-1); } This machine has no delay slots. The stack grows downward (toward lower memory addresses). The following registers are used in the calling convention:

Register Name Register Number Usage

$zero 0 Constant 0 $at 1 Reserved for assembler $v0, $v1 2, 3 Function return values $a0 - $a3 4 – 7 Function argument values $t0 - $t7 8 – 15 Temporary (caller saved) $s0 - $s7 16 – 23 Temporary (callee saved) $t8, $t9 24, 25 Temporary (caller saved) $k0, $k1 26, 27 Reserved for OS Kernel $gp 28 Pointer to Global Area $sp 29 Stack Pointer $fp 30 Frame Pointer $ra 31 Return Address

Name: _____________________


8. [5 points] In the snippet of MIPS assembler code below, how many times is instruction memory accessed? How many times is data memory accessed? (Count only accesses to memory, not registers.) lw $v1, 0($a0) addi $v0, $v0, 1 sw $v1, 0($a1) addi $a0, $a0, 1

Name: _____________________


9. [6 points] Use the register and memory values in the table below for the next questions. Assume a 32-bit machine. Assume each of the following questions starts from the table values; that is, DO NOT use value changes from one question as propagating into future parts of the question. Register Value Memory Location Value

R1 12 12 16 R2 16 16 20 R3 20 20 24 R4 24 24 28 a) Give the values of R1, R2, and R3 after this instruction: add R3, R2, R1 b) What values will be in R1 and R3 after this instruction is executed: load R3, 12(R1) c) What values will be in the registers after this instruction is executed: addi R2, R3, #16

Name: _____________________


10. [20 points] Loop Unrolling and Fibonacci: Consider the following pseudo-C code to compute the fifth Fibonacci number (F(5)). 1 int a,b,i,t; 2 a=b=1; /* Set a and b to F(2) and F(1) respectively */ 3 for(i=0;i<2;i++) 4 { 5 t=a; /* save F(n-1) to a temporary location */ 6 a+=b; /* F(n) = F(n-1) + F(n-2) */ 7 b=t; /* set b to F(n-1) */ 8 } One observation that a compiler might make is that the loop construction is somewhat unnecessary. Since the the range of the loop indices is fixed, one can unroll the loop by simply writing three iterations of the loop one after the other without the intervening increment/comparison on i. For example, the above could be written as: 1 int a,b,t; 2 a=b=1; 3 t=a; 4 a+=b; 5 b=t; 6 t=a; 7 a+=b; 8 b=t; (a) Convert the pseudo-C code for both of the snippets above into reasonably efficient MIPS code. Represent each variable of the pseudo-C program with a register. Try to follow the pseudo-C code as closely as possible (i.e. the first snippet should have a loop in it, while the second should not). (b) Now suppose that instead of the fifth Fibonacci number we decided to compute the 20th. How many static instructions would there be in the first version and how many would there be in the unrolled version? What about dynamic instructions? You do not need to write out the assembly for this part.

Name: _____________________


11. [10 points] In MIPS assembly, write an assembly language version of the following C code segment: for (i = 0; i < 98; i ++) { C[i] = A[i + 1] - A[i] * B[i + 2] } Arrays A, B and C start at memory location A000hex, B000hex and C000hex respectively. Try to reduce the total number of instructions and the number of expensive instructions such as multiplies.

Name: _____________________


12. [6 points] Suppose that a new MIPS instruction, called bcp, was designed to copy a block of words from one address to another. Assume that this instruction requires that the starting address of the source block be in register $t1 and that the destination address be in $t2. The instruction also requires that the number of words to copy be in $t3 (which is > 0). Furthermore, assume that the values of these registers as well as register $t4 can be destroyed in executing this instruction (so that the registers can be used as temporaries to execute the instruction). Do the following: Write the MIPS assembly code to implement a block copy without this instruction. Write the MIPS assembly code to implement a block copy with this instruction. Estimate the total cycles necessary for each realization to copy 100-words on the multicycle machine.

Date:

Quiz for Chapter 3 Arithmetic for Computers 3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in RED 1. [9 points] This problem covers 4-bit binary multiplication. Fill in the table for the Product, Multplier and Multiplicand for each step. You need to provide the DESCRIPTION of the step being performed (shift left, shift right, add, no add). The value of M (Multiplicand) is 1011, Q (Multiplier) is initially 1010.

Product Multiplicand Multiplier Description Step

0000 0000 0000 1011 1010 Initial Values Step 0 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Step 10 Step 11 Step 12 Step 13 Step 14 Step 15

Name: _____________________

Quiz for Chapter 3 Computer Arithmetic Page 2 of 12

2. [6 points] This problem covers floating-point IEEE format. (a) List four floating-point operations that cause NaN to be created? (b) Assuming single precision IEEE 754 format, what decimal number is represent by this word: 1 01111101 00100000000000000000000 (Hint: remember to use the biased form of the exponent.)

Name: _____________________


3. [12 points] The floating-point format to be used in this problem is an 8-bit IEEE 754 normalized format with 1 sign bit, 4 exponent bits, and 3 mantissa bits. It is identical to the 32-bit and 64-bit formats in terms of the meaning of fields and special encodings. The exponent field employs an excess-7coding. The bit fields in a number are (sign, exponent, mantissa). Assume that we use unbiased rounding to the nearest even specified in the IEEE floating point standard. (a) Encode the following numbers the 8-bit IEEE format:

(1) 0.0011011binary

(2) 16.0decimal

(b) Perform the computation 1.011binary + 0.0011011binary showing the correct state of the guard, round and sticky bits. There are three mantissa bits. (c) Decode the following 8-bit IEEE number into their decimal value: 1 1010 101 (d) Decide which number in the following pairs are greater in value (the numbers are in 8-bit IEEE 754 format):

(1) 0 0100 100 and 0 0100 111

(2) 0 1100 100 and 1 1100 101

(e) In the 32-bit IEEE format, what is the encoding for negative zero? (f) In the 32-bit IEEE format, what is the encoding for positive infinity?

Name: _____________________


4. [9 points] The floating-point format to be used in this problem is a normalized format with 1 sign bit, 3 exponent bits, and 4 mantissa bits. The exponent field employs an excess-4 coding. The bit fields in a number are (sign, exponent, mantissa). Assume that we use unbiased rounding to the nearest even specified in the IEEE floating point standard. (a) Encode the following numbers in the above format:

(1) 1.0binary

(2) 0.0011011binary

(b) In one sentence for each, state the purpose of guard, rounding, and sticky bits for floating point arithmetic. (c) Perform rounding on the following fractional binary numbers, use the bit positions in italics to determine rounding (use the rightmost 3 bits):

(1) Round to positive infinity: +0.100101110binary

(2) Round to negative infinity: -0.001111001binary

(4) Unbiased to the nearest even: +0.100101100binary

(5) Unbiased to the nearest even: -0.100100110binary

(d) What is the result of the square root of a negative number?

Name: _____________________


5. [6 points] Prove that Sign Magnitude and One’s Complement addition cannot be performed correctly by a single unsigned adder. Prove that a single n-bit unsigned adder performs addition correctly for all pairs of n-bit Two’s Complement numbers for n=2. You should ignore overflow concerns and the n+1 carry bit. (For an optional added challenge, prove for n=2 by first proving for all n.)

Name: _____________________


6. [4 points] Using 32-bit IEEE 754 single precision floating point with one(1) sign bit, eight (8) exponent bits and twenty three (23) mantissa bits, show the representation of -11/16 (-0.6875).

Name: _____________________


7. [3 points] What is the smallest positive (not including +0) representable number in 32-bit IEEE 754 single precision floating point? Show the bit encoding and the value in base 10 (fraction or decimal OK).

Name: _____________________


8. [12 points] Perform the following operations by converting the operands to 2’s complement binary numbers and then doing the addition or subtraction shown. Please show all work in binary, operating on 16-bit numbers. (a) 3 + 12 (b) 13 – 2 (c) 5 – 6 (d) –7 – (-7)

Name: _____________________


9. [9 points] Consider 2’s complement 4-bit signed integer addition and subtraction. (a) Since the operands can be negative or positive and the operator can be subtraction or addition, there are 8 possible combinations of inputs. For example, a positive number could be added to a negative number, or a negative number could be subtracted from a negative number, etc. For each of them, describe how the overflow can be computed from the sign of the input operands and the carry out and sign of the output. Fill in the table below: Sign (Input 1) Sign (Input 2) Operation Sign (Output) Overflow (Y/N)

+ + + + + + + - + + - + + + - - + - + + + - + - + - - + + - - - - + + + - + + - - + - + - + - - - - + + - - + - - - - + - - - -

(b) Define the WiMPY precision IEEE 754 floating point format to be:

where each ’X’ represents one bit. Convert each of the following WiMPY floating point numbers to decimal: (a) 00000000 (b) 11011010 (c) 01110000

Name: _____________________


10. [8 points] This problem covers 4-bit binary unsigned division (similar to Fig. 3.11 in the text). Fill in the table for the Quotient, Divisor and Dividend for each step. You need to provide the DESCRIPTION of the step being performed (shift left, shift right, sub). The value of Divisor is 4 (0100, with additional 0000 bits shown for right shift), Dividend is 6 (initially loaded into the Remainder).

Quotient Divisor Remainder Description Step

0000 0100 0000 0000 0110 Initial Values Step 0 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Step 10 Step 11 Step 12 Step 13 Step 14 Step 15

Name: _____________________


11. [8 points] Why is the 2’s complement representation used most often? Give an example of overflow when: (a) 2 positive numbers are added (b) 2 negative numbers are added (c) A-B where B is a negative number

Name: _____________________


12. [14 points] We’re going to look at some ways in which binary arithmetic can be unexpectedly useful. For this problem, all numbers will be 8-bit, signed, and in 2’s complement. (a) For x = 8, compute x & (−x). (& here refers to bitwise-and, and − refers to arithmetic negation.) (b) For x = 36, compute x & (−x). (c) Explain what the operation x & (−x) does. (d) In some architectures (such as the PowerPC), there is an instruction adde rX=rY,rZ, which performs the following: rX = rY + rZ + CA where CA is the carry flag. There is also a negation instruction, neg rX=rY which performs: rX = 0 – rY Both adde and neg set the carry flag. gcc (the GNU C Compiler) will often use these instructions in the following sequence in order to implement a feature of the C language: neg r1=r0 adde r2=r1,r0 Explain, simply, what the relationship between r0 and r2 is (Hint: r2 has exactly two possible values), and what C operation it corresponds to. Be sure to show your reasoning.

Date:

Quiz for Chapter 4 The Processor3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [6 points] For the MIPS datapath shown below, several lines are marked with “X”. For each one:

• Describe in words the negative consequence of cutting this line relative to the working, unmodified processor.

• Provide a snippet of code that will fail • Provide a snippet of code that will still work

Name: _____________________

Quiz for Chapter 4 The Processor Page 2 of 27

2. [3 points] Consider the following assembly language code: I0: ADD R4 = R1 + R0; I1: SUB R9 = R3 - R4; I2: ADD R4 = R5 + R6; I3: LDW R2 = MEM[R3 + 100]; I4: LDW R2 = MEM[R2 + 0]; I5: STW MEM[R4 + 100] = R2; I6: AND R2 = R2 & R1; I7: BEQ R9 == R1, Target; I8: AND R9 = R9 & R1; Consider a pipeline with forwarding, hazard detection, and 1 delay slot for branches. The pipeline is the typical 5-stage IF, ID, EX, MEM, WB MIPS design. For the above code, complete the pipeline diagram below (instructions on the left, cycles on top) for the code. Insert the characters IF, ID, EX, MEM, WB for each instruction in the boxes. Assume that there two levels of bypassing, that the second half of the decode stage performs a read of source registers, and that the first half of the write-back stage writes to the register file. Label all data stalls (Draw an X in the box). Label all data forwards that the forwarding unit detects (arrow between the stages handing off the data and the stages receiving the data). What is the final execution time of the code?

T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

T11

T12

T13

T14

T15

T16

T17

I0

I1

I2

I3

I4

I5

I6

I7

I8

Name: _____________________


3. [5 points] Structural, data and control hazards typically require a processor pipeline to stall. Listed below are a series of optimization techniques implemented in a compiler or a processor pipeline designed to reduce or eliminate stalls due to these hazards. For each of the following optimization techniques, state which pipeline hazards it addresses and how it addresses it. Some optimization techniques may address more than one hazard, so be sure to include explanations for all addressed hazards. (a) Branch Prediction (b) Instruction Scheduling (c) delay slots (d) increasing availability of functional units (ALUs, adders etc) (e) caches

Name: _____________________


4. [6 points] Branch Prediction. Consider the following sequence of actual outcomes for a single static branch. T means the branch is taken. N means the branch is not taken. For this question, assume that this is the only branch in the program. T T T N T N T T T N T N T T T N T N (a) Assume that we try to predict this sequence with a BHT using one-bit counters. The counters in the BHT are initialized to the N state. Which of the branches in this sequence would be mis-predicted? You may use this table for your answer: Predictor state before prediction

Branch outcome Misprediction?

N T T T N T N T T T N T N T T T N T N

Name: _____________________


(b) Now, assume a two-level branch predictor that uses one bit of branch history—i.e., a one-bit BHR. Since there is only one branch in the program, it does not matter how the BHR is concatenated with the branch PC to index the BHT. Assume that the BHT uses one-bit counters and that, again, all entries are initialized to N. Which of the branches in this sequence would be mis-predicted? Use the table below. Predictor state before prediction

BHR Branch outcome Misprediction?

N N T T T N T N T T T N T N T T T N T N (c) What is a return-address-stack? When is a return address stack updated?

Name: _____________________


5. [16 points] The classic 5-stage pipeline seen in Section 4.5 is IF, ID, EX, MEM, WB. This pipeline is designed specifically to execute the MIPS instruction set. MIPS is a load store architecture that performs one memory operation per instruction, hence a single MEM stage in the pipeline suffices. Also, its most common addressing mode is register displacement addressing. The EX stage is placed before the MEM stage to allow it to be used for address calculation. In this question we will consider a variation in the MIPS instruction set and the interactions of this variation with the pipeline structure. The particular variation we are considering involves swapping the MEM and EX stages, creating a pipeline that looks like this: IF, ID, MEM, EX, WB. This change has two effects on the instruction set. First, it prevents us from using register displacement addressing (there is no longer an EX in front of MEM to accomplish this). However, in return we can use instructions with one memory input operand, i.e., register-memory instructions. For instance: multf_m f0,f2,(r2) multiplies the contents of register f2 and the value at memory location pointed to by r2, putting the result in f0. (a) Dropping the register displacement addressing mode is potentially a big loss, since it is the mode most frequently used in MIPS. Why is it so frequent? Give two popular software constructs whose implementation uses register displacement addressing (i.e., uses displacement addressing with non-zero displacements). (b) What is the difference between a dependence and a hazard? (c) In this question we will work with the SAXPY loop. do I = 0,N Z[I] = A*X[I] + Y[I] Here is the new assembly code. 0: slli r2,r1,#3 // I is in r1 1: addi r3,r2,#X 2: multf_m f2,f0,(r3) // A is in f0 3: addi r4,r2,#Y 4: addf_m f4,f2,(r4) 5: addi r4,r2,#Z 6: sf f4,(r4) 7: addi r1,r1,#1 8: slei r6,r1,r5 // N is in r5 9: bnez r6,#0 Using the instruction numbers, label the data and control dependences. (d) Fill in the pipeline diagram for code for the new SAXPY loop. Label the stalls as d* for data-hazard stalls and s* for structural stalls. What is the latency of a single iteration? (The number of cycles between the completion of two successive #0 instructions). For this question, assume that FP addition takes 2 cycles, FP multiplication takes 3 cycles and that all other operations take a single cycle. The functional units are not pipelined. The FP adder, FP multiplier and integer ALU are all separate functional units, such that there are no structural hazards between them. The register file is written by

Name: _____________________


the WB stage in the first half of a clock cycle and is read by the ID stage in the second half of a clock cycle. In addition, the processor has full forwarding. The processor stalls on branches until the outcome is available which is at the end of the EX stage. The processor has no provisions for maintaining “precise state”.

(e) In the pipeline for MIPS mentioned in the text, what is the reason for forcing non-memory operations to go through the MEM stage rather than proceeding directly to the WB stage? (f) Aside from the direct loss of register displacement addressing and the subsequent instructions required to explicitly compute addresses, what are two other disadvantages of this sort of pipeline? (g) Reduce the stalls by pipeline scheduling a single loop iteration. Show the resulting code and fill in the pipeline diagram. You do not need to show the optimal schedule for a correct response.

Name: _____________________


6. [8 points] A two-part question. (Part A) Dependence detection This question covers your understanding of dependences between instructions. Using the code below, list all of the dependence types (RAW, WAR, WAW). List the dependences in the respective table (example INST-X to INST-Y) by writing in the instruction numbers involved with the dependence. I0: A = B + C; I1: C = A - B; I2: D = A + C; I3: A = B * C * D; I4: C = F / D; I5: F = A ˆ G; I6: G = F + D;

RAW Dependence WAR Dependence WAW Dependence From Instr To Instr From Instr To Instr From Instr To Instr

(Part B) Dependence analysis Given four instructions, how many unique comparisons (between register sources and destinations) are necessary to find all of the RAW, WAR, and WAW dependences. Answer for the case of four instructions, and then derive a general equation for N instructions. Assume that all instructions have one register destination and two register sources.

Name: _____________________


7. [10 points] This is a three-part question about critical path calculation. Consider a simple single-cycle implementation of MIPS ISA. The operation times for the major functional components for this machine are as follows:

Component Latency ALU 10 ns Adder 8 ns ALU Control Unit 2 ns Shifter 3 ns Control Unit/ROM 4 ns Sign/zero extender 3 ns 2-1 MUX 2 ns Memory (read/write) (instruction or data) 15 ns PC Register (read action) 1 ns PC Register (write action) 1 ns Register file (read action) 7 ns Register file (write action) 5 ns Logic (1 or more levels of gates) 1 ns

Below is a copy of the MIPS single-cycle datapath design. In this implementation the clock cycle is determined by the longest possible path in the machine. The critical paths for the different instruction types that need to be considered are: R-format, Load-word, and store-word. All instructions have the same instruction fetch and decode steps. The basic register transfer of the instructions are: Fetch/Decode: Instruction <- IMEM[PC]; R-type: R[rd] <- R[rs] op R[rt]; PC <- PC + 4; load: R[rt] <- DMEM[ R[rs] + signext(offset)]; PC <- PC +4; store: DMEM[ R[rs] + signext(offset)] <- R[Rt]; PC <- PC +4;

Name: _____________________


(Part A) In the table below, indicate the components that determine the critical path for the respective instruction, in the order that the critical path occurs. If a component is used, but not part of the critical path of the instruction (ie happens in parallel with another component), it should not be in the table. The register file is used for reading and for writing; it will appear twice for some instructions. All instruction begin by reading the PC register with a latency of 2ns. Instruction Type

Hardware Elements Used By Instruction

R-Format Load Store (Part B) Place the latencies of the components that you have decided for the critical path of each instruction in the table below. Compute the sum of each of the component latencies for each instruction. Instruction Type Hardware Latencies For Respective Elements Total

R-Format 2 ns Load 2 ns Store 2 ns (Part C) Use the total latency column to derive the following critical path information:

• Given the data path latencies above, which instruction determines the overall machine critical path (latency)?

• What will be the resultant clock cycle time of the machine based on the critical path instruction?

• What frequency will the machine run?

Name: _____________________


8. [18 points] This problem covers your knowledge of branch prediction. The figure below illustrates three possible predictors.

• Last taken predicts taken when 1 • Up-Down (saturating counter) predicts taken when 11 and 10 • Automata A3 predicts taken when 11 and 10

Fill out the tables below and on the next page for each branch predictor. The execution pattern for the branch is NTNNTTTN.

Table 1: Table for last-taken branch predictor Execution Time

Branch Execution

State Before Prediction Correct or Incorrect

State After

0 N 0 1 T 2 N 3 N 4 T 5 T 6 T 7 N

Name: _____________________


Table 2: Table for saturating counter (up-down) branch predictor. Execution Time

Branch Execution


State After

0 N 01 1 T 2 N 3 N 4 T 5 T 6 T 7 N

Table 3: Table for Automata-A3 branch predictor. Execution Time

Branch Execution


State After

0 N 01 1 T 2 N 3 N 4 T 5 T 6 T 7 N

Calculate the prediction rates of the three branch predictors: Predictor Prediction Accuracy Last-taken Up-Down Automata-A3

Name: _____________________


9. [3 points] Pipelining is used because it improves instruction throughput. Increasing the level of pipelining cuts the amount of work performed at each pipeline stage, allowing more instructions to exist in the processor at the same time and individual instructions to complete at a more rapid rate. However, throughput will not improve as pipelining is increased indefinitely. Give two reasons for this.

Name: _____________________


10. [3 points] Forwarding logic design. For this problem you are to design a forwarding unit for a 5-stage pipeline processor. The forwarding unit returns the value to be forwarded to the current instruction. There are three places that the values for register RS and register RT can come from: decode stage (register file), memory stage, and write-back stage.

The write-back and memory stage information consists of: _INDEX- explaining which inflight register index is to be written _VALUE- the value that is to be written _ENABLE- whether or not the instruction in the stage is writing. The decode stage simply states the register index (for RS and RT) and the corresponding register value from the register file. Generally three values could exist, one of which the forwarding unit should choose for each of the RS and RT register value requests. The memory stage has value MEM, the write-back stage has value WB, and the register file has value RS-REG or RT-REG. Using the table below which contains information about all of the instruction stages, indicate which value should be forwarded to the current instruction: MEM, WB, RS-REG, or RT-REG. Each line represents a Forwarding unit evaluation; there is no connection between evaluation lines in the table. You do not need to worry about hazard detection, only value bypassing. Mem Stage Write-Back Stage Register Stage RS

Value RT

Value Evaluation Index Write Index Write RS-Index RT-Index

0 5 1 23 0 6 7 1 7 0 16 1 16 8 2 10 1 10 1 11 10 3 17 0 12 1 12 12 4 19 0 19 0 19 25

Name: _____________________


11. [12 points] Consider a MIPS machine with a 5-stage pipeline with a cycle time of 10ns. Assume that you are executing a program where a fraction, f, of all instructions immediately follow a load upon which they are dependent. (a) With forwarding enabled what is the total execution time for N instructions, in terms of f ? (b) Consider a scenario where the MEM stage, along with its pipeline registers, needs 12 ns. There are now two options: add another MEM stage so that there are MEM1 and MEM2 stages or increase the cycle time to 12ns so that the MEM stage fits within the new cycle time and the number of pipeline stages remain unaffected. For a program mix with the above characteristics, when is the first option better than the second. Your answer should be based on the value of f.

(c) Embedded processors have two different memory regions – a faster scratchpad memory and a slower normal memory. Assume that in the 6 stage machine (with MEM1 and MEM2 stages), there is a region of memory that is faster and for which the correct value is obtained at the end of the MEM1 stage itself while the rest of the memory needs both MEM1 and MEM2 stages. For the sake of simplicity assume that there are two load instructions load.fast and load.slow that indicate which memory region is accessed. If 40% of the fraction f mentioned above get their value from the fast memory, how does the answer to the previous question change ?

Name: _____________________


12. [10 points] You are given a 4-stage pipelined processor as described below.

• IF: Instruction Fetch • IDE: Instruction Decode, Register Fetch, ALU evaluation, branch instructions change PC,

address calculation for memory access. • MEM: memory access for load and store instructions. • WB: Write the execution result back to the register file. The writeback occurs at the 2nd half of

the cycle. Assume the delayed branching method discussed in Section 4.3. For the following program, assume that the loop will iterate 15 times. Assume that the pipeline finishes one instruction every cycle except when a branch is taken or when an interlock takes place. An interlock prevents instructions from being executed in the wrong sequence to preserve original data dependencies. Assume register bypass from both the IDE output and the MEM output. Also assume that r2 will not be needed after the execution returns.

(a) Is there any interlock cycle in the program? If so, perform code reordering on the program and show the new program without interlock cycle. (b) Derive the total number of cycles required to execute all instructions before and after you eliminated the interlock cycle. (c) Fill the delay slots. Describe the code reordering and/or duplication performed. Show the same program after delay slot filling. Recall that RETURN is also a branch instruction. Use a ’ to mark the new copy of a duplicated instruction. For example, if you duplicated D, name the new copy D’. (d) Derive the total number of cycles required to execute all instructions after you filled the delay slots.

Name: _____________________


13. [3 points] Using any ILP optimization, double the performance of the following loop, or explain why it is not possible. The machine can only do one branch per cycle, but has infinite resources otherwise. r1 = ... ; r1 is head pointer to a linked list r3 = 0 LOOP: r2 = M[r1 + 8] r3 = r3 + r2 r1 = M[r1] branch r1 != 0, LOOP ... = r3 ; r3 is used when loop complete

Name: _____________________


14. [10 points] Consider the datapath below. This machine does not support code with branch delay slots. (It predicts not-taken with a 1-cycle penalty on taken branches.) For each control signal listed in the table below, determine its value at cycles 3 through 9, inclusive. Also, show the instruction occupying each stage of the pipeline in all cycles. (Assume the IF/ID write-enable line is set to the inverse of the Stall signal.) The initial state of the machine is: PC = 0 All pipeline registers contain 0s All registers in the register file contain 0s. The data memory contains 0s in all locations The instruction memory contains: 00: addiu $3, $zero, 4 04: lw $4, 100($3) 08: addu $2, $4, $3 0C: beq $4, $zero, 0x14 10: addiu $3, $3, 1 14: addu $2, $2, $3

all other locations contain 0 Use data forwarding whenever possible. All mux inputs are numbered vertically from “top” to “bottom” starting at 0 as you look at the datapath in the proper landscape orientation. Also, the values for ALUOp are:

Value Desired ALU Action 00 Add 01 Subtract 10 Determine by decoding funct field

Instruction formats:

Name: _____________________


Ti

me

PC

Wri

te

IF.F

lush

Bra

nch

Sta

ll

ALU

Src

ALU

Op

Re

gDst

Forw

ardA

Forw

ardB

Me

mR

ead

Me

mW

rite

Me

mto

Reg

Re

gWri

te

IF ID EX M WB

00: addiu -- -- -- -- 0

1 0 0 0 0 00 0 00 00 0 0 0 0

04: lw 00: addiu -- -- -- 1

1 0 0 0 0 00 0 00 00 0 0 0 0

08: addu 04: lw 00: addiu -- -- 2

1 0 0 0 1 00 0 00 00 0 0 0 0

3

4

5

6

7

8

9

Name: _____________________


15. [6 points] Consider the following data path diagram:

(a) Discuss the functionality of the RegDst and the ALUSrc control wires. (b) Next, modify the diagram to indicate the datapath changes (and any additional multiplexing) needed to provide bypassing from EX to EX for all possible RAW hazards on arithmetic instructions. How does ALUSrc change when bypassing is added?

Name: _____________________


16. [3 points] Imagine an instruction whose function is to read four adjacent 32-bit words from memory and places them into four specified 32-bit architectural registers. Assuming the 5-stage pipeline is filled with these instructions and these instructions ONLY, what is the minimum number of register file read and write ports that would be required?

Name: _____________________


17. [12 points] Pipelining and Bypass. In this question we will explore how bypassing affects program execution performance. To begin consider the standard MIPS 5 stage pipeline. For your reference, refer to the figure below. For this question, we will use the following code to evaluate the pipeline’s performance: 1 add $t2, $s1, $sp 2 lw $t1, $t1, 0 3 addi $t2, $t1, 7 4 add $t1, $s2, $sp 5 lw $t1, $t1, 0 6 addi $t1, $t1, 9 7 sub $t1, $t1, $t2 (a) What is the load-use latency for the standard MIPS 5-stage pipeline? (b) Once again, using the standard MIPS pipeline, identify whether the value for each register operand is coming from the bypass or from the register file. For clarity, please write REG or BYPASS in each box. Instruction Src Operand 1 Src Operand 2

1 2 N/A 3 N/A 4 5 N/A 6 N/A 7

(c) How many cycles will the program take to execute on the standard MIPS pipeline? (d) Assume, due to circuit constraints, that the bypass wire from the memory stage back to the execute stage is omitted from the pipeline. What is the load-use latency for this modified pipeline? (e) Identify whether the value for each register operand is coming from the bypass or from the register file for the modified pipeline. For clarity, please write REG or BYPASS in each box. Instruction Src Operand 1 Src Operand 2

1 2 N/A 3 N/A 4 5 N/A 6 N/A 7

(f) How long does the program take to execute on the modified pipeline?

Name: _____________________


18. [3 points] Scheduling. “A schedule defends from chaos and whim.”—Annie Dillard Consider this pseudo-assembly snippet: 1: mov r1 = 0 ;; iteration count i 2: mov r2 = 0 ;; initialize x loop: 3: sll r3 = r1, 2 ;; r3 = r1*4 4: lw r4 = A(r3) ;; load A[i] 5: add r2 = r2, r4 ;; x += A[i] 6: add r1 = r1, 1 ;; increment counter 7: bne r1, 1024, loop ;; backwards branch Assume you have a machine with infinite resources (i.e. infinite width, infinite instruction window, infinite number of functional units, infinite rename resources, infinite number of outstanding/overlapping loads, etc.), and that it is fully pipelined so that all normal instructions take 1 cycle. Loads which hit in the cache also take 1 cycle, but loads that miss take 10 cycles. Also assume perfect branch prediction. (a) Show the dynamic schedule of the instructions of the first two iterations of the loop in the table below assuming that the load misses in the cache in even iterations (r1=0,2,. . . ) and hits in the cache in odd iterations (r1=1,3,. . . ). Be sure to indicate the loop iteration each instruction is associated with. For example, if instruction 5 of iteration 7 can run in cycle 3, then you would write “5.7” in one of the boxes in cycle 3’s row. Note that the first cycle has been filled in for you and that the cells in the table are more than you need.

Cycle Instructions 1 1 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Name: _____________________


19. [3 points] Prediction and Predication. “When a decision has to be made, make it.” –General George Patton (a) Consider the following piece of code: for(i=0; i<1000000; i++){ a = random(100); if(a >= 50){ ... } }

Assume that random(N) returns a random number uniformly distributed between 0 and N-1 inclusive. Consider the branch instruction associated with the if statement. Is there a type of branch predictor that predicts this branch well? Explain your answer. (b) Predication can eliminate all forward conditional branches in a program. The backward branches in a program are, typically, associated with loops and hence are mostly taken. So it is possible to eliminate all forward branches and statically predict the backward branches as Taken and this would get rid of the complex branch predictors in a machine. But the Itanium processor which has hardware support for predication still retains a complex two level branch predictor. Explain why this is the case.

Name: _____________________


20. [10 points] Compilers are useful (occasionally). A common loop in scientific programs is the so-called SAXPY loop: for (i=0;i<N;i++) { Y[i] = A*X[i] + Y[i] }

(a) For A=3 and N=100, convert the C code given above into MIPS assembly. You may assume that X and Y are 32-bit integer arrays. Try to avoid using pseudo-ops as they will hinder the next parts of this question. Also, avoid introducing false dependences. Please use $s0 to hold the value of i. You may also assume that $s2 and $s3 initially hold the address of the start of X and Y, respectively. Hint: Since A=3=2+1, you can easily perform the multiplication without using the mult instruction. (b) How many static instructions (total instructions in memory) does the snippet have? How many dynamic instructions (total instructions executed) does the snippet have? (c) Suppose we wanted to do more than one SAXPY loop: for (i=0;i<N;i++) { Y[i] = A*X[i] + Y[i] } for (i=0;i<N;i++) { Z[i] = A*X[i] + Z[i] } Using your computations from parts (a) and (b), how many dynamic and how many static instructions will this snippet have?

Name: _____________________


(d) There are two optimizations that can be performed to this code. The first is loop fusion, which combines two similar loops into one loop: for (i=0;i<N;i++) { Y[i] = A*X[i] + Y[i] Z[i] = A*X[i] + Z[i] }

Another optimization we can do is common sub-expression elimination. This means that we can factor out repeated computation: for (i=0;i<N;i++) { T = A*X[i]; Y[i] = T + Y[i] Z[i] = T + Z[i] }

Convert each of these snippets into MIPS assembly. $s4 will contain the address of the start of Z. Use $s1 to store the value of T. How many dynamic and how many static instructions does each snippet have? How much do you save with loop-fusion (over unoptimized code). How much do you save with constant sub-expression elimination (over loop fusion)?

Name: _____________________


(e) Annotate assembly code for the loop fused version from part (d) with data dependence arcs (flow, anti, and output - not control).

Date:

Quiz for Chapter 5 Large and Fast: Exploiting Memory Hierarchy 3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [24 points] Caches and Address Translation. Consider a 64-byte cache with 8 byte blocks, an associativity of 2 and LRU block replacement. Virtual addresses are 16 bits. The cache is physically tagged. The processor has 16KB of physical memory. (a) What is the total number of tag bits? (b) Assuming there are no special provisions for avoiding synonyms, what is the minimum page size? (c) Assume each page is 64 bytes. How large would a single-level page table be given that each page requires 4 protection bits, and entries must be an integral number of bytes.

Name: _____________________

Quiz for Chapter 5: Large and Fast: Exploiting Memory Hierarchy Page 2 of 27

(d) For the following sequence of references, label the cache misses.Also, label each miss as being either a compulsory miss, a capacity miss, or a conflict miss. The addresses are given in octal (each digit represents 3 bits). Assume the cache initially contains block addresses: 000, 010, 020, 030, 040, 050, 060, and 070 which were accessed in that order.

Cache state prior to access Reference address

Miss ? Which ?

(00,04),(01,05),(02,06),(03,07) 024 (00,04),(01,05),(06,02),(03,07) 100 (04,10),(01,05),(06,02),(03,07) 270 (04,10),(01,05),(06,02),(07,27) 570 (04,10),(01,05),(06,02),(27,57) 074 (04,10),(01,05),(06,02),(57,07) 272 (04,10),(01,05),(06,02),(07,27) 004 (10,00),(01,05),(06,02),(07,27) 044 (00,04),(01,05),(06,02),(07,27) 640 (04,64),(01,05),(06,02),(07,27) 000 (64,00),(01,05),(06,02),(07,27) 410 (64,00),(05,41),(06,02),(07,27) 710 (64,00),(41,71),(06,02),(07,27) 550 (64,00),(71,55),(06,02),(07,27) 570 (64,00),(71,55),(06,02),(27,57) 410

(e) Which of the following techniques are aimed at reducing the cost of a miss: dividing the current block into sub-blocks, a larger block size, the addition of a second level cache, the addition of a victim buffer, early restart with critical word first, a writeback buffer, skewed associativity, software prefetching, the use of a TLB, and multi-porting. (f) Why are the first level caches usually split (instructions and data are in different caches) while the L2 is usually unified (instructions and data are both in the same cache)?

Name: _____________________


2. [6 points] A two-part question. (Part A) Assume the following 10-bit address sequence generated by the microprocessor: Time 0 1 2 3 4 5 6 7 Access 10001101 10110010 10111111 10001100 10011100 11101001 11111110 11101001

TAG

SET

INDEX

The cache uses 4 bytes per block. Assume a 2-way set assocative cache design that uses the LRU algorithm (with a cache that can hold a total of 4 blocks). Assume that the cache is initially empty. First determine the TAG, SET, BYTE OFFSET fields and fill in the table above. In the figure below, clearly mark for each access the TAG, Least Recently Used (LRU), and HIT/MISS information for each access.

Initial Block 0 Block 1 Set 0

Set 1 Access 0 Block 0 Block 1 Set 0




Set 1

Access 1 Block 0 Block 1 Set 0

Set 1


Set 1


Set 1


Set 1

Name: _____________________


(Part B) Derive the hit ratio for the access sequence in Part A.

Name: _____________________


3. [6 points] A two part question (a) Why is miss rate not a good metric for evaluating cache performance? What is the appropriate metric? Give its definition. What is the reason for using a combination of first and second- level caches rather than using the same chip area for a larger first-level cache? (b) The original motivation for using virtual memory was “compatibility”. What does that mean in this context? What are two other motivations for using virtual memory?

Name: _____________________


4. [6 points] A four part question (Part A) What are the two characteristics of program memory accesses that caches exploit? (Part B) What are three types of cache misses? Cold misses, conflict misses and compulsory misses (Part C) Design a 128KB direct-mapped data cache that uses a 32-bit address and 16 bytes per block. Calculate the following: (a) How many bits are used for the byte offset? (b) How many bits are used for the set (index) field? (c) How many bits are used for the tag?

Name: _____________________


(Part D) Design a 8-way set associative cache that has 16 blocks and 32 bytes per block. Assume a 32 bit address. Calculate the following: (a) How many bits are used for the byte offset? (b) How many bits are used for the set (index) field? (c) How many bits are used for the tag?

Name: _____________________


5. [6 points] The memory architecture of a machine X is summarized in the following table.

Virtual Address 54 bitsPage Size 16 K bytesPTE Size 4 bytes

(a) Assume that there are 8 bits reserved for the operating system functions (protection, replacement, valid, modified, and Hit/Miss- All overhead bits) other than required by the hardware translation algorithm. Derive the largest physical memory size (in bytes) allowed by this PTE format. Make sure you consider all the fields required by the translation algorithm.

(b) How large (in bytes) is the page table? (c) Assuming 1 application exists in the system and the maximum physical memory is devoted to the process, how much physical space (in bytes) is there for the application’s data and code.

Name: _____________________


6. [6 points] This question covers virtual memory access. Assume a 5-bit virtual address and a memory system that uses 4 bytes per page. The physical memory has 16 bytes (four page frames). The page table used is a one-level scheme that can be found in memory at the PTBR location. Initially the table indicates that no virtual pages have been mapped. Implementing a LRU page replacement algorithm, show the contents of physical memory after the following virtual accesses: 10100, 01000, 00011, 01011, 01011,11111. Show the contents of memory and the page table information after each access successfully completes in figures that follow. Also indicate when a page fault occurs. Each page table entry (PTE) is 1 byte.

Name: _____________________


Name: _____________________


7. [6 points] A multipart question. (Part A) In what pipeline stage is the branch target buffer checked? (a) Fetch (b) Decode (c) Execute (d) Resolve What needs to be stored in a branch target buffer in order to eliminate the branch penalty for an unconditional branch? (a) Address of branch target (b) Address of branch target and branch prediction (c) Instruction at branch target. The Average Memory Access Time equation (AMAT) has three components: hit time, miss rate, and miss penalty. For each of the following cache optimizations, indicate which component of the AMAT equation is improved.

• Using a second-level cache • Using a direct-mapped cache • Using a 4-way set-associative cache • Using a virtually-addressed cache • Performing hardware pre-fetching using stream buffers • Using a non-blocking cache • Using larger blocks

Name: _____________________


(Part B) (a) (True/False) A virtual cache access time is always faster than that of a physical cache? (b) (True/False) High associativity in a cache reduces compulsory misses. (c) (True/False) Both DRAM and SRAM must be refreshed periodically using a dummy read/write operation. (d) (True/False) A write-through cache typically requires less bus bandwidth than a write-back cache. (e) (True/False) Cache performance is of less importance in faster processors because the processor speed compensates for the high memory access time. (f) (True/False) Memory interleaving is a technique for reducing memory access time through increased bandwidth utilization of the data bus.

Name: _____________________


8. [12 points] A three-part question. This question covers cache and pipeline performance analysis. (Part A) Write the formula for the average memory access time assuming one level of cache memory: (Part B) For a data cache with a 92% hit rate and a 2-cycle hit latency, calculate the average memory access latency. Assume that latency to memory and the cache miss penalty together is 124 cycles. Note: The cache must be accessed after memory returns the data. (Part C) Calculate the performance of a processor taking into account stalls due to data cache and instruction cache misses. The data cache (for loads and stores) is the same as described in Part B and 30% of instructions are loads and stores. The instruction cache has a hit rate of 90% with a miss penalty of 50 cycles. Assume the base CPI using a perfect memory system is 1.0. Calculate the CPI of the pipeline, assuming everything else is working perfectly. Assume the load never stalls a dependent instruction and assume the processor must wait for stores to finish when they miss the cache. Finally, assume that instruction cache misses and data cache misses never occur at the same time. Show your work.

• Calculate the additional CPI due to the icache stalls. • Calculate the additional CPI due to the dcache stalls. • Calculate the overall CPI for the machine.

Name: _____________________


9. [12 points] A three-part question. (Part A) A processor has a 32 byte memory and an 8 byte direct-mapped cache. Table 0 shows the current state of the cache. Write hit or miss under the each address in the memory reference sequence below. Show the new state of the cache for each miss in a new table, label the table with the address, and circle the change:

Addr 10011 00001 00110 01010 01110 11001 00001 11100 10100 H/M

Name: _____________________


(Part B) Do the same thing as in Part A, except for a 4-way set associative cache. Assume 00110 and 11011 were the last two addresses to be accessed. Use the Least Recently Used replacement policy.

Addr 10011 00001 00110 01010 01110 11001 00001 11100 10100 H/M

Name: _____________________


(Part C) (a) What is the hit and miss rate of the direct-mapped cache in Part A? (b) What is the hit and miss rate of the 4-way set associative cache in Part B? (c) Assume a machine with a CPI of 4 and a miss penalty of 10 cycles. Ignoring writes, calculate the ratio of the performance of the 4-way set associative cache to the direct-mapped cache. In other words, what is the speedup when using the machine with the 4-way cache?

Name: _____________________


10. [6 points] Consider a memory system with the following parameters:

• Translation Lookaside Buffer has 512 entries and is 2-way set associative. • 64Kbyte L1 Data Cache has 128 byte lines and is also 2-way set associative. • Virtual addresses are 64-bits and physical addresses are 32 bits. • 8KB page size

Below are diagrams of the cache and TLB. Please fill in the appropriate information in the table that follows the diagrams:

L1 Cache TLB A = bits F = BitsB = bits G = BitsC = bits H = BitsD = bits I = BitsE = bits

Name: _____________________


11. [3 points] Invalidation vs. Update-based Protocols. (a) As miss latencies increase, does an update protocol become more or less preferable to an invalidation-based protocol? Explain. (b) In a multilevel cache hierarchy, would you propagate updates all the way to the first-level cache or only to the second-level cache? Explain the trade-offs.

Name: _____________________


12. [6 points] How many total SRAM bits will be required to implement a 256KB four-way set associative cache. The cache is physically-indexed cache, and has 64-byte blocks. Assume that there are 4 extra bits per entry: 1 valid bit, 1 dirty bit, and 2 LRU bits for the replacement policy. Assume that the physical address is 50 bits wide.

Name: _____________________


13. [6 points] TLB’s are typically built to be fully-associative or highly set-associative. In contrast, first-level data caches are more likely to be direct-mapped or 2 or 4-way set associative. Give two good reasons why this is so.

Name: _____________________


14. [6 points] Caches: Misses and Hits int i; int a[1024*1024]; int x=0; for(i=0;i<1024;i++) { x+=a[i]+a[1024*i]; } Consider the code snippet in code above. Suppose that it is executed on a system with a 2-way set associative 16KB data cache with 32-byte blocks, 32-bit words, and an LRU replacement policy. Assume that int is word-sized. Also assume that the address of a is 0x0, that i and x are in registers, and that the cache is initially empty. How many data cache misses are there? How many hits are there?

Name: _____________________


15. [6 points] Virtual Memory (a) 32-bit Virtual Address Spaces. Consider a machine with 32-bit virtual addresses, 32-bit physical addresses, and a 4KB page size. Consider a two-level page table system where each table occupies one full page. Assume each page table entry is 32 bits long. To map the full virtual address space, how much memory will be used by the page tables? (b) 64-bit Virtual Address Spaces. A two-part question. (Part 1) Consider a machine with a 64-bit virtual addresses, 64-bit physical addresses, and a 4MB page size. Consider a two-level page table system where each table occupies one full page. Assume each page table entry is 64 bits long. To map the full virtual address space, how much memory will be used by the page tables? (Hint: you will need more than 1 top-level page table. For this question this is okay.) (Part 2) Rather than a two-level page table, what other page table architecture could be used to reduce the memory foot print of page tables for the 64-bit address space from the last question? Assume that you do not need to map the full address space, but some small fraction (people typically do not have 264 bytes of physical memory). However, you should assume that the virtual pages that are mapped are uniformly distributed across the virtual address space (i.e. it is not only the low addresses or high addresses that are mapped, but rather a few pages from all ranges of memory).

Name: _____________________


16. [6 points] Consider an architecture that uses virtual memory, a two-level page table for address translation, as well as a TLB to speed up address translations. Further assume that this machine uses caches to speed up memory accesses. Recall that all addresses used by a program are virtual addresses. Further recall that main memory in the microarchitecture is indexed using physical addresses. The virtual memory subsystem and cache memories could interact in several ways. In particular, the cache memories could be accessed using virtual addresses. We will refer to this scheme as a virtually indexed, virtually tagged cache. The cache could be indexed using virtual addresses, but the tag compare could happen with physical addresses (virtually indexed, physically tagged). Finally, the cache could be accessed using only the physical address. Describe the virtues and drawbacks for each of these systems. Be sure to consider the case where two virtual addresses map to the same physical address.

Virtually Indexed, Virtually Tagged

Virtually indexed, physically tagged

Physically indexed, physically tagged

Advantages

Disadvantages

Name: _____________________


17. [6 points] Describe the general characteristics of a program that would exhibit very little temporal and spatial locality with regard to instruction fetches. Provide an example of such a program (pseudo-code is fine). Also, describe the cache effects of excessive unrolling. Use the terms static instructions and dynamic instructions in your description.

Name: _____________________


18. [6 points] You are given an empty 16K 2-way set-associative LRU-replacement cache with 32 byte blocks on a machine with 4 byte words and 32-bit addresses. Describe in mathematical terms a memory read address sequence which yields the following Hit/Miss patterns. If such a sequence is impossible, state why. Sample sequences: address(N) = N mod 232 (= 0, 1, 2, 3, 4...) address = (7, 12, 14) (a) Miss, Hit, Hit, Miss (b) Miss, (Hit)* (c) (Hit)* (d) (Miss)* (e) (Miss, Hit)*

Name: _____________________


19. [3 points] Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. Assume 36% of instructions are loads/stores.

Name: _____________________


20. [12 points] Caching. “One of the keys to happiness is a bad memory.” –Rita Mae Brown Consider the following piece of code: int x = 0, y = 0; // The compiler puts x in r1 and y in r2. int i; // The compiler put i in r3. int A[4096]; // A is in memory at address 0x10000 ... for (i=0;i<1024;i++) { x += A[i]; } for (i=0;i<1024;i++) { y += A[i+2048]; } (a) Assume that the system has a 8192-byte, direct-mapped data cache with 16-byte blocks. Assuming that the cache starts out empty, what is the series of data cache hits and misses for this snippet of code. Assume that ints are 32-bits. (b) Assume that an iteration of a loop in which the load hits takes 10 cycles but that an iteration of a loop in which the load misses takes 100 cycles. What is the execution time of this snippet with the aforementioned cache? (c) Repeat part A except assume that the cache is 2-way set associative with an LRU replacement policy and 16-byte sets (8-byte blocks). (d) Repeat part B using the cache described in part C. Is the direct-mapped or the set-associative cache better?

Date:

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [6 points] Give a concise answer to each of the following questions. Limit your answers to 20-30 words. (a) What is memory mapped I/O? (b) Why is DMA an improvement over CPU programmed I/O? (c) When would DMA transfer be a poor choice?

Name: _____________________

Quiz for Chapter 6 Storage and Other I/O Topics Page 2 of 16

2. [6 points] Mention two advantages and disadvantages for using a single bus as a shared communication link between memory, processor and I/O devices.

Name: _____________________


3. [6 points] Disk Technology. Suppose we have a magnetic disk (resembling an IBM Microdrive) with the following parameters:

Average seek time 12 ms Rotation rate 3600 RPM Transfer rate 3.5 MB/second # sectors per track 64 Sector size 512 bytes Controller overhead 5.5 ms

Answer the following questions. (Note: you may leave any answer as a fraction.) (a) What is the average time to read a single sector? (b) What is the average time to read 8 KB in 16 consecutive sectors in the same cylinder? (c) Now suppose we have an array of 4 of these disks. They are all synchronized such that the arms on all the disks are always on the same sector within the track. The data is striped across the 4 disks so that 4 logically consecutive sectors can be read in parallel. What is the average time to read 32 consecutive KB from the disk array?

Name: _____________________


4. [6 points] Answer the following questions: (a) What is the average time to read or write a 512-byte sector for a typical disk rotating at 7200 RPM? The advertised average seek time is 8ms, the transfer rate is 20MB/sec, and the controller overhead is 2ms. Assume that the disk is idle so that there is no waiting time. (b) A program repeatedly performs a three-step process: It reads in a 4-KB block of data from disk, does some processing on that data, and then writes out the result as another 4-KB block elsewhere on the disk. Each block is contiguous and randomly located on a single track on the disk. The disk drive rotates at 7200RPM, has an average seek time of 8ms, and has a transfer rate of 20MB/sec. The controller overhead is 2ms. No other program is using the disk or processor, and there is no overlapping of disk operation with processing. The processing step takes 20 million clock cycles, and the clock rate is 400MHz. What is the overall speed of the system in blocks processed per second assuming no other overhead?

Name: _____________________


5. [6 points] What is the bottleneck in the following system setup, the CPU, memory bus, or the disk set?

• The user program continuously performs reads of 64KB blocks, and requires 2 million cycles to process each block.

• The operating system requires 1 million cycles of overhead for each I/O operation. • The clock rate is 3GHz. • The maximum sustained transfer rate of the memory bus is 640MB/sec • The read/write bandwidth of the disk controller and the disk drives is 64MB/sec, disk average

seek plus rotational latency is 9ms. • There are 20 disks attached to the bus each with its own controller. (Assume that each disk

can be controlled independently and ignore disk conflicts.)

Name: _____________________


6. [6 points] Discuss why RAID 3 is not suited for transaction processing applications. What kind of applications is it suitable for and why?

Name: _____________________


7. [7 points] Suppose we have two different I/O system A and B. A has data transfer rate: 5KB/s and has access delay: 5 sec. While B has data transfer rate: 3 KB/s and has access delay: 4 sec. Now we have a 3M I/O request, taking performance into consideration, which I/O system will you use? What about for a 3KB request?

Name: _____________________


8. [7 points] If a system contains 1,000 disk drives, and each of them has a 800,000 hour MTBF, how often a drive failure will occur in that disk system? Could you give some idea to improve that? And why will your idea work?

Name: _____________________


9. [6 points] What is the average time to read a 512 byte sector for Seagate ST31000340NS in Figure 6.5? What is the minimum time? Assume that the controller overhead is 0.2 ms, and the disk is idle so that there is no waiting time.

Name: _____________________


10. [4 points] How many times can you store a 4MB song at your 1GB NOR flash memory in Figure 6.7 before the first wear out if wear leveling working ideally?

Name: _____________________


11. [6 points] In Figure 6.8 which fields are correlated with each other? Why do these correlations exist?

Name: _____________________


12. [6 points] In Figure 6.9, PCI-E connections are available from both the north bridge and the south bridge. What are the advantages and disadvantages to attaching devices to the PCI-E connections on the north and south bridges?

Name: _____________________


13. [6 points] Section 6.7 focuses on transactional processing as an example of a disk IO intensive application. Give another example of a disk IO intensive application compare and contrast the performance requirements and consider how different disk implementations (magnetic media, flash memory, or MEMS device) can be more or less appropriate for different applications.

Name: _____________________


14. [6 points] Imagine that you are proposing a new disk IO benchmark for transaction processing, what sort of experiments would you perform to show that your benchmark's results are meaningful. Imagine that you are reviewing a paper introducing a new disk IO benchmark for transaction processing. What sort of subtle flaws would you search for?

Name: _____________________


15. [6 points] Which of the following would be an acceptable transport medium for real-time transmission of human voice data? Which would be “overkill”?

• 56.5Kbps modem • 100 Base-T Ethernet connection • 802.11b wireless connection.

Name: _____________________


16. [10 points] A given computer system includes a hard disk with direct memory access (DMA). (a) Suppose a user application needs to change a single byte within a disk block. Sketch, in order, all communications that must take place between the processor and the hard drive to complete this operation. (b) Assume the total time required to perform a read of n blocks from a hard disk is T(n) = 250ms + 100ms * n. Further, assume that for any read or write to a hard disk block a, there is a probability p = 0.75 that hard disk block a+1 will be read soon afterwards. Given that an application has requested a read of a single disk block, the OS can expect the application to read subsequent blocks later. If the OS will pursue a strategy of reading n blocks at a time, analyze how the OS can choose this n in order to minimize the expected read time.

Date:

Quiz for Chapter 7 Multicores, Multiprocessors, and Clusters3.10

Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [5 points] Applying the send/receive programming model as outlined in the example in Section 7.4, describe a parallel message passing program that uses 10 processor-memory nodes to multiply a 50x50 matrix by a scalar value, c.

Name: _____________________

Quiz for Chapter 7 Multicores, Multiprocessors, and Clusters Page 2 of 13

2. [10 points] Consider the following GPU that consists of 8 multiprocessors clocked at 1.5 GHz, each of which contains 8 multithreaded single-precision floating-point units and integer processing units. It has a memory system that consists of 8 partitions of 1GHz Graphics DDR3DRAM, each 8 bytes wide and with 256 MB of capacity. Making reasonable assumptions (state them), and a naive matrix multiplication algorithm, compute how much time the computation C = A * B would take. A, B, and C are n * n matrices and n is determined by the amount of memory the system has.

Name: _____________________


3. [5 points] Besides network bandwidth and bisection bandwidth, two other properties sometimes used to describe network typologies are the diameter and the nodal degree. The diameter of a network is defined as the longest minimal path possible, examining all pairs of nodes. The nodal degree is the number of links connecting to each node. If the bandwidth of each link in a network is B, find the diameter, nodal degree, network bandwidth, and bisection bandwidth for the 2D grid and n-cube tree shown in Figure 7.9.

Name: _____________________


4. [5 points] Vector architecture exploits the data-level parallelism to achieve significant speedup. For programmers, it is usually be make the problem/data bigger. For instance, programmers ten years ago might want to model a map with a 1000 x 1000 single-precision floating-point array, but may now want to do this with a 5000 x 5000 double-precision floating-point array. Obviously, there is abundant data-level parallelism to explore. Give some reasons why computer architecture do not intend to create a super-big vector machine (in terms of the number and the length of vector registers) to take advantage of this opportunity?

Name: _____________________


5. [5 points] Why should there be stride-access for vector load instruction? Give an example when this feature is especially useful.

Name: _____________________


6. [10 points] A two-part question. (a) What are the advantages and disadvantages of fine-grained multithreading, coarse-grained multithreading, and simultaneous multithreading?

Name: _____________________


(b) Given the following instruction sequence of three threads, how many clock cycles will fine-grained multithreading, coarse-grained multithreading use respectively? Annotations are the same as in Figure 7.5 Thread 1: [] [] [] [] [] [] [] [] [] [] [] [] [] (stall) (stall) [] [] (stall) [] [] [] Thread 2: [] [] [] [] [] (stall) [] [] [] [] [] [] [] [] (stall) [] Thread 3: [] [] [] [] [] [] (stall) [] [] [] [] [] [] [] [] [] [] []

Name: _____________________


7. [10 points] Consider a multi-core processor with 64 cores where first shared cache is at level L3 (L1 and L2 are private to each core). Suppose an application needs to compute the sum of all the nodes in a perfectly balanced binary tree T (A tree where every node has either two or zero children and all the leaves are at the same depth/level). Assuming that an add operation takes 10 units of time, the total sum can be computed sequentially, in time 10*n units, ignoring the time to load and traverse links in the tree (assume they are factored in the add). Here n is the number of nodes in T. One way to compute the sum in a parallel manner is to have two arrays of length 64: (a) an input array having the roots of 64 leaf subtrees (b) an output array that holds the partial sums computed for the 64 leaf subtrees for each core. Both the arrays are indexed by the id (0 to 63) of the core that is responsible for that entry. Furthermore, assume the following:

• The input arrays are already filled in with the roots of the leaf subtrees and are in the L1 caches of each core.

• Core 0 is responsible for handling the internal nodes that are not a part of any of the 64 subtrees. It is also responsible for computing the final sum once the partial sums are filled in.

• Every time a partial sum is filled in the output array by a core, another core can fill in its partial sum only after a minimum delay of 50 units (due to cache invalidations).

In order to achieve a speedup of greater than 2 (over sequential code), what is the minimum number of nodes that should be present in the tree?

Name: _____________________


8. [10 points] Consider a multi-core processor with heterogeneous cores: A, B, C and D where core B runs twice as fast as A, core C runs three times as fast as A and cores C and A run at the same speed (ie have the same processor frequency, micro architecture etc). Suppose an application needs to compute the square of each element in an array of 256 elements. Consider the following two divisions of labor: (a)

Core A 32 elements Core B 128 elements Core C 64 elements Core D 32 elements

(b)

Core A 48 elements Core B 128 elements Core C 80 elements Core D Unused

Compute (1) the total execution time taken in the two cases and (2) cumulative processor utilization (Amount of total time the processors are not idle divided by the total execution time). For case (b), if you do not consider Core D in cumulative processor utilization (assuming we have another application to run on Core D), how would it change? Ignore cache effects by assuming that a perfect prefetcher is in operation.

Name: _____________________


9. [10 points] Consider a system with two multiprocessors with the following configurations: (a) Machine 1, a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word. (b) Machine 2, a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word. Suppose an application has two threads running on the two processors, each of them need to access an entire array of 4096 words, is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it rather than the UMA machine? If so, specify the partitioning. If not, by how many more cycles should the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Assume that the memory operations dominate the execution time.

Name: _____________________


10. [10 points] Consider the following code that adds two matrices A and B and stores the result in a matrix C: for (i= 0 to 15) { for (j= 0 to 63) { C[i][j] = A[i][j] + B[i][j]; } }

If we had a quad-core multiprocessor, where the elements of the matrices A, B, C are stored in row major order, which one of the following two parallelizations is better and why ? What about when they are stored in column major order ? (a) For each Pk in {0, 1, 2, 3}: for (i= 0 to 15) { for (j= Pk*15 + Pk to (Pk+1)*15 + Pk) { // Inner Loop Parallelization C[i][j] = A[i][j] + B[i][j]; } }

(b) For each Pk in {0, 1, 2, 3}: for (i= Pk*3 + Pk to (Pk+1)*3 + Pk) {

// Outer Loop Parallelization for (j= 0 to 63) { C[i][j] = A[i][j] + B[i][j]; } }

Name: _____________________


11. [10 points] How would you rewrite the following sequential code so that it can be run as two parallel threads on a dual-core processor ? Try to balance the loads as much as possible between the two threads: int A[80], B[80], C[80], D[80]; for (i = 0 to 40) { A[i] = B[i] * D[2*i]; C[i] = C[i] + B[2*i]; D[i] = 2*B[2*i]; A[i+40] = C[2*i] + B[i]; }

Name: _____________________


12. [10 points] Suppose we have a dual core chip multiprocessor with two level cache hierarchy: Both the cores have their own private first level cache (L1) while they share their second level cache (L2). The first level cache on both the cores is 2-way set associative with cache line size of 2K bytes, and access latency of 30ns per word, while the shared cache is direct mapped with cache line size of 4K bytes and access latency of 80ns per word. Consider a process with two threads running on these cores as follows (assume the size of an integer to be 4 bytes which is same as the word size): Thread 1: int A[1024]; for (i=0; i < 1024; i++) { A[i] = A[i] + 1; } Thread 2: int B[1024]; for (i=0; i< 1024; i++) { B[i] = B[i] + 1; } Initially assume that both the arrays A and B are in main memory, whose access latency is 200ns per word. Assume that an int is word sized. Furthermore, assume that A and B when mapped to L2 start at address 0 of a cache line. Assume a write back policy for both L1 and L2 caches. (a) If the main memory blocks having arrays A and B map to different L2 cache lines, how much time would it take the process to complete its execution in the worst case? (Assuming this is the only process running on the machine.) (b) If the main memory blocks having arrays A and B map to the same L2 cache line, how much time would it take the process to complete its execution in the worst case? (Assuming this is the only process running on the machine.) In the worst case, thread 1 could access A[0], thread 2 could access B[0], then thread 1 could access A[1] followed by B[1] access by thread 2 and so on. Every time A[I] or B[i] is accessed, it evicts the other array from L2 cache and so a subsequent access to the other array has to again cause a main memory access.

PH-4-Quiz

Documents

Transcript of PH-4-Quiz