CSE530hw1 Ans
-
Upload
xander-rosillo -
Category
Documents
-
view
223 -
download
0
description
Transcript of CSE530hw1 Ans
CMPEN 530
Speedup over Percent of VectorizationSpeedup
0%1
10%1.10497
20%1.23457
30%1.3986
40%1.6129
50%1.90476
60%2.32558
70%2.98507
80%4.16667
90%6.89655
100%20
CSE 530Homework #1Due September 26
Anthony Dotterer
For a linearly connected array of processors - i.e., ILLIAC IV type architecture - calculate the hardware utilization when performing a maximum search operation. Assume n data elements and an organization composed of n processing elements (n is a power of 2) (assume one dimensional communication pattern among the processors as defined by its architecture.
Answer:
Discuss about the advantage(s) and the disadvantages of the von Nuemann concept.
Answer:
The von Neumann concept is a computer design model that uses a single storage model to hold both instructions and data.
Advantages:Reprogramming was made easier
Programs are allowed to modify themselves
Programs can write Programs
General flexibility
Disadvantages:Malfunctioning programs can damage other programs or the operating system
von Neumann bottleneck - CPU must wait for Data to transfer to and from memory
Problems 1.1 and 1.2 page 60 (Hennessy and Patterson, 2nd edition).
Answer:
1.1) a), where
b)
, , , If and , then
c)If then and
d)For hardware improvement,and , therefore So compiler must have with , making Therefore compiler group must increase percentage of vectorization by 0.01842. Based on this the compiler group should be given the investment.
1.2) a)If, , and , therefore
b)If , , and , then So 55% of the original execution time has been converted to fast mode
A 40-MHz processor was used to execute a benchmark program with the following instruction mix and clock cycle counts:
Instruction typeInstruction countClock cycle count
Integer arithmetic450001
Data Transfer320002
Floating point150002
Control Transfer80002
Determine the effective CPI, MIPS rate, and execution time for this program.
Answer:
If and , then If , E is the execution time, and , then If , , and , then
A workstation uses a 15-MHz processor with a claimed 10-MIPS rating to execute a given program mix. Assume a one-cycle delay for each memory access:
a)What is the effective CPI of this computer?
Answer:If , , and , then and So if and , then
b)Suppose the processor is being upgraded with a 30-MHz clock. However, the speed of the memory subsystem remains unchanged, and consequently two clock cycles are needed per memory access. If 30% of the instructions require one memory access and another 5% require two memory accesses per instruction, what is the performance of the upgraded processor with a compatible instruction set and equal instruction counts in the given program mix?
Answer:
If then Not sure...
Problem 1.14 page 66 (Hennessy and Patterson, 2nd edition).
Answer:
1.14) a)If and , so the following table shows the MFLOPS for each program and computer in figure 1.11
ProgramsComputer AComputer BComputer C
P1100105
P20.115
b) Arithmetic mean is , so , , and
Geometric mean is , so , , and
Harmonic mean is , so , , and
c) Not sure.
Define term delayed branch, its application, and its shortcomings (if any).
Answer:
Delayed branch is technique for reducing the effects of control dependencies by delaying the point where a branch operation effects the program counter. This allows one or more instructions following the branch operation to execute whether or not the branch operation succeeds.
Advantage:Allows for pipeline CPUs to reduce the clock cycles wasted due to pipeline flushing during a branch or a jump operation
Disadvantage:If the compiler cannot put instructions to execute after the branch due to dependencies, then it must insert no-op instructions which increases the size of program
Three enhancements with the following speedups are proposed:Speedup1 = 30Speedup2 = 20Speedup3 = 10Only one enhancement is usable at a time:If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10?
Answer:
If , , , , , , and , then , , , and
Assume the distribution of enhancement usage is 30%, 30%, and 20% for enhancements 1, 2, and 3, respectively. Assuming all three enhancements are in use, for what fraction of the reduced execution time is no enhancement in use?
Answer:
If and , then So represents the usage of no enhancementsTherefore , where Ene is the execution time with no enhancements and Fne is the fraction of the reduced execution time where no enhancements are in use
Assume for some benchmark, the fraction of use is 15% for each of the enhancements 1 and 2 and 70% for enhancement 3. We want to maximize performance. If only one enhancement can be implemented, which should be chosen?
Answer:
If , then If , then If , then Therefore enhancement 3 should be implemented
True or False; if 10% of operations, in a program, must be performed sequentially, then the maximum speed up gained is 10, no matter how much parallelism is available (prove your answer).
Answer:
If , , and , where f is the percent of operations performed sequentially and p is the speedup gained from parallelism which goes to infinity with unlimited processors, then Therefore it is true that no matter how much parallelism is available the maximum speedup gained is 10
True or false; in general linear speed up is needed to make parallel systems (multiprocessors) cost effective (justify your answer).
Answer:
Not sure
CPU time (T) is defined as:T = Ic* CPI * Ic stands for the instruction count, CPI stands for average clock cycles per instruction, and stands for the clock cycle time.A RISC computer, ideally, should be able to execute one instruction per clock cycles. Within the scope of a RISC architecture, name and discuss (briefly) distinct issues that do not allow ideal performance.
Answer:
Issues:Memory Access: Any access to the memory can take longer than one instruction
Branching: Program branches will flush instructions in a pipeline and cause it to take longer then one instruction
Loop fusion allows two or more loops that are executed the same number of times and that use the same indices to be combined into one loop:
Within the scope of a RISC processor, why does it (Loop fusion) improve performance (detail explanation)?
Answer:
In the scope of a RISC processor, Loop fusion can improve performance by decreasing the need for extraneous loop control instructions. In the absence of extraneous loop control instructions, the processor can run a program faster.
Within the scope of a vector processor, why does it (Loop fusion) improve performance (beyond what has been discussed in Part a) (detail explanation)?
Answer:
In the scope of a vector processor, loop fusion improves performance by allowing data dependent loops to pipeline. Not Sure ?
Within the scope of a superscalar processor, why does it (Loop fusion) improve performance (beyond what has been discussed in Part a) (detail explanation)?
Answer:A superscalar processor allows fused loops to execute the fused loop instructions in parallel.
Interleave memoryDefine interleaved memory (be as clear as possible);
Answer:
Interleaved memory describes a way to virtually access memory into a number of memory banks.
Within the scope of interleaved memory, define mapping of the logical addresses to the physical addresses. Distinguish them from each other.
Answer:In interleaved memory, the memory is divided into N banks of memory where virtual address, i, would actually reside in memory bank i/N (ignoring the remainder), logically addressed by i mod N.
What is the main difference between an interleaved memory and a parallel memory?
Answer:
Interleaved memory requires 2 to N memory banks to look up multiple contiguous virtual memory locations where parallel memory only requires 1 memory bank.
Consider a memory hierarchy using one of the three organizations for main memory as shown below. Assume that the cache block size is 16 words, the width of organization b is four words, and the number of memory modules in organization c is four. If the main memory latency for a new access is 10 cycles and the transfer time is 1 cycle, what is the cache miss penalty for each of these organizations?
Answer:
A) If , , and , then B) If , , , and , then C) So the memory access would be the following:CyclesBank 1Bank 2Bank 3Bank 4
1Access word
2Cycle 2Access word
3Cycle 3Cycle 2Access word
4Cycle 4Cycle 3Cycle 2Access word
10Cycle 10Cycle 9Cycle 8Cycle 7
11Transfer wordCycle 10Cycle 9Cycle 8
12Access wordTransfer wordCycle 10Cycle 9
13Cycle 2Access wordTransfer wordCycle 10
14Cycle 3Cycle 2Access wordTransfer word
15Cycle 4Cycle 3Cycle 2Access word
So the 4 words are accumulated every 11 cycles with the initial access of 14 cycles, therefore 47 cycles are spent.
Suppose a processor with a 16-word block size has an effective miss rate per instruction of 0.5%. Assume that the CPI without cache miss is 1.2. Using the memory organizations in part D, how much faster is this processor when using the wide memory that when using narrow or interleaved memory?
Answer:
CPU
Cache
MEMORY
BUS
CPU
Cache
Memory
BUS
Mux.
CPU
Cache
Mem0
BUS
Mem1
Mem2
Mem3
One-word-wideb.Wide memoryC.Interleaved memory
Memory Org.OrganizationOrganization
Answer: