CSE530hw1 Ans

CMPEN 530

Speedup over Percent of VectorizationSpeedup

0%1

10%1.10497

20%1.23457

30%1.3986

40%1.6129

50%1.90476

60%2.32558

70%2.98507

80%4.16667

90%6.89655

100%20

CSE 530Homework #1Due September 26

Anthony Dotterer

For a linearly connected array of processors - i.e., ILLIAC IV type architecture - calculate the hardware utilization when performing a maximum search operation. Assume n data elements and an organization composed of n processing elements (n is a power of 2) (assume one dimensional communication pattern among the processors as defined by its architecture.

Answer:

Discuss about the advantage(s) and the disadvantages of the von Nuemann concept.

Answer:

The von Neumann concept is a computer design model that uses a single storage model to hold both instructions and data.

Advantages:Reprogramming was made easier

Programs are allowed to modify themselves

Programs can write Programs

General flexibility

Disadvantages:Malfunctioning programs can damage other programs or the operating system

von Neumann bottleneck - CPU must wait for Data to transfer to and from memory

Problems 1.1 and 1.2 page 60 (Hennessy and Patterson, 2nd edition).

Answer:

1.1) a), where

b)

, , , If and , then

c)If then and

d)For hardware improvement,and , therefore So compiler must have with , making Therefore compiler group must increase percentage of vectorization by 0.01842. Based on this the compiler group should be given the investment.

1.2) a)If, , and , therefore

b)If , , and , then So 55% of the original execution time has been converted to fast mode

A 40-MHz processor was used to execute a benchmark program with the following instruction mix and clock cycle counts:

Instruction typeInstruction countClock cycle count

Integer arithmetic450001

Data Transfer320002

Floating point150002

Control Transfer80002

Determine the effective CPI, MIPS rate, and execution time for this program.

Answer:

If and , then If , E is the execution time, and , then If , , and , then

A workstation uses a 15-MHz processor with a claimed 10-MIPS rating to execute a given program mix. Assume a one-cycle delay for each memory access:

a)What is the effective CPI of this computer?

Answer:If , , and , then and So if and , then

b)Suppose the processor is being upgraded with a 30-MHz clock. However, the speed of the memory subsystem remains unchanged, and consequently two clock cycles are needed per memory access. If 30% of the instructions require one memory access and another 5% require two memory accesses per instruction, what is the performance of the upgraded processor with a compatible instruction set and equal instruction counts in the given program mix?

Answer:

If then Not sure...

Problem 1.14 page 66 (Hennessy and Patterson, 2nd edition).

Answer:

1.14) a)If and , so the following table shows the MFLOPS for each program and computer in figure 1.11

ProgramsComputer AComputer BComputer C

P1100105

P20.115

b) Arithmetic mean is , so , , and

Geometric mean is , so , , and

Harmonic mean is , so , , and

c) Not sure.

Define term delayed branch, its application, and its shortcomings (if any).

Answer:

Delayed branch is technique for reducing the effects of control dependencies by delaying the point where a branch operation effects the program counter. This allows one or more instructions following the branch operation to execute whether or not the branch operation succeeds.

Advantage:Allows for pipeline CPUs to reduce the clock cycles wasted due to pipeline flushing during a branch or a jump operation

Disadvantage:If the compiler cannot put instructions to execute after the branch due to dependencies, then it must insert no-op instructions which increases the size of program

Three enhancements with the following speedups are proposed:Speedup1 = 30Speedup2 = 20Speedup3 = 10Only one enhancement is usable at a time:If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10?

Answer:

If , , , , , , and , then , , , and

Assume the distribution of enhancement usage is 30%, 30%, and 20% for enhancements 1, 2, and 3, respectively. Assuming all three enhancements are in use, for what fraction of the reduced execution time is no enhancement in use?

Answer:

If and , then So represents the usage of no enhancementsTherefore , where Ene is the execution time with no enhancements and Fne is the fraction of the reduced execution time where no enhancements are in use

Assume for some benchmark, the fraction of use is 15% for each of the enhancements 1 and 2 and 70% for enhancement 3. We want to maximize performance. If only one enhancement can be implemented, which should be chosen?

Answer:

If , then If , then If , then Therefore enhancement 3 should be implemented

True or False; if 10% of operations, in a program, must be performed sequentially, then the maximum speed up gained is 10, no matter how much parallelism is available (prove your answer).

Answer:

If , , and , where f is the percent of operations performed sequentially and p is the speedup gained from parallelism which goes to infinity with unlimited processors, then Therefore it is true that no matter how much parallelism is available the maximum speedup gained is 10

True or false; in general linear speed up is needed to make parallel systems (multiprocessors) cost effective (justify your answer).

Answer:

Not sure

CPU time (T) is defined as:T = Ic* CPI * Ic stands for the instruction count, CPI stands for average clock cycles per instruction, and stands for the clock cycle time.A RISC computer, ideally, should be able to execute one instruction per clock cycles. Within the scope of a RISC architecture, name and discuss (briefly) distinct issues that do not allow ideal performance.

Answer:

Issues:Memory Access: Any access to the memory can take longer than one instruction

Branching: Program branches will flush instructions in a pipeline and cause it to take longer then one instruction

Loop fusion allows two or more loops that are executed the same number of times and that use the same indices to be combined into one loop:

Within the scope of a RISC processor, why does it (Loop fusion) improve performance (detail explanation)?

Answer:

In the scope of a RISC processor, Loop fusion can improve performance by decreasing the need for extraneous loop control instructions. In the absence of extraneous loop control instructions, the processor can run a program faster.

Within the scope of a vector processor, why does it (Loop fusion) improve performance (beyond what has been discussed in Part a) (detail explanation)?

Answer:

In the scope of a vector processor, loop fusion improves performance by allowing data dependent loops to pipeline. Not Sure ?

Within the scope of a superscalar processor, why does it (Loop fusion) improve performance (beyond what has been discussed in Part a) (detail explanation)?

Answer:A superscalar processor allows fused loops to execute the fused loop instructions in parallel.

Interleave memoryDefine interleaved memory (be as clear as possible);

Answer:

Interleaved memory describes a way to virtually access memory into a number of memory banks.

Within the scope of interleaved memory, define mapping of the logical addresses to the physical addresses. Distinguish them from each other.

Answer:In interleaved memory, the memory is divided into N banks of memory where virtual address, i, would actually reside in memory bank i/N (ignoring the remainder), logically addressed by i mod N.

What is the main difference between an interleaved memory and a parallel memory?

Answer:

Interleaved memory requires 2 to N memory banks to look up multiple contiguous virtual memory locations where parallel memory only requires 1 memory bank.

Consider a memory hierarchy using one of the three organizations for main memory as shown below. Assume that the cache block size is 16 words, the width of organization b is four words, and the number of memory modules in organization c is four. If the main memory latency for a new access is 10 cycles and the transfer time is 1 cycle, what is the cache miss penalty for each of these organizations?

Answer:

A) If , , and , then B) If , , , and , then C) So the memory access would be the following:CyclesBank 1Bank 2Bank 3Bank 4

1Access word

2Cycle 2Access word

3Cycle 3Cycle 2Access word

4Cycle 4Cycle 3Cycle 2Access word

10Cycle 10Cycle 9Cycle 8Cycle 7

11Transfer wordCycle 10Cycle 9Cycle 8

12Access wordTransfer wordCycle 10Cycle 9

13Cycle 2Access wordTransfer wordCycle 10

14Cycle 3Cycle 2Access wordTransfer word

15Cycle 4Cycle 3Cycle 2Access word

So the 4 words are accumulated every 11 cycles with the initial access of 14 cycles, therefore 47 cycles are spent.

Suppose a processor with a 16-word block size has an effective miss rate per instruction of 0.5%. Assume that the CPI without cache miss is 1.2. Using the memory organizations in part D, how much faster is this processor when using the wide memory that when using narrow or interleaved memory?

Answer:

CPU

Cache

MEMORY

BUS

CPU

Cache

Memory

BUS

Mux.

CPU

Cache

Mem0

BUS

Mem1

Mem2

Mem3

One-word-wideb.Wide memoryC.Interleaved memory

Memory Org.OrganizationOrganization

Answer:

CSE530hw1 Ans

Documents

Transcript of CSE530hw1 Ans