ACA Answer Key

Advanced Computer Architecture

Question 1 - Write notes on any 2 Program Flow mechanisms Question 2 - Write notes on the following : Amdahl ’ s law and efficiency of a system Utilization of system and quality of parallelism RedundancyQuestion 3 - Define Clock rate , CPI , MIPS rate and Throughput rate Question 4 - Explain super scalar processors Question 5 - Explain non linear pipeline processor Question 6 - Write notes on the following :

Crossbar Switch Network Mulitport Memory Multistage Network

Question - 7 Explain message passing in detail

Question 1 - Write notes on any 2 Program Flow mechanismsConventional machines used control flow mechanism in which the order of program execution is explicitly stated in user programs.

● Control flow machines - In this type of machines, token of control indicates when a statement is executed.

● Data flow machines - In this type of machines, instructions can be executed by determining the operand availability

● Reduction machines - In this type of machines, instruction executions are trigger based on the demand for its results.

Comparison of program flow mechanisms:

Machine Model

Control Flow Data flow Reduction machine

Basic Definition

Conventional computation; token of control indicates when a statement should be executed.

Eager evaluation; statements are executed when all their operands are available

Lazy evaluation; statements are executed only when their result is required for another computation

Advantag 1. Full control 1. Very high potential 1. Only required

es 2. Complex data and control structures are easily implemented.

for parallelism2. High throughput3. Free from side

effects

instructions are executed.

2. High degree of parallelism

3. Easy manipulation of data structures.

Disadvantages

1. Less efficient2. Difficult in

programming3. Difficult in

preventing run time error

1. High control overhead

2. Difficult in manipulating data structures

1. Time needed to propagate demand tokens

Question 2 - Write notes on the following:

a. Amdahl’s law and efficiency of a systemb. Utilization of system and quality of parallelismc. Redundancy

a. Amdahl’s Law:

Example 1:If an improvement can speed up 30% of the computation, F will be 0.3; if the improvement makes the portion affected twice as fast, S will be 2.) Amdahl's law states that the overall speedup of applying the improvement will be

Example 2:We are given a task which is split up into four parts: F1 = 11%, F2 = 18%, F3 = 23%, F4 = 48%, which add up to 100%. Then we say F1 is not sped up, so S1 = 1 or 100%, F2 is sped up 5×, so S2 = 500%, F3 is sped up 20×, so S3 = 2000%, and F4 is sped up 1.6×, so S4 = 160%. By using the formula F1/S1 + F2/S2 + F3/S3 + F4/S4, we find the running time is

or a little less than ½ the original running time which we know is 1. Therefore the overall speed boost is 1 / 0.4575 = 2.186 or a little more than double the original speed using the formula(F1/S1 + F2/S2 + F3/S3 + F4/S4)−1. Notice how the 20× and 5× speedup don't have much effect on the overall speed boost and running time when 11% is not sped up, and 48% is sped up by 1.6×.

System Efficiency

■ Let O(n) be the total number of unit operations performed by n-processor system and T(n) be the execution time in unit time steps. In general, T(n) < O(n) if more than one operation is performed by n processors per unit time, where n>=2. Assume T(1) = O(1) in a uni-processor system.

The speed up factor is defined as:s(n) = T(1)/T(n)

The efficiency of a n-processor system is defined as:E(n) = s(n)/n = T(1)/(n*T(n))

■ Efficiency is an indication of the actual degree of speed up performance achieved compared with the maximum value.

Since 1<= S(n) <= n, we have 1/n <= E(n) <=1. [always a fraction]■ Lowest efficiency corresponds to the case where the entire program is

being executed sequentially on a single processor.■ The maximum efficiency is achieved when all n processors are fully

utilized throughout the execution period.

b. System Utilization ■ System utilization in a parallel computation is defined as below:

V(n) = R(n) * E(n) = O(n)/(n * T(n))■ The system utilization indicates the percentage of resources that was

kept busy during the execution of a parallel program. It is interesting to note the following relationships:

1/n <= E(n) <= U(n) <= 1 1 <= R(n) <= 1/E(n) <= 1

Quality of parallelism■ The quality of a parallel computation is directly proportional to the

speedup and efficiency and inversely related to the redundancy. Thus we have:

Q(n) = (S(n) * E(n))/R(n) = T^3(1)/(n*T^2(n)*O(n))

Since E(n) is always a fraction and R(n) is a number between 1 and n, the quality Q(n) is always bounded by the speed up factor S(n).

c. Redundancy ■ The redundancy in a parallel computation is defined as the ratio of O(n) to

O(1):R(n) = O(n)/O(1)

Question 3 - Define Clock rate, CPI, MIPS rate and Throughput rate

i. Clock rate

■ CPU is driven by a clock with a constant cycle time

● Cycle time is represented using T in nanoseconds

■ Inverse of cycle time is the clock rate (f=1/T)

● f = 1 in megahertz

■ The clock rate is the rate in cycles per second (measured in hertz) or the frequency of the clock in any synchronous circuit , such as a central processing unit (CPU).

ii. CPI - Cycles per instruction

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCentral_processing_unit&sa=D&sntz=1&usg=AFQjCNHFultNh4VomSTwnYERPdg14YeZTQ



http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSynchronous_circuit&sa=D&sntz=1&usg=AFQjCNG6085a4SZJRnQpFYTUK8SXIa67rQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSynchronous_circuit&sa=D&sntz=1&usg=AFQjCNG6085a4SZJRnQpFYTUK8SXIa67rQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHertz&sa=D&sntz=1&usg=AFQjCNFnaPVJHHZB7WXJpMIL2DPIl8IwaQ

iii. MIPS - Millions of instructions per second

iv. Throughput Rate

Question 4 - Explain super scalar processors

■ A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate .

■ A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier .

■ In the Flynn Taxonomy , a superscalar processor is classified as a MIMD processor (Multiple Instructions, Multiple Data).

■ While a superscalar CPU is typically also pipelined , pipelining and superscalar architecture are considered different performance enhancement techniques.

■ The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU core):

1. Instructions are issued from a sequential instruction stream2. CPU hardware dynamically checks for data dependencies between

instructions at run time (versus software checking at compile time )3. The CPU accepts multiple instructions per clock cycle

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCompile_time&sa=D&sntz=1&usg=AFQjCNEAezEg4igg2QpOJK3xOIeiH8wkGw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCompile_time&sa=D&sntz=1&usg=AFQjCNEAezEg4igg2QpOJK3xOIeiH8wkGw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FData_dependencies&sa=D&sntz=1&usg=AFQjCNFGPhhV7H0FOBb4RMu_B_n_TfRtrA

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FData_dependencies&sa=D&sntz=1&usg=AFQjCNFGPhhV7H0FOBb4RMu_B_n_TfRtrA

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FInstruction_pipeline&sa=D&sntz=1&usg=AFQjCNG85LGts1AGVOws4kWxWlumhk6xIQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFlynn_taxonomy&sa=D&sntz=1&usg=AFQjCNFYZl2dZEUEUX88SOjzKjoI65wf7w

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFlynn_taxonomy&sa=D&sntz=1&usg=AFQjCNFYZl2dZEUEUX88SOjzKjoI65wf7w

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMultiplication_ALU&sa=D&sntz=1&usg=AFQjCNFmv1J1lBszlxqgdtrzfUAZbaLALA

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FArithmetic_logic_unit&sa=D&sntz=1&usg=AFQjCNE-ywHeg0tDkQFsvKXbTd7TJ3wnwA



http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FClock_rate&sa=D&sntz=1&usg=AFQjCNEY6lF1HII5qNLXHiKwG8pE4dYMMw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FClock_rate&sa=D&sntz=1&usg=AFQjCNEY6lF1HII5qNLXHiKwG8pE4dYMMw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FThroughput&sa=D&sntz=1&usg=AFQjCNGkpP0GQf6-66zeCVm4-JXBQ-jMQg

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FInstruction_level_parallelism&sa=D&sntz=1&usg=AFQjCNE-6v9FUQnAcj_5cMqjAkGQM6Chyg



http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FParallel_computer&sa=D&sntz=1&usg=AFQjCNGUbHHQOLEPo8_9L-WMDzVzd7CotQ


Fig: Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed.

■ The simplest processors are scalar processors . Each instruction executed by a scalar processor typically manipulates one or two data items at a time.

■ In contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic.

■ A superscalar processor is sort of a mixture of the above 2 processor types. Each instruction processes one data item, but there are multiple redundant functional units within each CPU thus multiple instructions can be processing separate data items concurrently.

■ Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple functional

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FScalar_%2528mathematics%2529&sa=D&sntz=1&usg=AFQjCNG6p6vDIL4BCLSuWhdN4SywbxzgoQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FVector_processor&sa=D&sntz=1&usg=AFQjCNHhNrUK0b1aca6nNVn48tC3gVDa6w

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FVector_processor&sa=D&sntz=1&usg=AFQjCNHhNrUK0b1aca6nNVn48tC3gVDa6w

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FScalar_processor&sa=D&sntz=1&usg=AFQjCNFOp8d5FJRaJu69c6JIZULKCmCcsQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FScalar_processor&sa=D&sntz=1&usg=AFQjCNFOp8d5FJRaJu69c6JIZULKCmCcsQ

units in use at all times. This has become increasingly important when the number of units increased. While early superscalar CPUs would have two ALUs and a single FPU , a modern design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will suffer.

■ A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle . But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined , multiprocessor or multi - core architectures also achieve that, but with different methods.

■ In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to redundant functional units contained inside a single CPU. Therefore a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread.

Limitations:

Available performance improvement from superscalar techniques is limited by three key areas:

1. The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism.

2. The complexity and time cost of the dispatcher and associated dependency checking logic.

3. The branch instruction processing.

e.g. Some of the super scalar processors:● The P 5 Pentium was the first superscalar x86 processor; ● Nx 586 , P 6 Pentium Pro and AMD K 5 ● Cyrix 6 x 86 .

Question 5 - Explain non linear pipeline processor

■ A pipeline need not be a simple linear chain of stages. There are instances where it is useful to have a collection of functional units that can be wired into a particular pattern of flow, even with loops and skips in the chain. This may allow more than one function to be computed with the same pipeline.

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCyrix_6x86&sa=D&sntz=1&usg=AFQjCNHf6m-rB3V0wnRBRk85OlJhjtDumQ




http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FAMD_K5&sa=D&sntz=1&usg=AFQjCNEFnWB8r3v6RZIbD7D2gyf1Pa0jRQ



http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPentium_Pro&sa=D&sntz=1&usg=AFQjCNFoxCynKtEA1M6zvCgSOWExM_jiDg

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPentium_Pro&sa=D&sntz=1&usg=AFQjCNFoxCynKtEA1M6zvCgSOWExM_jiDg

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FP6_%2528microarchitecture%2529&sa=D&sntz=1&usg=AFQjCNHgE3FCK56gyOFoqk52cHFdhjDGVg

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FP6_%2528microarchitecture%2529&sa=D&sntz=1&usg=AFQjCNHgE3FCK56gyOFoqk52cHFdhjDGVg

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FNx586&sa=D&sntz=1&usg=AFQjCNGGHMYTFXFfXHf2frAjG_HP_FhBkw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FNx586&sa=D&sntz=1&usg=AFQjCNGGHMYTFXFfXHf2frAjG_HP_FhBkw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPentium_%2528brand%2529&sa=D&sntz=1&usg=AFQjCNEuG9DlMiD916gEW1CLSkOSa1ws3g

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FP5_%2528microarchitecture%2529&sa=D&sntz=1&usg=AFQjCNECFBb0YWENJt5oTOtbQpnfviFhtg

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FP5_%2528microarchitecture%2529&sa=D&sntz=1&usg=AFQjCNECFBb0YWENJt5oTOtbQpnfviFhtg

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMulti-core_%2528computing%2529&sa=D&sntz=1&usg=AFQjCNHrSnE2tpA0OLDW5etGHbTeZQHtXA



http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMultiprocessor&sa=D&sntz=1&usg=AFQjCNGTEpZrbE65trCtydaCrN4y85F1Dw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FInstruction_pipeline&sa=D&sntz=1&usg=AFQjCNG85LGts1AGVOws4kWxWlumhk6xIQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCycles_per_instruction&sa=D&sntz=1&usg=AFQjCNGQo06hIEasmUBmXlNcNxcD_WfGqw




http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FSIMD&sa=D&sntz=1&usg=AFQjCNEszIYofnsgpvcSoiVxnuMkE_HAYw

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPowerPC_970&sa=D&sntz=1&usg=AFQjCNHyd50L5dQKXhjnejpo3z2hmYVxUQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPowerPC_970&sa=D&sntz=1&usg=AFQjCNHyd50L5dQKXhjnejpo3z2hmYVxUQ

http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFloating_point_unit&sa=D&sntz=1&usg=AFQjCNHKiiyNs6vh1TuLd6M4kzUkfgxDtw


■ A typical case would be built-in floating-point square root, which chains together the floating-point adder and multiplier, rather than having separate functional units for this rarely used operation. Depending upon how the square root operation operates, it might leave holes in the schedule that would admit independent floating adds or multiplies.

■ The problem with trying to utilize a nonlinear pipeline is that it is difficult to keep it full unless the functions do not collide with each other or themselves.

■ These reservation tables show the sequence in which each function utilizes each stage. (For example, think of X as being a floating square root, and Y as being a floating cosine. A simple floating multiply might occupy just S1 and S2 in sequence.) We could also denote multiple stages being used in parallel, or a stage being drawn out for more than one cycle with these diagrams.

■ We determine the next start time for one or the other of the functions by lining up the diagrams and sliding one with respect to another to see where one can fit into the open slots.

■ Once an X function has been scheduled, another X function can start after 1, 3 or 6 cycles. A Y function can start after 2 or 4 cycles.

■ Once a Y function has been scheduled, another Y function can start after 1, 3 or 5 cycles. An X function can start after 2 or 4 cycles.

■ After two functions have been scheduled, no more can start until both are complete.

Question 6 - Write notes on the following:

i. Crossbar Switch Network ii.

■ A separate path is available for each memory unit.■ Every processor is connected to each memory module through a cross point

switch.■ Obviously hardware complexity increases.■ All processors can send memory requests independently and asynchronously.■ Each cross point switch in a cross point network can be set open or closed,

providing a point to point connection between the source and destination.■ On each row of the crossbar mesh multiple switches can be connected

simultaneously.■ In each column of the crossbar only one switch can be connected at a time.■ Problem arises when multiple requests are destined for the same memory

module at the same time. In such cases only one request can be services at a time, since at any given time only one switch can be connected.

■ Each cross point must have an additional hardware which is capable to handling all switching and resolving all conflicts.

■ An arbitration module is used to make the selection based on the priority, whenever a conflict arises. The acknowledgement signals are sent to indicate the result of the conflict.

■ A multiplexer module multiplexes the data, address and signal from the processor.

■ Each cross point requires a large number of connection lines for accommodating the address, data and control signals.

■ The cross bar switch has the potential for the highest bandwidth and system efficiency.

■ The maximum number of simultaneous transfers is limited by t he number of memory modules, bandwidth and speed of the buses rather than the number of paths available.

■ Because of its complexity and cost, it may not be preferred for large multiprocessor system.

ii. Mulitport Memory

■ In multiport memory has multiple ports connected to multiple paths between memory and processors.

■ Multiport memory is based on the idea of moving all crosspoint arbitration and switching functions associated with each memory module into the memory controller.

■ Thus the memory module becomes more expensive due to added access ports and associated logic.

iii. Multistage Network

Question 7 - Explain message passing in detail

Message passing in multicomputers

Message Formats

Store and forward routing

Flits and Wormhole Routing

Store and Forward Vs Wormhole

Asynchronous pipelining

Wormhole Node Handshake

ACA Answer Key

Documents

Transcript of ACA Answer Key