6677197 Parallel Processing PART 1

42
1 PARALLEL PROCESSING

Transcript of 6677197 Parallel Processing PART 1

Page 1: 6677197 Parallel Processing PART 1

1

PARALLEL PROCESSING

Page 2: 6677197 Parallel Processing PART 1

2

Course Material

Text Books: - Computer Architecture & Parallel Processing

Kai Hwang, Faye A. Briggs

- Advanced Computer Architecture

Kai Hwang.

Reference Book: - Scalable Computer Architecture

Page 3: 6677197 Parallel Processing PART 1

3

Marks Distribution

Presentation : 10 Midterm : 10 Final Paper : 80 Total : 100

Page 4: 6677197 Parallel Processing PART 1

4

What is Parallel Processing?

It is an efficient form of information processing which emphasizes on the exploitation of the concurrent events in the computing process.

Efficiency is measured as:-

Efficiency = Time / Speed + Accuracy

* Always first classify definitions then give properties.

Page 5: 6677197 Parallel Processing PART 1

5

Types of Concurrent EventsThere are 3 types of concurrent events:-

1. Parallel Event or Synchronous Event :- (Type of concurrency is parallelism)

It may occur in multiple resources during the same interval time.

ExampleArray/Vector Processors

CU

PE PE PE Based on ALU

Page 6: 6677197 Parallel Processing PART 1

6

2. Simultaneous Event or Asynchronous Event :- (Type of concurrency is simultaneity )

It may occur in multiple resources during the same interval time. Example

Multiprocessing System

3. Pipelined Event or Overlapped Event :-

It may occur in overlapped spans. Example

Pipelined Processor

Page 7: 6677197 Parallel Processing PART 1

7

System Attributes versus Performance Factors The ideal performance of a computer system

requires a perfect match between machine capability and program behavior.

Machine capability can be enhanced with better hardware technology, however program behavior is difficult to predict due to its dependence on application and run-time conditions.

Below are the five fundamental factors for projecting the performance of a computer.

Page 8: 6677197 Parallel Processing PART 1

8

1. Clock Rate :- CPU is driven by a clock

of constant cycle time ().

= 1/ f (ns)

2. CPI :- (Cycles per instruction)

As different instructions acquire different cycles to execute, CPI will be taken as an average value for a given instruction set and a given program mix.

Page 9: 6677197 Parallel Processing PART 1

9

3. Execution Time :- Let Ic be Instruction Count or total number of instructions in the program. So

Execution Time = ?

T = Ic CPI Now,CPI = Instruction Cycle = Processor Cycles + Memory Cycles Instruction cycle = p + m kwhere m = number of memory references

Page 10: 6677197 Parallel Processing PART 1

10

P = number of processor cycles k = latency factor (how much the memory is slow w.r.t to CPU)

Now let C be Total number of cycles required to execute a program. So, C = ?

C = Ic CPI

And the time to execute a program will be

T = C

Page 11: 6677197 Parallel Processing PART 1

11

4. MIPS Rate :-

MIPS rate =

5. Throughput Rate:- Number of programs executed per unit time. W = ?

W = 1 / TOR

W = MIPS 10 Ic

6

Ic

T 106

Page 12: 6677197 Parallel Processing PART 1

12

Numerical:- A benchmark program is executed on a 40MHz processor. The benchmark program has the following statistics.

Instruction Type

Arithmetic Branch Load/Store Floating Point

Instruction Count

4500032000150008000

Clock Cycle Count

1222

Calculate average CPI,MIPS rate & execution for the above benchmark program.

Page 13: 6677197 Parallel Processing PART 1

13

Average CPI = C Ic

C = Total # cycles to execute a whole programIc Total Instruction

= 45000 1 + 320002 + 15002 + 80002 45000 + 3200 + 15000 + 8000

= 155000 100000

CPI = 1.55Execution Time = C / f

Page 14: 6677197 Parallel Processing PART 1

14

T = 150000 / 40 10

T = 0.155 / 40

T = 3.875 ms

MIPS rate = Ic / T 10

MIPS rate = 25.8

6

6

Page 15: 6677197 Parallel Processing PART 1

15

SystemAttributes

Performance Factors

IcCPI

p m k

Instruction-set

Architecture

Compiler Technology

CPU Implementation & Technology

Memory Hierarchy

Page 16: 6677197 Parallel Processing PART 1

16

Practice Problems :-

1. Do problem number 1.4 from the book Advanced Computer Architecture by Kai Hwang.

2. A benchmark program containing 234,000 instructions is executed on a processor having a cycle time of 0.15ns The statistics of the program is given below.

Each memory reference requires 3 CPU cycles to complete.Calculate MIPS rate & throughput for the program.

Page 17: 6677197 Parallel Processing PART 1

17

Instruction Type

Arithmetic

Branch

Load/Store

Instruction Mix

58 %

33 %

9 %

Processor Cycles

2

3

3

MemoryCycles

2

1

2

Page 18: 6677197 Parallel Processing PART 1

18

Programmatic Levels of Parallel Processing

Parallel Processing can be challenged in 4 programmatic levels:-

1. Job / Program Level

2. Task / Procedure Level

3. Interinstruction Level

4. Intrainstruction Level

Page 19: 6677197 Parallel Processing PART 1

19

1. Job / Program Level :-

It requires the development of parallel processable algorithms.The implementation of parallel algorithms depends on the efficient allocation of limited hardware and software resources to multiple programs being used to solve a large computational problem.Example: Weather forecasting , medical consulting , oil exploration etc.

Page 20: 6677197 Parallel Processing PART 1

20

2. Task / Procedure Level :- It is conducted among procedure/tasks within the same program. This involves the decomposition of the program into multiple tasks. ( for simultaneous execution )

3. Interinstruction Level :- Interinstruction level is to exploit concurrency among multiple instructions so that they can be executed simultaneously. Data dependency analysis is often performed to reveal parallel-

Page 21: 6677197 Parallel Processing PART 1

21

-lism amoung instructions. Vectorization may be desired for scalar operations within DO loops.

4. Intrainstruction Level :- Intrainstruction level exploits faster and concurrent operations within each instruction e.g. use of carry look ahead and carry save address instead of ripple carry address.

Page 22: 6677197 Parallel Processing PART 1

22

Key Points :-

1. Hardware role increases from high to low levels whereas software role increases from low to high levels.

2. As highest job level is conducted algorithmically, lowest level is implemented directly by hardware means.

3. The trade-off between hardware and software approaches to solve a problem is always a very controversial issue.

Page 23: 6677197 Parallel Processing PART 1

23

4. As hardware cost declines and software cost increases , more and more hardware method are replacing the conventional software approaches.

Conclusion :- Parallel Processing is a combined field of studies which requires a broad knowledge of and experience with all aspects of algorithms, languages, hardware, software, performance evaluation and computing alternatives.

Page 24: 6677197 Parallel Processing PART 1

24

Parallel Processing in Uniprocessor Systems

A number of parallel processing mechanisms have been developed in uniprocessor computers. We identify them in six categories which are described below.

1. Multiplicity of Functional Units :-

Different ALU functions can be distributed to multiple & specialized functional units which can operate in parallel.

Page 25: 6677197 Parallel Processing PART 1

25

The CDC-6600 has 10 functional units built in its CPU.

mul / div

fixed point floating point

add / sub

IBM 360 / 91

Page 26: 6677197 Parallel Processing PART 1

26

2. Parallelism & Pipelining within the CPU :-

Use of carry-lookahead & carry-save address instead of ripple-carry adders.

Cascade two 4-bit parallel adders to create an 8-bit parallel adder.

Page 27: 6677197 Parallel Processing PART 1

27

Ripple-carry Adder :-

At each stage the sum bit is not valid untilafter the carry bits in all the preceding stages are valid.No of bits is directly proportional to the time required for valid addition.

Problem :- The time required to generate each carryout bit in the 8-bit parallel adder is 24ns. Once all inputs to an adder are valid, there is a delay of 32ns until the output sum bit is valid. What is the maximum number of additions per

Page 28: 6677197 Parallel Processing PART 1

28

second that the adder can perform?

1 addition = 7 24 + 32

= 200ns

Additions / sec = 1 / 200 = 0.5 10 10 = 5 10 = 5 million additions / sec

-3 9

6

Page 29: 6677197 Parallel Processing PART 1

29

Practice Problem

Assuming the 32ns delay in producing a valid sum bit in the 8-bit parallel adder. What maximum delay in generating a carry out bit is allowed if the adder must be capable of performing 10 additions per second.

7

Page 30: 6677197 Parallel Processing PART 1

30

Carry-Lookahead Adder :-

A 4-bit parallel adder incorporating carry look-ahead. Each full adder is of the type shown in fig.

Page 31: 6677197 Parallel Processing PART 1

31

Essence & Idea :-

To determine & generate the carry input bits for all stages after examining the input bits simultaneously.

C1 = A0 B0 + A0 C0 + B0C0

= A0 B0 + ( A0 + B0 ) C0

C2 = A1B1 + ( A1 + B1 ) C1

Carry Generate

Carry Propagate

Cn = An-1Bn-1 + ( An-1 + Bn-1 ) Cn-1

Page 32: 6677197 Parallel Processing PART 1

32

If Ai and Bi both are 1 then Ci+1 = 1. It means that the input data itself generating a carry this is called carry generate. G0 = A0B0

G1 = A1B1

Gn-1 = An-1 Bn-1Ci+1 can be 1 if Ci = 1 and if either Ai or Bi = 1 it means that A0 or B0 is used to propagate the carry. This is called carry propagate represented by P0.

Page 33: 6677197 Parallel Processing PART 1

33

P0 = A0B0

P1 = A1 + Bo

Pn-1 = An-1 + Bn-1

Now writing the carry equations in terms of carry generate and carry propagate.

C1 = G0 + P0C0

C2 = G1 + P1C1

= G1 + P1 (G0 + P0C0 )

C2 = G1 + P1G0 + P1P0C0

Page 34: 6677197 Parallel Processing PART 1

34

C3 = G2 + G2C2

= G2 + P2 ( G1 + P1G0 + P1P0C0 )

C3 = G2 + P2G1 + P2P1G0 + P2P1P0C0

Problem :- In each full adder of a 4-bit carry look-ahead adder, there is a propagation delay of 4ns before the carry propagate & carry generate outputs are valid. The delay in each external logic gate is 3ns. Once all inputs to an adder are valid, there is a delay of 6ns before the output sum bit is valid. What is the maximum no of additions/sec that the adder can

Page 35: 6677197 Parallel Processing PART 1

35

perform?

1 addition = 4ns + 3ns + 3ns + 6ns = 16ns

AND gate is in parallel

OR gate is in serial

Additions / sec = 1 / 16 = 62.5 10 10 = 62.5 10 = 62.5 million additions / sec

-3 9

6

Page 36: 6677197 Parallel Processing PART 1

36

3. Overlapping CPU & I/O Operations :-

DMA is conducted on a cycle-stealing basis.

• CDC-6600 has 10 I/O processors of I/O multiprocessing.

• Simultaneous I/O operations & CPU computations can be achieved using separate I/O controllers, channels.

4. Use of hierarchical Memory System :-

A hierarchal memory system can be used to close up the speed gap between the CPU &

Page 37: 6677197 Parallel Processing PART 1

37

& main memory because CPU is 1000 times faster than memory access.

5. Balancing of Subsystem Bandwidth :- Consider the relation

tm< tm < td

Bandwidth of a System :-

Bandwidth of a system is defined as the number of operations performed per unit time.

Bandwidth of a memory :-The memory bandwidth is the number of words

Page 38: 6677197 Parallel Processing PART 1

38

Bm = W (words / sec ) tm

In case of interleaved memory, the memory access conflicts may cause delayed access to some of the processors requests.

accessed per unit time. It is represented by Bm. If ‘W’ is the total number of words accessed per memory cycle tm then

Processor Bandwidth :-Bp :- maximum CPU computation rate.

Bp :- utilized processor bandwidth or the no. ofu

Page 39: 6677197 Parallel Processing PART 1

39

output results per second.

Bp = Rw (word result)

Tp

u

Rw :- no of word results.Tp :- Total CPU time to generate Rw results.Bd :- Bandwidth of devices. (which is assumed as provided by the vendor).The following relationships have been observed between the bandwidths of the major subsystems in a high performance uniprocessor.Bm Bm

u Bp BdBp u

Page 40: 6677197 Parallel Processing PART 1

40

Due to the unbalanced speeds we need to match the processing power of the three subsystem, we need to match the processing power of the three subsystems.Two major approaches are described below :-

1. Bandwidth balancing b/w CPU & memory :-

Using fast cache having access time tc = tp.

2. Bandwidth balancing b/w memory & I/O :-

Intelligent disk controllers can be used to filter out irrelevant data off the tracks. Buffering can be performed by I/O channels.

Page 41: 6677197 Parallel Processing PART 1

41

As we know that some computer programs are CPU bound & some are I/O bound.Whenever aProcess P1 is tied up with I/O operations.The system scheduler can switch the CPU to process P2.This allows simultaneous executionof several programs in the system.This interleaving of CPU & I/O operations amongseveral programs is called multiprogramming, so the total execution time is reduced.

6a. Multiprogramming :-

Page 42: 6677197 Parallel Processing PART 1

42

6b. Time sharing :- In multiprogramming, sometimes a high priority program may occupy the CPU for too long toallow others to share. This problem can be overcome by using a time-sharing operatingsystem.The concept extends from multiprogram –ming by assigning fixed or variable time slices to multiple programs. In other words, equal opportunities are given to all programs competing for the use of CPU. Time sharing is particularly effective when applied to a computer system connected to many interactive terminals.