CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

46
CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383

Transcript of CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Page 1: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

CSE 8383 - Advanced Computer Architecture

Week-4Week of Feb 2, 2004

engr.smu.edu/~rewini/8383

Page 2: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Contents Reservation Table Latency Analysis State Diagrams MAL and its bounds Delay Insertion Throughput Group Work Introduction to Multiprocessors

Page 3: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Reservation Table A reservation table displays the time-

space flow of data through the pipeline for one function evaluation

A static pipeline is specified by a single reservation table

A dynamic pipeline may be specified by multiple reservation tables

Page 4: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Static Pipeline

X

X

X

X

S1

S2

S3

S4

Time

Page 5: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Dynamic Pipeline

X X X

X X

X X X

Y Y

Y

Y Y Y

S1

S2

S3

S1

S2

S3

Page 6: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Reservation Table (Cont.) The number of columns in a reservation

table is called the evaluation time of a given function.

The checkmarks in a row correspond to the time instants (cycles) that a particular stage will be used.

Multiple checkmarks in a row repeated usage of the same stage in different cycles

Page 7: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Reservation Table (Cont.) Contiguous checkmarks

extended usage of a stage over more than one cycle

Multiple checkmarks in one column multiple stages are used in parallel

A dynamic pipeline may allow different initiations to follow a mix of reservation table

Page 8: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Reservation Table

1 2 3 4 5 6 7

A X X X

B X X

C X X

D X

Page 9: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Latency Analysis The number of cycles between two

initiations is the latency between them

A latency of k two initiations are separated by k cycles

Collision resource conflict between two initiations

Latencies that cause collision forbidden latencies

Page 10: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Collision with latency 2 & 5 in evaluating X

X1 X2 X1 X2 X1

X1 X2 X1 X2

X1 X2 X1

X2 X1

S1

S2

S3

X1 X2 X1 X1

X1 X1 X2

X1 X1 X1 X2

S1

S2

S3

5

2

Page 11: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Latency Analysis (cont.) Latency Sequence a sequence of

permissible latencies between successive initiations

Latency Cycle a latency sequence that repeats the same subsequence (cycle) indefinitely

Latency Sequence 1, 8 Latencies Cycle (1,8) 1, 8, 1, 8, 1,

8 …

Page 12: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Latency Analysis (cont.) Average Latency (of a latency

cycle) sum of all latencies / number of latencies along the cycle

Constant Cycle One latency value

Objective Obtain the shortest average latency between initiations without causing collisions.

Page 13: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Latency Cycle (1,8)

1 2 3 4 5 6 7 8 9 10

11 12 13

14 15 16

17 18 19

20

21

X1

X2

X1

X2

X1

X2

X3

X4

X3

X4

X3

X4

X5

X6

X1

X2

X1

X2

X3

X4

X3

X4

X5

X6

X1

X2

X1

X2

X1

X2

X3

X4

X3

X4

X3

X4

X5

Average Latency = (1+8)/2 = 4.5

Page 14: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Latency Cycle (6)

1 2 3 4 5 6 7 8 9 10

11 12 13

14 15 16

17 18 19

20

21

X1

X1

X2

X1

X2

X3

X2

X 3

X4

X3

X1

X1

X2

X2

X3

X3

X4

X1

X1

X1

X2

X2

X2

X3

X3

X3

X4

Average Latency = 6

Page 15: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Collision VectorC = (Cm, Cm-1, …, C2, C1)

Ci = 1 if latency i causes collision (forbidden)

Ci = 0 if latency i is permissible

Cm = 1 (always) maximum forbidden latency

Maximum forbidden latency: m <= n-1n = number of column in reservation table

Page 16: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Collision Vector (X after X) Forbidden Latencies: 2, 4, 5, 7 Collision Vector = 1 0 1 1 0 1 0

Page 17: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Collision Vector (Y after Y) Forbidden Latencies: 2, 4 Collision Vector = 1 0 1 0

Page 18: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

State Diagram It specifies the permissible state

transitions among successive initiations

Collision vector corresponds to the initial state at time t = 1 (initial collision vector)

The next state comes at time t + p, where p is a permissible latency in the range 1 <= p < m

Page 19: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Right Shift Register

The next state can be obtained with the help of an m-bit shift register

0

0

1 Collision

Safe to allow an initiation

Each 1-bit shift corresponds to increase in the latency by 1

Page 20: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

The next state The next state is obtained by

bitwise ORing the initial collision vector with the shifted register

C.V. = 1 0 1 1 0 1 0 (first state)0 1 0 1 1 0 1 C.V. 1-bit right shifted

1 0 1 1 0 1 0 initial C.V.---------------- OR

1 1 1 1 1 1 1

Page 21: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

State Diagram for X

1 0 1 1 0 1 0

1 1 1 1 1 1 11 0 1 1 0 1 1

36 8+

6

8+

8+

3*

1*

Page 22: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Cycles Simple cycles each state

appears only once(3), (6), (8), (1, 8), (3, 8), and (6,8) Greedy Cycles simple cycles

whose edges are all made with minimum latencies from their respective starting states

(1,8), (3) one of them is MAL

Page 23: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

MAL Minimum Average latency At least one of the greedy cycles

will lead to the MAL Consider state diagram for Y, MAL

is 3 (See diagram)

Page 24: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

State Diagram for Y

1 0 1 0

1 1 1 11 0 1 1 0 1 1

35+

5+

5+

3*

1*

Page 25: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Bounds on the MAL MAL is lower bounded by the maximum

number of checkmarks in any row of the reservation table. (Shar, 1972)

MAL is lower than or equal to the average latency of any greedy cycle in the state diagram. (Shar, 1972)

The average latency of any greedy cycle is upper-bounded by the number of 1’s in the initial collision vector plus 1. This is also an upper bund on the MAL. (Shar, 1972)

Page 26: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Delay Insertion The purpose is to modify the

reservation table, yielding a new collision vector

This may lead to a modified state diagram, which may produce greedy cycles meeting the lower bound on MAL

Page 27: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Example

S1 S2 S3

output

Page 28: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Example (Cont.)

1 2 3 4 5

S1 X X

S2 X X

S3 X X

Forbidden Latencies: 1, 2, 4C.V. 1 0 1 1

Page 29: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Example (Cont.) State Diagram

1 0 1 13*

5+

MAL = 3

Page 30: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Example (Cont.)

S1 S2 S3

outputD1

D2

Page 31: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Example (Cont.)

1 2 3 4 5 6 7

S1 X X

S2 X X

S3 X X

D1 X

D2 X

Forbidden: 2, 6C.V. 1 0 0 0 1 0

Page 32: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Group Activity 1

Find the State Diagram

Page 33: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Pipeline Throughput The average number of task

initiations per clock cycle

The inverse of MAL

Page 34: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Group Activity 2

1 2 3 4

S1 X X

S2 X

S3 X

C.V State Diagram Simple Cycles

Greedy Cycles MAL Throughput (t = 20 ns)

Page 35: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Multiprocessors

Page 36: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Introduction Uniprocessor systems are not capable

of delivering solutions to some problems in reasonable time

Multiple processors cooperate to jointly execute a single computational task in order to speed up its execution

Speed-up versus Quality-up

Page 37: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Architecture Background Three major Components

Processors

Memory Modules

Interconnection Network

Page 38: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Parallel and Distributed Computers MIMD Shared Memory

Bus based Switch based CC-NUMA

MIMD Distributed Memory SIMD Computers Clusters Grid Computing

Page 39: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

MIMD Shared Memory Systems

Interconnection Networks

M M M M

P P P P P

Page 40: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Bus Based & switch based SM Systems

Global Memory

P

C

P

C

P

C

P C

P C

P C

P C

M M M M

Page 41: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Cache Coherent NUMA

Interconnection Network

M

C

P

M

C

P

M

C

P

M

C

P

Page 42: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

MIMD Distributed Memory Systems

Interconnection Networks

M M M M

P P P P

Page 43: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

SIMD Computers

Processor

Memory

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

von Neumann Computer

Some Interconnection Network

Page 44: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Clusters

M

C

P

I/O

OS

M

C

P

I/O

OS

M

C

P

I/O

OS

Middleware

Programming Environment

Interconnection Network

Page 45: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Grids Grids are geographically

distributed platforms for computation.

They provide dependable, consistent, pervasive, and inexpensive access to high end computational capabilities.

Page 46: CSE 8383 - Advanced Computer Architecture Week-4 Week of Feb 2, 2004 engr.smu.edu/~rewini/8383.

Interconnection Network Taxonomy

Interconnection Network

Static Dynamic

Bus-based Switch-based1-D 2-D HC

Single Multiple SS MS Crossbar