CS4/MSc Parallel Architectures - 2009-2010 CS4 Parallel Architectures - Introduction Instructor :...

CS4/MSc Parallel Architectures - 2009-2010

CS4 Parallel Architectures - Introduction Instructor : Marcelo Cintra ([email protected] – 1.03 IF) Lectures: Tue and Fri in G0.9 WRB at 10am Pre-requisites: CS3 Computer Architecture Practicals: Practical 1 – out week 3 (26/1/10); due week 5 (09/2/10) Practical 2 – out week 5 (09/2/10); due week 7 (23/2/10) Practical 3 – out week 7 (23/2/10); due week 9 (09/3/10) (MSc only) Practical 4 – out week 7 (26/2/10); due week 9 (12/3/10) Books:

– (**) Culler & Singh - Parallel Computer Architecture: A Hardware/Software Approach – Morgan Kaufmann

– (*) Hennessy & Patterson - Computer Architecture: A Quantitative Approach – Morgan Kaufmann – 3rd or 4th editions

Lecture slides (no lecture notes) More info: www.inf.ed.ac.uk/teaching/courses/pa/

1


Topics Fundamental concepts

– Performance issues– Parallelism in software

Uniprocessor parallelism– Pipelining, superscalar, and VLIW processors– Vector, SIMD processors

Interconnection networks– Routing, static and dynamic networks– Combining networks

Multiprocessors, Multicomputers, and Multithreading– Shared memory and message passing systems– Cache coherence and memory consistency

Performance and scalability

2


Lect. 1: Performance Issues Why parallel architectures?

– Performance of sequential architecture is limited (by technology and ultimately by the laws of physics)

– Relentless increase in computing resources (transistors for logic and memory) that can no longer be exploited for sequential processing

– At any point in time many important applications cannot be solved with the best existing sequential architecture

Uses of parallel architectures– To solve a single problem faster (e.g., simulating protein

folding: researchweb.watson.ibm.com/bleugene)– To solve a larger version of a problem (e.g., weather forecast:

www.jamstec.go.jp/esc)– To solve many problems at the same time (e.g., transaction

processing)

3


Limits to Sequential Execution Speed of light limit

– Computation/data flow through logic gates, memory devices, and wires

– At all of these there is a non-zero delay that is at a minimum equal to delay of the speed of light

– Thus, the speed of light and the minimum physical feature sizes impose a hard limit on the speed of any sequential computation

Von Neumann’s limit– Programs consist of ordered sequence of instructions– Instructions are stored in memory and must be fetched in

order (same for data)– Thus, sequential computation is ultimately limited by the

memory bandwidth

4


Examples of Parallel Architectures

An ARM processor in a common mobile phone has 10s of instructions in-flight in its pipeline

Pentium IV executes up to 6 microinstructions per cycle and has up to 126 microinstructions in-flight

Intel’s quad-core chips have four processors and are now in mainstream desktops and laptops

Japan’s Earth Simulator has 5120 vector processors, each with 8 vector pipelines

IBM’s largest BlueGene supercomputer has 131,072 processors

Google has about 100,000 Linux machines connected in several cluster farms

5


Comparing Execution Times Example: system A: TA execution time of program P on A system B: TB execution time of program P’ on

B

Notes:– For fairness P and P’ must be “best possible implementation” on

each system– If multiple programs are run then report weighted arithmetic

mean – Must report all details such as: input set, compiler flags, command

line arguments, etc

6

Speedup: S =TB

TA

; we say: A is S times faster

or A is( TB

TA

X 100 - 100)% faster


Amdahl’s Law Let: F fraction of problem that can be optimized Sopt speedup obtained on optimized fraction

e.g.: F = 0.5 (50%), Sopt = 10 Sopt = ∞

Bottom-line: performance improvements must be balanced

7

Soverall =1

(1 – F) +F

Sopt

Soverall =1

(1 – 0.5) +0.5

10

= 1.8 Soverall =1

(1 – 0.5) + 0= 2


Amdahl’s Law and Efficiency Let: F fraction of problem that can be parallelized Spar speedup obtained on parallelized fraction P number of processors

e.g.: 16 processors (Spar = 16), F = 0.9 (90%),

Bottom-line: for good scalability E>50%; when resources are “free” then lower efficiencies are acceptable

8

Soverall =1

(1 – F) +F

Spar

Soverall =1

(1 – 0.9) +0.9

16

= 6.4

E =Soverall

P

E =6.4

16= 0.4 (40%)


Performance Trends: Computer Families

Bottom-line: microprocessors have become the building blocks of most computer systems across the whole range of price-performance

9

0.1

1

10

100

Per

form

ance

Year

Minicomputers

Mainf rames

Culler and SinghFig. 1.1


Technological Trends: Moore’s Law

10

Bottom-line: overwhelming number of transistors allow for incredibly complex and highly integrated systems

4004

8086

80286 80486

PentiumPentium II

Pentium IIIPentium IV

Core Duo

Xeon Multi-Core

1

10

100

1000

10000

100000

1000000

10000000

1970 1975 1980 1985 1990 1995 2000 2005 2010

Tra

nsi

sto

rs (

x100

0)

Year

Intel CPUs


Tracking Technology: The role of CA

Bottom-line: architectural innovation complement technological improvements

11

H&PFig. 1.1

0

200

400

600

800

1000

1200

1400

1600

1800

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

SP

EC

int

rati

ng

1.35x/year 1.58x/year

DEC Alpha

Intel Pentium III

DEC Alpha

HP 9000

IBM Pow er1

MIPS R2000


The Memory Gap

Bottom-line: memory access is increasingly expensive and CA must devise new ways of hiding this cost

12

H&PFig. 5.2

1

10

100

1000

10000

100000

1980

1985

1990

1995

2000

2005

Per

form

ance

Memory CPU


Software Trends Ever larger applications: memory requirements

double every year More powerful compilers and increasing role of

compilers on performance Novel applications with different demands: e.g., multimedia

– Streaming data– Simple fixed operations on regular and small data MMX-

like instructions e.g., web-based services

– Huge data sets with little locality of access– Simple data lookups and processing Transactional

Memory(?) (www.cs.wisc.edu/trans-memory)

Bottom-line: architecture/compiler co-design

13


Current Trends in CA Very complex processor design:

– Hybrid branch prediction (MIPS R14000)– Out-of-order execution (Pentium IV)– Multi-banked on-chip caches (Alpha 21364)– EPIC (Explicitly Parallel Instruction Computer) (Intel Itanium)

Parallelism and integration at chip level:– Chip-multiprocessors (CMP) (Sun T2, IBM Power6, Intel Itanium 2)– Multithreading (Intel Hyperthreading, IBM Power6, Sun T2)– Embedded Systems On a Chip (SOC)

Multiprocessors:– Servers (Sun Fire, SGI Origin)– Supercomputers (IBM BlueGene, SGI Origin, IBM HPCx)– Clusters of workstations (Google server farm)

Power-conscious designs

14


Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor)

– Parallel arithmetic– Pipelining– Superscalar, VLIW, SIMD, and vector execution

Parallelism in Hardware (Multiprocessor)– Chip-multiprocessors a.k.a. Multi-cores– Shared-memory multiprocessors– Distributed-memory multiprocessors– Multicomputers a.k.a. clusters

Parallelism in Software– Tasks– Data parallelism– Data streams

(note: a “processor” must be capable of independent control and of operating on non-trivial data types)

1


Taxonomy of Parallel Computers

According to instruction and data streams (Flynn):– Single instruction single data (SISD): this is the standard

uniprocessor– Single instruction, multiple data streams (SIMD):

Same instruction is executed in all processors with different data E.g., graphics processing

– Multiple instruction, single data streams (MISD): Different instructions on the same data Never used in practice

– Multiple instruction, multiple data streams (MIMD): the “common” multiprocessor

Each processor uses it own data and executes its own program (or part of the program)

Most flexible approach Easier/cheaper to build by putting together “off-the-shelf” processors

2



According to physical organization of processors and memory:– Physically centralized memory, uniform memory access

(UMA) All memory is allocated at same distance from all processors Also called symmetric multiprocessors (SMP) Memory bandwidth is fixed and must accommodate all

processors does not scale to large number of processors Used in most CMPs today (e.g., IBM Power5, Intel Core

Duo)

3

Interconnection

CPU

Main memory

CPU CPU CPU

Cache Cache Cache Cache



According to physical organization of processors and memory:– Physically distributed memory, non-uniform memory

access (NUMA) A portion of memory is allocated with each processor (node) Accessing local memory is much faster than remote memory If most accesses are to local memory than overall memory

bandwidth increases linearly with the number of processors

4

Interconnection

CPU

Mem.

CPU CPU CPU


Mem. Mem. Mem.

Node



According to memory communication model– Shared address or shared memory

Processes in different processors can use the same virtual address space

Any processor can directly access memory in another processor node Communication is done through shared memory variables Explicit synchronization with locks and critical sections Arguably easier to program

– Distributed address or message passing Processes in different processors use different virtual address spaces Each processor can only directly access memory in its own node Communication is done through explicit messages Synchronization is implicit in the messages Arguably harder to program Some standard message passing libraries (e.g., MPI)

5


Shared Memory vs. Message Passing

Shared memory

Message passing

6

flag = 0;…a = 10;flag = 1;

flag = 0;…while (!flag) {}x = a * y;

Producer (p1) Consumer (p2)

…a = 10;send(p2, a, label);

…receive(p1, b, label);x = b * y;



Types of Parallelism in Applications

Instruction-level parallelism (ILP)– Multiple instructions from the same instruction stream can

be executed concurrently– Generated and managed by hardware (superscalar) or by

compiler (VLIW)– Limited in practice by data and control dependences

Thread-level or task-level parallelism (TLP)– Multiple threads or instruction sequences from the same

application can be executed concurrently– Generated by compiler/user and managed by compiler and

hardware– Limited in practice by communication/synchronization

overheads and by algorithm characteristics

7


Types of Parallelism in Applications

Data-level parallelism (DLP)– Instructions from a single stream operate concurrently

(temporally or spatially) on several data– Limited by non-regular data manipulation patterns and

by memory bandwidth

Transaction-level parallelism– Multiple threads/processes from different transactions

can be executed concurrently– Sometimes not really considered as parallelism– Limited by access to metadata and by interconnection

bandwidth

8


Example: Equation Solver Kernel

The problem:– Operate on a (n+2)x(n+2) matrix– Points on the rim have fixed value– Inner points are updated as:

– Updates are in-place, so top and left are new values and bottom and right are old ones– Updates occur at multiple sweeps– Keep difference between old and new values and stop when difference for all points is small enough

9

A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j])



Dependences:– Computing the new value of a given point requires the

new value of the point directly above and to the left– By transitivity, it requires all points in the sub-matrix in

the upper-left corner– Points along the top-right to bottom-left diagonals can be

computed independently

10



ILP version (from sequential code):– Machine instructions from each j iteration can occur in

parallel– Branch prediction allows overlap of multiple iterations of

j loop – Some of the instructions from multiple j iterations can

occur in parallel

11

while (!done) { diff = 0; for (i=1; i<=n; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); diff += abs(A[i,j] – temp); } } if (diff/(n*n) < TOL) done=1;}



TLP version (shared-memory):

12

int mymin = 1+(pid * n/P);int mymax = mymin + n/P – 1;

while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } } lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P);}



TLP version (shared-memory) (for 2 processors):– Each processor gets a chunk of rows

E.g., processor 0 gets: mymin=1 and mymax=2 and processor 1 gets: mymin=3 and mymax=4

13

int mymin = 1+(pid * n/P);int mymax = mymin + n/P – 1;

while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } ...



TLP version (shared-memory):– All processors can access freely the same data structure

A– Access to diff, however, must be in turns– All processors update together their own done variable

14

... for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } } lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P);}


Types of Speedups and Scaling

Scalability: adding x times more resources to the machine yields close to x times better “performance”– Usually resources are processors, but can also be memory

size or interconnect bandwidth– Usually means that with x times more processors we can

get ~x times speedup for the same problem– In other words: How does efficiency (see Lecture 1) hold

as the number of processors increases? In reality we have different scalability models:

– Problem constrained– Time constrained– Memory constrained

Most appropriate scalability model depends on the user interests

15



Problem constrained (PC) scaling:– Problem size is kept fixed– Wall-clock execution time reduction is the goal– Number of processors and memory size are increased– “Speedup” is then defined as:

– Example: CAD tools that take days to run, weather simulation that does not complete in reasonable time

16

SPC =Time(1 processor)

Time(p processors)



Time constrained (TC) scaling:– Maximum allowable execution time is kept fixed– Problem size increase is the goal– Number of processors and memory size are increased– “Speedup” is then defined as:

– Example: weather simulation with refined grid

17

STC =Work(p processors)

Work(1 processor)



Memory constrained (MC) scaling:– Both problem size and execution time are allowed to increase– Problem size increase with the available memory with

smallest increase in execution time is the goal– Number of processors and memory size are increased– “Speedup” is then defined as:

– Example: astrophysics simulation with more planets and stars

18

SMC =Work(p processors)

Time(p processors)xTime(1 processor)

Work(1 processor)=Increase in Work

Increase in Time


Lect. 3: Superscalar Processors I/II

Pipelining: several instructions are simultaneously at different stages of their execution

Superscalar: several instructions are simultaneously at the same stages of their execution

(Superpipelining: very deep pipeline with very short stages to increase the amount of parallelism)

Out-of-order execution: instructions can be executed in an order different from that specified in the program

Dependences between instructions:– Read after Write (RAW) (a.k.a. data dependence)– Write after Read (WAR) (a.k.a. anti dependence)– Write after Write (WAW) (a.k.a. output dependence)– Control dependence

Speculative execution: tentative execution despite dependences

1


A 5-stage Pipeline

2

Generalregisters

ID MEMIF EXE WB

MemoryMemory

IF = instruction fetch (includes PC increment)ID = instruction decode + fetching values from general purpose registersEXE = arithmetic/logic operations or address computationMEM = memory access or branch completionWB = write back results to general purpose registers


A Pipelining Diagram Start one instruction per clock cycle

3

IF I1 I2

I1 I2ID

EXE

MEM

WB

I1 I2

I1 I2

I1 I2

I3 I4

I3

I3 I4 I5

I3 I4 I5 I6

cycle 1 2 3 4 5 6

instructionflow

each instruction still takes 5 cycles, but instructions now complete every cycle: CPI 1


Multiple-issue Start two instructions per clock cycle

4

IF I1 I3

I1 I3ID

EXE

MEM

WB

I1 I3

I1 I3

I1 I3

I5 I7

I5

I5 I7 I9

I5 I7 I9 I11

cycle 1 2 3 4 5 6

instructionflow I2 I4 I6 I8 I10 I12

I2 I4 I6 I8 I10

I2 I4 I6 I8

I2 I4 I6

I2 I4

CPI 0.5;IPC 2


A Pipelined Processor (DLX)

5

H&PFig. A.18


Advanced Superscalar Execution

6

Ideally: in an n-issue superscalar, n instructions are fetched, decoded, executed, and committed per cycle

In practice:– Control flow changes spoil fetch flow– Data, control, and structural hazards spoil issue flow– Multi-cycle arithmetic operations spoil execute flow

Buffers at issue (issue window or issue queue) and commit (reorder buffer) decouple these stages from the rest of the pipeline and regularize somewhat breaks in the flowGeneral

registers

ID MEMFetchengine EXE WB

MemoryMemory

instructions instructions


Problems At Instruction Fetch

7

Crossing instruction cache line boundaries– e.g., 32 bit instructions and 32 byte instruction cache

lines → 8 instructions per cache line; 4-wide superscalar processor

– More than one cache lookup are required in the same cycle

– What if one of the line accesses is a cache miss?– Words from different lines must be ordered and packed

into instruction queue

Case 1: all instructions located in same cache line and no branch

Case 2: instructions spread in more lines and no branch


Problems At Instruction Fetch

8

Control flow– e.g., 32 bit instructions and 32 byte instruction cache

lines → 8 instructions per cache line; 4-wide superscalar processor

– Branch prediction is required within the instruction fetch stage

– For wider issue processors multiple predictions are likely required

– In practice most fetch units only fetch up to the first predicted taken branch

Case 1: single not taken branch

Case 2: single taken branch outside fetch range and into other cache line


Example Frequencies of Control Flow

9

benchmark taken % avg. BB size# of inst. between taken

branches

eqntott 86.2 4.20 4.87

espresso 63.8 4.24 6.65

xlisp 64.7 4.34 6.70

gcc 67.6 4.65 6.88

sc 70.2 4.71 6.71

compress 60.9 5.39 8.85

Data from Rotenberg et. al. for SPEC 92 Int

One branch/jump about every 4 to 6 instructions One taken branch/jump about every 4 to 9

instructions


Solutions For Instruction Fetch

10

Advanced fetch engines that can perform multiple cache line lookups– E.g., interleaved I-caches where consecutive program

lines are stored in different banks that can accessed in parallel

Very fast, albeit not very accurate branch predictors (e.g., next line predictor in the Alpha 21464)– Note: usually used in conjunction with more accurate

but slower predictors (see Lecture 4) Restructuring instruction storage to keep

commonly consecutive instructions together (e.g., Trace cache in Pentium 4)


Example Advanced Fetch Unit

11

Figure fromRotenberg et. al.

Control flow predictionunits:i) Branch Target Bufferii) Return Address Stackiii) Branch Predictor

Final alignment unit

2-way interleaved I-cache

Mask to select instructionsfrom each of the cache lines


Trace Caches

12

Traditional I-cache: instructions laid out in program order

Dynamic execution order does not always follow program order (e.g., taken branches) and the dynamic order also changes

Idea:– Store instructions in execution order (traces)– Traces can start with any static instruction and are

identified by the starting instruction’s PC– Traces are dynamically created as instructions are

normally fetched and branches are resolved– Traces also contain the outcomes of the implicitly

predicted branches– When the same trace is again encountered (i.e., same

starting instruction and same branch predictions) instructions are obtained from trace cache

– Note that multiple traces can be stored with the same starting instruction


Pros/Cons of Trace Caches

13

+ Instructions come from a single trace cache line+ Branches are implicitly predicted

– The instruction that follows the branch is fixed in the trace and implies the branch’s direction (taken or not taken)

+ I-cache still present, so no need to change cache hierarchy

+ In CISC IS’s (e.g., x86) the trace cache can keep decoded instructions (e.g., Pentium 4)

- Wasted storage as instructions appear in both I-cache and trace cache, and in possibly multiple trace cache lines

- Not very good at handling indirect jumps and returns (which have multiple targets, instead of only taken/not taken) and even unconditional branches

- Not very good when there are traces with common sub-paths


Structure of a Trace Cache

14

Figure fromRotenberg et. al.


Structure of a Trace Cache

15

Each line contains n instructions from up to m basic blocks

Control bits:– Valid– Tag– Branch flags and mask: m-1 bits to specify the direction

of the up to m branches– Branch mask: the number of branches in the trace– Trace target address and fall-through address: the

address of the next instruction to be fetched after the trace is exhausted

Trace cache hit:– Tag must match– Branch predictions must match the branch flags for all

branches in the trace


Trace Creation

16

Starts on a trace cache miss Instructions are fetched up to the first predicted taken branch Instructions are collected, possibly from multiple basic blocks (when branches are predicted taken) Trace is terminated when either n instructions or m branches have been added Trace target/fall-through address are computed at the end


Example

17

I-cache lines contain 8 32-bit instructions and Trace Cache lines contain up to 24 instructions and 3 branches

Processor can issue up to 4 instructions per cycle

L1: I1 [ALU] ... I5 [Cond. Br. to L3]L2: I6 [ALU] ... I12 [Jump to L4]L3: I13 [ALU] ... I18 [ALU]L4: I19 [ALU] ... I24 [Cond. Br. to L1]

Machine Code

B1(I1-I5)

B2(I6-I12)

B3(I13-I18)

B4(I19-I24)

Basic Blocks

I1 I2 I3

I4 I5 I6 I7 I8 I9 I10 I11

I12 I13 I14 I15 I16 I17 I18 I19

I20 I21 I22 I23

Layout in I-Cache

I24


Example

18

Step 1: fetch I1-I3 (stop at end of line) → Trace Cache miss → Start trace collection

Step 2: fetch I4-I5 (possible I-cache miss) (stop at predicted taken branch)

Step 3: fetch I13-16 (possible I-cache miss) Step 4: fetch I17-I19 (I18 is predicted not taken branch, stop at

end of line) Step 5: fetch I20-I23 (possible I-cache miss) (stop at predicted

taken branch) Step 6: fetch I24-I27 Step 7: fetch I1-I4 replaced by Trace Cache accessB1

(I1-I5)

B2(I6-I12)

B3(I13-I18)

B4(I19-I24)

Basic Blocks

I1 I2 I3

I4 I5 I6 I7 I8 I9 I10 I11

I12 I13 I14 I15 I16 I17 I18 I19

Layout in I-Cache

Common path

I1 I2 I3 I4 I5 I13 I14 I15

I16 I17 I18 I19 I20 I21 I22 I23

Layout in Trace Cache

I20 I21 I22 I23 I24

I24


References and Further Reading

19

Original hardware trace cache:“Trace Cache: a Low Latency Approach to High

Bandwidth Instruction Fetching”, E. Rotenberg, S. Bennett, and J. Smith, Intl. Symp. on Microarchitecture, December 1996.

Next trace prediction for trace caches:“Path-Based Next Trace Prediction”, Q. Jacobson, E.

Rotenberg, and J. Smith, Intl. Symp. on Microarchitecture, December 1997.

A Software trace cache:“Software Trace Cache”, A. Ramirez, J.-L. Larriba-Pey, C.

Navarro, J. Torrellas, and M. Valero, Intl. Conf. on Supercomputing, June 1999.


Lect. 4: Superscalar Processors II/II

n-wide instruction width + m-deep pipeline + d delay to resolve branches:– Up to n*m instructions in-flight– Up to n*d instructions must be re-executed on branch misprediction– Current processors have 10 to 20 cycles of branch misprediction

penalty

Current branch prediction accuracy is around 80%-90% for “difficult” applications and >95% for “easy” applications

Increasing prediction accuracy usually involves increasing the size of tables

Different predictor types are good at different types of branch behavior

Current processors have multiple branch predictors with different accuracy-delay tradeoffs

1


Quantifying Prediction Accuracy

2

Two measures:– Coverage: the fraction of branches for which the

predictor has a prediction (Note: usually, it is considered that coverage is 100% and no prediction equals predict not taken)

– Accuracy: the ratio of correctly predicted branches over the total number of branches predicted (Pitfall: higher accuracy is not necessarily better when coverage is lower)

Performance impact is proportional to (1-accuracy), penalty, and amount of branches in the application

Two ways of looking at accuracy improvements:– E.g., accuracy improves from 95% to 97%:

97 - 95

95= 0.021

Only 2% increasein accuracy

5 - 3

5= 0.4

40% reductionin mispredictions


2-bit Branch Prediction Branch prediction buffers:

– Match branch PC during IF or ID stages

2-bit saturating counter:– 00: do not take– 01: do not take– 10: take– 11: take

3

Branch PC

0x135c8

0x147e0

…

Outcome

00

01

…

…0x135c4: add r1,r2,r30x135c8: bne r1,r0,n…


(2,2) Correlating Predictor

For example: if the four counter values are 00 01 10 01 and the last two branches were, respectively, taken and not taken, then we will predict the branch as not taken (01)

Organized as a table of values indexed by the sequence of past branch outcomes and by the branch PC

This is an example of a context-based branch predictor

4

Prediction bits

00

01

10

11

Do not take

Do not take

Take

Take

PredictionIf NT/NT If T/NT If NT/T If T/T

00

01

10

11

00

01

10

11

00

01

10

11


Two Level Branch Predictors

5

Two types of arrangement/indexing:– Global: Information is not particular to a branch and the

table/information is not directly indexed by the branch’s PC

Good when branches are highly correlated– Local (a.k.a. per address): Information is particular to a

branch and the table/information is indexed by the branch’s PC

Good when branches are individually highly biased– Partially local: Table/information is indexed by part of

the branch’s PC (in order to save bits in the tags for the tables)

– Note: sometimes global information may be indexed by information that was local, and is then somewhat indexed by the branch’s PC



6

1st level: history of the last n branches– If global:

Single History Register (HR) (n-bit shift register) with the last outcomes of all branches

– If local: Multiple HR’s in a History Register Table (HRT) that is

indexed by the branch’s PC, where each HR contains the last outcomes of the corresponding branch only

2nd level: the branch behavior of the last s occurrences of the history pattern– If global:

Single Pattern History Table (PHT) indexed by the resulting HR contents

– If local: Multiple PHT’s that are indexed by the branch’s PC, where

each entry is indexed by the resulting HR contents– Thus, 2n entries for each HR



7

Example with global history and global pattern table (GAg)

– All branches use the same HR– All branches use the same PHT– 2-bit saturating counter is only an example and other

schemes are possible– Meaning: “When the outcome of the last any n branches

is 11…10 then the prediction is P, regardless of what branch is being predicted”

1 1 1... 0

Branch History Register

00 … 00

00 … 01

00 … 10

11 … 10

11 … 11

…

Pattern History Table

P

2-bit SaturatingCounter

P = 01

Predict Not Taken

IndexingPrediction

Result


Example with local history and global pattern table (PAg)

– Each branch uses its own HR– All branches use the same PHT– Meaning: “When the outcome of the last n instances of

the branch being predicted is 11…10 then the prediction is P, regardless of what branch is being predicted”


8

1 1 1... 0

Branch History Registers00 … 00

00 … 01

00 … 10

11 … 10

11 … 11

…


PIndexing0 1 0... 0

tag

tag

PC

1 1 1... 0tagIndexing


Example with local history and local pattern table (PAp)

– Each branch uses its own HR– Each branch uses its own PHT– Meaning: “When the outcome of the last n instances of

the branch being predicted is 11…10 then the prediction is P for this particular branch”


9

1 1 1... 0

Branch History Registers00 … 00

00 … 01

00 … 10

11 … 10

11 … 11

…


PIndexing0 1 0... 0

tag

tag

PC

1 1 1... 0tagIndexing

tag

P’

tag



10

Notes:– When only part of the branch’s PC is used for indexing

there is aliasing (i.e., multiple branches appear to be same)

– In practice there is a finite number of entries in the tables with local information, so

Either these only cache information for the most recently seen branches

Or the tables are indexed by hashing (usually with an XOR) the branch’s PC (this also leads to aliasing)

– Aliasing also happens with global information, as multiple branches appear to have the same behavior/prediction

– Accuracy of predictor depends on: Local versus Global information at each level Size of the tables in local schemes (number of different

branches that can be tracked) Depth of the history (n) Amount of aliasing



11

Updates:– The HR’s are updated with the outcome of the branch

being predicted (only the corresponding HR in case of local scheme)

– The predictor in the selected PHT entry is updated with the outcome of branch (e.g., a 2-bit saturating counter is incremented/decremented if the outcome is taken/not taken)

Taxonomy:– History Table type:

Global: GA; Local (per address): PA– Pattern Table type:

Global: g; Local (per address): p– Thus: GAg=global history table and global pattern table PAg=local history table and global pattern table– GAp combination does not make much sense


Local vs. Global Predictors

12

Simple 2-bit predictor performs best for small predictor sizes, but saturates quickly and below other predictors

Local outperforms global for all these predictor sizes

84

86

88

90

92

94

96

98

64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB 32KB 64KB

Predictor Size

Pre

dic

tor

Acc

ura

cy (

%)

Gag Pag 2-bit

Data from McFarling for SPEC 1989 Int and FP


Combining Branch Predictors

13

Different predictors are good at different behaviors

Different predictors have different accuracy and latency

Combining predictors– Can lead to schemes that are good at more behaviors– Can generate quickly a reasonably accurate prediction

and with some more delay a highly accurate prediction, which corrects the previous prediction if necessary

– Usually combine a simple and a complex predictor Choosing between multiple predictors:

– “Meta-predictor” to choose the predictor that most likely has the correct prediction

– Augment predictors with confidence estimators



14

Meta predictor– Use 2-bit saturating counter to select predictor to use

Selector Counters

S

tag

tag

tag

tag

tag

PCIndexing

2-bit SaturatingCounter

S = 01

Use Predictor P2

SelectionResult

P1 P2

Predictors

2:1 MUX

Predictions

Final prediction



15

Meta predictor– 2-bit saturating counter interpretation:

00: Use P2 01: Use P2 10: Use P1 11: Use P1

– Updating counter: P1 correct and P2 correct this time: no change to counter P1 correct and P2 incorrect this time: increment counter P1 incorrect and P2 correct this time: decrement counter P1 incorrect and P2 incorrect this time: no change to

counter

Choosing among more than 2 predictors is more involved and rarely pays off


Example: The Alpha 21464 Predictors

16

8-wide out-or-order superscalar processor with very deep pipeline and multithreading

Predictors take approximately 44KBytes of storage

Up to 16 branches predicted every cycle Minimum misprediction penalty of 14 cycles (112

instructions) and most common is 20 or 25 cycles (160 or 200 instructions)

Based on global schemes; local schemes were ruled out because:– They would require up to 16 parallel lookups of the

tables– Difficult to maintain per-branch information (e.g., the

same branch may appear multiple times in such a deeply pipelined wide issue machine)

In addition to conditional branch prediction it has a jump predictor and a return address stack predictor


Example: The Alpha 21464 Predictors

17

Fetch unit:– Can fetch up to 16 instructions from 2 dynamically

consecutive I-cache lines– Instruction fetch stops at the first taken branch

(predicted not taken branches (up to 16) do not stop fetch)

1st Predictor: Next Line Predictor– Operates within a single cycle– Unacceptably high misprediction rate

2nd Predictor: 2Bc-gskew– Operates over 2 cycles and is pipelined– Actually consists of 2 different predictors (a 2-bit

saturating counter and an e-gskew) combined and with a meta predictor selector

– Uses “de-aliasing” approach: Partition the tables into multiple sets and use special

hashing functions Shown to reduce aliasing in global schemes



18

Seminal branch prediction work:“Two-Level Adaptive Training Branch Prediction”, T.-Y.

Yeh and Y. Patt, Intl. Symp. on Microarchitecture, December 1991.

“Alternative Implementations of Two-Level Adaptive Branch Prediction”, T.-Y. Yeh and Y. Patt, Intl. Symp. on Computer Architecture, June 1992.

“Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation”, S.-T. Pan, K. So, and J. T. Rahmeh, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1992.

“Combining Branch Predictors”, S. McFarling, WRL Technical Note TN-36, June 1993.

Adding confidence estimation to predictors:“Assigning Confidence to Conditional Branch

Predictions”, E. Jacobsen, E. Rotenberg, and J. Smith, Intl. Symp. on Microarchitecture, December 1996.



19

Alpha 21464 predictor:“Design Tradeoffs for the Alpha EV8 Conditional Branch

Predictor”, A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, Intl. Symp. on Computer Architecture, June 2002.

“Next Cache Line and Set Prediction”, B. Calder and D. Grunwald, Intl. Symp. on Computer Architecture, June 1995.

“Trading Conflict and Capacity Aliasing in Conditional Branch Predictors”, P. Michaud, A. Seznec, and R. Uhlig, Intl. Symp. on Computer Architecture, June 1997.

Neural net based branch predictors:“Fast Path-Based Neural Branch Prediction”, D. Jimenez,

Intl. Symp. on Microarchitecture, December 2003. Championship Branch Prediction

– www.jilp.org/cbp/– camino.rutgers.edu/cbp2/


Probing Further

20

Advanced register allocation and de-allocation“Late Allocation and Early Release of Physical Registers”,

T. Monreal, V. Vinals, J. Gonzalez, A. Gonzalez, and M. Valero, IEEE Trans. on Computers, October 2004.

Value prediction“Exceeding the Dataflow Limit Via Value Prediction”, M.

H. Lipasti and J. P. Shen, Intl. Symp. on Microarchitecture, December 1996.

Limitations to wide issue processors“Complexity-Effective Superscalar Processors”, S.

Palacharla, N. P. Jouppi, and J. Smith, Intl. Symp. on Computer Architecture, June 1997.

“Clock Rate Versus IPC: the End of the Road for Conventional Microarchitectures”, V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Intl. Symp. on Computer Architecture, June 2000.

Recent alternatives to out-of-order execution“”Flea-flicker” Multipass Pipelining: An Alternative to the

High-Power Out-of-Order Offense”, R. D. Barnes, S. Ryoo, and W. Hwu, Intl. Symp. on Microarchitecture, November 2005.


Lect. 5: Vector Processors Many real-world problems, especially in science and

engineering, map well to computation on arrays RISC approach is inefficient:

– Based on loops → require dynamic or static unrolling to overlap computations– Indexing arrays based on arithmetic updates of induction variables– Fetching of array elements from memory based on individual, and unrelated,

loads and stores– Small register files– Instruction dependences must be identified for each individual instruction

Idea:– Treat operands as whole vectors, not as individual integer of float-point numbers– Single machine instruction now operates on whole vectors (e.g., a vector add)– Loads and stores to memory also operate on whole vectors– Individual operations on vector elements are independent and only dependences

between whole vector operations must be tracked

1


Execution Model

Straightforward RISC code:– F2 contains the value of s– R1 contains the address of the first element of a– R2 contains the address of the first element of b– R3 contains the address of the last element of a + 8

2

for (i=0; i<64; i++)a[i] = b[i] + s;

loop: L.D F0,0(R2) ;F0=array element of b ADD.D F4,F0,F2 ;main computation S.D F4,0(R1) ;store result DADDUI R1,R1,8 ;increment index DADDUI R2,R2,8 ;increment index BNE R1,R3,loop ;next iteration


Execution Model

Straightforward vector code:– F2 contains the value of s– R1 contains the address of the first element of a– R2 contains the address of the first element of b– Assume vector registers have 64 double precision elements

– Notes: Some vector operations require access to integer and FP register files as well In practice vector registers are not of the exact size of the arrays Refer to Figure G.3 of Hennessy&Patterson for a list of the most common types of

vector instructions Only 3 instructions executed compared to 6*64=384 executed in the RISC

3

for (i=0; i<64; i++)a[i] = b[i] + s;

LV V1,R2 ;V1=array b ADDVS.D V2,V1,F2 ;main computation SV V2,R1 ;store result


Execution Model (Pipelined)

With multiple vector units, I2 can execute together with I1 (as we will see later)

In practice, the vector units takes several cycles to operate on each element, but is pipelined

4

IF I1

I1ID

EXE

MEM

WB

I1.1

I1.1

I1.1

cycle 1 2 3 4 5 6

I1.2

7

I1.2

I1.3

8

I2

I1.2

I1.3

I1.4

I1.3

I1.4

I1.5

I1.4

I1.5

I1.6


Pros of Vector Processors Reduced pressure on instruction fetch

– Fewer instructions are necessary to specify the same amount of work

Reduced pressure on instruction issue– Reduced number of branches alleviates branch prediction– Much simpler hardware for checking dependences

Simpler register file– No need for too many ports as only one element used per

cycle (for pipeline approach)

More streamlined memory accesses– Vector loads and stores specify a regular access pattern– High latency of initiating memory access is amortized

5


Cons of Vector Processors Requires a specialized, high-bandwidth, memory system

– Caches do not usually work well with vector processors– Usually built around heavily banked memory with data

interleaving

Still requires a traditional scalar unit (integer and FP) for the non-vector operations

Difficult to maintain precise interrupts (can’t rollback all the individual operations already completed)

Compiler or programmer has vectorize programs Not very efficient for small vector sizes Not suitable/efficient for many different classes of

applications

6


Performance Issues Performance of a vector instruction depends on the length of

the operand vectors Initiation rate

– Rate at which individual operations can start in a functional unit– For fully pipelined units this is 1 operation per cycle– Usually >1 for load/store unit

Start-up time– Time it takes to produce the first element of the result– Depends on how deep the pipeline of the functional units are– Especially large for load/store unit

With an initiation rate of 1, the time to complete a single vector instruction is equal to the vector size + the start-up time, which is approximately equal to the vector size for large vectors

7


Performance Issues Common vector processor performance metrics:

– R∞ : the rate of execution of the processor with vectors of infinite size (i.e., with no overheads due to smaller vectors)

– N1/2: the vector length required for the processor to reach half of R∞

– NV: the vector length required for the processor to match the performance of scalar execution (i.e., the point at which it pays off to execute in vector mode)

8


Dealing with Vector Sizes Two new registers are used:

– vector length register (VLR) specifies (to the hardware) what length is to be assumed for the next instruction to be issued

– maximum vector length (MVL) specifies (to the programmer/compiler) what the maximum length is (i.e., the size of the registers in the particular machine)

Use strip mining for user arrays larger than MVL

9

for (i=0; i<n; i++) a[i] = b[i] + s;

low = 0;VL = n % MVL;for (j=0; j<n/MVL; j++) { for (i=low; i<low+VL-1; i++) a[i] = b[i] + s; low = low + VL; VL = MVL;}

Set length to the remainderpart of the array when the sizeis not divisible by MVL.For instance with n=140 andMVL=64, we have 2 chunks of64 and 1 remainder chunk of 16

This is the loop thatgets vectorized

Set the length back to MVL


Advanced Features: Masking What if the operations involve only some elements of the

array, depending on some run-time condition?

Solution: masking– Add a new boolean vector register (the vector mask register)– The vector instruction then only operates on elements of the vectors

whose corresponding bit in the mask register is 1– Add new vector instructions to set the mask register

E.g., SNEVS.D V1,F0 sets to 1 the bits in the mask registers whose corresponding elements in V1 are not equal to the value in F0

CVM instruction sets all bits of the mask register to 1

10

for (i=0; i<64; i++) if (b[i] != 0) a[i] = b[i] + s;


Advanced Features: Masking

Vector code:– F2 contains the value of s and F0 contains zero– R1 contains the address of the first element of a– R2 contains the address of the first element of b– Assume vector registers have 64 double precision elements

11

for (i=0; i<64; i++) if (b[i] != 0) a[i] = b[i] + s;

LV V1,R2 ;V1=array b SNEVS.D V1,F0 ;mask bit is 1 if b !=0 ADDVS.D V2,V1,F2 ;main computation CVM SV V2,R1 ;store result


Advanced Features: Scatter-Gather

How can we handle sparse matrices?

Solution: scatter-gather– Use the contents of an auxiliary vector to select which elements

of the main vector are to be used– This is done by pointing to the address in memory of the

elements to be selected– Add new vector instruction to load memory values based on this

auxiliary vector E.g. LVI V1,(R1+V2) loads the elements of a user array from

memory locations R1+V2(i) Also SVI store counterpart

12

for (i=0; i<64; i++) a[K[i]] = b[K[i]] + s;


Advanced Features: Scatter-Gather

Vector code:– F2 contains the value of s– R1 contains the address of the first element of a– R2 contains the address of the first element of b– V3 contains the indices of a and b that need to be used– Assume vector registers have 64 double precision elements

13

for (i=0; i<64; i++) a[K[i]] = b[K[i]] + s;

LVI V1,(R2+V3) ;V1=array b indexed by V3 ADDVS.D V2,V1,F2 ;main computation SVI V2,(R1+V3) ;store result


Advanced Features: Striding

Assume that the 2D array b is laid out by rows– Iterations access non-contiguous elements of b– Could use scatter-gather, but this would waste a vector

register– Access pattern is very regular and a single integer, the

stride, fully defines it– Add a new vector instruction to load values from memory

based on the stride E.g., LVWS V1,(R1,R2) loads the elements of a user array from

memory locations R1+i*R2 Also SVWS store counterpart

14

for (i=0; i<64; i++) a[i] = b[i,j] + s;


Advanced Features: Chaining

Forwarding in pipelined RISC processors allow dependent instructions to execute as soon as the result of the previous instruction is available

15

IF add mul

add mulID

EXE

MEM

WB

add mul

add mul

add mul

I3 I6

cycle 1 2 3 4 5 6

I3

I3

I5

I3

I4

I4

I4

I5

ADD.D R1,R2,R3 # R1=R2+R3MUL.D R4,R5,R1 # R4=R5+R1

value


Advanced Features: Chaining

Similar idea applies to vector instructions and is called chaining– Difference is that chaining of vector instructions requires

multiple functional units as the same unit cannot be used back-to-back

16

IF add mul

add mulID

EXE

MEM

MEM

A.1 A.2

A.1

I3

cycle 1 2 3 4 5 6

A.3 A.4

ADDV.D V1,V2,V3 # V1=V2+V3MULV.D V4,V5,V1 # V4=V5+V1

value

EXE M.1M.2 M.3

M.1 M.2

A.2 A.3

WB


Example: The Earth Simulator

17

73rd fastest supercomputer as of Top500 list of November 2008 (was 1st March 2002 to September 2004)

Multiprocessor Vector architecture– 640 nodes, 8 vector processors per node → 5120

processors– 8 pipelines per vector processor– 10 TBytes of main memory– Vector units contain 72 vector registers, each with 256

elements Performance and Power consumption

– 35.9 TFLOPS on Top500 benchmark (closest RISC-based multiprocessor (#72) reaches 36.6 TFLOPS using 9216 processors)

– 12800 KWatts power consumption Designed specifically to simulate nature (e.g.,

weather, ocean, earthquakes) at a global scale (i.e., the whole earth)


Further Reading

18

The first truly successful vector supercomputer:“The CRAY-1 Computer System”, R. M. Russel,

Communications of the ACM, January 1978. A recent vector processor on a chip:

“Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks”, C. Kozyrakis and D. Patterson, Intl. Symp. on Microarchitecture, December 2002.

Integrating a vector unit with a state-of-the-art superscalar:“Tarantula: A Vector Extension to the Alpha

Architecture”, R. Espasa, F. Ardanaz, J. Elmer, S. Felix, J. Galo, R. Gramunt, I. Hernandez, T. Ruan, G. Lowney, M. Mattina, and A. Seznec, Intl. Symp. on Computer Architecture, June 2002.


Lect. 6: SIMD Processors Superscalar execution model:

– Mix of scalar ALUs– n unrelated instructions per cycle– 2n unrelated operands per cycle– Results from any ALU can feed back to any ALU individually– Operands are wide (32/64 bits)

Vector execution model:– Vector ALU– 1 vector instruction → multiple of the same operation– Operands belong to an array– Results are written back to reg. file– Operands are wide (32/64 bits)

1

Instr.Sequencer

Reg. file

Reg. fileInstr.

Sequencer


Network of simple processing elements (PE)– PEs operate in lockstep under the control of a master sequencer,

the array control unit (ACU) (note: masking is possible)– PEs can exchange results with a small number of neighbors via

special data-routing instructions– Each PE has its own local memory or (less common) accesses

memory via an alignment network– PEs operate on very narrow operands (1 bit in the extreme case of

the CM-1)– Very large (up to 64K) number of PEs– Usually operated as co-processors with a host computer to perform

I/O and to handle external memory Suitable for some scientific, AI, and vizualization

applications Intended for use as supercomputers Programmed via custom extensions of common HLL

2

Original SIMD Idea

CS4/MSc Parallel Architectures - 2009-2010 3

Original SIMD Idea

Instr.Sequencer

M M M M

M M M M

M M M M



The problem:– Operate on a (n+2)x(n+2) matrix

SIMD implementation:– Assign one node to each PE– Step 1: all PE’s send their data to their east neighbors and

simultaneously read the data sent by their west neighbors (nodes at the right, top, and bottom rim are masked out at this step)– Steps 2 to 4: same as step 1 for west, south, and north (again,

appropriate nodes are masked out)– Step 5: all PE’s compute the new value using equation above– Note: strictly speaking we need some extra tricks to juggle new and

old values

4

A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j])


Example: MasPar MP-1 Key features

– First SIMD to use a traditional RISC IS– ACU also performs non-SIMD operations/computation– From 1K to 16K PE’s– PE array interconnects

2D mesh for 8-way (N, S, E, W, NE, SE, SW, NW) neighbor communication (X net)

Circuit-switched 3 stage hierarchical crossbar for any-to-any communication

Two global buses for ACU-PE lockstep control

– PE’s have local memory for data (16KB) (instructions are stored in the ACU)

– PE’s commonly operate on 32 bit words, but can also operate on individual bits, bytes, 16 bit words, and 64 bit words

5


Example: MasPar MP-1

6

Figure fromBlank

PE array with 2D mesh

ACU and Unix host

Crossbar with routers


A Modern SIMD Co-processor

ClearSpeed CSX600– Intended as an accelerator for high performance technical computing– Current implementation has 96 PE’s plus a scalar unit for non-SIMD

operations (including control flow)– Each PE is in fact a VLIW core– 1, 2, 4, and 8 byte operands– PE’s can communicate directly with right and left neighbors– Also supports multithreading to hide I/O latency (Lecture 12)– Uses traditional instruction and data caches in addition to memory

local to each PE– Programmed with a extension of C

Poly variables: replicated in each PE with different values Mono variables: only a single instance exists (either at the host, or

replicated at the PE’s but with synchronized values)

7


A Modern SIMD Co-processor

8

Figure fromClearSpeed

PE array with local memories(SRAM) and registers

RISC scalar processor andACU

Neighbor communicationinfrastructure (swazzle)


Multimedia SIMD Extensions Key ideas:

– No network of processing elements, but an array of (narrow) ALU’s

– No memories associated with ALU’s, but a pool of relatively wide (64 to 128 bits) registers that store several operands

– Still narrow operands (8 bits) and instructions that use operands of different sizes

– No direct communication between ALU’s, but via registers and with special shuffling/permutation instructions

– Not co-processors or supercomputers, but tightly integrated into CPU pipeline

– Still lockstep operation of ALU’s– Special instructions to handle common media operations (e.g.,

saturated arithmetic)

9


Multimedia SIMD Extensions SIMD ext. execution model:

10

Instr.Sequencer

Reg. file

Shuffling network

Inter register operations

R1

R2

+ + + +

R3

Intra register operations

R1

+ + + +

R2or

R1

+ + + +

R2


Example: Intel SSE Streaming SIMD Extensions introduced in 1999 with

Pentium III Improved over earlier MMX (1997)

– MMX re-used the FP registers– MMX only operated on integer operands

70 new machine instructions (SEE2 added 144 more in 2001) and 8 128bit registers– Registers are part of the architectural state– Include instructions to move values between SEE and x86 registers– Operands can be: single (32bit) and double (64bit) precision FP; 8,

16, and 32 bit integer– Some instructions to support digital signal processing (DSP) and 3D– SSE2 included instructions for handling the cache (recall that

streaming data does not utilize caches efficiently)

11


A Modern SIMD Variation: Cell

12

IBM/Sony/Toshiba Cell Broadband Engine: Heterogeneous “multi-core” system with 1 PowerPC

(PPE) + 8 SIMD engines (SPE – “Synergistic Processor Units”)

On-chip storage based on “scratch pads” (very, very hard to program)

Used in the Playstation 3 SIMD support

SPE’s are incapable of independent control and are “slaves” to PowerPC

PPE already supports SIMD extensions (IBM’s VMX) SPE supports SIMD through specific IS 128 128-bit registers and 128 bit datapath (note: no

scalar registers in SPE) Accessible to programmer through HLL intrinsics (i.e.,

function calls, e.g., spu_add(a,b)) Additional support for synchronization across SPE’s

and PPE and for data transfer



13

Seminal SIMD work:“A Model of SIMD Machines and a Comparison of Various

Interconnection Networks”, H. Siegel, IEEE Trans. on Computers, December 1979.

“The Connection Machine”, D. Hillis, Ph.D. dissertation, MIT, 1985.

Two commercial SIMD supercomputers:“The CM-2 Technical Summary”, Thinking Machines

Corporation, 1990.“The MasPar MP-1 Architecture”, T. Blank, Compcon,

1990. A modern SIMD co-processor:

“CSX Processor Architecture”, ClearSpeed, Whitepaper, 2006.


Lect. 7: Shared Mem. Multiprocessors I/V

Obtained by connecting full processors together– Processors contain normal width (32 or 64 bits) datapaths– Processors are capable of independent execution and control– Processors have their own connection to memory(Thus, by this definition, Sony’s Playstation 3 is not a multiprocessor as

the 8 SPE’s in the Cell are not full processors)

Have a single OS for the whole system, support both processes and threads, and appear as a common multiprogrammed system(Thus, by this definition, Beowulf clusters are not multiprocessors)

Can be used to run multiple sequential programs concurrently or parallel programs

Suitable for parallel programs where threads can follow different code

1


Recall the communication model:– Threads in different processors can use the same virtual

address space– Communication is done through shared memory variables– Explicit synchronization with locks (e.g., variable flag

below) and critical sections

2

flag = 0;…a = 10;flag = 1;

flag = 0;…while (!flag) {}x = a * y;


Shared Memory Multiprocessors


Shared Memory Multiprocessors

Recall the two common organizations:– Physically centralized memory, uniform memory access (UMA) (a.k.a. SMP)– Physically distributed memory, non-uniform memory access (NUMA)

(Note that both organizations have caches between processors and memory)

3

CPU

Main memory

CPU CPU CPU


CPU

Mem.

CPU CPU CPU


Mem. Mem. Mem.


The Cache Coherence Problem

4

CPU

Main memory

CPU CPU

Cache Cache Cache

T0: A=1

T0: A not cached T0: A not cached T0: A not cachedT1: load A (A=1)

T1: A=1

T1: A not cached T1: A not cachedT2: load A (A=1)T2: A not cachedT2: A=1

T2: A=1

T3: store A (A=2)T3: A not cachedT3: A=1

T3: A=1

stale

stale

T4: load A (A=1)T4: A=1 T4: A=2

T4: A=1

use old value

T5: load A (A=1)

use stale value!


Cache Coherence Protocols Idea:

– Keep track of what processors have copies of what data– Enforce that at any given time a single value of every data exists:

By getting rid of copies of the data with old values → invalidate protocols By updating everyone’s copy of the data → update protocols

In practice:– Guarantee that old values are eventually invalidated/updated (write

propagation) (recall that without synchronization there is no guarantee that a load

will return the new value anyway)– Guarantee that a single processor is allowed to modify a certain datum

at any given time (write serialization)– Must appear as if no caches were present

Note: must fit with cache’s operation at the granularity of lines

5


Write-invalidate Example

6

CPU

Main memory

CPU CPU

Cache Cache Cache

T1: load A (A=1)

T1: A=1


T2: A=1

T3: store A (A=2)T3: A not cachedT3: A not cached

T3: A=1

invalidate

stale

T4: load A (A=2)T4: A not cached T4: A=2

T4: A=1

new valueT5: load A (A=2)

new value


Write-update Example

7

CPU

Main memory

CPU CPU

Cache Cache Cache

T1: load A (A=1)

T1: A=1


T2: A=1

T3: store A (A=2)T3: A not cachedT3: A = 2

T3: A=2

update

update

T4: load A (A=2)T4: A = 2 T4: A=2

T4: A=2

new value

T5: load A (A=2)


Invalidate vs. Update Protocols

Invalidate:+ Multiple writes by the same processor to the cache block only

require one invalidation+ No need to send the new value of the data (less bandwidth)– Caches must be able to provide up-to-date data upon request– Must write-back data to memory when evicting a modified blockUsually used with write-back caches (more popular)

Update:+ New value can be re-used without the need to ask for it again+ Data can always be read from memory+ Modified blocks can be evicted from caches silently– Possible multiple useless updates (more bandwidth)Usually used with write-through caches (less popular)

8


Cache Coherence Protocols Implementation

– Can be in either hardware or software, but software schemes are not very practical (and will not be discussed further in this course)

Add state bits to cache lines to track state of the line– Most common: Invalid, Shared, Owned, Modified, Exclusive– Protocols usually named after the states supported

Global state of a memory line corresponds to the collection of its state in all caches

Cache lines transition between states upon load/store operations from the local processor and by remote processors

These state transitions must guarantee the invariant: no two cache copies can be simultaneously modified

9


Example: MSI Protocol States:

– Modified (M): block is cached only in this cache and has been modified

– Shared (S): block is cached in this cache and possibly in other caches (no cache can modify the block)

– Invalid (I): block is not cached

10


Example: MSI Protocol Transactions originated at this CPU:

11

Invalid Shared

Modified

CPU read miss

CPU read hit

CPU write miss

CPU write

CPU write hitCPU read hit


Example: MSI Protocol Transactions originated at other CPU:

12

Invalid Shared

Modified

CPU read miss

CPU read hit

CPU write miss

CPU write


Remote write miss

Remote write missRemote read miss

Remote read miss


Example: MESI Protocol States:

– Modified (M): block is cached only in this cache and has been modified– Exclusive (E): block is cached only in this cache, has not been modified,

but can be modified at will– Shared (S): block is cached in this cache and possibly in other caches– Invalid (I): block is not cached

State E is obtained on reads when no other processor has a shared copy– All processors must answer if they have copies or not

Easily done in bus-based systems with a shared-OR line

– Or some device must know if processors have copies

Advantage over MSI– Often variables are loaded, modified in register, and then stored– The store on state E then does not require asking for permission to write

13


Example: MESI Protocol Transactions originated at this CPU:

14

Invalid Shared

Modified

CPU read miss & sharing

CPU read hit

CPU write miss

CPU write


ExclusiveCPU read hit

CPU read miss & no sharing

CPU write

Must inform everyone(upgrade)

Can be done silently


Example: MESI Protocol Transactions originated at other CPU:

15

Invalid Shared

Modified Exclusive

Remote writemiss

Remote read miss

Remote write miss

Remote read miss

Remote read miss

Remote write miss


Possible Implementations Three possible ways of implementing coherence

protocols in hardware– Snooping: all cache controllers monitor all other caches’

activities and maintain the state of their lines Commonly used with buses and in many CMP’s today

– Directory: a central control device directly handles all cache activities and tells the caches what transitions to make

Can be of two types: centralized and distributed Commonly used with scalable interconnects and in many CMP’s

today

– List: each cache controller keeps track of its own state and the identity and state of its neighbors in a linked list

E.g., IEEE SCI protocol (ANSI/IEEE Std 1596-1992) Only used in a few machines in the late 90’s

16


Behavior of Cache Coherence Protocols

Uniprocessor cache misses (the 3 C’s):– Cold (or compulsory) misses: when a block is accessed for the first

time– Capacity misses: when a block is not in the cache because it was

evicted because the cache was full– Conflict misses: when a block is not in the cache because it was

evicted because the cache set was full

Coherence misses: when a block is not in the cache because it was invalidated by a write from another processor– Hard to reduce relates to intrinsic communication and sharing of

data in the parallel application

– False sharing coherence misses: processors modify different words of the cache block (no real communication or sharing) but end up invalidating the complete block

17



False sharing increases with larger cache line size– Only true sharing remains with single word/byte cache lines

False sharing can be reduced with better placement of data in memory

True sharing tends to decrease with larger cache line sizes (due to locality)

Classifying misses in a multiprocessor is not straightforward– E.g., if P0 has line A in the cache and evicts it due to capacity limitation,

and later P1 writes to the same line: is this a capacity or a coherence miss?

It is both, as fixing one problem (e.g., increasing cache size) won’t fix the other (see Figure 5.20 of Culler&Singh for a complete decision chart)

18



Common types of data access patterns– Private: data that is only accessed by a single processor– Read-only shared: data that is accessed by multiple processors but

only for reading (this includes instructions)– Migratory: data that is used and modified by multiple processors,

but in turns– Producer-consumer: data that is updated by one processor and

consumed by another– Read-write: data that is used and modified by multiple processors

simultaneously Falsely shared data Data used for synchronization (Lecture 10)

Bottom-line: threads don’t usually read and write the same data indiscriminately

19


Snooping coherence on simple shared bus

– “Easy” as all processors and memory controller can observe all transactions

– Bus-side cache controller monitors the tags of the lines involved and reacts if necessary by checking the contents and state of the local cache

– Bus provides a serialization point (i.e., every transaction A is either before or after another transaction B)

More complex with split transaction buses

1

P1L1

0 0

Line stateP2

L1

0 0

Line stateCache states:00 = invalid01 = shared10 = modified

Lect. 8: Shared Mem. Multiprocessors II/V


“The devil is in the details”, Classic Proverb

Problem: conflict when processor and bus-side controller must check the cache

Solutions:– Use dual-ported modules for the tag and state array– Or, duplicate tag and state array

Both must be kept consistent when one is changed, which introduces some amount of conflicts

2

P1L1

0 0

Line stateP2

L1

0 0


Snooping on Simple Bus

Ld/St


Problem: even if bus is atomic, transactions are not instantaneous and may require several steps → transactions are not atomic– E.g., part of a transaction may be delayed by a memory response

or by a bus-side controller that had to wait to access its tags– E.g., out-of-order processors may issue cache requests that

conflict with the current request being served– E.g., an upgrade request may lose bus arbitration to another

processor’s and may have to be re-issued as a full write miss (due to the required invalidation)

Solution:– Introduce transient states to cache lines and the protocol (the I,

S, M, etc states seen in Lecture 7 are then called the stable states)

3

Snooping on Simple Bus


Example: Extended MESI Protocol

Transactions originated at this CPU:

4

Invalid Shared

Modified

CPU write miss

CPU write


ExclusiveCPU

read hit

CPU read miss & shr.

CPU read miss & no shr.

CPU write

I→S,E

CPU read hit

I→M

S→M

CPU write miss

bus granted

CPU read

bus granted & shr.

bus granted

& no shr. CPU write

bus granted& no conflict

conflict


Problems:– Processor interacts with L1 while bus snooping device

interacts with L2, and propagating such operations up or down is not instantaneous

– L2 lines are usually bigger than L1 lines

5

Snooping with Multi-Level Hierarchies

P1L1

0 0

Line stateP2

L1

0 0

Line state Cache states:00 = invalid01 = shared10 = modified

Ld/St

L2

0 0

Line stateL2

0 0

Line state


Solution: 1. Maintain inclusion property

– Lines in L1 must also be in L2 → no data is found solely in L1, so no risk of missing a relevant transaction when snooping at L2

– Lines M state in L1 must also be in M state in L2→ snooping controller at L2 can identify all data that is modified locally

2. Propagate coherence transactions

6



Maintaining inclusion property Assume: L1: associativity a1, number of sets n1,

block size b1 L2: associativity a2, number of sets n2,

block size b2– Difficulty: Replacement policy (e.g., LRU) Assume: a1=a2=2; b1=b2; n2=k*n1; lines m1, m2, and

m3 map to same set in L1 and the same set in L2

7


m1

L1

m1

L2

P

miss2 miss3

Ld m21

m2m2fill 4fill 5

miss8 miss9

Ld m16hit

Ld m37

fill 10fill 11

m3 m3


Maintaining inclusion property Assume: L1: associativity a1, number of sets n1, block size b1 L2: associativity a2, number of sets n2, block size b2

– Difficulty: Different line sizes Assume: a1=a2=1; b1=1, b2=2; n1=4, n2=8

Thus, words w0 and w17 can coexist in L1, but not in L2

8


w0L1 L2w1w2w3

w16

w17


Maintaining inclusion property– Most combinations of L1/L2 size, associativity, and line size do

not automatically lead to inclusion– One solution is to have a1=1, a2≥1, b1=b2, and n1≤n2– More common solution is to invalidate the L1 line (or lines, if

b1<b2) upon replacing a line in L2– Must also invalidate L1 line(s) when L2 line is invalidated due

to coherence Propagate all invalidations from L2 to L1, whether relevant or not Keep extra state in the L2 lines to tell whether the line is also

present in L1 or not (inclusion bits)

– Finally, add a new state to L2 (modified-but-stale) to keep track of lines that are in state M in L1

9



Non-split-transaction buses are idle from when the address request is finished until the data returns from memory or another cache

In split-transaction buses transactions are split into a request transaction and a response transaction, which can be separated

Sometimes implemented as two buses: one for requests and one for responses

10

Snooping with Split-Transaction Buses

Address(normal) address 1

address 2

address 2

Data(normal) data 1

Address(split) address 1 address 3

Data(split) data 0 data 1


Problems– Multiple requests can clash (e.g., a read and a write, or two writes, to the

same data) (Note that this is more complicated than the case in Slide 3, as now different transactions may be at different stages of service)

– Buffers used to hold pending transactions may fill up and cause incorrect execution and even deadlock (flow control is required)

– Responses from multiple requests may appear in a different order than their respective requests

Responses and requests must then be matched using tags for each transaction

Note: it may be necessary for snoop controllers to request more time before responding (e.g., when they can’t have quick enough access to the local cache tags)

Note: snoop controllers may have to keep track themselves of what transactions are pending, in case there is conflict

11



Clashing requests– Allow only one request at a time for each line (e.g., SGI

Challenge)

Flow control– Use negative acknowledgement (NACK) when buffers are full

(requests must be retried later; a bit more tricky with responses, due to danger of deadlock) (e.g., SGI Challenge)

– Or, design the size of all queues for the worst case scenario

Ordering of transactions– Responses can appear in any order → the interleaving of the

requests fully determine the order of transactions (e.g., SGI Challenge)

– Or, enforce a FIFO order of transactions across the whole system (caches + memory) (e.g., Sun Enterprise)

12



Sun Enterprise (1996-2001) Up to 30 UltraSparc processors (Enterprise 6000) The Gigaplane bus: (3rd generation of buses from Sun)

– Peak bandwidth of 2.67GB/s at 83MHz– Supports up to 16 nodes (either processor or I/O boards)– 256bits data, 43bits address, 32bits ECC, and 57 control lines– Split-transaction with up to 112 outstanding transactions

Up to 30GB of main memory 16-way interleaved Memory is physically located in processor boards, but it is

still a UMA system

13

P

ctrl

L1

L2

MemP

L1

L2

Bus interface

CPU/Memcards

Fib

erC

han

nel

SB

US

Bus interface

I/Ocards

SB

US

SB

US

100b

T,S

CS

I

Gigaplane bus


Sun Fire (2001-present) Up to 106 UltraSparc III processors (Fire 15K) The Fireplane bus: (4th generation of buses from Sun)

– Peak bandwidth of 9.6GB/s at 150MHz– Actually implemented using 4 levels of switches, not bus lines– Consists of two snooping domains connected by the upper level

switch Up to 576GB of main memory Memory is physically located in processor boards, but it is

still a UMA system

14

P L1 Mem

Switch level 0

P L1 Mem

L2

L2

Switch level 13x3 dataswitch

Low-end systemwith 2 processors

Up to 8 processors

Switch level 2data switch

Switch level 318x18 dataswitch

Up to 106 processorsUp to 24 processors


Like a bus, rings easily support broadcasts Snooping implemented by all controllers checking the

message as it passes by and re-injecting it into the ring Potentially multiple transactions can be simultaneously on

different stretches of the ring (harder to enforce proper ordering)

Large latency for long rings and growing linearly with number of processors

Used to provide coherence across multiple chips in current CMP systems (e.g., IBM Power 5)

15

Snooping with Ring

P

L1 Mem

P

L1 Mem

P

L1 Mem

P

L1Mem

P

L1 Mem

P

L1 Mem



16

Original (hardware) cache coherence works:“Using Cache Memory to Reduce Processor Memory

Traffic”, J. Goodman, Intl. Symp. on Computer Architecture, June 1983.

“A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories”, M. Papamarcos and J. Patel, Intl. Symp. on Computer Architecture, June 1984.

“Hierarchical Cache/Bus Architecture for Shared-Memory Multiprocessors”, A. Wilson Jr., Intl. Symp. on Computer Architecture, June 1987.

An early survey of cache coherence protocols:“Cache Coherence Protocols: Evaluation Using a

Multiprocessor Simulation Model”, J. Archibald and J.-L. Baer, ACM Trans. on Computer Systems, November 1986.

Discussion on the difficulties of maintaining inclusion“On the Inclusion Properties for Multi-Level Cache

Hierarchies”, J.-L. Baer and W.-H. Wang, Intl. Symp. on Computer Architecture, May 1988.



17

Modern bus-based coherent multiprocessors:“The Sun Fireplane System Interconnect”, A.

Charlesworth, Supercomputing Conf., November 2001. Some software cache coherence schemes:

“The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture”, G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss, Intl. Conf. on Parallel Processing, August 1985.

“Automatic Management of Programmable Caches”, R. Cytron, S. Karlowsky, and K. McAuliffe, Intl. Conf. on Parallel Processing, August 1988.


Snooping coherence– Global state of a memory line is the collection of its state in all caches,

and there is no summary state anywhere– All cache controllers monitor all other caches’ activities and maintain the

state of their lines– Requires a broadcast shared medium (e.g., bus or ring) that also

maintains a total order of all transactions– Bus acts as a serialization point to provide ordering

Directory coherence– Global state of a memory line is the collection of its state in all caches,

but there is a summary state at the directory– Cache controllers do not observe all activity, but interact only with

directory– Can be implemented on scalable networks, where there is no total order

and no simple broadcast, but only one-to-one communication– Directory acts as a serialization point to provide ordering (Lecture 11)

1

Lect. 9: Shared Mem. Multiprocessors III/V


Directory Structure Directory information (for

every memory line)– Line state bits (e.g., not

cached, shared, modified)– Sharing bit-vector: one bit for

each processor that is sharing or for the single processor that has the modified line

– Organized as a table indexed by the memory line address

Directory controller– Hardware logic that interacts

with cache controllers and enforces cache coherence

2

Sharing vector

0 0 00 0

Line state Memory

4

Cache states:00 = invalid01 = shared10 = modified

Dir. states:00 = not cached01 = shared10 = modified

Directory information

Up to 3 processors can be supported

Line is not cached so sharing vectoris empty and memory value is valid

1 0 10 1 9

Line is shared in P0 and P2and memory value is valid


Directory Operation Example: load with no sharers

3

Sharing vector

0 0 00 0

Line state Memory

4P0

L1

0 0

Line state

P1L1

0 0

Line state

P2L1

0 0



Load

Miss

1 1

4 Value1


Directory Operation Example: load with sharers

4

Sharing vector

0 0 10 1

Line state Memory

4P0

L1

0 1

Line state

P1L1

0 0

Line state

P2L1

0 0



Load

4

Miss

1

4

4

Value

1


Directory Operation Example: store with sharers

5

Sharing vector

0 1 10 1

Line state Memory

4P0

L1

0 1

Line state

P1L1

0 1

Line state

P2L1

0 0



Store

4

Miss

4

4

0

Ackn

ow

led

ge

6

Acknowledge

1 0

Invalidate

1 0 0

Reply


Directory Operation Example: load with owner

6

Sharing vector

0 1 01 0

Line state Memory

4P0

L1

0 0

Line state

P1L1

1 0

Line state

P2L1

0 0



Load

44

6

Miss

Forward

0 1

6

Value

0 1

Acknowledge+Value

10 1 6


Notes on Directory Operation On a write with multiple sharers it is necessary to collect and

count all the invalidation acknowledgements (ACK) before actually writing

On transactions that involve more complex state changes the directory must also receive and acknowledgement– In case something goes wrong– To establish the completion of the load or store (Lecture 11)

As with snooping on buses, “the devil is in the details” and we actually need transient states, must deal with conflicting requests, and must handle multi-level caches

As with buses, when buffers overflow we need to introduce NACKs

Directories should work well if only a small number of processors share common data at any given time (otherwise broadcasts are better)

7


Quantitative Motivation for Directories

Number of invalidations per store miss on MSI with infinite caches

Bottom-line: number of sharers for read-write data is small

8

0 3 6

12

to1

5

24

to2

7

36

to3

9

48

to5

1

60

to6

3

0

20

40

60

80

100

LU

Radix

Ocean

Raytrace

Barnes-Hut

Radiosity



Example Implementation Difficulties

Operations have to be serialized locally

Operations have to be serialized at directory

9

P0 P1

Dir.

1. P0 sends read request for line A.

1

2. P1 sends read exclusive request for line A (waits at dir.).2 3. Dir. responds to (1), sets sharing vector (message gets delayed).3 4a/b. Dir. responds to (2) to both P0 (sharer) and P1 (new owner).4a

4b Problem: when (3) finally arrives at P0 the stale value of line A is placed in the cache. Solution: P0 must serialize transactions locally so that it won’t react to 4b while it has a read pending.

5. P0 invalidates line A and sends acknowledgement

5

P0 P1

Dir.

1. P1 sends read exclusive request for line A.

12. Dir. forwards request to P0 (owner).

24. P1 receives (3a) and considers read excl. complete. A replacement miss sends the updated value back to memory.

4

Problem: when (4) arrives dir. accepts and overwrites memory. When (3b) finally arrives dir. completes ownership transfer and thinks that P1 is the owner. Solution: dir. must serialize transactions so that it won’t react to 4 while the ownership transfer is pending.

3b

3a/b. P0 sends data to P2 and ack. to dir. (ack gets delayed).

3a


Directory Overhead Problem: consider a system with 128 processors, 256GB of

memory, 1MB L2 cache per processor, and 64byte cache lines– 128 bits for sharing vector plus 3 bits for state → ~16bytes– Per line: 16/64 = 0.25 → 25% memory overhead– Total: 0.25*256G = 64GB of memory overhead!

Solution: Cached Directories– At any given point in time there can be only 128M/64 = 2M lines

actually cached in the whole system– Lines not cached anywhere are implicitly in state “not cached” with null

sharing vector– To maintain only the entries for the actively cached lines we need to

keep the tags → 64bits = 8bytes– Overhead per cached line: (8+16)/64 = 0.375 → 37.5% overhead– Total overhead: 0.375*2M = 768KB of memory overhead

10


Scalability of Directory Information

Problem: number of bits in sharing vector limit the maximum number of processors in the system– Larger machines are not possible once we decide on the size of the vector– Smaller machines waste memory

Solution: Limited Pointer Directories– In practice only a small number of processors share each line at any time– To keep the ID of up to n processors we need log2n bits and to remember

m sharers we need m IDs → m*log2n

– For n=128 and m=4 → 4*log2128 = 28bits = 3.5bytes– Total overhead: (3.5/64)*256G = 14GB of memory overhead– Idea:

Start with pointer scheme If more than m processors attempt to share the same line then trap to OS and

let OS manage longer lists of sharers Maintain one extra bit per directory entry to identify the current

representation

11


Distributed Directories Directories can be used with UMA systems, but are more

commonly used with NUMA systems

In this case the directory is actually distributed across the system

These machines are then called cc-NUMA, for cache-coherent-NUMA, and DSM, for distributed shared memory

12

Interconnection

CPU

Cache

Mem.

Node

Dir.

CPU

Cache

Mem.Dir.

CPU

Cache

Mem.Dir.

CPU

Cache

Mem.Dir.


Distributed Directories Now each part of the directory is only responsible for the

memory lines of its node How are memory lines distributed across the nodes?

– Lines are mapped per OS page to nodes– Pages are mapped to nodes following their physical address– Mapping of physical pages to nodes is done statically in chunks– E.g., 4 processors with 1GB of memory each and 4KB pages (thus,

256 pages per node) Node 0 is responsible (home) for pages [0,255] Node 1 is responsible for pages [256,511] Node 2 is responsible for pages [512,767] Node 3 is responsible for pages [768,1023] Load to address 1478656 goes to page 1478656/4096=361, which goes

to node 361/256=1

13


Distributed Directories How is data mapped to nodes?

– With a single user, OS can map a virtual page to any physical page→ OS can place data almost anywhere, albeit at the granularity of pages

– Common mapping policies: First-touch: the first processor to request a particular data has the data’s

page mapped to its range of physical pages– Good when each processor is the first to touch the data it needs, and other nodes

do not access this page often Round-robin: as data is requested virtual pages are mapped to physical

pages in circular order (i.e., node 0, node 1, node 2, … node N, node 0, …)– Good when one processor manipulates most of the data at the beginning of a

phase (e.g., initialization of data)– Good when some pages are heavily shared (hot pages)

Note: data that is only private is always mapped locally– Advanced cc-NUMA OS functionality

Mapping of virtual pages to nodes can be changed on-the-fly (page migration)

A virtual page with read-only data can be mapped to physical pages in multiple nodes (page replication)

14


Combined Coherence Schemes

Use bus-based snooping in nodes and directory (or bus snooping) across nodes– Bus-based snooping coherence for a small number of processors

is relatively strait-forward– Hopefully communication across processors within a node will

not have to go beyond this domain– Easier to scale up and down the machine size– Two levels of state:

Per-node at higher level (e.g., a whole node owns modified data, but Dir. does not know which processor in the node actually has it)

Per-processor at lower level (e.g., by snooping inside the node we can find the exact owner and the exact up-to-date value)

15

Bus

CPU

Main memory

CPU CPU CPU


Dir.

Bus or Scalable interconnect

Bus

CPU

Main memory

CPU CPU CPU


Dir.



16

Original directory coherence idea:“A New Solution to Coherence Problems in Multicache

Systems”, L. Censier and P. Feautrier, IEEE Trans. on Computers, December 1978

Seminal work on distributed directories:“The DASH Prototype: Implementation and Performance”,

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy, Intl. Symp. on Computer Architecture, June 1992.

A commercial machine with distributed directories:“The SGI Origin: a ccNUMA Highly Scalable Server”, J.

Laudon and D. Lenoski, Intl. Symp. on Computer Architecture, June 1997.

A commercial machine with SCI:“STiNG: a CC-NUMA Computer System for the

Commercial Marketplace”, T. Lovett and R. Clapp, Intl. Symp. on Computer Architecture, June 1996.

Adaptive full/limited pointer distributed directory protocols:“An Evaluation of Directory Schemes for Cache

Coherence”, A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, Intl. Symp. on Computer Architecture, June 1988.


Probing Further

17

Page migration and replication for ccNUMA“Operating System Support for Improving Data Locality

on ccNUMA Compute Servers”, B. Verghese, S. Devine, A. Gupta, and M. Rosemblum, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.

Cache Only Memory Architectures“Comparative Performance Evaluation of Cache-Coherent

NUMA and COMA Architectures”, P. Stenstrom, T. Joe, and A. Gupta, Intl. Symp. on Computer Architecture, June 1992.

Recent alternative protocols: token, ring“Token Coherence: Decoupling Performance and

Correctness”, M. Martin, M. Hill, and D. Wood, Intl. Symp. on Computer Architecture, June 2003.

“Coherence Ordering for Ring-Based Chip Multiprocessors”, M. Marty and M. Hill, Intl. Symp. On Microarchitecture, December 2006.


Synchronization is necessary to ensure that operations in a parallel program happen in the correct order

Different primitives are used at different levels of abstraction– High-level (e.g., critical sections, monitors, parallel sections and

loops, atomic): supported in languages themselves or language extensions (e.g, Java threads, OpenMP)

– Middle-level (e.g., semaphores, condition variables, locks, barriers): supported in libraries (e.g., POSIX threads)

– Low-level (e.g., compare&swap, test&set, load-link & store-conditional): supported in hardware

Higher level primitives can be constructed from lower level ones

Things to consider: deadlock, livelock, starvation

1

Lect. 10: Shared Mem. Multiprocessors IV/V


Example: Sync. in Java Threads

2

Synchronized Methods– Concurrent calls to the method on the same object have

to be serialized– All data modified during one call to the method

becomes atomically visible to all calls to other methods of the object

– E.g.:

– Can be implemented with locks

public class SynchronizedCounter { private int c = 0;

public synchronized void increment() { c++; }}

SynchronizedCounter myCounter;


Example: Sync. in OpenMP

3

Doall loops– Iterations of the loop can be executed concurrently– After the loop, all processors have to wait and a single

one continues with the following code– All data modified during the loop is visible after the loop– E.g.:

– Can be implemented with barrier

#pragma omp parallel for \ private(i,s) shared (A,B)\ schedule(static)for (i=0; i<N; i++) { s = …

A[i] = B[i] + s;}


Example: Sync. in POSIX Threads

4

Locks– Only one thread can own the lock at any given time– Unlocking makes all the modified data visible to all

threads and locking forces the thread to obtain fresh copies of all data

– E.g.:

– Can be implemented with test&set

pthread_mutex_t mylock;

pthread_mutex_init(&mylock, NULL);pthread_mutex_lock(&mylock);

Count++;

pthread_mutex_unlock(&mylock);


Example: Building CS from Locks

5

Relatively strait-forward

Actual implementation is encapsulated in library function

In practice, the library may implement different policies on how to wait for a lock and how to avoid starvation

Processor 0

int A, B, C;lock_t mylock;

lock(&mylock);

A = …;B = …;

unlock(&mylock);

Processor 1

lock(&mylock);

… = A + …;… = C + …;

unlock(&mylock);

init

iali

zati

on

para

llel


Example: Building CS from Ld/St?

6

E.g., Peterson’s algorithm

No! This is not a safe way to implement CS in a modern multiprocessor (Lecture 11)

Processor 0int A, B, C;int mylock[2], turn;

mylock[0]=0; mylock[1]=0;turn = 0;

mylock[0] = 1; turn = 1;while(mylock[1]&&turn==1);

A = …;B = …;

mylock[0] = 0;

Processor 1

mylock[1] = 1; turn = 0;while(mylock[0]&&turn==0);

… = A + …;… = C + …;

mylock[1] = 0;

init

iali

zati

on

para

llel


Hardware Primitives

7

Hardware job is to provide atomic memory operations, which involves both processors and the memory subsystem

Implemented in the IS, but usually encapsulated in library function calls by manufacturers

At a minimum, hardware must provide an atomic swap

Examples:– Compare&Swap (e.g., Sun Sparc) and Test&Set: if

value in memory is equal to value in register Ra then swap memory value with the value in Rb and return memory’s original value in Ra

Can implement more complex conditions for synchronization

Requires comparison operation in memory or must block memory location until processor is done with comparison

CAS (R1),R2,R3 ;MEM[R1]==R2?MEM[R1]=R3:


Hardware Primitives

8

Examples:– Fetch&Increment (e.g., Intel x86) (in general

Fetch&Op): increment the value in memory and return the old value in register

Less flexible than Compare&Swap Requires arithmetic operation in memory or must block

memory location (or bus) until processor is done with comparison (e.g., x86)

– Swap: swap the values in memory and in a register Less flexible of all Does not require comparison or arithmetic operation in

memory

lock; ADD (R1),R2 ;MEM[R1]=MEM[R1]+R2 LODSW ;accumulator=MEM[DS:SI]


Building Locks with Hdw. Primitives

9

Example: Test&Set

int lock(int *mylock) {

int value;

value = test&set(mylock,1); if (value) return FALSE; else return TRUE;}

void unlock(int *mylock) { *mylock = 0; return;}


What If the Lock is Taken?

10

Spin-wait lock

– Each call to lock invokes the hardware primitive, which involves an expensive memory operation and takes up network bandwidth

Spin-wait on cache: Test-and-Test&Set– Spin on cached value using normal load and rely on

coherence protocol

– Still, all processors race to memory, and clash, once the lock is released

while (!lock(&mylock));…unlock(&mylock);

while (TRUE) { if (!lock(&mylock)) while (!mylock);}…unlock(&mylock);


What If the Lock is Taken?

11

Software solution: Blocking locks and Backoff

– Wait can be implemented in the application itself (backoff) or by calling the OS to be put to sleep (blocking)

– The waiting time is usually increased exponentially with the number of retries

– Similar to the backoff mechanism adopted in the Ethernet protocol

while (TRUE) { if (!lock(&mylock)) wait (time);}…unlock(&mylock);


A Better Hardware Primitive

12

Load-link and Store-conditional– Implement atomic memory operation as two operations– Load-link (LL):

Registers the intention to acquire the lock Returns the present value of the lock

– Store-conditional (SC): Only stores the new value if no other processor attempted

a store between our previous LL and now Returns 1 if it succeeds and 0 if it fails

– Relies on the coherence mechanism to detect conflicting SC’s

– All operation is done locally at the cache controllers or directory, no need for complex blocking operation in memory

– New register id added to L1 to remember pending LL from local processor (e.g., PowerPC RESERVE register)

– Also benefits from blocking and backoff– Introduced in the MIPS processor, now also used in

PowerPC and ARM


A Better Hardware Primitive

13

Load-link and Store-conditional operation

P0L1

RESERVE

P1L1

RESERVE

Coherence substrate

LL 0xA

0xA1

LL completes

SC 0xA

SC suceeds

P0L1

RESERVE

P1L1

RESERVE

LL 0xA

Coherence substrate

LL 0xA

0xA1

LL completes

SC 0xA

SC fails 0xA1

LL completesSC suceeds

SC 0xA

0


E.g., spin-wait with attempted swap

– At the end, if SC succeeds the value of the lock variable will be in R4

– If lock is taken then start over again

Building Locks with LL/SC

14

try: OR R3,R4,R0 ;move value to be exchanged LL R2,0(R1) ;value of lock loaded SC R3,0(R1) ;try to store value BEQZ R3,try ;branch if SC failed MOV R4,R2 ;move lock value into R4

check: BNEZ R4,try ;try again if lock was taken


An Alternative Hdw. Approach

15

Locks have a relatively large overhead and, thus, are suitable for guarding relatively large amounts of data

Some algorithms need to exchange only a small number of words each time

Also, consumer thread must wait for all data guarded by a lock to be ready before it can begin work

Better approach for fine-grain synchronization: Full/Empty Bits– Associate one bit with every memory word (1.5%

overhead for 64bit words)– Augment the behavior of load/store

Load: if word is empty then trap to OS (to wait) otherwise, return value and set bit to empty Store: if word is full then trap to OS (to deal with error) otherwise, store the new value, set bit to full, and

release any threads pending on the word (with OS help) Reset: set the bit to empty

– Good for producer-consumer type of communication/synchronization


Example: Using Full/Empty Bits

16

Compare against example in Slide 5

Processor 0

int A, B, C;

A = …; // blocks if not // yet used

B = …; // no impact on P1

Processor 1

… = A + …; // waits if not // ready

… = C + …; // does not // have to wait



17

A commercial machine with Full/Empty bits:“The Tera Computer System”, R. Alverson, D. Callahan, D.

Cummings, B. Koblenz, A. Porterfield, and B. Smith, Intl. Symp. on Supercomputing, June 1990.

Performance evaluations of synchronization for shared-memory:“The Performance of Spin Lock Alternatives for Shared-

Memory Multiprocessors”, T. Anderson, IEEE Trans. on Parallel and Distributed Systems, January 1990.

“Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors”, J. Mellor-Crummey and M. Scott, ACM Trans. on Computer Systems, February 1991.


Consider the following code:

– What are the possible outcomes?

1

Lect. 11: Shared Mem. Multiprocessors V/V

A=0, B=0, C=0;…

C = 1;A = 1; while (A==0);

B = 1;

P1 P2

init

iali

zati

on

para

llel

A==1, C==1?A==0, C==1?A==0, C==0?A==1, C==0?

while (B==0);print A;

P3

Yes. This is what one would expect.Yes. If st to B overtakes the st to A on the interconnect toward P3.Yes. If the st to C overtakes the st to A from the same processor.


Memory Consistency Cache coherence:

– Guarantees eventual write propagation– Guarantees a single order of all writes

Memory consistency:– Specifies the ordering of loads and stores to different memory

locations– Defined in so called Memory Consistency Models– This is really a “contract” between the hardware, the compiler, and the

programmer i.e., hardware and compiler will not violate the ordering specified i.e., the programmer will not assume a stricter order than that of the model

– Hardware/Compiler provide “safety net” mechanisms so the user can enforce a stricter order than that provided by the model

2

For the same memory location.No guarantees on when writes propagate.No guarantees on the order of writes.


Sequential Consistency (SC) Key ideas:

– The behavior on a multiprocessor should be the same as in a time-shared multiprocessor

– Thus, memory ordering has to follow the individual order in each thread and there can be any interleaving of such sequential segments

– Memory abstraction is that of a random switch to memory:

– Notice that in practice many orderings are still valid

3

P0 P1 Pn

Memory


Terminology Issue: memory operation leaves the processor and becomes

visible to the memory subsystem Performed: memory operation appears to have taken place

– Performed w.r.t. processor X: as far as processor X can tell E.g., a store S by processor Y to variable A is performed w.r.t. processor X

if a subsequent load by X to A returns the value of S (or the value of a store later than S, but never a value older than that of S)

E.g., a load L is performed w.r.t. processor X if all subsequent stores by any processor cannot affect the value returned by L to X

– Globally performed or complete: performed w.r.t. to all processors E.g., a store S by processor Y to variable A is globally performed if any

subsequent load by any processor to A returns the value of S

X consistent execution: any execution that matches one of the possible total orders (interleavings) as defined by model X

4


Example: Sequential Consistency

Some valid SC orderings:

5

A=0, B=0, C=0;…

C = 1;A = 1; while (A==0);

B = 1;

P1 P2

init

iali

zati

on

para

llel

P1: st C # C=1P1: st A # A=1P2: ld A # whileP2: st B # B=1P3: ld B # whileP3: ld A # print

while (B==0);print A;

P3

P1: st C # C=1P2: ld A # while…P1: st A # A=1P2: ld A # whileP2: st B # B=1P3: ld B # whileP3: ld A # print

P1: st C # C=1P2: ld A # while…P1: st A # A=1P2: ld A # whileP3: ld B # while…P2: st B # B=1P3: ld B # whileP3: ld A # print


Sequential Consistency (SC) Sufficient conditions

1. Threads issue memory operations in program order2. Before issuing next memory operation threads wait until last issued

write completes (i.e., performs w.r.t. all other processors)3. Before issuing next memory operation threads wait until last issued

read completes and until the matching write (i.e., the one whose value is returned to the read) also completes

Notes:– Condition 3 is actually quite demanding and is the one that guarantees

write atomicity– In practice necessary conditions may be more relaxed– These conditions are easily violated in real hardware and compilers

(e.g., write buffers in hdw. and ld-st scheduling in compiler)– Program order defined after source code (programmer’s intention) and

may be different from assembly code due to compiler optimizations

6


Relaxed Memory Consistency Models

At a high level they relax ordering constraints between pairs of reads, writes, and read-write (e.g., reads are allowed to bypass writes, writes are allowed to bypass each other)

In practice there are some implementation artifacts (e.g., no write atomicity in Pentium)

Some models make synchronization explicit and different from normal loads and stores

Many models have been proposed and implemented– Total Store Ordering (TSO) (e.g., Sparc)– Partial Store Ordering (PSO) (e.g., Sparc)– Relaxed Memory Ordering (RMO) (e.g., Sparc)– Processor Consistency (PC) (e.g., Pentium)– Weak Ordering (WO)– Release Consistency (RC)– PowerPC

7


Relaxed Memory Consistency Models

Note that control flow and data flow dependences within a thread must still be honored regardless of the consistency model– E.g.,

– E.g.,

8

A=0, B=0, C=0;…

C = 1;A = 1; while (A==0);

B = 1; while (B==0);print A;st to B cannot overtake ld to A

ld to A cannot overtake ld to B

A = 1;…A = 2;…B = A;

Second st to A cannot overtake earlier st to A

ld to A cannot overtake earlier st to A


Example: Total Store Ordering (TSO)

Reads are allowed to bypass writes (can hide write latency) Similar to PC Still makes prior example work as expected,

but breaks some intuitive assumptions,

including Peterson’s algorithm (Lecture 10)

9

…C = 1;A = 1; while (A==0);

B = 1;

P1 P2

…A = 1;Print B;

B = 1;Print A;

P1 P2 SC guarantees that A==0 and B==0 will never be printedTSO allow it if ld B (P1) overtakes st A (P1) and ld A (P2) overtakes st B (P2)


Example: Release Consistency (RC)

Reads and writes are allowed to bypass both reads and writes (i.e., any order that satisfies control flow and data flow is allowed)

Assumes explicit synchronization operations: acquire and release (Lecture 10). So, for correct operation, our example must become:

Constraints– All previous writes must complete before a release can complete– No subsequent reads can complete before a previous acquire completes– All synchronization operations must be sequentially consistent (i.e., follow

the rules of Slide 6, where an acquire is equivalent to a read and a release is equivalent to a write)

10

…C = 1;Release(A); while (!Lock(A));

B = 1;

P1 P2


Example: Release Consistency (RC)

Example: original program order

11

Read/write…Read/write

P1

Acquire


Release


Allowable overlaps

– Reads and writes from block 1 can appear after the acquire (thus, initialization also requires an acquire-release pair)

– Reads and writes from block 3 can appear even before the release

– Between acquire and release any order is valid in block 2 (and also 1 and 3)

Note that despite the many reorderings, this still matches our intuition of critical sections

1

2

3


1

Acquire


Release

2Read/write…Read/write

3


Races and Proper Synchronization

Races: unsynchronized loads and stores “race each order” through the memory hierarchy (e.g., the loads and stores to A, B, and C in prior example)

Delay-set Analysis– Technique that allows to identify races that require

synchronization– Mark all memory references in both threads and create arcs

between them Directed arcs that follow program order (the blue ones below) Undirected arcs that follow cross-thread data dependences (the

green ones below, recall that the print implicitly contains a read)

– Cycles following the arcs indicate the problematic memory references

12

…A = 1;

Print B;

B = 1;

Print A;

P1 P2


Memory Barriers/Fences How can I enforce some order of memory accesses?

– Ideally use the synchronization primitives, but these can be very costly

Memory Barriers/Fences:– New instructions in the IS and supported in the processor and memory– Specify that previously issued memory operations must complete before

processor is allowed to proceed past the barrier Write-to-read barrier: all previous writes must complete before the next read can

be issued Write-to-write: all previous writes must complete before the next write can be

issued Full barriers: all previous loads and stores must complete before the next

memory operation can be issued

Note: not to be confused with synchronization barriers (Lecture 10)

Note: stricter models can be emulated with such barriers on system that only support less strict models

13


Final Notes Many processors/systems support more than one

consistency model, usually set at boot time It is possible to decouple consistency model

presented to programmer from that of the hardware/compiler– E.g., hardware may implement a relaxed model but

compiler guarantees SC via memory barriers

It is possible to allow a great degree of reordering with SC through speculative execution in hardware (with rollback when stricter model is violated) (e.g., MIPS R10000/SGI Origin)

14



15

Original definition of sequential consistency:“How to Make a Multiprocessor Computer that Correctly

Execute Multiprocess Programs”, L. Lamport, IEEE Trans. on. Computers, September 1979.

Original work on relaxed consistency models:“Correct Memory Operation of Cache-Based

Multiprocessors”, C. Scheurich and M. Dubois, Intl. Symp. on Computer Architecture, June 1987.

“Weak Ordering: A New Definition”, S. Adve and M. Hill, Intl. Symp. on Computer Architecture, June 1990.

“Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors”, K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, Intl. Symp. on Computer Architecture, June 1990.

A very good tutorial on memory consistency models:“Shared Memory Consistency Models: A Tutorial”, S. Adve

and K. Gharachorloo, IEEE Computer, December 1996.



16

Problems with OO memory consistency models (e.g., Java):“Fixing the Java Memory Model”, W. Pugh, Conf. on Java

Grande, June 1999. Delay set analysis:

“Efficient and Correct Execution of Parallel Programs that Share Memory”, D. Shasha and M. Snir, ACM Trans. on. Programming Languages and Operating Systems, February 1988.

Compiler support for SC on non-SC hardware:“Analyses and Optimizations for Shared Address Space

Programs”, A. Krishnamurthy and K. Yelick, Journal of Parallel and Distributed Computing, February 1996.

“Hiding Relaxed Memory Consistency with Compilers”, J. Lee and D. Padua, Intl. Conf. on Parallel Architectures and Compilation Techniques, October 2000.


Probing Further

17

Transactional Memory“Transactional Memory: Architectural Support for Lock-

Free Data Structures”, M. Herlihy and J. Moss, Intl. Symp. on Computer Architecture, June 1993.

“Transactional Memory Coherence and Consistency”, L. Hammond, V. Wong, M. Chen, B. Calstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Intl. Symp. on Computer Architecture, June 2004.

“Transactional Execution: Toward Reliable, High-Performance Multithreading”, R. Rajwar and J. Goodman, IEEE Micro, November 2003.


Lect. 12: Multithreading Memory latencies and even latencies to lower level

caches are becoming longer w.r.t. processor cycle times There are basically 3 ways to hide/tolerate such latencies

by overlapping computation with the memory access– Dynamic out-of-order scheduling– Prefetching– Multithreading

OOO execution and prefetching allow overlap of computation and memory access within the same thread (these were covered in CS3 Computer Architecture)

Multithreading allows overlap of memory access of one thread/process with computation by another thread/process

1


Blocked Multithreading

2

Basic idea:– Recall multi-tasking: on I/O a process is context-

switched out of the processor by the OS

– With multithreading a thread/process is context-switched out of the pipeline by the hardware on longer-latency operations

process 1running

system call for I/O

OS interrupt handlerrunning

I/O completion

Process 1running

process 2running

OS interrupt handlerrunning

process 1running

Long-latency operation

Hardware contextswitch

Long-latency operation

Process 1running

process 2running

Hardware contextswitch



3

Basic idea:– Unlike in multi-tasking, context is still kept in the

processor and OS is not aware of any changes– Context switch overhead is minimal (usually only a few

cycles)– Unlike in multi-tasking, the completion of the long-

latency operation does not trigger a context switch (the blocked thread is simply marked as ready)

– Usually the long-latency operation is a L1 cache miss, but it can also be others, such as a fp or integer division (which takes 20 to 30 cycles and is unpipelined)

Context of a thread in the processor:– Registers– Program counter– Stack pointer– Other processor status words

Note: the term is commonly (mis)used to mean simply the fact that the system supports multiple threads



4

Latency hiding example:

Memory latencies

Pipeline latency

Thread A

Thread B

Thread C

Thread D

= context switch overhead

= idle (stall cycle)




5

Hardware mechanisms:– Keeping multiple contexts and supporting fast switch

One register file per context One set of special registers (including PC) per context

– Flushing instructions from the previous context from the pipeline after a context switch

Note that such squashed instructions add to the context switch overhead

Note that keeping instructions from two different threads in the pipeline increases the complexity of the interlocking mechanism and requires that instructions be tagged with context ID throughout the pipeline

– Possibly replicating other microarchitectural structures (e.g., branch prediction tables, load-store queues, non-blocking cache queues)

Employed in the Sun T1 and T2 systems (a.k.a. Niagara)



6

Simple analytical performance model:– Parameters:

Number of threads (N): the number of threads supported in the hardware

Busy time (R): time processor spends computing between context switch points

Switching time (C): time processor spends with each context switch

Latency (L): time required by the operation that triggers the switch

– To completely hide all L we need enough N such that ~N*(R+C) equals L (strictly speaking, (N-1)*R + N*C = L)

Fewer threads mean we can’t hide all L More threads are unnecessary

– Note: these are only average numbers and ideally N should be bigger to accommodate variation

R C

L

R C R C R C



7

Simple analytical performance model:– The minimum value of N is referred to as the saturation

point (Nsat)

– Thus, there are two regions of operation: Before saturation, adding more threads increase processor

utilization linearly After saturation, processor utilization does not improve

with more threads, but is limited by the switching overhead

– E.g.: for R=40, L=200, and C=10

Nsat =R + LR + C

Usat =R

R + C

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6 7 8 9

Number of threads

Pro

cess

or

uti

liza

tio

n (

%)



Fine-grain or Interleaved Multithreading

8

Basic idea:– Instead of waiting for long-latency operation, context

switch on every cycle– Threads waiting for a long latency operation are

marked not ready and are not considered for execution– With enough threads no two instructions from the same

thread are in the pipeline at the same time → no need for pipeline interlock at all

Advantages and disadvantages over blocked multithreading:+ No context switch overhead (no pipeline flush)+ Better at handling short pipeline latencies/bubbles– Possibly poor single thread performance (each thread

only gets the processor once every N cycles)– Requires more threads to completely hide long

latencies– Slightly more complex hardware than blocked

multithreading Some machines have taken this idea to the

extreme and eliminated caches altogether (e.g., Cray MTA-2, with 128 threads per processor)



9

Latency hiding example:

Memory latencies

Pipeline latency

Thread A

Thread B

Thread C

Thread D = idle (stall cycle)


Thread E

Thread F

A is still blocked,so is skipped

E is still blocked,so is skipped



10

Simple analytical performance model (see Slide 6):– Parameters:

Number of threads (N) and Latency (L) Busy time (R) is now 1 and switching time (C) is now 0

– To completely hide all L we need enough N such that N-1 = L

– Again, these are only average numbers and ideally N should be bigger to accommodate variation

– The minimum value of N (i.e., N=L+1) is the saturation point (Nsat)

– Again, there are two regions of operation: Before saturation, adding more threads increase processor

utilization linearly After saturation, processor utilization does not improve

with more threads, but is 100% (i.e., Usat = 1)

R

L

R RR


Simultaneous Multithreading (SMT)

11

Basic idea:– Don’t actually context switch, but on a superscalar

processor fetch and issue instructions from different threads/processes simultaneously

– E.g., 4-issue processor

Advantages:+ Can handle not only long latencies and pipeline

bubbles but also unused issue slots+ Full performance in single-thread mode– Most complex hardware of all multithreading schemes

cycles

no multithreading

cachemiss

blocked interleaved SMT


Simultaneous Multithreading (SMT)

12

Fetch policies:– Non-multithreaded fetch: only fetch instructions from

one thread in each cycle, in a round-robin alternation– Partitioned fetch: divide the total fetch bandwidth

equally between some of the available threads (requires more complex fetch unit to fetch from multiple I-cache lines; see Lecture 3)

– Priority fetch: fetch more instructions for specific threads (e.g., those not in control speculation, those with the least number of instructions in the issue queue)

Issue policies:– Round-robin: select one ready instruction from each

ready thread in turn until all issue slots are full or there are no more ready instructions

(note: should remember which thread was the last to have an instruction selected and start from there in the next cycle)

– Priority issue: E.g., threads with older instructions in the issue queue are

tried first E.g., threads in control speculative mode are tried last E.g., issue all pending branches first



13

Original work on multithreading:“The Tera Computer System”, R. Alverson, D. Callahan, D.

Cummings, B. Koblenz, A. Porterfield, and B. Smith, Intl. Conf. on Supercomputing, June 1990.

“Performance Tradeoffs in Multithreaded Processors”, A. Agarwal, IEEE Trans. on Parallel and Distributed Systems, September 1992.

“Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. Tullsen, S. Eggers, and H. Levy, Intl. Symp. on Computer Architecture, June 1995.

“Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor”, D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Intl. Symp. on Computer Architecture, June 1996.

Intel’s hyper-threading mechanism:“Hyper-Threading Technology Architecture and

Microarchitecture”, D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Intel Technology Journal, Q1 2002.


Lect. 13: Chip-Multiprocessors (CMP)

Main driving forces:– Complexity of design and verification of wider-issue

superscalar processor would be unmanageable– Performance gains of either wider issue width or deeper

pipelines would be only marginal Limited ILP in applications Wire delays and longer access times of larger structures

– Power consumption of the large centralized structures necessary in wider-issue superscalar processors would be unmanageable

– Increase relative importance of throughput oriented computing as compared to latency oriented computing

– Continuation of Moore’s law so that more transistors fit in a chip

1


Early (ca. 2006) CMP’s

2

Example: Intel Core Duo– 2 cores

3-issue superscalar 12-stage pipeline 2-way simultaneous

multithreading (HT) Up to 2.33GHz P6 (Pentium M)

microarchitecture– 2MB shared L2 cache– 151M transistors in

65nm technology– Power consumption

between 9W and 30W


Current (ca. 2007) CMP’s

3

Example: Sun T2– 8 cores

Single issue, statically scheduled

8-stage pipeline 8-way multithreading

(blocked) Up to 1.4GHz UltraSparc V9 IS

– 4MB shared L2 cache– 65nm technology– Power consumption

around 72W


Future CMP’s?

4

Example: Intel Polaris (2007)– 80 cores

Single issue, statically scheduled

3.2GHz (up to 5GHz)– Scalable, packet-

switched, interconnect (8x10 mesh)

– No shared L2 or L3 cache

– No cache coherence– “Tiled” approach

Core + cache + router– Stacked memory

technology– Power consumption

around 62W Example: Intel SCC

(2010)– 48 cores (full IA-32

compatible)


CMP’s vs. Multi-chip Multiprocessors

5

While conceptually similar to traditional multiprocessors, CMP’s have specific issues:– Off-chip memory bandwidth: number of pins per

package does not increase much– On-chip interconnection network: wires and metal

layers are a very scarce resource– Shared memory hierarchy: processors must share some

lower level cache (e.g., L2 or L3) and the on-chip links between these

– Wire delays: actual physical distances to be crossed for communication affect the latency of the communication

– Power consumption and heat dissipation: both are much harder to fit within the limitations of a single chip package


Shared vs. Private L2 Caches

6

Private caches:+ Less chance of negative interference between

processors+ Simpler interconnections– Possibly wasted storage in less loaded parts of the chip– Must enforce coherence across L2’s

Shared caches:– More chance for negative interference between

processors+ Possible positive interference between processors+ Better utilization of storage+ Single/few threads have access to all resources when

cores are idle+ No need enforce coherence (but still must enforce

coherence across L1’s) and L2 can act as a coherence point (i.e., directory)

– All-to-one interconnect takes up large area and may become a bottleneck

Note: L1 caches are tightly integrated into the pipeline and are an inseparable part of the core



7

Priority Inversion and Fair Sharing– In uniprocessor and multi-chip multiprocessors:

processes with higher priority are given more resources (e.g., more processors, larger scheduling quanta, more memory/caches, etc) → faster execution

– In CMP’s with shared resources (e.g., L2 caches, off-chip memory bandwidth, issue slots with multithreading)

Dynamic allocation of resources to threads/processes is oblivious to OS (e.g., LRU replacement policy in caches)

Hardware policies attempt to maximize utilization across the board

Hardware treats all threads/processes equally and threads/processes compete dynamically for resources

– Thus, at run time, a lower priority thread/processe may grab a larger share of resources and may execute relatively faster than a higher thread/process

– One of the biggest problems is that of fair cache sharing

– In more general terms, overall quality of service should be directly proportional to priority



8

Fair Sharing– Example:

– Interference in L2 causes gzip to have 3 to 10 times more L2 misses and to run at as low as half the original speed

– Effect of interference depends on what other application is co-scheduled with gzip

0123456789

10

gzip

(alo

ne)

gzip+

applu

gzip+

apsi

gzip+

art

gzip+

swim

Co-scheduled combinations

No

rmal

ized

L2

mis

ses

per

in

stru

ctio

n

0

0.2

0.4

0.6

0.8

1

1.2

gzip

(alo

ne)

gzip+

applu

gzip+

apsi

gzip+

art

gzip+

swim

Co-scheduled combinations

No

rmal

ized

IPC

Figure fromKim et. al.



9

Fair Sharing– Condition for fair sharing:

Where Tdedi is the execution time of thread i when executed alone in the CMP with a dedicated L2 cache and Tshri is its execution time when sharing L2 with the other n-1 threads

– To maximize fair sharing, minimize:

where

– Possible solution: partition caches in different sized portions either statically or at run time

Tshr1

Tded1

=Tshr2

Tded2

= … =Tshrn

Tdedn

Mij = Xi - Xj

Xi =Tshri

Tdedj


NUCA L2 Caches

10

On-chip L2 and L3 caches are expected to continue increasing in size (e.g., Core Duo has 2MB while Core 2 Duo has 4MB L2)

Such caches are logically divided in a few (2 to 8) logical banks with independent access

Banks are physically divided into small (128KB to 512KB) sub-banks

Thus, future multi-megabyte L2 and L3 caches will likely have 32 or more sub-banks

Increasing wire delays mean that sub-banks closer to a given processor could be accessed quicker than sub-banks further away

Also, some sub-banks will invariably be closer to one processor and far from another, and some sub-banks will be at similar distances from a few processors

Bottom-line: uniform (worst case) access times will be increasingly inefficient


NUCA L2 Caches

11

Key ideas:– Allow and exploit the fact that different sub-banks have

different access times– Each sub-bank has its own wire set to the cache

controller (which does increase overall area)– Either statically or dynamically map and migrate the

most heavily used lines to the banks closer to the processor

– By tweaking the dynamic mapping and migration mechanisms such NUCA caches can adapt from private to shared caches

– Obviously, with such dynamic mapping and migration, searching the cache and performing replacements becomes more expensive

E.g., Sun’s T2 uses a NUCA L2 cache with 8 banks spread across the chip borders, but with static mapping and no migration


Directory Coherence On-Chip?

12

Mem.Dir.

CPU

L2 Cache

Mem.Dir.

CPU

L2 Cache

L2 CacheDir.

CPU

L1 Cache

L2 CacheDir.

CPU

L1 Cache

One-to-One mapping from CC-NUMA?

L2 Cache → L1 Cache Main memory → L2 Cache Dir. entry per memory line → Dir. entry per L2 cache

line Mem. lines mapped to physical mem. by first-touch

policy at OS page granularity → L2 lines mapped to physical L2 by first-touch policy at OS page level


Directory Coherence On-Chip

13

The mapping problem (home node) OS page granularity is too coarse and many lines

needed by Px might be actually used by Py, but still have to be cached in Px (ok for large mem. but not ok for small L2; also may lead to imbalance in mapping)

Line granularity with first-touch needs a hardware/OS mapping of every individual cache line to a physical L2 (too expensive)

Solution: map at line granularity but circularly based on physical address (mem. line 0 maps to L2 #0, mem. line 1 maps to L2 #1, etc) The problem with this solution is that locality of use

is lost! The eviction problem

Upon eviction of an L2 (mem.) line the corresponding dir. entry is lost and all L1 cached copies must be invalidated (ok for rare paging case in CC-NUMA, but not ok for small L2)

Solution: associate dir. entries not with L2 cache lines, but with cached L1 lines (replicated tags and exclusive L1-Home L2)


Exclusivity with Replicated Tags

12

Dir. contains copy of the L1 tags of lines mapped to the home L2, but L2 does not have to keep the L1 data itself Good: lines can be evicted from L2 silently (by

exclusivity, they are not cached in any L1) and Dir. does not change

Bad: replicated tags (i.e., the Dir. information) increases with number of L1 caches E.g., for 8 cores with 32KB L1 with 32B lines (i.e.,

1024 lines) and fully associative → 8x1024 = 8,192 entries per Dir.

(In practice, associativity reduces this overhead and alternative exist)

L2 Cache

CPU

L1 Cache

L2 CacheDir.

CPU

L1 Cache

Dir.



14

Early study of chip-multiprocessors“The Case for a Single-Chip Multiprocessor”, K. Olukotun,

B. Nayfeh, L. Hammond, K. Wilson, and K. Chang, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.

More recent study of chip-multiprocessors (throughput-oriented)“Maximizing CMP Throughput with Mediocre Cores”, J.

Davis, J. Laudon, and K. Olukotun, Intl. Conf. on Parallel Architecture and Compilation Techniques, September 2005.

First NUCA caches proposal (for uniprocessor)“An Adaptive, Non-uniform Cache Structure for Wire-

delay Dominated On-chip Caches”, C. Kim, D. Burger, and S. Keckler, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 2002.



15

NUCA cache study for CMP“Managing Wire Delay in Large Chip-Multiprocessor

Caches”, B. Beckmann and D. Wood, Intl. Symp. on Microarchitecture, December 2004.

Recent fair cache sharing studies“Fair Cache Sharing and Partitioning in a Chip

Multiprocessor Architecture”, S. Kim, D. Chandra, and Y. Solihin, Intl. Conf. on Parallel Architecture and Compilation Techniques, October 2004.

“CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms”, R. Iyer, Intl. Conf. on Supercomputing, June 2004.

Other recent studies on priorities and quality of service in CMP/SMT“Symbiotic Job-Scheduling with Priorities for

Simultaneous Multithreading Processors”, A. Snavely, D. Tullsen, and G. Voelker, Intl. Conf. on Measurement and Modeling of Computer Systems, June 2002.


Lect. 14: Interconnection Networks

Communication networks (e.g., LANs and WANs)– Must follow industry standards– Must support many different types of packets– Many features, such as reliability, are handled by upper software layers– Currently based on buses (e.g., Ethernet LAN) and optic fiber (WAN)– Latency is high and bandwidth is low– Topologies are highly irregular (e.g., Internet)

Multiprocessor interconnects– Custom made and proprietary– Must only support a few (3 to 4) different types of packets– Most features handled in hardware– Many different topologies and technologies are commonly used– Latency is low and bandwidth is high– Topologies are very regular

1


Interconnection Networks

2

General organization

– Network controller (NC): links the processor (host) to the network

– Switches (SW): links different parts of the network internally

– Note: SW’s may not be present at all in some topologies

CPU

Cache

Mem.NC

CPU

Cache

Mem.NC

CPU

Cache

Mem. NC

SW

SW

SW



3

Characterizing Interconnects– Topology: the “shape” or structure of the interconnect

(e.g., buses, meshes, hypercubes, butterflies, etc) Direct networks: each host+NC connects directly to other

hosts+NCs Indirect networks: hosts+NCs connect to a subset of the

switches, which are then the entry points to the network and are themselves connected to other internal switches

– Routing algorithm: the rules and mechanisms for routing messages

Dynamic: route from a given A to B may change at different times

Static: route from a given A to B is fixed– Switching strategy: how exchange of messages is set up

Circuit switching: route and connection from source to destination is established and fixed before communication (e.g., like telephone calls)

Packet switching: each part of the communication (packet) is handled separately

– Flow control mechanism: how traffic flow under conflict and/or congestion is handled



4

Terminology:– Link: the physical connection between two

hosts/switches– Channel: a logical connection between two

hosts/switches that are connected with a link (multiple channels may be multiplexed into a single link)

– Degree of a switch: the number of input/output channels

– Simplex channel: communication can only happen in one direction

Duplex channel: communication can happen in both directions

– Phit: the smallest physical unit of data that can be transferred in a unit of time over a link

– Flit: the smallest unit of data that can be exchanged between two hosts/switches (1 flit ≥ 1 phit)

– Hop: each step between two adjacent hosts/switches– Permutations: a combination of pairs of hosts that can

communicate simultaneously



5

Important properties:– Degree or radix: the smallest number of hosts/switches

that any given host/switch can connect directly to– Diameter: the longest distance between any two hosts (in

number of hops)– Bisection: a collection of links that if removed would

divide the network in two equal-size (disconnect) parts Bisection width: the minimum number of links across all

bisections Bisection bandwidth: the minimum bandwidth across the

bisections– Total bandwidth: the maximum communication

bandwidth that can be attained– Cost: usually given as a function of the total number of

links, switches, and network controllers– Scalability: how a given property scales with the increase

in the number of hosts (e.g., bandwidth, cost, diameter, etc)

Usually given in terms of O() (e.g., O(1) is constant and O(N) is linear)

– Fault tolerance: whether communication between any two nodes is still possible after failure of some links


Topologies Buses:

– Degree: N-1 (i.e., fully connected)– Diameter: 1– Bisection width: 1– Total bandwidth: O(1)– Cost: O(N)– Permutations: single pair, broadcast (one-to-all), multicast (one-to-

many)

6

Bus

CPU

Main memory

CPU CPU CPU


Topologies Crossbar:

– Degree: N-1 (i.e., fully connected)– Diameter: 2 (sometimes also said to be 1)– Bisection width: N– Total bandwidth: O(N)– Cost: O(N2)– Permutations: single-pair, any pair-wise permutation

7

CPU 1

CPU 2

CPU 3

CPU 4

CPU 1 CPU 2 CPU 3 CPU 4

CPU

Mem.


Topologies Bidirectional Ring:

or, with same-size wires:

– Degree: 2– Diameter: N/2– Bisection width: 2– Total bandwidth: O(N) (e.g., all nodes communicate to neighbor)– Cost: O(N)– Permutations: single-pair, neighbor

8

CPU

Mem.


Topologies 2-D Mesh:

– Degree: 2 (maximum is 4 at internal nodes)– Diameter: 2*(k-1) (k is the number of nodes per row/column, i.e., N1/2)– Bisection width: k– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor

9

CPU

Mem.


Topologies 2-D Torus:

– Degree: 4– Diameter: k– Bisection width: 2*k– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor

10

CPU

Mem.


Topologies 4-D Cube (hypercube):

– Degree: 4– Diameter: 4– Bisection width: 8– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor

11

CPU

Mem.


Topologies Binary Tree:

or, in H-tree configuration:– Degree: 1 for hosts and 3 for switches

– Diameter: 2*log2N

– Bisection width: 1– Total bandwidth: O(N)– Cost: O(N)– Permutations: single-pair, neighbor– Note: “fat” tree → width of links increases as we go toward root

12

CPU

Mem.

Intermediate node:switch

root: switch

leaf node:


Topologies Switched network:

– Degree: 1 for hosts and 2 for switches

– Diameter: log2N

– Bisection width: N/2– Total bandwidth: O(N)

– Cost: O(Nlog2N)

– Permutations: depends on the actual topology

13

CPU

Mem.

CPU 1

CPU 2

CPU 3

CPU 4

CPU 1

CPU 2

CPU 3

CPU 4


Topologies Switched network: e.g., Omega network

14


Routing

15

Example: mesh and d-dimension cubes– Hosts are numbered as in a matrix– To avoid deadlock use dimension-ordered routing

(a.k.a. X-Y routing in 2D) Follow all the steps necessary in one dimension before

changing dimensions Always choose dimensions in the same order

– E.g., from (1,1) to (3,3)

and from (3,3) to (1,1)

(0,0) (0,1) (0,2) (0,3)

(1,0) (1,1) (1,2) (1,3)

(2,0) (2,1) (2,2) (2,3)

(3,0) (3,1) (3,2) (3,3)


Routing

16

Example: Omega network– Hosts are numbered linearly in binary (log2N bits are

required)– The routing function is given by F=S XOR D, where S

and D are the binary numbers of the source and destination hosts, respectively

– At each level of the network, use the corresponding bit of the routing function to go:

Straight, if bit is 1 Across, if bit is 0

– Assign numbers to hosts appropriately (easy for Omega, but more complex for other networks)– E.g., from 010 to 011 F=001 → straight, straight, across

and from 100 to 111 F=011 → straight, across, across

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111


Packet Switching

17

Store-and-forward– Enough space must be pre-allocated in destination

router’s buffers for the complete packet– Router must wait until the complete packet is received

before it can initiate forwarding it Cut-through

– Enough space must be pre-allocated in destination router’s buffer for the complete packet

– Router may initiate forwarding parts of the packet as soon as they arrive

Wormhole– Packets are divided in small pieces called flow units

(flits)– Only header flit contains address of destination and is

responsible for setting up the route (trailing flits simply follow the header)

– No need to allocate enough buffer space for entire packet (packet spreads through multiple routers and links like a “worm”)

– May lead to deadlock



18

Recent books on multiprocessor interconnects“Principles and Practice of Interconnection Networks”, W.

Dally and B. Towles, Morgan Kaufmann, 2003.“Interconnection Networks”, J. Duato, Morgan Kaufmann,

2002.

CS4/MSc Parallel Architectures - 2009-2010 CS4 Parallel Architectures - Introduction Instructor :...

Documents

Transcript of CS4/MSc Parallel Architectures - 2009-2010 CS4 Parallel Architectures - Introduction Instructor :...