1 Programming for Performance. 2 Simulating Ocean Currents Model as two-dimensional grids Discretize...

1

Programming for Performance

2

Simulating Ocean Currents

Model as two-dimensional grids• Discretize in space and time• finer spatial and temporal resolution => greater accuracy

Many different computations per time step– set up and solve equations

• Concurrency across and within grid computationsStatic and regular

(a) Cross sections (b) Spatial discretization of a cross section

3

Simulating Galaxy Evolution

Simulate the interactions of many stars evolving over time

Computing forces is expensive•O(n2) brute force approach•Hierarchical Methods take advantage of force law: G

m1m2

r2

•Many time-steps, plenty of concurrency across stars within one

Star on which forcesare being computed

Star too close toapproximate

Small group far enough away toapproximate to center of mass

Large group farenough away toapproximate

4

Rendering Scenes by Ray Tracing

Shoot rays into scene through pixels in image plane

Follow their paths•they bounce around as they strike objects•they generate new rays: ray tree per input ray

Result is color and opacity for that pixel

Parallelism across rays

How much concurrency in these examples?

5

4 Steps in Creating a Parallel Program

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

Decomposition of computation in tasks

Assignment of tasks to processes

Orchestration of data access, comm, synch.

Mapping processes to processors

6

Performance Goal => Speedup

Architect Goal•observe how program uses machine and improve the design to enhance performance

Programmer Goal•observe how the program uses the machine and improve the implementation to enhance performance

What do you observe?

Who fixes what?

7

Analysis Framework

Solving communication and load balance NP-hard in general case•But simple heuristic solutions work well in practice

Fundamental Tension among:•balanced load•minimal synchronization•minimal communication•minimal extra work

Good machine design mitigates the trade-offs

Sequential Work

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <

8

Decomposition

Identify concurrency and decide level at which to exploit it

Break up computation into tasks to be divided among processes•Tasks may become available dynamically•No. of available tasks may vary with time

Goal: Enough tasks to keep processes busy, but not too many•Number of tasks available at a time is upper bound on achievable speedup

9

Limited Concurrency: Amdahl’s Law

Most fundamental limitation on parallel speedup

If fraction s of seq execution is inherently serial, speedup <= 1/s

Example: 2-phase calculation

• sweep over n-by-n grid and do some independent computation

• sweep again and add each value to global sum

Time for first phase = n2/p

Second phase serialized at global variable, so time = n2

Speedup <= or at most 2

Trick: divide second phase into two

• accumulate into private sum during sweep

• add per-process private sum into global sum

Parallel time is n2/p + n2/p + p, and speedup at best

2n2

n2

p + n2

2n2

2n2 + p2

10

Understanding Amdahl’s Law

1

p

1

p

1

n2/p

n2

p

wor

k do

ne c

oncu

rren

tly

n2

n2

Timen2/p n2/p

(c)

(b)

(a)

11

Concurrency Profiles

•Area under curve is total work done, or time with 1 processor•Horizontal extent is lower bound on time (infinite processors)

•Speedup is the ratio: , base case:

•Amdahl’s law applies to any overhead, not just limited concurrency

Concurrency

150

219

247

286

313

343

380

415

444

483

504

526

564

589

633

662

702

7330

200

400

600

800

1,000

1,200

1,400

Clock cycle number

fk k

fkkp

k=1

k=1

1

s + 1-sp

12

Programming as Successive Refinement

Rich space of techniques and issues•Trade off and interact with one another

Issues can be addressed/helped by software or hardware•Algorithmic or programming techniques•Architectural techniques

Not all issues in programming for performance dealt with up front•Partitioning often independent of architecture, and done first

•Then interactions with architecture– Extra communication due to architectural interactions– Cost of communication depends on how it is structured– May inspire changes in partitioning

13

Partitioning for Performance

Balancing the workload and reducing wait time at synch points

Reducing inherent communication

Reducing extra work

Even these algorithmic issues trade off:•Minimize comm. => run on 1 processor => extreme load imbalance

•Maximize load balance => random assignment of tiny tasks => no control over communication

•Good partition may imply extra work to compute or manage it

Goal is to compromise•Fortunately, often not difficult in practice

14

Load Balance and Synchronization

Sequential WorkMax Work on any Processor

Speedup problem(p) <

Instantaneous load imbalance revealed as wait time•at completion•at barriers•at receive•at flags, even at mutex

P0

P1

P2

P3

P0

P1

P2

P3

Sequential Work

Max (Work + Synch Wait Time)

15

Load Balance and Synch Wait Time

Limit on speedup: Speedupproblem(p) <

•Work includes data access and other costs•Not just equal work, but must be busy at same time

Four parts to load balance and reducing synch wait time:

1. Identify enough concurrency

2. Decide how to manage it

3. Determine the granularity at which to exploit it

4. Reduce serialization and cost of synchronization

Sequential WorkMax Work on any Processor

16

Reducing Inherent Communication

Communication is expensive!

Metric: communication to computation ratio

Focus here on inherent communication •Determined by assignment of tasks to processes•Later see that actual communication can be greater

Assign tasks that access same data to same process

Solving communication and load balance NP-hard in general case

But simple heuristic solutions work well in practice•Applications have structure!

17

Domain Decomposition

Works well for scientific, engineering, graphics, ... applications

Exploits local-biased nature of physical problems•Information requirements often short-range•Or long-range but fall off with distance

Simple example: nearest-neighbor grid computation

Perimeter to Area comm-to-comp ratio (area to volume in 3-d)•Depends on n,p: decreases with n, increases with p

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14

P10

n

n np

np

P15

18

Domain Decomposition (contd)

Comm to comp: for block, for strip

Application dependent: strip may be better in other cases•E.g. particle flow in tunnel

4*√pn

2*pn

Best domain decomposition depends on information requirementsNearest neighbor example: block versus strip decomposition:

P0 P1 P2 P3

P4

P8

P12

P5 P6 P7

P9 P11

P13 P14 P15

P10

n

n

n

p------

n

p

------

19

Finding a Domain Decomposition

Static, by inspection•Must be predictable: grid example above

Static, but not by inspection•Input-dependent, require analyzing input structure

•E.g sparse matrix computations

Semi-static (periodic repartitioning)•Characteristics change but slowly; e.g. N-body

Static or semi-static, with dynamic task stealing•Initial domain decomposition but then highly unpredictable; e.g ray tracing

20

N-body: Simulating Galaxy Evolution

•Simulate the interactions of many stars evolving over time

•Computing forces is expensive•O(n2) brute force approach•Hierarchical Methods take advantage of force law: G

m1m2

r2

•Many time-steps, plenty of concurrency across stars within one

Star on which forcesare being computed

Star too close toapproximate

Small group far enough away toapproximate to center of mass

Large group farenough away toapproximate

21

A Hierarchical Method: Barnes-Hut

Locality Goal: • Particles close together in space should be on same processor

Difficulties: Nonuniform, dynamically changing

(a) The spatial domain (b) Quadtree representation

22

Application Structure

•Main data structures: array of bodies, of cells, and of pointers to them– Each body/cell has several fields: mass, position, pointers to others

– pointers are assigned to processes

Computeforces

Updateproperties

Time-steps Build tree

Computemoments of cells

Traverse treeto compute forces

23

Partitioning

Decomposition: bodies in most phases (sometimes cells)

Challenges for assignment:•Nonuniform body distribution => work and comm. Nonuniform– Cannot assign by inspection

•Distribution changes dynamically across time-steps– Cannot assign statically

•Information needs fall off with distance from body– Partitions should be spatially contiguous for locality

•Different phases have different work distributions across bodies– No single assignment ideal for all– Focus on force calculation phase

•Communication needs naturally fine-grained and irregular

24

Load Balancing

• Equal particles equal work.

– Solution: Assign costs to particles based on the work they do

• Work unknown and changes with time-steps

– Insight : System evolves slowly

– Solution: Count work per particle, and use as cost for next time-step.

Powerful technique for evolving physical systems

25

A Partitioning Approach: ORB Orthogonal Recursive Bisection:

• Recursively bisect space into subspaces with equal work– Work is associated with bodies, as before

• Continue until one partition per processor

• High overhead for large no. of processors

26

Another Approach: Costzones

Insight: Tree already contains an encoding of spatial locality.

• Costzones is low-overhead and very easy to program

(a) ORB (b) Costzones

P1 P2 P3 P4 P5 P6 P7 P8

27

Space Filling Curves

Morton Order Peano-Hilbert Order

28

Rendering Scenes by Ray Tracing

•Shoot rays into scene through pixels in image plane

•Follow their paths– they bounce around as they strike objects– they generate new rays: ray tree per input ray

•Result is color and opacity for that pixel•Parallelism across rays

All case studies have abundant concurrency

29

PartitioningScene-oriented approach

•Partition scene cells, process rays while they are in an assigned cell

Ray-oriented approach •Partition primary rays (pixels), access scene data as needed

•Simpler; used here

Need dynamic assignment; use contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing

A block,the unit ofassignment

A tile,the unit of decompositionand stealing

Could use 2-D interleaved (scatter) assignment of tiles instead

30

Other Techniques

Scatter Decomposition, e.g. initial partition in Raytrace

Preserve locality in task stealing•Steal large tasks for locality, steal from same queues, ...

1

3 4 3 4 3 4 3 4

3 4 3 4 3 4 3 4

3 4 3 4 3 4 3 4

3 4 3 4 3 4 3 4

1

43

Domain decomposition Scatter decomposition

2 1 2 1 2 1 2

1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2

2

31

Determining Task Granularity

Task granularity: amount of work associated with a task

General rule:•Coarse-grained => often less load balance•Fine-grained => more overhead; often more comm., contention

Comm., contention actually affected by assignment, not size•Overhead by size itself too, particularly with task queues

32

Dynamic Tasking with Task QueuesCentralized versus distributed queues

Task stealing with distributed queues•Can compromise comm and locality, and increase synchronization

•Whom to steal from, how many tasks to steal, ...•Termination detection•Maximum imbalance related to size of task

QQ 0 Q2Q1 Q3

All remove tasks

P0 inserts P1 inserts P2 inserts P3 inserts

P0 removes P1 removes P2 removes P3 removes

(b) Distributed task queues (one per process)

Others maysteal

All processesinsert tasks

(a) Centralized task queue

Preserve locality in task stealing•Steal large tasks for locality, steal from same queues, ...

33

Assignment Summary

Specify mechanism to divide work up among processes• E.g. which process computes forces on which stars, or which rays

• Balance workload, reduce communication and management cost

Structured approaches usually work well• Code inspection (parallel loops) or understanding of application

• Well-known heuristics• Static versus dynamic assignment

As programmers, we worry about partitioning first• Usually independent of architecture or prog model• But cost and complexity of using primitives may affect decisions

34

Parallelizing Computation vs. Data

Computation is decomposed and assigned (partitioned)

Partitioning Data is often a natural view too•Computation follows data: owner computes•Grid example; data mining;

Distinction between comp. and data stronger in many applications•Barnes-Hut•Raytrace

35

Reducing Extra Work

Common sources of extra work:•Computing a good partition

– e.g. partitioning in Barnes-Hut or sparse matrix

•Using redundant computation to avoid communication

•Task, data and process management overhead– applications, languages, runtime systems, OS

• Imposing structure on communication– coalescing messages, allowing effective naming

Architectural Implications:•Reduce need by making communication and orchestration efficient

Sequential Work


36

It’s Not Just Partitioning

Inherent communication in parallel algorithm is not all• artifactual communication caused by program implementation and architectural interactions can even dominate

• thus, amount of communication not dealt with adequately

Cost of communication determined not only by amount• also how communication is structured• and cost of communication in system

Both architecture-dependent, and addressed in orchestration step

37

Structuring CommunicationGiven amount of comm (inherent or artifactual), goal is to reduce cost

Cost of communication as seen by process: C = f * ( o + l + + tc - overlap)

– f = frequency of messages– o = overhead per message (at both ends)– l = network delay per message– nc = total data sent– m = number of messages– B = bandwidth along path (determined by network, NI, assist)

– tc = cost induced by contention per message– overlap = amount of latency hidden by overlap with comp. or comm.

• Portion in parentheses is cost of a message (as seen by processor)

• That portion, ignoring overlap, is latency of a message

•Goal: reduce terms in latency and increase overlap

nc/mB

38

Reducing Overhead

Can reduce no. of messages m or overhead per message o

o is usually determined by hardware or system software•Program should try to reduce m by coalescing messages

•More control when communication is explicit

Coalescing data into larger messages:•Easy for regular, coarse-grained communication•Can be difficult for irregular, naturally fine-grained communication– may require changes to algorithm and extra work

• coalescing data and determining what and to whom to send

39

Reducing Network Delay

Network delay component = f*h*th

– h = number of hops traversed in network– th = link+switch latency per hop

Reducing f: communicate less, or make messages larger

Reducing h:•Map communication patterns to network topology

– e.g. nearest-neighbor on mesh and ring; all-to-all

•How important is this?– used to be major focus of parallel algorithms– depends on no. of processors, how th, compares with other components

– less important on modern machines• overheads, processor count, multiprogramming

40

Reducing Contention

All resources have nonzero occupancy•Memory, communication controller, network link, etc.•Can only handle so many transactions per unit time

Effects of contention:•Increased end-to-end cost for messages•Reduced available bandwidth for other messages•Causes imbalances across processors

Particularly insidious performance problem•Easy to ignore when programming•Slow down messages that don’t even need that resource– by causing other dependent resources to also congest

•Effect can be devastating: Don’t flood a resource!

41

Types of ContentionNetwork contention and end-point contention (hot-spots)

Location and Module Hot-spots•Location: e.g. accumulating into global variable, barrier

– solution: tree-structured communication

• Module: all-to-all personalized comm. in matrix transpose

– solution: stagger access by different processors to same node temporally

•In general, reduce burstiness; may conflict with making messages larger

Flat Tree structured

Contention Little contention

42

Overlapping Communication

Cannot afford to stall for high latencies• even on uniprocessors!

Overlap with computation or communication to hide latency

Requires extra concurrency (slackness), higher bandwidth

Techniques:•Prefetching•Block data transfer•Proceeding past communication•Multithreading

43

Communication Scaling (NPB2)

1.00E+02

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

0 10 20 30 40

FT

IS

LU

MG

SP

BT

0

1

2

3

4

5

6

7

8

0 10 20 30 40

FT

IS

LU

MG

SP

BT

Normalized Msgs per Proc Average Message Size

44

Communication Scaling: Volume

Bytes per Processor

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40

FT

IS

LU

MG

SP

BT

Total Bytes

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

0 10 20 30 40

FT

IS

LU

MG

SP

BT

45

Mapping

Two aspects:•Which process runs on which particular processor?

– mapping to a network topology•Will multiple processes run on same processor?

Space-sharing•Machine divided into subsets, only one app at a time in a subset

•Processes can be pinned to processors, or left to OS

System allocation

Real world•User specifies desires in some aspects, system handles some

Usually adopt the view: process <-> processor

46

Recap: Performance Trade-offs

Programmer’s View of Performance

Different goals often have conflicting demands•Load Balance

– fine-grain tasks, random or dynamic assignment

•Communication– coarse grain tasks, decompose to obtain locality

•Extra Work– coarse grain tasks, simple assignment

•Communication Cost:– big transfers: amortize overhead and latency– small transfers: reduce contention

Sequential Work


47

Recap (cont)

Architecture View• cannot solve load imbalance or eliminate inherent communication

But can:• reduce incentive for creating ill-behaved programs

– efficient naming, communication and synchronization

• reduce artifactual communication• provide efficient naming for flexible assignment• allow effective overlapping of communication

48

P

Time (s)

100

75

50

25

Uniprocessor View

Performance depends heavily on memory hierarchy

Managed by hardware

Time spent by a program• Timeprog(1) = Busy(1) + Data Access(1)• Divide by cycles to get CPI equation

Data access time can be reduced by:• Optimizing machine

– bigger caches, lower latency...

• Optimizing program– temporal and spatial locality

Busy-useful

Data-local

49

Same Processor-Centric Perspective

P 0 P 1 P 2 P 3

Busy-overhead Busy-useful

Data-local

Synchronization

Data-remote

Time (s)

Time (s)

100

75

50

25

100

75

50

25

(a) Sequential essors(b) Parallel with four proc

50

What is a Multiprocessor?

A collection of communicating processors• Goals: balance load, reduce inherent communication and extra work

A multi-cache, multi-memory system• Role of these components essential regardless of programming model

• Prog. model and comm. abstr. affect specific performance tradeoffs

P P P

P P P

...

...

51

Relationship between Perspectives

Synch wait

Data-remote

Data-localOrchestration

Busy-overheadExtra work

Performance issueParallelization step(s) Processor time component

Decomposition/assignment/orchestration

Decomposition/assignment

Decomposition/assignment

Orchestration/mapping

Load imbalance and synchronization

Inherent communication volume

Artifactual communication and data locality

Communication structure

Busy(1) + Data(1)

Busyuseful(p)+Datalocal(p)+Synch(p)+Dataremote(p)+Busyoverhead(p)Speedup <

52

Artifactual CommunicationAccesses not satisfied in local portion of memory hierachy cause “communication”•Inherent communication, implicit or explicit, causes transfers– determined by program

• Artifactual communication– determined by program implementation and arch. interactions

– poor allocation of data across distributed memories– unnecessary data in a transfer– unnecessary transfers due to system granularities– redundant communication of data– finite replication capacity (in cache or main memory)

•Inherent communication is what occurs with unlimited capacity, small transfers, and perfect knowledge of what is needed.

53

Spatial Locality Example

• Repeated sweeps over 2-d grid, each time adding 1 to elements

P6 P7P4

P8

P0 P1 P2 P3

P5

Page straddlespartition boundaries:difficult to distribute memory well

Cache blockstraddles partitionboundary

(a) Two-dimensional array

Contiguity in memory layout

54

Spatial Locality Example

• Repeated sweeps over 2-d grid, each time adding 1 to elements• Natural 2-d versus higher-dimensional array representation

P6 P7P4

P8

P0 P1 P2 P3

P5

Page straddlespartition boundaries:difficult to distribute memory well

Cache blockstraddles partitionboundary

(a) Two-dimensional array

P0 P3

P5 P6 P7P4

P8

P2P1

Page doesnot straddlepartitionboundary

Cache block is within a partition

(b) Four-dimensional array

Contiguity in memory layout

55

Tradeoffs with Inherent Communication

Partitioning grid solver: blocks versus rows•Blocks still have a spatial locality problem on remote data•Rowwise can perform better despite worse inherent c-to-c ratio

•Result depends on n and p

Good spacial locality onnonlocal accesses atrow-oriented boudary

Poor spacial locality onnonlocal accesses atcolumn-orientedboundary

56

Example Performance Impact

Equation solver on SGI Origin2000

Speedup

Number of processors

Speedup

Number of processors

ll

l

l

l

l

ss

s

s

s

s

nn

n

n

n

n

1 3 5 7 9 11131517192123252729310

5

10

15

20

25

30

l ll

l

l

l

l

nnn

n

n

n

n

s ss

s

s

s

s

uuu

u

u

u

u

666

6

6

6

6

1 3 5 7 9 11131517192123252729310

5

10

15

20

25

30

35

40

45

504D

4D-rr

n 2D-rr

s 2D

u Rows-rr

6 Rowsl 2D

n 4D

s Rows

(a) Smaller problem size (a) Larger problem size

57

Working Sets Change with P

8-fold reductionin miss rate from4 to 8 proc

58

Implications for Programming Models

59

Implications for Programming Models

Coherent shared address space and explicit message passing

Assume distributed memory in all cases

Recall any model can be supported on any architecture•Assume both are supported efficiently•Assume communication in SAS is only through loads and stores

•Assume communication in SAS is at cache block granularity

60

Issues to Consider

Functional issues: •Naming: How are logically shared data and/or processes referenced?

•Operations: What operations are provided on these data•Ordering: How are accesses to data ordered and coordinated?

Performance issues•Granularity and endpoint overhead of communication

– (latency and bandwidth depend on network so considered similar)

•Replication: How are data replicated to reduce communication?

•Ease of performance modeling

Cost Issues•Hardware cost and design complexity

61

Sequential Programming Model

Contract•Naming: Can name any variable ( in virtual address space)– Hardware (and perhaps compilers) does translation to physical addresses

•Operations: Loads, Stores, Arithmetic, Control•Ordering: Sequential program order

Performance Optimizations•Compilers and hardware violate program order without getting caught– Compiler: reordering and register allocation– Hardware: out of order, pipeline bypassing, write buffers

•Retain dependence order on each “location”•Transparent replication in caches

Ease of Performance Modeling: complicated by caching

62

SAS Programming Model

Naming: Any process can name any variable in shared space

Operations: loads and stores, plus those needed for ordering

Simplest Ordering Model: •Within a process/thread: sequential program order•Across threads: some interleaving (as in time-sharing)

•Additional ordering through explicit synchronization

•Can compilers/hardware weaken order without getting caught?– Different, more subtle ordering models also possible (more later)

63

Synchronization

Mutual exclusion (locks)•Ensure certain operations on certain data can be performed by only one process at a time

•Room that only one person can enter at a time•No ordering guarantees

Event synchronization • Ordering of events to preserve dependences

– e.g. producer —> consumer of data

•3 main types:– point-to-point– global– group

64

Message Passing Programming Model

Naming: Processes can name private data directly. •No shared address space

Operations: Explicit communication through send and receive•Send transfers data from private address space to another process

•Receive copies data from process to private address space•Must be able to name processes

Ordering: •Program order within a process•Send and receive can provide pt to pt synch between processes– Complicated by asynchronous message passing

•Mutual exclusion inherent + conventional optimizations legal

Can construct global address space:•Process number + address within process address space•But no direct operations on these names

65

Naming

Uniprocessor: Can name any variable ( in virtual address space)•Hardware (and perhaps compiler) does translation to physical addresses

SAS: similar to uniprocessor; system does it all

MP: each process can only directly name the data in its address space•Need to specify from where to obtain or where to transfer nonlocal data

•Easy for regular applications (e.g. Ocean)•Difficult for applications with irregular, time-varying data needs

– Barnes-Hut: where the parts of the tree that I need? (change with time)– Raytrace: where are the parts of the scene that I need (unpredictable)

•Solution methods exist– Barnes-Hut: Extra phase determines needs and transfers data before computation phase

– Raytrace: scene-oriented rather than ray-oriented approach– both: emulate application-specific shared address space using hashing

66

Operations

Sequential: Loads, Stores, Arithmetic, Control

SAS: loads and stores, plus those needed for ordering

MP: Explicit communication through send and receive•Send transfers data from private address space to another process

•Receive copies data from process to private address space

•Must be able to name processes

67

ReplicationWho manages it (i.e. who makes local copies of data)?

•SAS: system, MP: program

Where in local memory hierarchy is replication first done?•SAS: cache (or memory too), MP: main memory

At what granularity is data allocated in replication store?•SAS: cache block, MP: program-determined

How are replicated data kept coherent?•SAS: system, MP: program

How is replacement of replicated data managed?•SAS: dynamically at fine spatial and temporal grain (every access)

•MP: at phase boundaries, or emulate cache in main memory in software

Of course, SAS affords many more options too (discussed later)

68

Communication Overhead and GranularityOverhead directly related to hardware support provided

•Lower in SAS (order of magnitude or more)

Major tasks: •Address translation and protection

– SAS uses MMU– MP requires software protection, usually involving OS in some way

•Buffer management– fixed-size small messages in SAS easy to do in hardware– flexible-sized message in MP usually need software involvement

•Type checking and matching – MP does it in software: lots of possible message types due to flexibility

•A lot of research in reducing these costs in MP, but still much larger

Naming, replication and overhead favor SAS •Many irregular MP applications now emulate SAS/cache in software

69

Block Data Transfer

Fine-grained communication not most efficient for long messages•Latency and overhead as well as traffic (headers for each cache line)

SAS: can using block data transfer•Explicit in system we assume, but can be automated at page or object level in general (more later)

•Especially important to amortize overhead when it is high– latency can be hidden by other techniques too

Message passing:•Overheads are larger, so block transfer more important•But very natural to use since message are explicit and flexible– Inherent in model

70

Synchronization

SAS: Separate from communication (data transfer)•Programmer must orchestrate separately

Message passing•Mutual exclusion by fiat•Event synchronization already in send-receive match in synchronous– need separate orchestration (using probes or flags) in asynchronous

71

Hardware Cost and Design Complexity

Higher in SAS, and especially cache-coherent SAS

But both are more complex issues•Cost

– must be compared with cost of replication in memory

– depends on market factors, sales volume and other nontechnical issues

•Complexity – must be compared with complexity of writing high-performance programs

– Reduced by increasing experience

72

Performance Model

Three components:•Modeling cost of primitive system events of different types•Modeling occurrence of these events in workload•Integrating the two in a model to predict performance

Second and third are most challenging

Second is the case where cache-coherent SAS is more difficult•replication and communication implicit, so events of interest implicit– similar to problems introduced by caching in uniprocessors

•MP has good guideline: messages are expensive, send infrequently

•Difficult for irregular applications in either case (but more so in SAS)

Block transfer, synchronization, cost/complexity, and performance modeling advantageus for MP

73

Summary for Programming Models

Given tradeoffs, architect must address:•Hardware support for SAS (transparent naming) worthwhile?

•Hardware support for replication and coherence worthwhile?

•Should explicit communication support also be provided in SAS?

Current trend:•Tightly-coupled multiprocessors support for cache-coherent SAS in hw

•Other major platform is clusters of workstations or multiprocessors– currently don’t support SAS in hardware, mostly use message passing

•At highest end, clusters of cache-coherent SAS multiprocessors

74

Summary

Crucial to understand characteristics of parallel programs•Implications for a host or architectural issues at all levels

Architectural convergence has led to:•Greater portability of programming models and software

– Many performance issues similar across programming models too

•Clearer articulation of performance issues – Used to use PRAM model for algorithm design– Now models that incorporate communication cost (BSP, logP,….)– Emphasis in modeling shifted to end-points, where cost is greatest

– But need techniques to model application behavior, not just machines

Performance issues trade off with one another; iterative refinement

Ready to understand using workloads to evaluate systems issues

1 Programming for Performance. 2 Simulating Ocean Currents Model as two-dimensional grids Discretize...

Documents

Transcript of 1 Programming for Performance. 2 Simulating Ocean Currents Model as two-dimensional grids Discretize...