Programming Environments for Multigrain Parallelization

37
Programming Environments for Multigrain Parallelization Nikolopoulos, D. (2003, Jun). Programming Environments for Multigrain Parallelization. EURESCO (European Research Conference Series). Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:18. Nov. 2021

Transcript of Programming Environments for Multigrain Parallelization

Page 1: Programming Environments for Multigrain Parallelization

Programming Environments for Multigrain Parallelization

Nikolopoulos, D. (2003, Jun). Programming Environments for Multigrain Parallelization. EURESCO (EuropeanResearch Conference Series).

Queen's University Belfast - Research Portal:Link to publication record in Queen's University Belfast Research Portal

General rightsCopyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights.

Take down policyThe Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made toensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in theResearch Portal that you believe breaches copyright or violates any law, please contact [email protected].

Download date:18. Nov. 2021

Page 2: Programming Environments for Multigrain Parallelization

Programming Environments for Multigrain Parallelization

Dimitris Nikolopoulos Assistant Professor

Department of Computer Science The College of William&Mary

Page 3: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 2

In a nutshell

•  State-of-the-art –  Deep machines –  Multiple forms of parallelism –  Balanced hardware –  Unbalanced software

•  Proposal: –  Integrated management of multilevel parallelism –  Explicitly parallel programming models –  Compiler/runtime support for

•  fine-grain multithreading •  nested data placement and layout optimizations •  novel architectural features (multithreading, SIMD,

speculation)

Page 4: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 3

This talk

•  A proposed programming model for deep systems – Balancing programming complexity with

performance – Compiler/runtime support for managing (not

discovering) parallelism and data

•  Preliminary work –  locality optimizations for shared cache memories

Page 5: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 4

Agenda

•  Multigrain parallelization (state-of-the-art) •  GAS programming notation •  Compiler/runtime components •  Preliminary work

– Data-centric thread formation for shared caches – Bandwidth-driven thread scheduling

•  Future plans

Page 6: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 5

Hybrid Programming Models

•  First idea: hybrid programming models – Combination of MPI and one of many shared-

memory models (OpenMP, Pthreads, DSM,...)

•  Pros – Well-known parallel programming paradigms – Easy (?) to parallelize at multiple levels – Available software components

Page 7: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 6

Hybrid Programming Models

•  Cons –  Uncooperative libraries (thread safety, threading semantics,

memory management) –  Ad-hoc parallelization process

•  Usually start with efficient MPI and use OpenMP whenever possible

•  Unfortunately, the MPI/OpenMP ratio is critical (64x1, 32x2, 16x4 ?)

–  Contradicting performance results •  IBM SP cluster: probably bad •  Pentium clusters: mostly bad sometimes good •  Origin2000: mostly good sometimes bad

Page 8: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 7

Flatten Parallelism

•  MPI for everything –  MPI over distributed memory –  MPI over shared memory –  Backward compatibility –  Unbalanced code

•  MPI for distributed-memory communication, threads for shared-memory communication –  Hide deeper levels of parallelism in the communication

libraries –  Fine-grain parallelism only for communication instead of

communication and computation

Page 9: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 8

Unification

•  Unified management of multiple levels of parallelism – Processes, user-level threads, kernel-level threads,

hardware threads – Share-nothing, shared-memory, on-chip shared

caches

•  One means to express parallelism at all levels •  One means to express data distribution at all

levels

Page 10: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 9

Global Address Space

•  Abstraction of shared memory – But software knows always were data is placed

•  Built around the thread and loop constructs – Threads and loops can be nested

•  Difficulties – Managing communication –  Data placement and data layout –  Managing threads at many layers – Exploit deep on-chip parallelism

Fetc

h

Page 11: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 10

Global Address Space

•  Earlier work – HPF – ZPL, NESL, other data-parallel languages – UPC – Other Threaded-C languages (Split-C, EARTH-C)

•  Limitations – One level of parallelism (except NESL, EARTH-C) –  If multiple levels of parallelism, limited to data

(NESL) or computation (EARTH-C), not both – Limited exploitation of architectural innovations

Page 12: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 11

An Explicit Multigrain Parallel Programming Model

•  Multiple levels of parallelism mapped to multiple levels of physical parallel execution contexts –  Vertical exploitation of parallelism –  Separation of the abstraction from the execution vehicle

•  Utilization of deep parallelism (hardware threads) for communication and computation

•  Multiple levels of implicit data distribution –  Multilevel caches, on-chip DRAM, off-chip DRAM, ... –  Explicit data distribution for backward compatibility

•  Effective use of new processor features

Page 13: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 12

Notation for Parallelism

•  Pin-point parallelism –  Threads (not necessarily functions, can be basic blocks) –  Loops

•  Describe parallelism in the compiler as a hierarchy –  No automatic parallelization, hierarchy specified by the user

•  A greedy programming strategy –  Programmer specifies all the parallelism at all levels –  Programmer specifies distributable data –  Compiler/RTS decides how much parallelism should be

extracted –  Fully dynamic model, subject to resource constraints

Page 14: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 13

Simple motivating example

0

1

2

3

4

5

6

7

8

9

for {

...

}

for {

...

}

for {

...

}

Page 15: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 14

Simple motivating example

0

1

2

3 5

4 6

7

8

9

Page 16: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 15

Vertical exploitation of parallelism

•  Top-down approach –  Optimize first level (e.g. MPI) –  After optimizing MPI optimize second level (e.g. user-level

threads)

•  Vertical approach –  Assume n levels with capacity Pi, i = 1, ... ,n –  Estimate the speedup at each level as a function of the

execution contexts used fi(Ti), Ti ≤ Pi

–  Estimate multiplicative speedup f1 (T1) f2 (T2) ... fn (Tn) –  Maximize multiplicative speedup subject to constraints Ti ≤

Pi

Page 17: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 16

Data notations

v<T>; // A simple templated parallel vector e.g. v<Double> // = [1.0 ... 100.0];

v<v<T>>; // A collection of a templated parallel vectors e.g.

// v<v<Double>> = [[1.0 ... 50.0]...[50.0...100.0]

// for two processors, which is derived from the // compiler/runtime system

v<v<v<T>>>; // A collection of collections of templated parallel vectors v

s = [ 100] ;// A data segment descriptor for a uniprocessor

s = [50 50]; // A data segment descriptor for a dual SMP

s = [[25 25] [25 25]]; // A data segment descriptor for two dual SMPs

Page 18: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 17

Runtime Data Management

•  Implicit approach –  A small set of predefined data distributions (block/cyclic) –  Runtime support for tuning data distributions at execution

time –  Based on previous work on runtime data distribution on

ccNUMA

•  Novel features –  Data distribution combined with data layout optimizations –  Flattening/explosion of data types –  Profile-guided optimization for irregular data structures

Page 19: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 18

Data-centric thread formation

•  Conceptually – Organize data in blocks that fit levels of the

memory hierarchy (L1, L2, L3, DRAM) – Associate blocks with threads (affinity

relationship, no hardware parallelism assumed here)

– Group threads based on block affinity – Match with hardware resources – Runtime support: dynamic splitting and

coarsening (flattening/floating parallelism)

Page 20: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 19

Managing architectural features

•  Explicit parallelism exposed to the programming model –  SIMD –  Hardware threads

•  Prefetching controlled by the compiler •  Speculative execution controlled by the compiler/

hardware •  Open problems:

–  Inner processor parallelism vs. outer parallelism –  When and how to use speculative execution

Page 21: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 20

Scheduling Issues

•  Static and dynamic scheduling for computation is manageable

•  Dynamic scheduling of computation and communication is difficult –  How many threads for fine-grain computation ? –  How many threads for communication, where to schedule

them ? –  Constrained optimization by multiple parameters (CPU,

bandwidth, cache space, possibly others) –  Currently investigating runtime scheduling methods based

on observed hardware metrics

Page 22: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 21

Preliminary Results - Tiling on SMTs

•  Data-centric thread formation •  Problem: multithreaded processors use

shared caches •  Share caches may help parallelization:

– Communication/synchronization is accelerated •  Shared caches may hurt parallelization:

–  If code is properly tiled threads will conflict in the cache

– Overall performance is poor

Page 23: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 22

SMT execution engine

IQ IS RR EX L1

RW

Page 24: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 23

First solution: smaller tiles and copy

•  Partition the cache between two tiles

•  Pin the tiles in a fixed set of cache locations

•  How ? –  Copy tiles to buffers

•  Setup a buffer per thread •  Align buffers so that buffers

from different threads do not conflict

Page 25: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 24

Second solution: block data layouts

11 12 13 14

21 22 23 24

31 32 33 34

41 42 43 44

Row-major array layout

13 14 11 12

21 22 23 24

31 32

41 42

33 34

43 44

Block array layout

Page 26: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 25

Using block data layouts

•  Select best block size and organize array in blocks

•  Store tiles contiguously in memory •  Align tiles so that:

– Tiles of the same thread conflict (!) – Tiles of different threads do not conflict

11 12

21 22

13 14

23 24

31 32

41 42

33 34

43 44

Page 27: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 26

Scheduling and Runtime Support

•  Automatic detection of cache contention – Using hardware counters – Adaptivity to execution contexts: runs with 1

thread/processor or multiple threads/processor

•  Interleaving for virtual cache partitioning, blocking for non-partitioned caches – Adaptivity to cache organization

•  Multilevel cache management – Multilevel tiling

Page 28: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 27

Results

•  blocked matrix multiply (3 levels) •  sor 5-point stencil •  Intel C compiler (7.0) •  Manual parallelization (Linux threads,

OpenMP under construction) •  4-CPU Xeon MP 1.4 GHz, 512 Kb L3 cache,

256 Kb L2 cache, 8 Kb L1 cache

Page 29: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 28

Results: matrix multiplication

0

100

200

300

400

500

600

2000 2200 2400 2600 2800 3000

matrix size (n)

exec

utio

n tim

e (s

econ

ds)

1 thread

4 threads on 2 SMTs

4 threads on 2 SMTs,dynamic tiling

8 threads on 4 SMTs

8 threads on 4 SMTs,dynamic tiling

Page 30: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 29

Results matrix multiplication matrix multilpy

0.00E+005.00E+051.00E+061.50E+062.00E+062.50E+063.00E+063.50E+064.00E+064.50E+065.00E+06

2000 2200 2400 2600 2800 3000

matrix size (n)

L1 c

ache

mis

ses

4 threads on 2 SMTs, f ixedtiles

4 threads on 2 SMTs,dynamic tiling

8 threads on 4 SMTs, f ixedtiles

8 threads on 4 SMTsdynamic tiling

Page 31: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 30

Results sor stencil

0

50

100

150

200

250

300

1000 1200 1400 1600 1800 2000

matrix size (n)

exec

utio

n tim

e (s

econ

ds)

1 thread

4 threads on 2 SMTs

4 threads on 2 SMTs,dynamic tiling

8 threads on 4 SMTs

8 threads on 4 SMTs,dynamic tiling

Page 32: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 31

Impact of bandwidth saturation

•  One app running on an idle bus using two SMTs

•  One app using two SMTs running with two bandwidth-friendly communcation threads

•  One app using two SMTs running with a bandwidth-unfriendly communication threads

1 Appl (2 Threads)

0

5

10

15

20

25

Rad

iosi

ty

Vol

rend

Wat

er-n

sqr

Bar

nes

LU C

B

FMM BT

SP

MG

CG

Ray

trace

Bus

Tran

sact

ions

/ µs

ec

1 Appl (2 Threads) + 2 nBBMA

0

5

10

15

20

25

Rad

iosi

ty

Vol

rend

Wat

er-

nsqr

Bar

nes

LU C

B

FM

M

BT

SP

MG

CG

Ray

trac

e

Bus

Tra

nsac

tions

/ µs

ec

1 Appl (2 Threads) + 2 BBMA

0

5

10

15

20

25

30

Rad

iosi

ty

Vol

rend

Wat

er-

nsqr

Bar

nes

LU C

B

FM

M

BT

SP

MG

CG

Ray

trac

e

Bus

Tra

nsac

tions

/ µs

ec

1 Appl (2 Threads) + 2 nBBMA

0.950.960.970.980.991

1.011.021.03

Rad

iosi

ty

Vol

rend

Wat

er-n

sqr

Bar

nes

LU C

B

FM

M

BT

SP

MG

CG

Ray

trac

e

Slowdown

1 Appl (2 Threads) + 2 BBMA

0

0.5

1

1.5

2

2.5

3

Rad

iosi

ty

Vol

rend

Wat

er-n

sqr

Bar

nes

LU C

B

FM

M

BT

SP

MG

CG

Ray

trac

e

Slowdown

Page 33: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 32

Goal

•  Devise runtime scheduling algorithms that – Do not oversubscribe the memory bandwidth – Meet the other scheduling objectives

•  Methodology – Move from a cache-centric to a bandwidth-centric

scheduling algorithm – Use existing time-sharing schedulers

Page 34: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 33

A scheduling algorithm

•  A gang-scheduling like policy – Threads of the application are co-scheduled in

rounds •  Each batch of threads is selected so that the

bandwidth is not over-subscribed •  How do we measure bandwidth ?

– On-line measurement using hardware counters –  1 measurement per time quantum (100 ms.) – Weighted average of measurements for discarding

obsolete history

Page 35: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 34

Last quantum

•  Fitness metric: – Available bandwidth per unallocated thread – Select thread that utilizes best the available

bandwidth

threadthread

ectedthread

BTRABTRfitness

threadsdunallocateSBTRSBTR

ABTR

−+=

−=

11000

_expmax

Page 36: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 35

Results with synthetic workloads

2 Appls (2 Threads each) + 2 BBMA + 2 nBBMA

-100

102030405060

Radio

sity

Volrend

Barn

es

Wate

r-nsqr BT

FM

M

LU

CB

SP

MG

CG

Raytr

ace

Avg. T

urn

aro

und T

ime Im

pro

vem

ent (%

)

LatestWindow

Page 37: Programming Environments for Multigrain Parallelization

ESF Euresco 2/13/12 36

Conclusions

•  Multigrain parallelization is necessary because of the multigrain nature of parallelism in deep computing systems

•  A mixture of common programming notations (threads and data) suffices to express multilevel parallelism

•  More powerful compiler/runtime support is required, even if we stay with MPI

•  Modern microprocessors need to be utilized for HPC and it’s not everything up to the compiler to do it