Download - Experiences Building a Multi-platform Compiler for Co-array Fortran

1

John Mellor-Crummey

Cristian Coarfa, Yuri Dotsenko

Department of Computer ScienceRice University

Experiences Building a Multi-platform Compiler for

Co-array Fortran

AHPCRC PGAS Workshop September, 2005

2

Goals for HPC Languages

• Expressiveness

• Ease of programming

• Portable performance

• Ubiquitous availability

3

PGAS Languages

• Global address space programming model

– one-sided communication (GET/PUT)

• Programmer has control over performance-critical factors

– data distribution and locality control

– computation partitioning

– communication placement

• Data movement and synchronization as language primitives

– amenable to compiler-based communication optimization

HPF & OpenMP compilers must get this right

simpler than msg passing

lacking in OpenMP

4

Co-array Fortran Programming Model

• SPMD process images– fixed number of images during execution– images operate asynchronously

• Both private and shared data– real x(20, 20) a private 20x20 array in each image– real y(20, 20)[*] a shared 20x20 array in each image

• Simple one-sided shared-memory communication – x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns

• Synchronization intrinsic functions– sync_all – a barrier and a memory fence– sync_mem – a memory fence– sync_team([team members to notify], [team members to wait for])

• Pointers and (perhaps asymmetric) dynamic allocation• Parallel I/O

5

integer a(10,20)[*]

if (this_image() > 1)

a(1:10,1:2) = a(1:10,19:20)[this_image()-1]

a(10,20) a(10,20) a(10,20)

image 1 image 2 image N

image 1 image 2 image N

One-sided Communication with Co-Arrays

6

CAF Compilers

• Cray compilers for X1 & T3E architectures

• Rice Co-Array Fortran Compiler (cafc)

7

Performance comparable to that of hand-tuned MPI codes

Rice cafc Compiler

• Source-to-source compiler

– source-to-source yields multi-platform portability

• Implements core language features

– core sufficient for non-trivial codes

– preliminary support for derived types

• soon support for allocatable components

• Open source

8

Implementation Strategy

• Goals

– portability

– high performance on a wide range of platforms

• Approach

– source-to-source compilation of CAF codes

• use Open64/SL Fortran 90 infrastructure

• CAF Fortran 90 + communication operations

– communication

• ARMCI and GASNet one-sided comm libraries for portability

• load/store communication on shared-memory platforms

9

Key Implementation Concerns

• Fast access to local co-array data

• Fast communication

• Overlap of communication and computation

10

Accessing Co-Array Data

Two Representations

• SAVE and COMMON co-arrays as Fortran 90 pointers– F90 pointers to memory allocated outside Fortran run-time system

– original references accessing local co-array data

• rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - …

– transformed references

• rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - …

• Procedure co-array arguments as F90 explicit-shape arrays– CAF language requires explicit shape for co-array arguments

real :: a(10,10,10)[*]

type CAFDesc_real_3 real, pointer:: ptr(:,:,:) ! F90 pointer to local co-array dataend Type CAFDesc_real_3type(CAFDesc_real_3):: a

11

Performance Challenges

• Problem

– Fortran 90 pointer-based representation does not convey

• the lack of co-array aliasing

• contiguity of co-array data

• co-array bounds information

– lack of knowledge inhibits important code optimizations

• Approach: procedure splitting

12

Procedure Splitting

subroutine f(…)real, save :: c(100)[*]interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_innerend interface

call f_inner(…,c(1))

end subroutine f

subroutine f_inner(…, c_arg)real :: c_arg(100)[*]

... = c_arg(50) ...

end subroutine f_inner

subroutine f(…)real, save :: c(100)[*]

... = c(50) ...

end subroutine f

CAF to CAF optimization

Benefits• better alias analysis• contiguity of co-array data• co-array bounds information• better dependence analysis

result: back-end compiler can generate better code

13

Implementing Communication

• x(1:n) = a(1:n)[p] + …

• General approach: use buffer to hold off processor data

– allocate buffer

– perform GET to fill buffer

– perform computation: x(1:n) = buffer(1:n) + …

– deallocate buffer

• Optimizations

– no buffer for co-array to co-array copies

– unbuffered load/store on shared-memory systems

14

Strided vs. Contiguous Transfers

• Problem

– CAF remote reference might induce many small data transfers

• a(i,1:n)[p] = b(j,1:n)

• Solution

– pack strided data on source and unpack it on destination

• Constraints

– can’t express both source-level packing and unpacking for a one-sided transfer

– two-sided packing/unpacking is awkward for users

• Preferred approach

– have communication layer perform packing/unpacking

15

Pragmatics of Packing

Who should implement packing?

• CAF programmer

– difficult to program

• CAF compiler

– must convert PUTs into two-sided communication to unpack

• difficult whole-program transformation

• Communication library

– most natural place

– ARMCI currently performs packing on Myrinet (at least)

16

Synchronization

• Original CAF specification: team synchronization only

– sync_all, sync_team

• Limits performance on loosely-coupled architectures

• Point-to-point extensions

– sync_notify(q)

– sync_wait(p)

Point to point synchronization semantics

Delivery of a notify to q from p

all communication from p to q issued before the notify has been delivered to q

17

Hiding Communication Latency

Goal: enable communication/computation overlap

• Impediments to generating non-blocking communication

– use of indexed subscripts in co-dimensions

– lack of whole program analysis

• Approach: support hints for non-blocking communication

– overcome conservative compiler analysis

– enable sophisticated programmers to achieve good performance today

18

Questions about PGAS Languages

• Performance

– can performance match hand-tuned msg passing programs?

– what are the obstacles to top performance?

– what should be done to overcome them?

• language modifications or extensions?

• program implementation strategies?

• compiler technology?

• run-time system enhancements?

• Programmability

– how easy is it to develop high performance programs?

19

Investigating these Issues

Evaluate CAF, UPC, and MPI versions of NAS benchmarks

• Performance

– compare CAF and UPC performance to that of MPI versions

• use hardware performance counters to pinpoint differences

– determine optimization techniques common for both languages as well as language specific optimizations

• language features

• program implementation strategies

• compiler optimizations

• runtime optimizations

• Programmability

– assess programmability of the CAF and UPC variants

20

Platforms and Benchmarks

• Platforms

– Itanium2+Myrinet 2000 (900 MHz Itanium2)

– Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB)

– SGI Altix 3000 (1.5 GHz Itanium2)

– SGI Origin 2000 (R10000)

• Codes

– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

– MG, CG, SP, BT

– CAF and UPC versions were derived from Fortran77+MPI versions

21

MG class A (2563) on Itanium2+Myrinet2000

Intel compiler: restrict yields factor of 2.3 performance

improvement

UPCstrided comm

28% faster thanmultiple transfers

UPCpoint to point

49% faster than barriers

CAFpoint to point

35% faster than barriers

Higher is better

22

MG class C (5123) on SGI Altix 3000

Intel C compiler: scalar performance

Fortran compiler: linearized array subscripts 30% slowdown compared

to multidimensional subscripts

Higher is better

64

23

MG class B (2563) on SGI Origin 2000

Higher is better

24

CG class C (150000) on SGI Altix 3000

Intel compiler: sum reductions in C 2.6 times slower than Fortran!

point to point

19% faster than

barriers

Higher is better

25

CG class B (75000) on SGI Origin 2000

Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran!

Higher is better

26

SP class C (1623) on Itanium2+Myrinet2000

restrict yields 18%

performance improvement

Higher is better

27

SP class C (1623) on Alpha+Quadrics

Higher is better

28

BT class C (1623) on Itanium2+Myrinet2000

UPC: use of restrict boosts the performance 43%

CAF: procedure splitting improves performance 42-60%

UPC: comm. packing 32%

faster

CAF: comm. packing 7%

faster

Higher is better

29

BT class B (1023) on SGI Altix 3000

use of restrict improves

performance 30%

Higher is better

30

Performance Observations

• Achieving highest performance can be difficult

– need effective optimizing compilers for PGAS languages

• Communication layer is not the problem

– CAF with ARMCI or GASNet yields equivalent performance

• Scalar code optimization of scientific code is the key!

– SP+BT: SGI Fortran: unroll+jam, SWP

– MG: SGI Fortran: loop alignment, fusion

– CG: Intel Fortran: optimized sum reduction

• Linearized subscripts for multidimensional arrays hurt!

– measured 30% performance gap with Intel Fortran

31

Performance Prescriptions

For portable high performance, we need …

• Better language support for CAF synchronization

– point-to-point synchronization is an important common case!

– currently only a Rice extension outside the CAF standard

• Better CAF & UPC compiler support

– communication vectorization

– synchronization strength reduction: important for programmability

• Compiler optimization of loops with complex dependences

• Better run-time library support

– efficient communication support for strided array sections

32

Programmability Observations

• Matching MPI performance required using bulk communication

– communicating multi-dimensional array sections is natural in CAF

– library-based primitives are cumbersome in UPC

• Strided communication is problematic for performance

– tedious programming of packing/unpacking at src level

• Wavefront computations

– MPI buffered communication easily decouples sender/receiver

– PGAS models: buffering explicitly managed by programmer