S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and...

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013

Performance Portability and Programmability

for Heterogeneous Many-core Architectures

(PEPPHER)

Siegfried Benkner

(on behalf of the PEPPHER Consortium)

Research Group Scientific Computing

Faculty of Computer Science

University of Vienna

Austria


EU Project PEPPHER

Performance Portability & Programmability for

Heterogeneous Manycore Architectures

• ICT FP7, Computing Systems; 3 years; finished Feb. 2013

• 9 Partners, Coordinated by University of Vienna

• http://www.peppher.eu

Goal: Enable portable, productive and efficient programming of single-node heterogeneous many-core systems.

Holistic Approach• Component-Based High-Level Program Development• Auto-tuned Algorithms & Data Structures• Compilation Strategies• Runtime Systems• Hardware Mechanisms


Performance.Portability.Programmability

Application (C/C++)

Many-core CPU

CPU+GPU

PePU(Movidius

)

PEPPHERSim

IntelXeon Phi

Focus: Single-node/chip heterogeneous architectures

Approach• Multi-architectural, performance-aware components

multiple implementation variants of functions; each with a performance model

• Task-based execution model & intelligent runtime system runtime selection of best task implementation variant for given platform

Methodology & framework for development of performance portable code.

• Execute same application efficiently on different heterogeneous architectures.

• Support multiple parallel APIs: OpenMP, OpenCL, CUDA, TBB, ...

PEPPHER Framework


PEPPHER Approach

C1

C2

:::

...

:::

Component-basedapplication with

annotations

Mainstream Programmer

Component impl. variants for different platforms,

algorithms, inputs ...

C1 C1

C1 C1

C2 C2

Expert Programmer(Compiler/Autotuner)

C1

C1

C2

C1 C2

C1 C2

Target Platforms

Feed-back ofmeasured

performance

Programmer• Annotate calls to performance-

critical functions (=components)

• Provide implementation variants for different platforms (PDL)

• Provide component meta-data

Platform Descriptors (PDL)


PEPPHER Approach

C1

C2

:::

...

:::

Component-basedapplication with

annotations

Mainstream Programmer

Component impl. variants for different platforms,

algorithms, inputs ...

C1 C1

C1 C1

C2 C2

Expert Programmer(Autotuner)

C1

C1

C2

C1 C2

C1 C2

Target Platforms

Feed-back ofmeasured

performance

PEPPHER framework• Management of components and

implementation variants

• Transformation / Composition

• Implementation variant selection

• Dynamic, performance-aware task scheduling (StarPU runtime)

Dynamic selection of ”best”

implementation variant

HeterogenousTask Scheduler

RuntimeSystem

Transformation/Composition

Intermediatetask-based

representation

Programmer• Annotate calls to performance-

critical functions (=components)

• Provide implementation variants for

different platforms (PDL)

• Provide component meta-data


PEPPHER Framework

C/C++ source code with annotated

component calls

Component implementation variants for different core

architectures ... algorithms, ...

Component glue codeStatic variant selection (if any)

Component task graphwith explicit data dependecies

Performance-aware, data-aware dynamic scheduling of „best“ component variants onto free

execution units

Single-node heterogeneous manycore SIM = PEPPHER

simulatorPePU = Peppher proc. unit

(Movidius)

ApplicationsEmbedded General Purpose HPC

PEPPHER Run-time (StarPU)

Drivers (CUDA, OpenCL, OpenMP)

CPU GPU SIM

PEPPHERTaskgraph

PePU

Scheduling Strategy

Scheduling Strategy

Performance

Models

Components

C/C++, OpenMP, CUDA, OpenCL, TBB, Offload

Autotuned Algorithms

Data Structures

High-Level Coordination/Patterns/Skeletons

Asnynchronous calls, Data distribution

Patterns, SkePU Skeletons

Xeon Phi

Autotuned Data Structures & Algorithms

Transformation Tool

Composition Tool


PEPPHER Components

Component Interface• Specification of functionality

• Used by mainstream programmers

Implementation Variants• Different architectures/platforms

• Different algorithms/data structures

• Different input characteristics

• Different performance goals

• Written by expert programmers

(or generated, e.g. auto-tuning

cf. EU Autotune Project)

Component Implementation Variants

…

«interface»C

f(param-list)

«variant»Cn

f(param-list){…}

«variant»C1

f(param-list){…}

Interfacemeta-data

Variantmeta-data

Variantmeta-data

Features • Different programming languages

(C/C++, OpenCL, Cuda, OpenMP)

• Task & Data parallelism

Constraints• No side-effects; Non-preemptive

• Stateless; Composition on CPU only


Platform Description Language (PDL)Goal: Make platform specific information explicit for tools and users.

Processing Units (PUs)

• Master (initiates program execution)

• Worker (executes delegated tasks)

• Hybrid (master & worker)

Memory Regions• Express key characteristics of memory hierarchy

• Can be defined for all processing units

Interconnects • describe communication facilities between PUs

Hardware and Software Properties• e.g., core-count, memory sizes, available libraries

Data movement


Component calls•asynchronous & synchronous

PEPPHER Coordination Language

#pragma pph call//read A, write B -> meta data

cf1(A, N, B, M);

#pragma pph callcf2(B, M);

#pragma pph call synccf(A, N);

Other Features:•Specification of optimization goals (time vs. power) and execution targets

•Data partitioning; array access patterns; parameter assertions;

•Memory consistency control

#pragma pph pipelinewhile(inputstream >> file) { readImage(file,image); #pragma pph stage replicate(N) { resizeAndColorConvert(image); detectFace(image,outImage); ...}

Patterns•e.g. pipeline pattern


Source-to-Source Transformation• based on ROSE

• generates C++ with calls to coordination layer and StarPU runtime

Coordination Layer• Support for parallel patterns

(pipelining)

• Submission of tasks to StarPU

Heterogeneous Runtime System• Based on INRIA’s StarPU • Selection of implementation variants

based on available hardware resources• Data-aware & performance-aware task

scheduling onto heterogeneous PUs

Transformation System

Hybrid Hardware

GPU MIC

PEPPHER Component Framework

Task-basedHeterogeneous Runtime

Application with Annotations

Transformation Tool

Coordination Layer

SMP

PEPPHERComponentRepository

PlatformDescriptor

PDL


Performance ResultsOpenCV Face Detection

• 3425 images

• Image resolution: 640x480 (VGA)

• Different implementation variants for middle stages (CPU vs. GPU)

• Comparison to plain OpenCV version and hand-coded Intel TBB(pipeline) version

• Architecture: 2 Xeon X5550 (4 core), 2 NVIDIA C2050, 1 NVIDIA C1060


Major Results of PEPPHER Component Framework

• Multi-architectural, resource- & performance-aware components;

• PDL adopted by Open Community Runtime (OCR) – US XStack program

Transformation, Composition, Compilation• Transformation Tool (U. Vienna)

• Composition Tool & SkePU (U. Linköping)

• Offload C++ compiler used by game industry (Codeplay)

Runtime System (U. Bordeaux)• StarPU part of Linux (Debian) distribution and MAGMA library

Superior parallel algorithms and data structures (KIT,

Chalmers)

PePU Experimental Hardware Platform & Simulator

(Movidius)• PeppherSIM used in industry


Backup Slides


Example

FOR k = 0..TILES-1

POTRF(A[k][k])

FOR m = k+1..TILES-1

TRSM(A[k][k], A[m][k])

FOR n = k+1..TILES-1

SYRK(A[n][k], A[n][n])

FOR m = n+1..TILES-1

GEMM(A[m][k], A[n][k], A[m][n])

Utilize expert written components:BLAS kernels from MAGMA and PLASMA

Implementation variants:

• multi-core CPU (PLASMA)

• GPU (MAGMA)

Cholesky factorization

PEPPHER component:

Interface, implementation variants + meta-data


PEPPHER Approach

Transformation/Composition• Processing of user annotations/meta data• Generation of task-based representation

(DAG)• Static pre-selection of variants

Multi-level parallelism• Coarse-grained inter-component parallelism• Fine(r) grained intra-component parallelism• Exploit ALL execution units

POTFR

SYRK

GEMM

TRSM

CPU-GEMM

GPU-GEMM

SYRK

TRSM

Task-based Execution Model • Runtime task variant selection & scheduling • Data/topology-aware: minimize data transfer• Performance-aware: minimize make-span, or

other objective (power, …)

... ...

......

......


Component Meta-Data

Interface Meta-Data (XML)• Parameter intent (read/write)• Supported performance apsects (execution-time, power)

Implementation Variant Meta-Data (XML)• Supported target platforms (PDL)• Performance Model• Input data constraints (if any)• Tunable parameters (if any) • Required components (if any)

Key issues• Make platform specific optimizations/dependencies

explicit.• Make components performance and resource aware.• Support runtime variant selection.• Support code transformation and auto-tuning.

XML Schema for Variant Meta-Data

XML Schema for Interface Meta-Data


Performance-Aware Components

Each component is associated with an abstract performance model.

Invocation Context: captures performance-relevant information of input data

(problem size, data layout, etc.)

Resource Context: specifies main HW/SW characteristics (cores, memory, …)

Performance Descriptor: usually includes (relative) runtime, power estimates

Generic performance prediction function:

ComponentPerformance

Model

Performance

Descriptor

PerfDsc getPrediction(InvocationContextDsc icd, ResourceContextDsc rcd)

InvocationContext

Descriptors

ResourceContext

Descr. (PDL)


Memory Consistency•flush; for ensuring consistency btw. host and workers

Component calls•implicit memory consistency across workers

Basic Coordination Language

#pragma pph callcf1 (A, N);...#pragma pph flush(A) // block until A has become availableint first = A[0]; // explicit flush req. since A is accessed

#pragma pph callcf1 (A, N); // A: read / write... // implicit memory consistency on workers only... // no explicit flush is needed here provided A... // is not accessed within the master process#pragma pph callcf2(A, N); // A:read; actual values of A produced by cf1()


Parameter Assertions•influence component variant selection

Optimization Goals•specify optimization goals to be taken into account by runtime scheduler

Execution Target •specify pre-defined target library (e.g., OPENCL) or processing unit group from PDL platform descriptor


#pragma pph call parameter(size < 1000)cf1(A, size);

#pragma pph call optimize(TIME)cf1(A, size);...

#pragma pph call optimize(POWER < 100 && TIME < 10)cf2(A, size);

#pragma pph call target(OPENCL)cf(A, size);


Data Partitioning•generate multiple component calls, one for each partition (cf. HPF)

Access to Array Sections•specify which array section is accessed in component call (cf. Fortran array sections)


#pragma pph call partition(A(size:BLOCK(size/2)))cf1(A, size);

#pragma pph call access(A(size:50:size-1))cf(A+50, size-50);


Performance Results

Leukocyte Tracking

• Adapted from Rodinia benchmark suite

• Different implementation variants for Motion Gradient Vector Flow (CPU vs. GPU)

• Comparison to OpenMP version

• Architecture:

• 2 Xeon X5550 (4 core)

• 2 NVIDIA C2050

• 1 NVIDIA C1060

• 4 different configurations

-> PDL descriptors SEQ

OM

P

8C

PU

core

s

7C

PU

1G

PU

6C

PU

2G

PU

5C

PU

3G

PU

Sp

eedu

p

0

2

4

6

8

10

12

14

PEPPHEROrig. (Rodinia)


Future Work

EU AutoTune Project

• Autotuning of high-level patterns (pipeline replication factor, …)

• Tunable parameters specified in component descriptors

Work on Energy Efficiency

• Energy-aware components; Runtime scheduling for energy efficiency

User-specified optimization goals

• Tradeoff execution time vs. energy consumption; QoS support

Extension towards Clusters

• Combine with global MPI layer across nodes

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and...

Documents

Transcript of S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and...