S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and...
-
Upload
dalton-lett -
Category
Documents
-
view
214 -
download
0
Transcript of S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and...
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Performance Portability and Programmability
for Heterogeneous Many-core Architectures
(PEPPHER)
Siegfried Benkner
(on behalf of the PEPPHER Consortium)
Research Group Scientific Computing
Faculty of Computer Science
University of Vienna
Austria
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
EU Project PEPPHER
Performance Portability & Programmability for
Heterogeneous Manycore Architectures
• ICT FP7, Computing Systems; 3 years; finished Feb. 2013
• 9 Partners, Coordinated by University of Vienna
• http://www.peppher.eu
Goal: Enable portable, productive and efficient programming of single-node heterogeneous many-core systems.
Holistic Approach• Component-Based High-Level Program Development• Auto-tuned Algorithms & Data Structures• Compilation Strategies• Runtime Systems• Hardware Mechanisms
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Performance.Portability.Programmability
Application (C/C++)
Many-core CPU
CPU+GPU
PePU(Movidius
)
PEPPHERSim
IntelXeon Phi
Focus: Single-node/chip heterogeneous architectures
Approach• Multi-architectural, performance-aware components
multiple implementation variants of functions; each with a performance model
• Task-based execution model & intelligent runtime system runtime selection of best task implementation variant for given platform
Methodology & framework for development of performance portable code.
• Execute same application efficiently on different heterogeneous architectures.
• Support multiple parallel APIs: OpenMP, OpenCL, CUDA, TBB, ...
PEPPHER Framework
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
PEPPHER Approach
C1
C2
:::
...
:::
Component-basedapplication with
annotations
Mainstream Programmer
Component impl. variants for different platforms,
algorithms, inputs ...
C1 C1
C1 C1
C2 C2
Expert Programmer(Compiler/Autotuner)
C1
C1
C2
C1 C2
C1 C2
Target Platforms
Feed-back ofmeasured
performance
Programmer• Annotate calls to performance-
critical functions (=components)
• Provide implementation variants for different platforms (PDL)
• Provide component meta-data
Platform Descriptors (PDL)
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
PEPPHER Approach
C1
C2
:::
...
:::
Component-basedapplication with
annotations
Mainstream Programmer
Component impl. variants for different platforms,
algorithms, inputs ...
C1 C1
C1 C1
C2 C2
Expert Programmer(Autotuner)
C1
C1
C2
C1 C2
C1 C2
Target Platforms
Feed-back ofmeasured
performance
PEPPHER framework• Management of components and
implementation variants
• Transformation / Composition
• Implementation variant selection
• Dynamic, performance-aware task scheduling (StarPU runtime)
Dynamic selection of ”best”
implementation variant
HeterogenousTask Scheduler
RuntimeSystem
Transformation/Composition
Intermediatetask-based
representation
Programmer• Annotate calls to performance-
critical functions (=components)
• Provide implementation variants for
different platforms (PDL)
• Provide component meta-data
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
PEPPHER Framework
C/C++ source code with annotated
component calls
Component implementation variants for different core
architectures ... algorithms, ...
Component glue codeStatic variant selection (if any)
Component task graphwith explicit data dependecies
Performance-aware, data-aware dynamic scheduling of „best“ component variants onto free
execution units
Single-node heterogeneous manycore SIM = PEPPHER
simulatorPePU = Peppher proc. unit
(Movidius)
ApplicationsEmbedded General Purpose HPC
PEPPHER Run-time (StarPU)
Drivers (CUDA, OpenCL, OpenMP)
CPU GPU SIM
PEPPHERTaskgraph
PePU
Scheduling Strategy
Scheduling Strategy
Performance
Models
Components
C/C++, OpenMP, CUDA, OpenCL, TBB, Offload
Autotuned Algorithms
Data Structures
High-Level Coordination/Patterns/Skeletons
Asnynchronous calls, Data distribution
Patterns, SkePU Skeletons
Xeon Phi
Autotuned Data Structures & Algorithms
Transformation Tool
Composition Tool
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
PEPPHER Components
Component Interface• Specification of functionality
• Used by mainstream programmers
Implementation Variants• Different architectures/platforms
• Different algorithms/data structures
• Different input characteristics
• Different performance goals
• Written by expert programmers
(or generated, e.g. auto-tuning
cf. EU Autotune Project)
Component Implementation Variants
…
«interface»C
f(param-list)
«variant»Cn
f(param-list){…}
«variant»C1
f(param-list){…}
Interfacemeta-data
Variantmeta-data
Variantmeta-data
Features • Different programming languages
(C/C++, OpenCL, Cuda, OpenMP)
• Task & Data parallelism
Constraints• No side-effects; Non-preemptive
• Stateless; Composition on CPU only
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Platform Description Language (PDL)Goal: Make platform specific information explicit for tools and users.
Processing Units (PUs)
• Master (initiates program execution)
• Worker (executes delegated tasks)
• Hybrid (master & worker)
Memory Regions• Express key characteristics of memory hierarchy
• Can be defined for all processing units
Interconnects • describe communication facilities between PUs
Hardware and Software Properties• e.g., core-count, memory sizes, available libraries
Data movement
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Component calls•asynchronous & synchronous
PEPPHER Coordination Language
#pragma pph call//read A, write B -> meta data
cf1(A, N, B, M);
#pragma pph callcf2(B, M);
#pragma pph call synccf(A, N);
Other Features:•Specification of optimization goals (time vs. power) and execution targets
•Data partitioning; array access patterns; parameter assertions;
•Memory consistency control
#pragma pph pipelinewhile(inputstream >> file) { readImage(file,image); #pragma pph stage replicate(N) { resizeAndColorConvert(image); detectFace(image,outImage); ...}
Patterns•e.g. pipeline pattern
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Source-to-Source Transformation• based on ROSE
• generates C++ with calls to coordination layer and StarPU runtime
Coordination Layer• Support for parallel patterns
(pipelining)
• Submission of tasks to StarPU
Heterogeneous Runtime System• Based on INRIA’s StarPU • Selection of implementation variants
based on available hardware resources• Data-aware & performance-aware task
scheduling onto heterogeneous PUs
Transformation System
Hybrid Hardware
GPU MIC
PEPPHER Component Framework
Task-basedHeterogeneous Runtime
Application with Annotations
Transformation Tool
Coordination Layer
SMP
PEPPHERComponentRepository
PlatformDescriptor
PDL
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Performance ResultsOpenCV Face Detection
• 3425 images
• Image resolution: 640x480 (VGA)
• Different implementation variants for middle stages (CPU vs. GPU)
• Comparison to plain OpenCV version and hand-coded Intel TBB(pipeline) version
• Architecture: 2 Xeon X5550 (4 core), 2 NVIDIA C2050, 1 NVIDIA C1060
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Major Results of PEPPHER Component Framework
• Multi-architectural, resource- & performance-aware components;
• PDL adopted by Open Community Runtime (OCR) – US XStack program
Transformation, Composition, Compilation• Transformation Tool (U. Vienna)
• Composition Tool & SkePU (U. Linköping)
• Offload C++ compiler used by game industry (Codeplay)
Runtime System (U. Bordeaux)• StarPU part of Linux (Debian) distribution and MAGMA library
Superior parallel algorithms and data structures (KIT,
Chalmers)
PePU Experimental Hardware Platform & Simulator
(Movidius)• PeppherSIM used in industry
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Backup Slides
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Example
FOR k = 0..TILES-1
POTRF(A[k][k])
FOR m = k+1..TILES-1
TRSM(A[k][k], A[m][k])
FOR n = k+1..TILES-1
SYRK(A[n][k], A[n][n])
FOR m = n+1..TILES-1
GEMM(A[m][k], A[n][k], A[m][n])
Utilize expert written components:BLAS kernels from MAGMA and PLASMA
Implementation variants:
• multi-core CPU (PLASMA)
• GPU (MAGMA)
Cholesky factorization
PEPPHER component:
Interface, implementation variants + meta-data
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
PEPPHER Approach
Transformation/Composition• Processing of user annotations/meta data• Generation of task-based representation
(DAG)• Static pre-selection of variants
Multi-level parallelism• Coarse-grained inter-component parallelism• Fine(r) grained intra-component parallelism• Exploit ALL execution units
POTFR
SYRK
GEMM
TRSM
CPU-GEMM
GPU-GEMM
SYRK
TRSM
Task-based Execution Model • Runtime task variant selection & scheduling • Data/topology-aware: minimize data transfer• Performance-aware: minimize make-span, or
other objective (power, …)
... ...
......
......
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Component Meta-Data
Interface Meta-Data (XML)• Parameter intent (read/write)• Supported performance apsects (execution-time, power)
Implementation Variant Meta-Data (XML)• Supported target platforms (PDL)• Performance Model• Input data constraints (if any)• Tunable parameters (if any) • Required components (if any)
Key issues• Make platform specific optimizations/dependencies
explicit.• Make components performance and resource aware.• Support runtime variant selection.• Support code transformation and auto-tuning.
XML Schema for Variant Meta-Data
XML Schema for Interface Meta-Data
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Performance-Aware Components
Each component is associated with an abstract performance model.
Invocation Context: captures performance-relevant information of input data
(problem size, data layout, etc.)
Resource Context: specifies main HW/SW characteristics (cores, memory, …)
Performance Descriptor: usually includes (relative) runtime, power estimates
Generic performance prediction function:
ComponentPerformance
Model
Performance
Descriptor
PerfDsc getPrediction(InvocationContextDsc icd, ResourceContextDsc rcd)
InvocationContext
Descriptors
ResourceContext
Descr. (PDL)
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Memory Consistency•flush; for ensuring consistency btw. host and workers
Component calls•implicit memory consistency across workers
Basic Coordination Language
#pragma pph callcf1 (A, N);...#pragma pph flush(A) // block until A has become availableint first = A[0]; // explicit flush req. since A is accessed
#pragma pph callcf1 (A, N); // A: read / write... // implicit memory consistency on workers only... // no explicit flush is needed here provided A... // is not accessed within the master process#pragma pph callcf2(A, N); // A:read; actual values of A produced by cf1()
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Parameter Assertions•influence component variant selection
Optimization Goals•specify optimization goals to be taken into account by runtime scheduler
Execution Target •specify pre-defined target library (e.g., OPENCL) or processing unit group from PDL platform descriptor
Basic Coordination Language
#pragma pph call parameter(size < 1000)cf1(A, size);
#pragma pph call optimize(TIME)cf1(A, size);...
#pragma pph call optimize(POWER < 100 && TIME < 10)cf2(A, size);
#pragma pph call target(OPENCL)cf(A, size);
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Data Partitioning•generate multiple component calls, one for each partition (cf. HPF)
Access to Array Sections•specify which array section is accessed in component call (cf. Fortran array sections)
Basic Coordination Language
#pragma pph call partition(A(size:BLOCK(size/2)))cf1(A, size);
#pragma pph call access(A(size:50:size-1))cf(A+50, size-50);
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Performance Results
Leukocyte Tracking
• Adapted from Rodinia benchmark suite
• Different implementation variants for Motion Gradient Vector Flow (CPU vs. GPU)
• Comparison to OpenMP version
• Architecture:
• 2 Xeon X5550 (4 core)
• 2 NVIDIA C2050
• 1 NVIDIA C1060
• 4 different configurations
-> PDL descriptors SEQ
OM
P
8C
PU
core
s
7C
PU
1G
PU
6C
PU
2G
PU
5C
PU
3G
PU
Sp
eedu
p
0
2
4
6
8
10
12
14
PEPPHEROrig. (Rodinia)
S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013
Future Work
EU AutoTune Project
• Autotuning of high-level patterns (pipeline replication factor, …)
• Tunable parameters specified in component descriptors
Work on Energy Efficiency
• Energy-aware components; Runtime scheduling for energy efficiency
User-specified optimization goals
• Tradeoff execution time vs. energy consumption; QoS support
Extension towards Clusters
• Combine with global MPI layer across nodes