Best% Practice:% Use% HARPA%within%HPC applications · HARPA ! FP7-612069-HARPA Project 1...

HARPA Project ref. FP7-612069 Call ref. FP7-ICT-2013-10 Activity ICT-10-3.4

Best Practice: Use HARPA within HPC applications Antoni Portero (IT4I), Radim Vavrik (IT4I), Stepan Kuchar (IT4I), Vit Vondrak (IT4I) Report #: D5.4 Version: V1.0 Date: 17-Mar-1016

HARPA

FP7-612069-HARPA Project I

Revisions List

Date Version Author(s) Description of the changes 04/11/15 0.1 A. Portero (IT4I) Template creation 21/12/15 0.2 A. Portero (IT4I) All Sections, first draft

02/02/16 0.3 R. Vavrik (IT4I) Improvements in sections 4.1.1, 4.1.2 and 5.3 25/02/16 0.4 S. Kuchar (IT4I) Grammar and comments 26/02/16 0.5 A. Portero (IT4I) Commits of the comments 26/02/16 0.6 F. Sassi (HENESIS) Review 28/02/16 0.7 A. Portero (IT4I) Integration of review comments 02/03/16 0.8 A. Portero (IT4I) Integration of R. Vavrik, change section 5.2.

proposed by G. Massari. 15/03/16 0.9 A. Portero (IT4I) Updated section 5.1. 17/03/16 1.0 A. Portero (IT4I) Integration of G. Massari comments in section 5.2

and final reading.

HARPA

FP7-612069-HARPA Project II

Table of contents 1 Executive Summary ................................................................................................................................... 1 2 Introduction .................................................................................................................................................. 2 3 Analysis of application infrastructure for HARPA use .......................................................................... 3 3.1 Best Practice: Programming Models for parallelization ................................................................. 4

3.1.1 OpenMP ......................................................................................................................................... 4 3.1.2 MPI .................................................................................................................................................. 4 3.1.3 OpenACC ...................................................................................................................................... 4 3.1.4 OpenCL .......................................................................................................................................... 5 3.1.5 Other programming models ........................................................................................................ 5

3.2 Best Practice: Profiling Toolset .......................................................................................................... 6 3.2.1 Profiling of the Math1D model: Runoff model hot-spot .......................................................... 8 3.2.2 Part of the code to be accelerated ............................................................................................. 9

3.3 Intel x86-64 HPC node (Harpa-s1) .................................................................................................. 10 4 Customization of application infrastructure for HARPA environment .............................................. 10 4.1 Integration in HARPA-OS ................................................................................................................. 11 4.2 Preparation of hardware and software development/deployment environment ....................... 11

4.2.1 OpenMP thread integration in our solution ............................................................................. 12 4.2.2 MPI library integration in our solution ...................................................................................... 13

5 Configuration and finalizing of the HPC codes running within HARPA environment .................... 16 5.1 Steps to enable HARPA-OS with the heterogeneous example .................................................. 16 5.2 Extensions to HARPA-OS: Resource-Aware Adaptively ............................................................. 16 5.3 HPC code running with HARPA-OS and MPI library .................................................................... 25 5.4 HPC code running with HARPA-OS in accelerator with OpenCL .............................................. 26

5.4.1 Problem Decomposition ............................................................................................................ 27 5.4.2 NVIDIA GPU Example ............................................................................................................... 28 5.4.3 Possible Future Scenarios ........................................................................................................ 31

6 Annex A: Convolution of Runoff model in OpenCL ............................................................................ 34 6.1.1 Headers files ............................................................................................................................... 35 6.1.2 Errors ............................................................................................................................................ 35 6.1.3 OpenCL Contexts ...................................................................................................................... 36 6.1.4 OpenCL Devices ........................................................................................................................ 37 6.1.5 Building and running .................................................................................................................. 42

6.2 gDEBugger tool to debug OpenCL binaries .................................................................................. 42 6.2.1 Debugging Runoff model with gDEBugger ............................................................................ 43 6.2.2 Graphic Memory Analysis viewer ............................................................................................ 45

7 Appendix B: Knobs and Monitors: Power and Temperature monitoring (tools). ............................ 46 7.1.1 Power and Energy monitoring in x86-64 cores ...................................................................... 46 7.1.2 Temperature in x86-64 cores ................................................................................................... 47 7.1.3 Power and Temperature monitoring in NVIDIA GPU ........................................................... 48

HARPA

FP7-612069-HARPA Project III

List of Figures Figure 1:Hot-spot: Shows the most expensive calls path based on samples count. The path is math1d_cl::MatData::scsMethod .................................................................................................................... 8

Figure 2: Code view with percentage of the time spent in the intensive instructions. ................................. 9 Figure 3: Provides an abstraction view of the HARPA-s1 machine ............................................................. 10 Figure 4: Graph with HARPA-OS method and relation with MPI for two nodes. ....................................... 14 Figure 5: Up: Goal gap computation and a number of resources allocated to YAMS management. Down: Goal gap and number of resources in unmanaged mode. .......................................................... 21

Figure 6: Power consumed in diverse parts of the system when YAMS is executed versus unmanaged mode. Up-left power consumed in the DRAM, socket 0, the upright figure shows the power consumed in the DRAM, socket 1. Down-left power consumed in package 0, socket 0, the down-right power consumed in package 1. ........................................................................................................... 22

Figure 7, up temperature per core and socket 0,1 monitored with psensor. Down, figure similar results but for managed mode. ..................................................................................................................... 23

Figure 8. up, the difference in temperature YAMS minus Unmanaged mode. Down, mean temperature per CPU0 (c0..7) and CPU1 (c8..15). .......................................................................................................... 24

Figure 9: Shows schema of two nodes, both nodes are connected with high-speed cable. ................... 25 Figure 10: HARPA-OS running with MPI communication. The complete video of the demonstration ... 26 Figure 12: Schema of the server right with a schema of the NVIDIA 970 GTX GPU ............................... 29 Figure 13 : Schema about how to call the model for running it in accelerator ........................................... 30 Figure 14: Visualization of OpenCL Kernel of the Runoff model with gDEBugger .................................... 31 Figure 16: Source Code viewer. Displays the OpenCL line that is being debugged. ............................... 44 Figure 17: Shaders and kernels source code editor. ...................................................................................... 44 Figure 18: shows the textures, buffers and image viewer. In our example shows the output correct results from the accelerator. .......................................................................................................................... 45

HARPA

FP7-612069-HARPA Project 1

1 Executive Summary This document presents the deliverable about HARPA best practices for HPC system. The document presents the HPC environment that allows to effectively run a disaster management application with high availability. The main effects consist in trading the accepted loss of precision for lower energy consumption in the standard operation while increasing the number of allocated resources to provide higher precision levels and lower execution times for emergency operation. The uncertainty modelling module of the disaster management application was first executed in one node of the HPC cluster and dependency of precision on the number of Monte-Carlo samples was measured to provide exact information about attainable precision levels. This precision is then taken into account in deciding the Application Working Modes that balance the trade-offs between precision, executed time and consumed energy for different application scenarios. Additionally, the Run-Time Resource Manager is deployed on the HPC cluster to monitor the current workload of the platform and select the most appropriate working mode to reconfigure the application at run-time. The proposed process can be used to prepare an operational environment with high availability while saving energy, at the cost of lower precisions, during non-critical situations. The current work will focus on extending the implementation of the application:

• to support accelerators (Phase 3 defined in D.5.3);; • to improve the support of multi-node communication (Phase 4 defined in D.5.4);; • monitoring.

This will enable the creation of additional working modes with accelerators for high-performance computations that will lead to even shorter execution times. The deliverable D.5.4 is composed of two parts and this document delivers the first part (M30) about the best practices to run HPC applications with HARPA-OS. The document explains the message passing interface development for running the application in a multi-node and the OpenCL extension to offload workload to accelerators with an execution time vs. power consumption trade-off. These two features extend the AWM possibilities. During the developing of the project, we have used different programming models. Each programming model has some specific reasons. In the following sections (mainly in section 3.1) are highlighted the reasons and why according to our research these are the best candidates. Section 4 focuses in customization of application infrastructure for HARPA environment. Section 5 explains configuration and finalizing of the HPC codes running within HARPA environment. At the end of the document, it is presented an annex A with a description about how to create an OpenCL kernel from scratch, how to debug and run in the accelerator. There is an annex B with explanation of the monitors installed in the x86-64 server.

HARPA


2 Introduction The deliverable “HARPA: Best practices within HPC applications” describes the experiences with the installation of HARPA environment into the HPC application infrastructure. This document describes some metrics and tools assessment procedure. According to task T5.3 named Installation of HARPA into applications infrastructures, we have installed HARPA OS in an HPC environment and integrated it into application codes. The integration of HARPA environment into these infrastructures and application codes has been realized in the following steps:

1. Analysis of application infrastructure for HARPA use. 2. Customization of application infrastructure for HARPA environment: the first phase of the

customization is described in D.5.2. while extensions are presented here, in particular, multi-node capability and opportunity to use accelerators.

3. Installation of HARPA OS layer into the application infrastructure: the installation process is also presented in D.5.2. while this document describes extensions of our code using MPI library and OpenCL framework. It also describes extensions to the runtime in order to create the final runtime HARPA-OS for HPC systems.

4. Configuration and finalizing of the HPC codes running within HARPA environment. Deliverable D.5.2 describes how HARPA-OS runs in x86-64 cores. A more complex environment is being developed with extended AWMs through new capabilities, as multi-node and accelerators, and also improved resource allocator of the HARPA-OS runtime

The document also focuses on describing the preliminary evaluation of HARPA validation (T.5.4). The Evaluation of the techniques proposed in the project was performed on research applications provided by IT4I and running on a full-system simulating a complete HPC scenario (i.e. multi-core, multi-node and accelerators). The evaluation of the case studies will demonstrate the suitability of the techniques and methods developed in the project to the HPC domains. The deliverable presents best practises to implement HARPA-OS with the chosen programming models for HPC, with the profiling tools and with monitors and knobs.

HARPA


3 Analysis of application infrastructure for HARPA use

The section presents the programming models most convenient for parallel applications developed by IT4I. During the development of the model, we have extended the example to multi-node through MPI communications that work jointly with the HARPA runtime. This document describes in section 4 how the parallelization is performed at the process level. OpenMP is a set of directives using a shared programming model selected for creating threads at the node level. The programming model for accelerators is OpenCL, further details are in the Annex A where it explains how we created our example from C++ and we instrumented it to create the OpenCL version. The main reasons why these models are chosen are exposed in this section. Later on, there is subsection 6.2 in annex A explaining the debugging procedure, where it is explained the tool used for debugging and how to run it in the accelerator. The profiling features are needed to extract information (basically hot-spots) of the code. The profiling of the code is made with Microsoft Visual Studio tool. We observed and described in subsection 3.2.1 that the Runoff model (scsMethod of the rainfall-runoff model) is the most computational demanding model with more than 60% of system time used, thus, it has been selected as the part to be accelerated. This part of the code has been rewritten with OpenCL primitives, a standard coding technique for heterogeneous systems. In our case, we have focused in NVIDIA GPUs but the work can be extended to Xeon Phi and FPGAs. The following subsection presents the hardware HPC infrastructure where the scenario development is being setup.

HARPA


3.1 Best Practice: Programming Models for parallelization Several programming models have been added to the HPC prototype to make it able to run in the HPC cluster. From the set of possible models, OpenMP and OpenCL are preferred to parallelize the application at the node level. Besides, MPI library is used to run the model in a multi-node system. A description of the possibilities and the selection is justified in the following subsections.

3.1.1 OpenMP OpenMP (Open Multi-Processing) is a multi-platform application programming interface (API) used for a development of parallel applications in C, C++, and Fortran. It enables to create parallel tasks in an application and distribute computations, typically very extensive loops, among an arbitrary number of threads running concurrently. This usually improves an application performance and significantly shorten an execution time on multi-core processors. Threads use a shared memory architecture in order to share their working data. An advantage is a lower runtime overhead of the threads compared to regular processes. Therefore, OpenMP is used for a single-node parallelization.

3.1.2 MPI MPI (Message Passing Interface) is a protocol intended for development of parallel applications on a computational cluster. Since a cluster consists of many independent computers (computational nodes) interconnected to a computer network, an application is distributed among the nodes using many processes. Each process handles a specific part of the program or data. For communication, data transfers and synchronization purposes, processes are able to send and receive individual or collective MPI messages through the network. There are several commonly used implementations of the MPI, e.g. OpenMPI, MPICH, Intel MPI, available for C, C++ and Fortran as well as several less widespread bindings to Python, Java, Matlab, R and other programming languages. MPI exploited together with OpenMP creates a hybrid model of parallel programming that allows to fully utilize computational cluster resources on the single-node and many-node levels.

3.1.3 OpenACC OpenACC (for Open Accelerators) is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems. OpenACC defines an extensive list of pragmas (directives) similar to OpenMP. The main drawback is that it has a high-level

HARPA


interface and rely a lot on the specific developer implementation although OpenACC is a good candidate as a programming model for future developments. The recent OpenACC standard defines a high-level interface for many-core accelerators and helps the users to gain early experience of directive-based interfaces. To the best of our knowledge at the time of writing, the latest OpenACC standard, version 2.0, has commercial implementations from PGI and Cray for NVIDIA GPU architectures.

3.1.4 OpenCL Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors. OpenCL specifies a language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides parallel computing using task-based and data-based parallelism. OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. It is important to remind that OpenCL is considered a low-level programming interface, for recent many-core and accelerator architectures, that can be used as user-level programming interfaces or intermediate-level interfaces for the compiler-transformation targets of high-level interfaces.

3.1.5 Other programming models A study of current programming models for accelerators is presented in HPCWire [HPCWIRE15]. Node-level parallel models range from threading primitives such as pthreads, C++11 thread for CPU/SMPs, and low-level models for manycore accelerators such as proprietary CUDA from NVIDIA and open standard OpenCL, to high-level models including directive-based programming models such as OpenMP* and OpenACC. Microsoft Visual C++ parallel programming on Windows platforms and specifically tailored for C++;; and other options such as Cilkplus, TBB and vector primitives. The choice of parallel model for a particular application and/or hardware architecture depends on the programmability and portability of the model as well as the performance delivered to users by the implementations. For example, GPGPU accelerator support in high-level programming interfaces is now available in both OpenACC and OpenMP, with OpenACC being developed earlier and with more existing compiler support. This feature is urgently needed also in parallel node-level programming. However, a wide variety of users still use the proprietary CUDA model despite its productivity challenges because it currently delivers higher performance than the high-level programming models on the supported architectures.

HARPA


In conclusion, the existence of multiple programming models, each having its own unique set of features that serve the specific needs of users and applications, and each having a different degree of tradeoff between productivities and performance, is still necessary. 3.1.5.1 Advantages and disadvantages of OpenCL vs. CUDA

The main advantage of OpenCL versus CUDA is that OpenCL is a multiplatform programming model, allowing the designer to program several hardware platforms environments such as CPUs, DSPs, FPGAs and Xeon Phi with the same programming model. While CUDA is only restricted to NVIDIA GPUs. The CUDA strong point is that code developer can get better performance with CUDA since CUDA has better integration between programming model and hardware). [Matsumoto12]. The OpenCL advantages are more interesting for our developments than its main drawback with respect of CUDA for GPU programming.

3.2 Best Practice: Profiling Toolset This subsection focuses on the rainfall-runoff model under study in HARPA project. The code to be accelerated is the runoff based on a mathematic convolution. We have detected the bottle neck through profiling tools, which provide information about where in the code the executed time spends more time. That part of the code is called bottle neck or hot-spot. To reach the best integration of the HPC application with the HARPA-OS runtime, accurate performance analysis tools are indispensable. Tools allow identifying performance issues by:

• Collecting performance data from the system running your application. • Organizing and displaying the data in a variety of interactive views, from system-wide down to

source code or processor instruction perspective. • Identifying potential performance issues and suggesting improvements.

Such tools are able to find hot-spots in the application. The hotspots activity can be in memory, processes or threads running on the operating system, executable file or module or user function (requires symbols). The tool should identify the line of source code (requires symbols with line numbers) or processor (assembly) instruction. It should detect any significant activity even if it occurs infrequently and probably does not have much impact on system performance. It also should detect activity related to time spent or other internal processor events. The regular method of finding hotspots is using a sampling collector. The sampling collector is a periodic interruption of the processor to collect information about what is happening in the CPU cores when the application is running on top. The sampling can be time-based or event based. The event based is triggered by the occurrence of a certain number of microarchitectural events. The sampling collector collects the execution context, execution

HARPA


address in memory, operating system process and thread ID, executable module loaded at that address. If there are symbols of the module, a post-processing can identify the function or method at the memory address. Line numbers from the symbol file can direct you to the relevant line of source code. Time-based sampling (TBS) is triggered by:

• Operating system timer services;; • every n processor clock ticks.

Event-based sampling (EBS) is triggered by processor event counter overflow. These events are processor-specific, like L2 cache misses, branch mispredictions, floating-point instructions retired, and so on. Two possible profile tools to reach better integration our under study HPC applications are Intel VTUNE [Vtune16] and Microsoft Visual Studio [MVS12]. Other profiling and debugging tools that are available in IT4I but still not used in the project are:

• Allinea DDT [Allinea16] to debug hybrid MPI, OpenMP, CUDA and OpenACC applications on a single workstation or GPU cluster.

• TotalView [TotalView16] is a GUI-based tool that allows you to debug one or many processes/threads with complete control over program execution.

• NVIDIA Nsight [Nsight16] debugging and profiling tool for GPUs from NVIDIA enabled to work with Eclipse and Microsoft Visual Studio.

• CUDA-GDB [CUDA-gdb16] to debug both the CPU and GPU portions of our application simultaneously. CUDA-GDB can be used on Linux or MacOS, from the command line, DDD or EMACS.

A sophisticate debugger, profiler and memory analyzer is necessary for profiling OpenCL. The tool lets the code developer trace the application activity on top of the OpenCL API and understand what is happening within the system implementation. The tool should at least:

• assist the designer to optimize the OpenCL applications performance;; • discover OpenCL associated bugs;; • aids to improve application performance and quality;; • reduce debugging and profiling time;; • deploy on multiple platforms;; • conform to future OpenCL versions;; • optimize memory consumption.;;

The tool that fulfils all the proposed features is gDEBugger [Gremedy15]. Annex A, subsection named gDEBugger tool to debug OpenCL binaries explains more features of the tool. Such qualities that helped in the debugging and development of the OpenCL code in the HARPA project. Moreover, in Annex A, there is an accurate description about how to profile the runoff model described in OpenCL.

HARPA


The following section describes the steps to profile the Math1D code. Math1D is the model that is executed many times (Monte Carlo samples) to estimate the uncertainties of the model. The profiling tool provides information about the hot-spots and parts of the code convenient for the acceleration. It also details what infrastructure was used for running the HPC examples.

3.2.1 Profiling of the Math1D model: Runoff model hot-spot The main interests to the profile of the Math1D model are in a dynamic analysis to measure more than space (memory), execution time and complexity of the program to see which parts the program contain more activity. This profiling information helps us to optimize the program optimizations. For our models, we are not very interested in memory since at the current stage the model is not memory bounded. We have instrumented the binary executable program for x86-64 CPUs to detect the bottlenecks, where the program consumes more time. According to the profile tool (Microsoft Visual Studio), the program takes longer time in the SCSMethod() with a 65.32% of the time (see figure 1).

Figure 1:Hot-spot: Shows the most expensive calls path based on samples count. The path is

math1d_cl::MatData::scsMethod

In particular, we can observe the code with a hot-spot in figure 2.

HARPA


Figure 2: Code view with percentage of the time spent in the intensive instructions.

Figure 2 depicts that 42.7% of the time is spent in the assignation of j1 to j. Here, there is a serialization of j:j1 is only an integer and the j for loop could be in principle executed for all range (m+n-1) concurrently (loop unrolling). So, the generated j vector has to pass through j1 each time. This produces a downgrade in terms of performance. In the future deliverable will be explained what to do to improve the performance of the system.

3.2.2 Part of the code to be accelerated According to the profiling results in the part to be accelerated is a vector multiplication Q = f(pef, q1). The original code is presented below (C = f(A, B), where C = Q, A = pef and B = q1). This part of the code belongs to the inner parts of the SCSMethod(). void ori_computation(int M,int N,float *A, float *B, float *C) int j; int k; for(j = 0; j < M; j++) int j1 = j; C[j]=0; for(k = 0; k < N; k++) if(j1 >= 0 && j1 < (int)M-N+1) C[j] += (A[j1]*B[k]); //if j1--; //fork //forj //for_ori

Code 1: Part of the model to be accelerated The code has been accelerated with OpenMP and OpenCL. OpenMP has been tested in x86-64 core CPUS. OpenCL has been tested in x86-64 CPUs and GPUs devices.

HARPA


3.3 Intel x86-64 HPC node (Harpa-s1) The experiments performed in this deliverable have been executed in two HPC nodes for the MPI validation and in one for the OpenCL-GPU enabling. The system has 24 (2 CPU sockets of 12 cores) x86-64 (Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz) cores. The GNU/Linux OS installed is Centos 6.6 with kernel 2.6.32-573.7.1.el6.x86_64, a similar operating system that the one is installed in the two clusters. The machine is equipped with a NVIDIA GPU GTX970 [GTX970]. The OpenCL installed are a) OpenCL 1.2 (Build 57) from Intel(R) Corporation and b) OpenCL 1.2 CUDA from NVIDIA Corporation.

Figure 3: Provides an abstraction view of the HARPA-s1 machine

4 Customization of application infrastructure for HARPA environment

Rainfall-runoff simulations, like any others, can be affected by a certain amount of inacuracies due to necessary simplifications of relations in the computational model itself compared to the real world and due to imprecisions in the input data set. To provide additional information about possible situations, uncertainty simulations are built and executed to complement the standard rainfall-runoff simulations results. Our implementation of rainfall-runoff uncertainty simulations is based on the Monte-Carlo method. A large set of possible input parameters values are generated and then simulations with such parameters sets are computed. Using partial simulations results, selected p-percentiles are picked and confidence intervals representing boundaries in which output values can occur are formed. Since these partial simulations, over given parameters sets, are in general independent sub-tasks and form one computationally intensive task, an application of a parallel approach with distributed processing arises.

HARPA


The usage of rainfall-runoff uncertainty simulations as an example of HPC application benefiting from HARPA-OS is convenient because the distribution of sub-tasks can be easily adapted to various runtime scenarios utilizing different types and amounts of computational resources. The code is fully integrated into the HARPA-OS Abstract Execution Model (AEM) using its API and thus can be seen as a HARPA-OS managed application.

4.1 Integration in HARPA-OS The process of integration and deployment of rainfall-runoff uncertainty simulations as an example managed application consists of several steps: 1) Preparation of hardware and software development/deployment environment. 2) Implementation of selected HARPA OS AEM API methods is partly assessing in D.5.2 for x86-

64 cores. Extension to this implementation are being developed and will be presented in D.5.4 M36.

3) Adaptation of the internal application structure and data communication are presented in this document subsection 3.1, section 4 and 5. Where modifications to the application are assessed like the explanation of MPI, the code extensions for creating a 2D memory for OpenCL.

4) Design of suitable runtime scenarios and appropriate application requirements were presented in D5.2 more accurate/refined scenarios by M36 and M39 in D5.5 and D.5.6, where new AWMs will be presented thanks to multi-node and accelerator extensions (M36) and description about how HARPA-OS provides benefits with mitigation techniques under a model of degradation M39 (Scenario described in High Level degradation model D.4.4).

Validation of the application requirements and compliance limits were validated as a first approximation in D.5.2, where there is a description of four uncertainty instances that are competing for resources in a HPC node. Moreover, the programming models and integration for running with extended AWMs are presented in this document. A more refined approximation will come by M36 with similar experimentation as D.5.2, while at M39 the degradation model and HARPA-OS as mitigation tool will be exposed as final report. This deliverable 5.4 M30 presents the development phases that includes installation of HARPA-OS, required tools familiarization with hardware. This part is partially presented in section 4 and 5 of this document. HARPA-OS API provides a wrapper for an application initiation e.g. command line arguments propagation. How to run the diverse instances were explained in D5.2, how to run it with extended AWMS (i.e. multi-node, with accelerators) will be presented in D5.4 at M36. In the following the subsection, the API implementation example is described.

4.2 Preparation of hardware and software development/deployment environment

The subsection describes the integration of our example with diverse programming models chosen as best candidates for HPC. For shared memory was chosen OpenMP and explained in subsection 3.2. For enabling HARPA-OS and application in a multi node MPI is selected.

HARPA


4.2.1 OpenMP thread integration in our solution All OpenMP directives in C and C++ are indicated with a #pragma omp followed by parameters, ending in a newline. The pragma usually applies only to the statement immediately following it, except for the barrier and flush commands, which do not have associated statements. The code represents a parallel processing of a chunk of Monte-Carlo samples in one MPI process. Each parallel thread invokes a call to set (load) a set of uncertainty parameters for a particular model instance in the first critical section. A critical section is used to ensure a unique identification number of a sample and a correct order of parameters. After that, a rainfall-runoff simulation of a model is executed by each thread. In the second critical section, computed simulation results are appended to a vector of results for a given (local) MPI process. 1 #pragma omp parallel 2 //line commented #pragma omp single 3 #pragma omp for schedule(dynamic) 4 for (size_t i = 0; i < threadsNumber; ++i) 5 int tid = omp_get_thread_num(); 6 #pragma omp critical 7 for(std::vector<std::shared_ptr<math1d_cl::AbstractParam>>::const_iterator it = m_uncertainParameters.begin(); it != m_uncertainParameters.end(); ++it) 8 (*it)->setParam(m_models[tid]); 9 10 m_models[tid].rainfallRunoffModel(); #pragma omp critical 11 for (size_t ch = 0; ch < m_matData->getChannels().size(); ++ch) 12 m_localQ.insert(m_localQ.end(), 13 m_models[tid].getChannels()[ch]->getHydrograph().getQOut().begin(), 14 m_models[tid].getChannels()[ch]->getHydrograph().getQOut().begin() + m_timeSteps); 15 //forch 16 //forit 17 // Fori 18 // OMP Parallel Code 2: OpenMP Pragmas for one Monte-Carlo Thread Line 1: #pragma omp parallel The parallel pragma starts a parallel block. It creates a team of N threads (where N is determined at runtime, usually from the number of CPU cores, but may be affected by a few things), all of which execute the next statement (or the next block, if the statement is a … -enclosure). After the statement, the threads join back into one. Line 2: #pragma omp single The single directive specifies that the given statement/block is executed by only one thread. It is unspecified which thread. Other threads skip the statement/block and wait at an implicit barrier at the end of the construct. The line 2 of the code example is the "pragma omp single" related to the following line which prints out the number of currently used OMP threads

HARPA


through the logger. Since the line is commented from the example, the pragma is not needed at this stage. It is added only as information that is useful during the developing code phases. Line 3: #pragma omp for schedule(dynamic) static is the default schedule as shown above. Upon entering the loop, each thread independently decides which chunk of the loop they will process. In the dynamic schedule, there is no predictable order in which the loop items are assigned to different threads. Each thread asks the OpenMP runtime library for an iteration number, then handles it, then asks for next, and so on. Line 6: #pragma omp critical The critical directive restricts the execution of the associated statement / block to a single thread at a time. The critical directive may optionally contain a global name that identifies the type of the critical directive. No two threads can execute a critical directive of the same name at the same time.

4.2.2 MPI library integration in our solution The MPI primitives have been integrated into the HARPA-OS methods and are described in the following section. A schema about how two MPI processes communicate through the HARPA methods is presented in the figure below. The methods and the relation with MPI libraries are also summarized. The complete mechanisms are in the git repository (git: https://bitbucket.org/tonipat/bosp-uncertaintympi). MpiUncertainty constructor In the constructor, MPI initialization phase including detection of a number of MPI

processes and their ranks occurs. The master process is also determined. onSetup method In this method, configuration data and specific QoS (Quality of Service) parameters are

loaded from configuration files. The master process performs an initial computation with input data in order to select the QoS mode according to the current situation. The QoS mode represents a set of application runtime criteria and constraints and determines the total amount of work that must be performed. Selected QoS mode is then broadcasted using MPI messages to all processes. The elapsed time measurement used for performance monitoring is also initiated. An extract of the code is presented below: 1 if(m_rank == m_master) 2

HARPA


3 m_qosMode = matData-‐>checkFWL(); 4 m_timer.start(); 5 6 MPI_Bcast(&m_qosMode, 1, MPI_INT, m_master, MPI_COMM_WORLD); 7 logger-‐>Info(RANK("QoS mode %d selected (%d MC samples, %d sec)"), m_qosMode, m_jobsNumber[m_qosMode], m_timeFrame[m_qosMode]);

Code 3 Broadcasting QoS to all ranks

Figure 4: Graph with HARPA-OS method and relation with MPI for two nodes. onConfigure method An argument of the onConfigure method is an id of newly assigned AWM (Application

Working Mode). According to this id, appropriate parameters from OPL (Operating Points List) are extracted. These can include number of threads used in current AWM, flags for controlling processes, usage of accelerators or other resources related parameters. An internal application configuration can be adjusted based on new parameters at this stage:

m_threadsNumber = opList[awm_id].parameters["threads"]; m_accelerate = opList[awm_id].parameters["accelerate"]; m_uncertainty-‐>CreateModels(m_threadsNumber); onRun method In every workload processing cycle, master process must obtain information about

resources currently available to each slave process due to a dynamic workload distribution. E.g., if only total number of threads of all processes is sufficient information for further control, MPI_Reduce function can be used: MPI_Reduce(&m_threadsNumber, &m_threadsNumberTotal, 1, MPI_INT, MPI_SUM, m_master, MPI_COMM_WORLD);

HARPA


After evaluation of gathered information, master process broadcasts flags to slave

processes to control its activity or usage of accelerators.

MPI_Bcast(&m_accelerate, 1, MPI_INT, m_master, MPI_COMM_WORLD); MPI_Bcast(&m_workDone, 1, MPI_INT, m_master, MPI_COMM_WORLD);

Once the main computation is completed, master process should verify the amount of work

actually performed by all processes and compute a time based metric, e.g. average execution time per workload cycle. onMonitor method This method handles an evaluation of an ongoing progress and monitors satisfying of the

runtime criteria. The master process computes the time required to simulate the remaining number of Monte-Carlo samples (amount of work in general) and estimated total time of whole simulation at a current speed of the computation. timeRequired = (m_jobsNumber[m_qosMode] -‐ m_jobsDone) * m_avgTimePerSimulation; estimatedTotalTime = timeRequired + m_timer.getElapsedTimeMs();

If this remaining time plus the actual total execution time exceeds the time slot limit defined by the selected QoS mode, the higher AWM is requested.

1 goalGap = (1 -‐ (timeFrameMs / estimatedTotalTime)) * 100; 2 if(estimatedTotalTime > timeFrameMs) 3(*) SetGoalGap(goalGap); //MPI version

(*) SetGoalGap used in the uncertainty model with MPI enabling. But we realized that explicit approximation is better approximation. Please read subsection 5.2. onRelease method When the simulations are finished, last phase of the application process is to collect the

computed results. The master process gathers these data and performs the post processing, a final evaluation of simulations and a storage of the results. numQPerProcess = m_samplesProcessed * m_timeSteps * m_matData-‐>getChannels().size();

MPI_Gather(&numQPerProcess, 1 , MPI_INT, &numQ.front(), 1, MPI_INT, m_master, MPI_COMM_WORLD); MPI_Gatherv(&m_localQ.front(), numQPerProcess, MPI_DOUBLE, &gatheredQ.front(), &numQ.front(),

&displs.front(), MPI_DOUBLE, m_master, MPI_COMM_WORLD);

After that, MPI execution environment is cleared and terminated and the application run is completed.

MPI::Finalize();

HARPA


5 Configuration and finalizing of the HPC codes running within HARPA environment

5.1 Steps to enable HARPA-OS with the heterogeneous example In order to run the HPC application in the HARPA nodes, it is required to install the indispensable libraries and software. The HARPA-OS runtime has to be aware of the diverse tools to run the application in the heterogeneous system. The settings of the environment are performed through a CMakelist.txt [CMake16]. CMake is an open-source, cross-platform family of tools designed to build, test and package software. CMake is used to control the software compilation process using simple platform and compiler independent configuration files, and generate native makefiles and workspaces that can be used in the compiler environment of your choice. Inside CMakelist file, there is the path to the mpi compiler (mpic++) and identify which is the main file to link to the library. There is also the OpenCL include directory and library directory. The paths are different if you are interested in running OpenCL for x86-64 core CPUs or OpenCL for accelerators such as Nvidia-GPU or Xeon-phi. Another extension to the regular file is the path where the OpenCL kernels are saved.

5.2 Extensions to HARPA-OS: Resource-Aware Adaptively Some extensions are being implemented in the application with the HARPA-OS runtime to facilitate the achievement of objectives (such as constraints in terms of execution time, temperature or power). CPS (Cycles Per Second) can be defined as the number of times in seconds that onRun and onMonitor method are called. Since, functions onRun and onMonitor are sequentially called and executed in a loop, until the entire computation is over. Therefore, the library can estimate the current performance of the application, in terms of cycles-per-second (CPS). Thus, the application itself can check the gap between the required performance level and the current one, and notify the resource manager about it. With the new performance control, API run-time is possible to manage applications explicitly, instead of implicitly with GetCPS and setGoalGap, functions that were presented in the previous deliverable D.5.2. It is possible to use new functions available in the new released as named GetAssignedResources(), and SetCPSGoal(). But as it is explaining later SetCPSGoal() is not useful for our HPC example. The HARPA-OS runtime manages the application assigning an Application Working Mode (AWM) each iteration. The AWM value (awm_id) is set first time in onConfigure method. The awm_id is an integer number that identifies the AWM assigned by the HARPA-OS. There is a function called CurrentAWM() that can be used that returns this identification number. The allocation of resources is obtained explicitly with the GetAssignedResources function: RTLIB_ExitCode_t GetAssignedResources (RTLIB_ResourceType r_type, int32_t &r_amount);

HARPA


Code 4: Call function to GetAssignedResources The function returns the number of assigned resources of the given type RTLIB_ResourceType::PROC_NR Number of CPU cores RTLIB_ResourceType::PROC_ELEMENT Amount of CPU bandwidth CFS quota RTLIB_ResourceType::MEMORY Amount of memory (in KB)

Code 5: Parameters of GetAssignedResources Or implicitly, transparently performed by the RTLib. RTLIB_ExitCode_t SetCPSGoal(float cps,uint16_t fwd_rate =FR_DEFAULT); Code 6: Implicitly Goal-Gap forwarding example This example, set the desired performance in terms of Cycles Per Second(CPS), automatically computes the Goal Gap and implicitly forward CPS to the HARPA-OS runtime. The second way has several advantages (available from version 1.1): 1) Less code required easier approach. 2) Goal-gap value is a mean computed over N samples. 3) Goal-gap is forwarded to the HARPA-OS every fwd_rate cycles (frequency filter). 4) A Goal-gap threshold can be specified (amplitude filter). The problem for our HPC application related to this second approximation is that SetCPSGoal () is for applications which HARPA-OS cycle changes when the resources allocation changes. In our example, the cycle is constant and does not change depending on the number of resources. Therefore, we have experimented with SetCPSGoal but we stayed with the explicit computation of the goal gap. Where goal gap is the computation in percentage of the jobs performed per second minus the ideal jobs per second all divided by the ideal number of jobs per second: float goal_gap = 100.0 * (jobs_per_second - ideal_jobs_per_second) / ideal_jobs_per_second; A more accurate description is in the paragraph, where onMonitor method is described. HARPA-OS methods modified Next subsection explains the modifications to HARPA-OS methods from the previous deliverable. The invocation of onMonitor method is produced every cycle, after the onRun() execution. The onMonitor implementation exploits the PerformanceAPI (further BbqueEXC member functions). The application can get an estimation of the current performance. Expressed in terms of CyclesPerSecond(CPS). The application can forward a feedback to the HARPA-OS (Goal-Gap). Goal-Gap provides percentage gap between current CPS and expected CPS goal. In two ways, explicitly forwarding: performed by the application and Implicitly forwarding: transparently performed by the RTLib library.

HARPA


onConfigure Method: Once the application ends the initialization step (onSetup), the control thread waits for the resource allocation decision of the HARPA-OS. As soon as this is received, the onConfigure function is called. The application can then check the amount of assigned resources, and configure itself accordingly, before starting (or continuing) the execution, as described in line 6, below: 1 RTLIB_ExitCode_t MpiUncertainty::onConfigure(int8_t awm_id) 2 3 logger->Error("OnConfigure !!!"); 4 int32_t available_cores = 0; 5 int8_t gars = 1; 6 GetAssignedResources(RTLIB_ResourceType::PROC_NR, available_cores); 7 8 logger->Error("OnConfigure: Available cores: %d", available_cores); 9 if (available_cores == 0) 10 available_cores=omp_get_max_threads(); 11 logger->Error("onConfigure: Unknown CPU amount. Using 1 thread"); 12 13 if (m_threadsNumber != available_cores) 14 logger->Notice("Changing threads number from %d to 15 %d",m_threadsNumber,available_cores); 16 m_threadsNumber = available_cores; 17 m_uncertainty->CreateModels(m_threadsNumber); 18 19 return RTLIB_OK; 20 //onConfigure

Code 7: onConfigure Method description GetAssignedResources returns available_cores, if the available cores parameter is zero means that the application is running in unmanaged mode. Then, for our purposes, the number of available cores is equal to an OpenMP pragma, which is the maximum number of threads. Therefore, the pragma returns the maximum number of cores available. Précising, when the HARPA-OS is configured in an unmanaged mode the number of threads to run the application is always the maximum number of cores available in the x86-64 node (lines 8-12). But the normal execution of GetAssignedResources is in managed mode. Usually, available_cores parameter returns the number of cores needed to minimize the Goal Gap (defined in onMonitor method). If the available cores value is different than the number of threads of the previous cycle, then the new model is called with the new number of available cores (line 13-18).

HARPA


onRun Method The onRun method executes the model (line 16) N times, the number equal to m_treadsNumber parameter. Some conditions are added like if the remaining jobs are equal to 0 or negative, then the simulation has to stop (line 6). Or, if the remaining jobs to be run is smaller than the number of threads. Then only the remaining jobs have to be executed (line 9). 1 RTLIB_ExitCode_t MpiUncertainty::onRun() 2 3 int remaining_jobs = m_jobsNumber[m_qosMode] - m_jobsDone; 4 logger->Notice("Quality of service mode is: %d", m_qosMode); 5 // Return if there are not jobs left to do 6 if(remaining_jobs <= 0) 7 return RTLIB_EXC_WORKLOAD_NONE; 8 // Exploit less threads if less jobs remain 9 if (m_threadsNumber > remaining_jobs) 10 logger->Notice("Only %d jobs remaining. Scaling down threads number", 11 remaining_jobs); 12 m_threadsNumber = remaining_jobs; 13 m_uncertainty->CreateModels(m_threadsNumber); 14 15 // Set number of OMP threads to uncertainty and simulate given chunk of MC samples 16 m_uncertainty->RunMC(m_threadsNumber); 17 m_jobsDone += m_threadsNumber; 18 return RTLIB_OK; 19 //onRunMethod

Code 8: onRun Method description onMonitor Method If CPS is zero onMonitor returns at the beginning correctly because getCPS returns the number of cycles that is being monitored the application per second. Normally, it is a float higher than 0.0, if it is zero means that the application has ended. 1 RTLIB_ExitCode_t MpiUncertainty::onMonitor() 2 3 logger->Warn("[%.2f%%] Cycle %d done (%d threads), average Jobs Per Second = 4 %.2f", 100.0 * (float)m_jobsDone / 5(float)m_jobsNumber[m_qosMode], Cycles(), m_threadsNumber, GetCPS() * m_threadsNumber); 6 if (GetCPS() == 0.0) 7 return RTLIB_OK; 8 int remaining_jobs = m_jobsNumber[m_qosMode] - m_jobsDone; 9 float remaining_time = std::max(1.0, m_timeFrame[m_qosMode] - (m_timer.getElapsedTimeMs() / 1000.0)); 10 float jobs_per_second = (float)m_threadsNumber * GetCPS(); 11 float ideal_jobs_per_second = (float)remaining_jobs / remaining_time; 12 float goal_gap = 100.0 * (jobs_per_second - ideal_jobs_per_second) / ideal_jobs_per_second;

HARPA


13 logger->Warn("Time spend: %f", m_timer.getElapsedTimeMs() / 1000.0); 14 logger->Warn("Remaining jobs: %d", remaining_jobs); 15 logger->Warn("Remaining time: %f", remaining_time); 16 logger->Warn("Current cps: %f", jobs_per_second); 17 logger->Warn("Ideal cps: %f", ideal_jobs_per_second); 18 logger->Warn("Goal Gap for Cycle %d is %.2f", Cycles(), goal_gap); 19 SetGoalGap(goal_gap); 20 return RTLIB_OK; 21 //end onMonitor Method

Code 9: onMonitor Method description The method computes the remaining jobs that are the subtraction between the total jobs (m_jobsNumber[m_qosMode]) minus the jobs were done (line 8). Similarly to compute the remaining time that can be computed as the subtraction of the total time m_timeFrame[m_qosMode] minus the elapsed time (m_timer.getElapsedTimeMs()) in seconds (line 9). Ideal jobs per second performed is equal to the remaining jobs divided by the remaining time (line 11). Results and Scenario Description We executed our HPC application (computation of uncertainties in rainfall-runoff model) with the new methods. We decided to run eight thousand samples in six minutes (8k samples, 360 sec). We run it first in managed mode (named YAMS: improvements in Goal-Gap management that is the HARPA-OS policy selected for the experimental scenario.) [BOSP2016], figure 5, shows in x-axis the execution time, and in y1 axis the goal gap, depending on the goal gap results GetAssignedResources provides a number of cores to execute the application always trying to minimize the goal gap and therefore, trying to minimize the allocation of resources. The number of cores used per cycle is presented in the y2 axis. Goal Gap Figure 5, shows the computation of the goal gap, figure 1 down shows the best-effort execution with the max number cores available. This produces that the goal gap grows exponentially with time. The following, figures show what does it means this in terms of the monitors system such as power/energy and temperature.

HARPA


Figure 5: Up: Goal gap computation and a number of resources allocated to YAMS management. Down: Goal gap and number of resources in unmanaged mode. Power, energy, and temperature are monitored through tools that run in parallel in the system alongside with the HPC application. Such tools are executed in the HOST cores, cores that are not managed by the HARPA-OS runtime. So, there is no downgrade in performance due to such tools. The tool used to monitor temperature is named psensor (Appendix B). The tool to monitor energy and power is likwid (Appendix B), more precisely likwid-powermeter. We sampled the monitors of the system every 16 seconds. Power monitoring The following figure, figure 6 shows the power consumed in DRAMS and in the package (PKG) for the two CPU sockets. The application does not have an extreme use of the memory (not memory bounded). The amount of memory monitored with htop is of 0,1% per thread, meanwhile, the CPU usage is always higher than 95% per thread. Hence, uncertainty model is more CPU bounded. This behaviour is observed in figure 6. It shows power consumption when the HPC application is executed in managed mode (YAMS) vs. unmanaged mode. In terms of memory (DRAM) power usage both executions drain similar power. In case, of unmanaged mode the usage is shorter since the execution time is with maximum resources and therefore, finalize before the deadline (360 seconds). Hence, in terms of power consumed in DRAMs, unmanaged mode gives less power budget. But, if we observe the power consumption in the cores (packages zero and one), the power is reduced

HARPA


around 15 Watts in the maximum power consumed from managed vs. unmanaged mode. In terms of energy, the extra energy is similar in both cases. Extra energy area is the areas between Power and time E(J)=P(W)*t(s). First for unmanaged mode, the area between YAMS and unmanaged lines in y-axis for time 0 seconds till 280 seconds for the x axis. And the extra energy due to managed mode is the area in y-axis from managed and unmanaged mode and x-axis from 280 till 380 seconds approximately. Summarizing, one advantage of YAMS vs unmanaged mode is that the maximum power budget is lower (around 15 Watts in our system) but with the trade-off that the maximum YAMS power stayed longer time till the time deadline. The global energy consumed is similar.

Figure 6: Power consumed in diverse parts of the system when YAMS is executed versus unmanaged mode. Up-left power consumed in the DRAM, socket 0, the upright figure shows the power consumed in the DRAM, socket 1. Down-left power consumed in package 0, socket 0, the down-right power consumed in package 1. Temperature monitoring The results of the temperature monitored through sensors in the CPU is presented in figure 7. The figure shows managed vs. unmanaged execution mode. Figure 7 presents the temperature for each core and for each CPU socket (0 and 1), the temperature is sampled every 16 seconds. Figure 7 up shows results for the unmanaged mode. And figure down shows similar results but for managed (YAMS) execution. Figures show that in unmanaged mode seems that the per core temperature is higher but with the trade-off that the time that CPU cores are in higher temperature is lower.

HARPA


Figure 7, up temperature per core and socket 0,1 monitored with psensor. Down, figure similar results but for managed mode. From both figures 7 is difficult to see clear differences. Therefore, we decided to evaluate the differences in terms of temperature between managed YAMS minus unmanaged. Results are shown in figure 7.

HARPA


Figure 8. up, the difference in temperature YAMS minus Unmanaged mode. Down, mean temperature per CPU0 (c0..7) and CPU1 (c8..15). Figure 8 up shows that during YAMS execution the temperature can be till 8 degrees less than the same execution in unmanaged mode (time 0 till 256 seconds). Figure 8 down shows the mean temperature per socket and per CPU (c0..7) and (c8..15), the mean temperature when the execution temperature is stable (from 56 till 256 seconds approx.) is -3.7 C degrees less than the respective temperature in unmanaged mode.

HARPA


5.3 HPC code running with HARPA-OS and MPI library The scenario created to verify that the multi-node simulation is composed by two nodes connected with Infiniband/Ethernet. According to our preliminary experiments, it has been experimentally verified that the uncertainty simulation is not memory bounded and that the exchange of data is not very intense. However, more detailed experiments will be performed by M36 in the second part of this deliverable. Figure 9 depicts the two twins’ nodes.

Figure 9: Shows schema of two nodes, both nodes are connected with high-speed cable.

The simulation introduces a new parameter to the HARPA-OS recipe, the current name of the parameter is “accelerate”. If accelerate is enabled (accelerate=1) means that the uncertainty model runs with two processes, one per node. The mechanisms of the integration between MPI and HARPA-OS within our HPC example is described in subsection 5.3. Below, an XML description of one AWM (with identification equal two, id = 2) value in the recipe describing that the slave node is enabled. HARPA-OS habilitates 16 cores for each node (pe qty=”1600"), the number of threads (threads=16) available are the same number as the number of cores. The available memory for running during this AWM is 129088Mb.

1 </system_metrics> 2 </awm> <awm id="2" name="Ultra High” 3 value="34"> 4 <resources> 5 <cpu> <pe qty=”1600"/> 6 <mem units="Mb" qty="129088"/> 7 </cpu> 8 </resources> 9 <parameters> 10 <parameter name="threads" value="16"/> 11 <parameter name="accelerate”value="1"/> 12 </parameters> 13 <system_metrics Code 10 Recipe with accelerator activation

In figure 10 is presented the simulation that shows the two independent HARPA-OS instances running in two nodes (harpa-s1 and harpa-s2) in master-slave fashion. They are both

Master Slave

HARPA


communicating (MPI rank0 and MPI rank 1) through MPI. In figure 10 is possible to observe how two instances are allocating 16 cores each. This is shown in the figure up right and left. At the current stage, the HPC application is integrated with HARPA-OS and communicates with MPI library (Master-slave), but the HARPA-OS runtime engines run as separated instances and not exchange information about how “healthy” is the other system. A video of the experiment is available for downloading in the link: https://youtu.be/g05HfCygWuc

Figure 10: HARPA-OS running with MPI communication. The complete video of the demonstration

5.4 HPC code running with HARPA-OS in accelerator with OpenCL This subsection describes our state in a work still under development. In subsection 6.2 (gDEBugger tool to debug OpenCL binaries) is described how the profiling tools permits to verify which parts of the code are our hot-spots in the model. It was observed that the runoff part is pointed as the portion with most intensive computation and hence, the major hot-spot. Therefore, it makes sense to offload this part of the code to the accelerator.

Harpa%s1(Master Harpa%s2(Slave(

HARPA


5.4.1 Problem Decomposition

In the general case, a problem decomposition paradigm is a strategy for organizing a problem as a number of parts. Usually, the aim of using a decomposition paradigm is to optimize some metric related to problem complexity, for example, the modularity of the program or its maintainability. Most decomposition paradigms suggest breaking down a program into parts so as to minimize the dependencies among those parts, and to maximize the consistency of each part. In our specific case, we have to generate a large enough data to compute the algorithm in the accelerator. Therefore, we have to:

- First observe the graph of our execution thread. (Fig. 11a)

- Observe how many nodes are possible to execute in parallel. Each node has 2 input vectors to compute a convolution.

- Save the vectors, the contribution of each node in a 2D memory.

- When all the contributions are loaded and data is ready.

- Call the OpenCL primitives to execute the kernel in the accelerator.

- The 2D memory result transforms it as vector contribution of each node.

The steps presented above are explained in the following paragraphs. The runoff has as input two one-dimensional vectors (pef and q1) and one output vector (Q). The size of these vectors must be large enough to make the usage of the accelerator an advantage. Vectors size have to utilize the accelerator memory. And the call to the accelerator device has to be minimized since the cost to transfer the data from L2 to device memory is not negligible. Figure 11 presents the execution of a single thread. This thread is usually executed in the x8-64 CPUs even if, our desire is to run it also in the accelerator. Figure 11 shows the dependency graph inside each thread. The nodes are catchments and the edges are the dependencies and contributions of the lower catchments to the uppers. The graph dependencies provide information about how many convolutions can be performed in parallel. - It is convenient to execute the convolution in the accelerator (OpenCL kernel) when execution per iteration (Fig 11.b.) in the accelerator is much smaller than execution time in CPU. One of the drawbacks that we have detected is the computation of the runoff model is performed for each basin. But, the number of basins that can be computed in parallel depends strongly on the schematization (geographic information). In figure 11 left is exhibited the graph tree showing the dependencies among basins. It is clear that it is not possible to compute all the basins in parallel. It is only feasible to compute in the first iteration all the independent (figure 11 right part) basins. Vectors pef and q1 are extracted from each basin and they are loaded in a 2D memory. This 2D memory has

HARPA


the input data to call the accelerator. The output of the accelerator is another 2 D memory with all the Q vectors of each basin that must be copied again properly to continue the execution of the thread. Then in the second iteration, it is again possible to call the accelerator. The number of times that the device is called during the execution of the thread has to be determined at design time. Since, when the number of vectors that have to be executed in the accelerator is below a threshold. Then the execution of the Runoff convolution model is better to be performed in the x86-64 CPUs. The number N of vectors has to be determined at design time. This problem decomposition is independent of the accelerator. In the following subsection is explained the problem decomposition in the NVIDIA GPU.

a) b)

Figure 11: a) graph tree each node is a Runoff convolution, edges are contributions of one lower catchment (node) to upper nodes. b) depicts the number of convolutions are possible to be performed by iteration.

5.4.2 NVIDIA GPU Example The GPU NVIDIA 970GTX [Geforce15] has 1664 graphic processing units or CUDA cores, runs with base clock of 1050 MHz, and boost 1178 MHz. Main memory is 4GB GDDR5, with interface of 256bits. The best usage is obtained by filling up the memory, thus minimizing the access to the external memory, and fully utilizing all the graphics processing units (GPUs) or stream processors. At the current moment, we have created the first OpenCL kernel. We are working in a more optimized kernels with possible transposed input matrix and rectangular tiles, 2D register blocking and wider loads with register blocking [Nugteren15]. This new implementation will allow to get more GOPs (Giga Operations Per Second). The HARPA-s1 server has installed a GPU in the PCI express socket:

1977$$$$$$$$12286$$$$1567$$$$$3886$$$$$$$4795$$$$$3335$$$3578$$$$$$4968$$$$$$2916$$$$$4065$$$$10775$$$$4580$$$$$3061$$$$$$4466$$$$$1318$$$$$$435$$$$$$$600$$$$$$1748$$ 3619$$$$$$$6581$$$$6664$$$$$3738$$$$$668$$$$$$$$$946$$$

2468$$$$$$$$$$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$3001$$$$$$$$$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $ 2657$$$$$$$$$$$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$14436$$$$$$$$$$$$$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ $$$$$$$$ 7340$$$

3641 76826171

4275

454

9756

9756

1493

3039

2468

First$iterationSecond$iterationThird$iteration

HARPA


Figure 12: Schema of the server right with a schema of the NVIDIA 970 GTX GPU

Improvement of the performance is achieved by offloading compute-intensive portions of the application to the accelerator. Below code of the first version of the OpenCL kernel: __kernel void ocl_uncer(const int M, const int N, const int K, const __global float* A, const __global float* B, __global float* C) // Thread identifier const int l = get_global_id(2); // --> K Row ID of C (0..M) const int j = get_global_id(1); // --> M Col ID of C (0..N) int k; float acc=0; int j1 = j; for(k = 0; k < N; k++) if(j1 >= 0 && j1 < (int)M-N+1) acc += (A[l*M*N+j1]*B[l*M*N+k]); //if j1--; //forh C[l*M*N+j] = acc; acc=0;

//ori_computation Code 11 First implementation version

An accurate explanation about how to integrate the OpenCL example, debug and execute in the GPU is in the Appendix A. The integration of the OpenCL kernel with the uncertainty example and the HARPA runtime is explained in the next lines. The uncertainty model is called by m_model[tid].rainfallRunoffModelOCL();; where tid is the number of samples or threads of the model. At the current stage for OpenCL, tid is equal to one (tid = 1). Inside the function there is the matrix composition explained in the previous subsection. When the matrix with the input

Harpa-s1 node GPU nvidia 970 GTX

GPU connected in PCI socket

HARPA


vectors pef and q1 are ready the call to the initialization of the accelerator device with the input and output matrices, and sizes of the matrices (M, N, K) is performed. The output provided by the accelerator (GPU GTX970) is a matrix Q. This matrix is then transformed in a set of vector via inverse matrix composition, where each vector has the contribution of each basin (node in figure 13).

Figure 13 : Schema about how to call the model for running it in accelerator Inside the object cl_dev the init_ocl_device has all the primitives to initiate, call, and run the GPU device. A detailed description on how to run an example in the device is in the Appendix A. The activation of the rainfallRunoffModelOCL is made through the activation of the flag, the ocl parameter. The recipe has a new parameter ocl that when it is active allocate one thread to be run in the GPU:

<Parameter name= “ocl” value=“1”> In the onRun method, there is an extension, when the recipe has the parameter ocl = 1 active, the most intensive part that were detected with the profiling tools runs in the GPU, in the future Xeon Phi will be also feasible, since OpenCL supports heterogeneous architectures. At the current stage, only one thread runs at the same time in the GPU, but our evaluation estimates that matrix for the GPU can be vectors (representing the contributions from different independent basin) from different threads. Then, it will be necessary to use a fork-join mechanism to synchronize the information that belong from diverse threads. But, we will have to verify the complexity versus the benefits in the future steps of the development. In figure 14, we show a kernel with gDEBugger to step by step verify the correct behavior of the model. Accurate description about how to debug is in appendix A, subsection 6.2.

Uncertainity::RunMC(size_t threadsNumber)

m_models[tid].rainfallRunoffModelOCL();

//Preparation of Datacl_dev.init_ocl_device (&pef_f[0],

&q1_f[0],&Q_f[0],KK,NN,MM);

Uncertainty.cpp

Vectorization: vectors(pef,q1)!one matrix(pef_f, Q_f)

Vectorization: one matrix(pef_f, Q_f) ! vector (pef_f, Q_f); to each basin

//Preparation of Datacl_dev.init_ocl_device (&pef_f[0],

&q1_f[0],&Q_f[0],KK,NN,MM)

//OpenCL functions

cl_mem bufA, bufB, bufCclEnqueueWriteBuffer A,B,CclSetKernelArg M,N,K, bufA, bufB, bufCcl_platform_id platform ...cl_device_id device ...

;

HARPA


Figure 14: Visualization of OpenCL Kernel of the Runoff model with gDEBugger

5.4.3 Possible Future Scenarios The future scenarios fall under the urgent computing environment. The deliverable D5.4 M36 will present a similar setup as the one presented in D.5.2., where several instances are running in parallel with different QoS, with some extensions like new AWMs will be presented with the possibility to use accelerators to offload workload and with MPI primitives that will allow to use more than one node. In case, the resources are not enough to run the uncertainty instances with the QoS the runtime will submit a job in the HPC cluster with high priority. Furthermore, for the final scenario M39 several HARPA partners work in the future runtime, application and model of hardware degradation. UCY has made an insightful research about monitors to detect the degradation through diverse sensors, and about knobs where UCY is able to provide knowledge about the mitigation techniques (i.e. DVFS) presented in the previous subsection. A detailed description about monitors and knobs is already in deliverable D.3.2 and D.3.3 Intermediate and Final Report on Novel Monitors and Knobs. The goal of using bubble, already presented in other deliverables, is to apply pressure to an application and measure how much an application QoS can be affected by different bubble pressure sizes. The QoS degradation is caused by physical affects due to aging phenomena including NBTI, electro-migration, hot carrier injection, gate oxide breaks down, temperature cycling/thermal shock and stress migration. Effects that are more predominant when silicon technologies scale down. The final envisioned scenario at the end of the project is divided in three basic layers. A lower layer where the degradation model (described in deliverable 4.4) runs simulating degradation of the hardware. An intermediate layer where diverse monitors (i.e. sensors) spread along the

HARPA


silicon provides information about the state of the hardware. And finally, in the upper-layer, there is the HARPA-OS that receives information about the state of the hardware with the support of the monitors and takes decisions through diverse knobs and schedules the HPC application mitigating the effects of the degraded hardware at runtime. Future Work There are still several steps to develop to have an example with all the primitives used in HPC applications. The first improvement is to develop a better resource allocator in the HARPA runtime specifically for HPC. At the current stage, when the resources requested by the application are higher than a threshold, the time to deadline is lost even if there are HW resources available. This will be fixed with the new HARPA-OS release named Steaks. Another future work is to perform experiments as presented in D.5.2. where several uncertainty instances are running in parallel with different QoS, but with the extensions, or possibility to use more than one node, and the accelerator from each node, increasing the possible number of AWMs. It is expected that these improvements will be presented in the second part of this deliverable 5.4. As final step, we are working in a high level degradation model that will be running in the HARPA servers. The HARPA runtime will work as error mitigation. The integration of the interface between the knobs and monitors and the HARPA-OS is being made by both partners, UCY and POLIMI. The models, API, and management, monitoring and mitigation techniques that are valid for embedded systems can be reused for future HPC environments. The expected deadline for presenting this scenario is by the end of the project, M39. References [Intel15] https://software.intel.com/en-us/intel-opencl (June 2015) [Nvidia15] https://developer.nvidia.com/cuda-downloads (June 2015). [Kronos15] https://www.khronos.org (June 2015) [Allinea16] http://www.allinea.com/products/ddt (Feb. 2016). [TotalView16] http://www.roguewave.com/products-services/totalview (Feb. 2016). [Nsight16] http://www.nvidia.com/object/nsight.html (Feb. 2016). [CUDA-gdb] https://developer.nvidia.com/cuda-gdb (Feb. 2016). [HPCWIRE15] http://www.hpcwire.com/2015/03/02/a-comparison-of-heterogeneous-and-manycore-programming-models/ (Feb. 2016) [Nugteren15] Tutorial: OpenCL SGEMM tuning for Kepler, http://www.cedricnugteren.nl/tutorial.php?page=1 (Oct. 2015) [Geforce15] http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications (Sept. 2015) [Matsumoto12] K. Matsumoto, N. Nakasato, S.G. Sedukhin. “Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs” In: SC-Companion '12. IEEE [Vtune16] https://software.intel.com/en-us/intel-vtune-amplifier-xe (February 2016) [MVS12] https://blogs.msdn.microsoft.com/visualstudioalm/2015/07/20/performance-and-diagnostic-tools-in-visual-studio-2015/ (February 2016) [Psensor16] http://wpitchoune.net/psensor/doc/README.html#_ati_amd_gpu_support (Feb. 16) [CMake16] https://cmake.org/cmake-tutorial/ (Feb. 2016)

HARPA


[Gremedy15] http://www.gremedy.com/ (July 2015) [Likwid15] https://code.google.com/p/likwid/wiki/LikwidPowermeter (June 2015) [Psensor15] http://wpitchoune.net/blog/psensor/ (July 2015) [Nvidia-smi15] http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf (July2015) [BOSP2016] https://bosp.dei.polimi.it/doku.php?id=news (Feb2016)

HARPA


6 Annex A: Convolution of Runoff model in OpenCL

This subsection describes the OpenCL development of the Runoff model. The Runoff is a convolution operationthat is the most demanding computationally part of the model, as observed through the profiling tool (subsection 6.2). The parallizations explained here is used under linux OS, Ubuntu and Centos more specifically. The OpenCL versions used for this example are Intel(R) OpenCL 1.2 Linux and Nvidia OpenCL (OpenCL 1.2 CUDA 7.5.18). The OpenCL are installed following the instructions from intel [Intel15] and nvidia[Nvidia15]. The following Runoff example provides a set of OpenCL steps to obtain code running on an accelerator. The intention of this part is to cover how to prepare a code to be accelerated with OpenCL and an accelerator (i.e. CPU, GPU) but similar methodology could be extended to Xeon Phi or FPGAs platforms. The example covers the following topics:

§ Using platform and device layers to build robust OpenCL

§ Program compilation and kernel objects

§ Managing buffers

§ Kernel execution

§ Kernel programming – basics

§ Kernel programming – synchronization

Here are some notes of caution on how the OpenCL samples are written:

OpenCL defines a C-like language for programming compute device programs. These programs are passed to the OpenCL runtime via API calls expecting values of type char *. Often, it is convenient to keep these programs in separate source files. For this and subsequent tutorials, we assume the device programs are stored in files with names of the form name_kernels.cl, where name varies, depending on the context, but the suffix _kernels.cl does not. The corresponding device programs are loaded at runtime and passed to the OpenCL API.

For this first OpenCL program, we start with the source for the host application.

HARPA


6.1.1 Headers files We are using the deprecated definition for NVIDIA devices that at the current moment uses only OpenCL 1.2. The clCreateCommandQueue function was deprecated as of OpenCL 2.0, and replaced with clCreateCommandQueueWithProperties. If we are only targeting devices that support OpenCL 2.0 (some recent Intel and AMD processors at the time of writing), you can safely use this new function. As we need our code to run on devices that do not yet support OpenCL 2.0, as NVIDIA devices, then we can continue using the deprecated clCreateCommandQueue function by using the preprocessor macros that the OpenCL headers provide, as we shown below: #define CL_USE_DEPRECATED_OPENCL_2_0_APIS Just like any other external API used in C++, we must include a header file when using the OpenCL API. Usually, this is in the directory CL within the primary include directory. For the C++ bindings we have (replace the straight C API with cl.h):

#include <CL/cl.h>

For our program, we use a small number of additional C++ headers, which are agnostic to OpenCL. #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h>

6.1.2 Errors A common property of most OpenCL API calls is that they either return an error code (type cl_int) as the result of the function itself, or they store the error code at a location passed by the user as a parameter to the call. As with any API call that can fail, it is important, for the application to check its behavior correctly in the case of error. For the most part we will not concern ourselves with recovering from an error;; for simplicity, we define two functions, CL_CHECK and CL_CHECK_ERR, to verify that each specific call has completed successfully. OpenCL returns the value CL_SUCCESS in this case. If it is not, it outputs a user message and exits;; otherwise, it simply returns.

#define CL_CHECK(_expr) \ do \ cl_int _err = _expr; \ if (_err == CL_SUCCESS) \ break; \ fprintf(stderr, "OpenCL Error: '%s' returned %d!\n", #_expr, (int)_err); \ abort(); \ while (0)

HARPA


#define CL_CHECK_ERR(_expr) \ ( \ cl_int _err = CL_INVALID_VALUE; \ typeof(_expr) _ret = _expr; \ if (_err != CL_SUCCESS) \ fprintf(stderr, "OpenCL Error: '%s' returned %d!\n", #_expr, (int)_err); \ abort(); \ \ _ret; \ ) A common paradigm for error handling in C++ is through the use of exceptions, and the OpenCL C++ bindings provide just such an interface. For now, let’s look at the one remaining function, “main,” necessary for our OpenCL development.

6.1.3 OpenCL Contexts The first step to initialize and use OpenCL is to create a context. The rest of the OpenCL steps (creating devices and memory, compiling and running programs) are performed within this context.

A context can have a number of associated devices (for example, CPU, GPU, Xeon Phi, and FPGA devices), and, within a context, OpenCL guarantees a relaxed memory consistency between devices. We use a single device, CL_DEVICE_TYPE_CPU, for the CPU device. We could have used CL_DEVICE_TYPE_GPU or some other support device type, assuming that the OpenCL implementation supports that device. But before we can create a context we must first queue the OpenCL runtime to determine which platforms, i.e. different vendor’s OpenCL implementations, are present. The class cl::Platform provides the static method cl::Platform::get for this and returns a list of platforms. For now, we select the first platform and use this to create a context. The constructor cl::Context should be successful and, in this case, the value of err is CL_SUCCESS.

cl_platform_id platform[100]; cl_uint platforms_n = 0; CL_CHECK(clGetPlatformIDs(100, platform, &platforms_n)); for (int i=0; i<platforms_n; i++) char buffer[10240]; printf(" -- %d --\n", i); CL_CHECK(clGetPlatformInfo(platform[i], CL_PLATFORM_PROFILE, 10240, buffer, NULL)); printf(" PROFILE = %s\n", buffer); CL_CHECK(clGetPlatformInfo(platform[i], CL_PLATFORM_VERSION, 10240, buffer, NULL)); printf(" VERSION = %s\n", buffer); CL_CHECK(clGetPlatformInfo(platform[i], CL_PLATFORM_NAME, 10240, buffer, NULL));

HARPA


printf(" NAME = %s\n", buffer); CL_CHECK(clGetPlatformInfo(platform[i], CL_PLATFORM_VENDOR, 10240, buffer, NULL)); printf(" VENDOR = %s\n", buffer); CL_CHECK(clGetPlatformInfo(platform[i], CL_PLATFORM_EXTENSIONS, 10240, buffer, NULL)); printf(" EXTENSIONS = %s\n", buffer); if (platforms_n == 0) return 1; cl_device_id device[100]; cl_uint devices_n = 0; if(type_==1) clGetDeviceIDs(platform[0], CL_DEVICE_TYPE_GPU, 100, device,&devices_n); else clGetDeviceIDs(platform[1], CL_DEVICE_TYPE_CPU, 100, device,&devices_n); //ifthenelse printf("=== %d OpenCL device(s) found on platform:\n", platforms_n);

6.1.4 OpenCL Devices In OpenCL many operations are performed with respect to a given context. For example, buffers (1D regions of memory) and images (2D and 3D regions of memory) allocation are all context operations. But there are also device specific operations. For example, program compilation and kernel execution are on a per device basis, and for these a specific device handle is required. So the way how we obtain a handle for a device is simply by querying a context for it. OpenCL provides the capability to queue information about particular objects, and using the C++ API it comes in the form of object.getInfo<CL_OBJECT_QUERY>(). In the specific case of getting the device from a context:

for (int i=0; i<devices_n; i++) //int i=0; char buffer[10240]; cl_uint buf_uint; cl_ulong buf_ulong; printf(" -- %d --\n", i); CL_CHECK(clGetDeviceInfo(device[i], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL)); printf(" DEVICE_NAME [%d] = %s\n",i, buffer); CL_CHECK(clGetDeviceInfo(device[i], CL_DEVICE_VENDOR, sizeof(buffer), buffer, NULL)); printf(" DEVICE_VENDOR = %s\n", buffer); CL_CHECK(clGetDeviceInfo(device[i], CL_DEVICE_VERSION, sizeof(buffer), buffer, NULL)); printf(" DEVICE_VERSION = %s\n", buffer); CL_CHECK(clGetDeviceInfo(device[i], CL_DRIVER_VERSION, sizeof(buffer), buffer, NULL)); printf(" DRIVER_VERSION = %s\n", buffer);

HARPA


CL_CHECK(clGetDeviceInfo(device[i], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(buf_uint), &buf_uint, NULL)); printf(" DEVICE_MAX_COMPUTE_UNITS = %u\n", (unsigned int)buf_uint); CL_CHECK(clGetDeviceInfo(device[i], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(buf_uint), &buf_uint, NULL)); printf(" DEVICE_MAX_CLOCK_FREQUENCY = %u\n", (unsigned int)buf_uint); CL_CHECK(clGetDeviceInfo(device[i], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(buf_ulong), &buf_ulong, NULL)); printf(" DEVICE_GLOBAL_MEM_SIZE = %llu\n", (unsigned long long)buf_ulong); //for_devices cl_context context; context = CL_CHECK_ERR(clCreateContext(NULL, 1, device, &pfn_notify, NULL, &_err)); When we have the list of associated devices for a context, in this case a single CPU and a GPU device in HARPA-s1 and s2 server, we need to load and build the compute program (the program we intend to run on the device, or more generally: devices). The first few lines of the following code simply load the OpenCL device program from memory, convert it to a string, and create a cl::Program::Sources object using the helper constructor. Given an object of type cl::Program::Sources a cl::Program, an object is created: long lFileSize; char *kernelstring; lFileSize = LoadOpenCLKernel("./kernels/kernel_uncertainty.cl", &kernelstring, false); if( lFileSize < 0L ) perror("File read failed"); return 1; //if A specified program can have many entry points, called kernels, and to call one a kernel object must be built. There is presumed to exist a straightforward mapping from kernel names, represented as strings, to a function defined with the __kernel attribute in the compute program. In this case we can build a cl::kernel object, kernel. Kernel arguments are set using the C++ API with kernel.setArg(), which takes the index and value for the particular argument. Where the function LoadOpenCLKernel load the kernel in memory and it is described as: long LoadOpenCLKernel(char * path, char **buf, bool add_nul) FILE *fp; size_t fsz; long off_end; int rc; /* Open the file */ fp = fopen(path, "r"); if( NULL == fp ) return -1L;

HARPA


/* Seek to the end of the file */ rc = fseek(fp, 0L, SEEK_END); if( 0 != rc ) return -1L; /* Byte offset to the end of the file (size) */ if( 0 > (off_end = ftell(fp)) ) return -1L; fsz = (size_t)off_end; /* Allocate a buffer to hold the whole file */ *buf = (char *) malloc( fsz+(int)add_nul ); if( NULL == *buf ) return -1L; /* Rewind file pointer to start of file */ rewind(fp); /* Slurp file into buffer */ if( fsz != fread(*buf, 1, fsz, fp) ) free(*buf); return -1L; The part that is detected as more computational demanding with the profiling tool. It is our kernel program. The kernel, part of the code that must be run in the accelerator device, below the description of the kernel (basic version): __kernel void ocl_uncer(const int M, const int N, const int KK, const __global float* A, const __global float* B, __global float* C) const int j = get_global_id(0);//m+n-1--> m:Y int k; int j1 = j; C[j]=0; float acc = 0; for(k = 0; k < N; k++) if(j1 >= 0 && j1 < (int)M-N+1) acc += (A[j1]*B[k]); //if j1--; //fork C[j]= acc; //ori_computation Before taking in account into computing devices we first allocate an OpenCL buffer to hold the result of the kernel that will be run on the device, i.e. the inputs values of the runoff convolution. In our case, two input buffers for pef and q1 named A and B. And an output buffer with the vector of results Q named C. The input buffers are only for reading (CL_MEM_READ_ONLY), the output buffer have both properties read and write (CL_MEM_READ_WRITE). The size of the buffer is dynamically allocated and it depends on the accuracy of the runoff

HARPA


computation. The correct size is given with the formula: input_size<vector> N*M*K(float), then, simply some memory is allocated on the host and request that OpenCL use this memory directly, passing the flag CL_MEM_USE_HOST_PTR, when creating the buffer. cl_mem bufA = clCreateBuffer(context, CL_MEM_READ_ONLY, N*M*K (float), NULL, NULL); cl_mem bufB = clCreateBuffer(context, CL_MEM_READ_ONLY, input_sizeN*nthreads*sizeof(float), NULL, NULL); cl_mem bufC = clCreateBuffer(context, CL_MEM_READ_WRITE, N*M*K (float), NULL, NULL); // Copy matrices to the GPU clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0, N*M*K (float), A, 0, NULL, NULL); clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0, N*M*K (float), B, 0, NULL, NULL); clEnqueueWriteBuffer(queue, bufC, CL_TRUE, 0, N*M*K (float), C, 0, NULL, NULL); The call to clSetKernelArg [Kronos15] is used to set the argument value for a specific argument of a kernel.

cl_int clSetKernelArg ( cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value)

kernel

A valid kernel object.

arg_index

The argument index. Arguments to the kernel are referred by indices that go from 0 for the leftmost argument to n - 1, where n is the total number of arguments declared by a kernel.

arg_value

A pointer to data that should be used as the argument value for argument specified by arg_index. The argument data pointed to by arg_value is copied and the arg_value pointer can therefore be reused by the application after clSetKernelArg returns. The argument value specified is the value used by all API calls that enqueue kernel(clEnqueueNDRangeKernel and clEnqueueTask) until the argument value is changed by a call to clSetKernelArg for kernel.

In our case, we pass the size of the strings N and M, the input buffer A and B, and finally the output buffer C. // Configure the uncertainty kernel and set its arguments cl_kernel kernel = clCreateKernel(program, "ocl_uncer", NULL); clSetKernelArg(kernel, 0, sizeof(int), (void*)&M); clSetKernelArg(kernel, 1, sizeof(int), (void*)&N); clSetKernelArg(kernel, 2, sizeof(int), (void*)&M);

HARPA


clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&bufA); clSetKernelArg(kernel, 4, sizeof(cl_mem), (void*)&bufB); clSetKernelArg(kernel, 5, sizeof(cl_mem), (void*)&bufC); At the current stage the most important parts of the code are completed and it is possible to compute the result. In particular, the output buffer with the runoff result will be saved in a file: “C.txt”. All device computations are prepared using a command queue, which is a virtual interface for the device itself. Each command queue has a one-to-one mapping with a given device;; it is created with the associated context using a call to the constructor for the class cl::CommandQueue. Given a cl::CommandQueuequeue, kernels can be queued using queue.enqueuNDRangeKernel. This argument enqueues the kernel for execution on the associated device. The kernel can be executed on a 1D, 2D, or 3D domain of indexes that execute in parallel, given enough resources. The total number of elements (indexes) in the launch domain is called the global work size;; individual elements are known as work-items. Work-items can be grouped into work-groups when communication between work-items is required. Work-groups are defined with a sub-index function (called the local work size), in the example below TS describing the size in each dimension corresponding to the dimensions specified for the global launch domain. There is a lot to consider with respect to kernel launches. Each work-item computes two buffers resulting the convolution results. The code in general can be described: The code for our purposes is described as: const size_t local[3] = TS, TS, 0; const size_t global[3] = M, N, 0; To complete the program, it is required to define the external entry point for the device program (kernels_uncertainty.cl). The kernel implementation is straightforward: it calculates a unique index as a function of the launch domain using get_global_id(), it uses it as an index into the string, hw, then writes its value to the output array, out. For robustness, it would make sense to check that the thread id (tid) is not out of range of the hw; for now, we assume that the corresponding call to queue.enqueueNDRangeKernel() is correct. clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, local, 0, NULL, &event); The clWaitForEvents waits on the host thread for commands identified by event objects to complete. // Wait for calculations to be finished clWaitForEvents(1, &event); The complete code is in (git repo: https://bitbucket.org/tonipat/bosp-uncertaintympi)

HARPA


6.1.5 Building and running On Linux, it should be enough to use a single command to build the OpenCL program;; for example: gcc –o uncer_ocl –Ipath-OpenCL-include –Lpath-OpenCL-libdir uncer_ocl.cpp –lOpenCL

With Icc from Intel (some part explanation of the tool, paragraph):

icc –o uncer_ocl –Ipath-OpenCL-include –Lpath-OpenCL-libdir uncer_ocl.cpp –lOpenCL

To run: LD_LIBRARY_PATH=path-OpenCL-libdir ./uncer_ocl

On Windows, with a Visual Studio command window, an example is: cl /Feuncer_ocl.exe /Ipath-OpenCL-include uncer_ocl.cpp path-OpenCL-libdir/OpenCL.lib

Let’s assume that OpenCL.dll is on the path, then, running .\uncer_ocl

outputs the following string pm stdout: The corrects values are presented in figure 15, where the buffers represents the same values as in C.txt file.

6.2 gDEBugger tool to debug OpenCL binaries

The tool gDEBugger [Gremedy15] is a sophisticated OpenCL (and OpenGL) debugger, profiler and memory analyzer. The gDEBugger tool lets you trace application activity on top of the OpenGL and OpenCL APIs and see what is happening within the system implementation.

The tool provides assist the performance optimization of OpenGL and OpenCL applications. Moreover, it saves time of the code developers to locate and to discover OpenGL and OpenCL related bugs. It helps to improve application quality and robustness. The gDEBugger tool aids to improve application performance and quality, reduce debugging and profiling time, deploy on multiple platforms, conform with future OpenGL and OpenCL versions, optimize memory consumption etc. The following paragraph explains basics features of the tool. Such qualities that helped in the debugging and development of the OpenCL code in the HARPA project.

HARPA


6.2.1 Debugging Runoff model with gDEBugger The gDEBugger main window contains toolbars and views. For a specific debugging or profiling task, we can customize the application to show only the necessary views and toolbars.

Figure 15: gDEBugger Main Frame window

It is not the purpose of this part of the document to write specifically how the gdebugger works. The purpose is to describe the minimum to illustrate the possibilities of the tool for the current OpenCL developments. The API is formed of several windows from left to right, up to down. Figure 15 (Functions calls History): shows the OpenCL functions calls history. Figure 15 (calls Stack): shows the program counter and provides information about which code line is being tracked. Specific information about the line that is being debugged is displayed thanks to symbol and the information is displayed as in figure 15. Figure 15 (Debugged Process Events): It provides information about debugging process event. This box is very important since provides the crucial information about what is happening and why the program crashes. Provides information about OpenCL errors that are not visible with normal gcc or icc debuggers. Figure 15 (Properties): Shows the OpenCL function that we are looking at, what are the input arguments. This figure is also important since with figure b, and c provides crucial information about what is happening in the code when we are debugging. Thanks to such windows the designer can figure it out what it is happening in the code and the reasons why the code does not run as expected. The tool offers added information that was not possible to extract easily and as visual with gcc or icc debuggers (i.e gdb debugger). Figure 15 Performance Graph view: Shows at runtime the performance work loaded in percentage. The Performance Graph view contains counters graphs of: gDEBugger, operating system, and vendor specific graphic boards (currently NVIDIA, ATI/AMD, S3 Graphics and 3DLabs). For example: CPU/GPU idle, graphic memory consumption, vertex and fragment processors utilizations, number of function calls per frame, frame per second, etc.

HARPA


Figure 16: Source Code viewer. Displays the OpenCL line that is being debugged.

In a similar way, there is the possibility to see the code that belongs to the kernel:

Figure 17: Shaders and kernels source code editor.

HARPA


6.2.2 Graphic Memory Analysis viewer The Graphic Memory Analysis viewer displays information about graphic memory leaks and graphic memory allocated objects. With the Graphic Objects Application Tree view, you can browse the allocated objects quickly by their render context and type. The objects' details can be viewed in the Graphic Object Details list as well as the Properties view. Use the Object Creation Calls to see the scenario that led to each object's creation. Turning on the "Break on Memory Leaks" option will let you see which allocated objects are not cleared properly by your application. In our case, allows to visualize the input and output buffers.

Figure 18: shows the textures, buffers and image viewer. In our example shows the output correct results from the accelerator.

HARPA


7 Appendix B: Knobs and Monitors: Power and Temperature monitoring (tools).

The servers (harpa-s1 and harpa-s2) have several sensors in the x86-64 cores that allow to monitor temperature and counters to obtain energy and power. They have also knobs that can be used as mitigation techniques (e.g. DVFS). In a similar way the possible accelerators like the GPU has a “System Management Interface” to monitor power and temperature. To the best of our knowledge there is not a DVFS knob in these devices. As future work, we also checked the features of the Xeon Phi (also named Intel MICs). These devices have similar resources as GPUs to monitor power and temperature but not the possibility to perform DVFS. Below, it is presented a table with the monitors and knobs in the system and the tools to get the information. X means that the feature is not available. After the table 2 a brief description of each tool is presented. Monitors CPU

GPU Xeon Phi

Temperature lm-sensors nvidia-smi micinfo Power likwid-powermeter nvidia-smi

micsmc/micras syssf

Knobs DVFS

likwid-setFrequences X X

Table 1: Installed monitor and knobs tool in HPC system.

7.1.1 Power and Energy monitoring in x86-64 cores The access to the power information of x86-64 CPU cores has been performed through the tool [Likwid15] for accessing RAPL counters and query Turbo mode steps on Intel processor. RAPL (Running Average Power Limit) provides a set of counters providing energy and power consumption information. RAPL is not an analog power meter, but rather uses a software power model. This software power model estimates energy usage by using hardware performance counters and I/O models. Based on our measurements, they match actual power measurements [Power-Management Architecture of the Intel Microarchitecture Code-Name Sandy Bridge” (IEEE Micro, March/April 2012)]. Example: CPU name: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz CPU type: Intel Xeon Haswell EN/EP/EX processor CPU clock: 2.60 GHz -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Runtime: 2.00081 s Measure for socket 0 on CPU 1 Domain PKG: Energy consumed: 33.2696 Joules Power consumed: 16.628 Watt

HARPA


Domain PP0: Energy consumed: 0 Joules Power consumed: 0 Watt Domain PP1: Energy consumed: 0 Joules Power consumed: 0 Watt Domain DRAM: Energy consumed: 79.7524 Joules Power consumed: 39.86 Watt Measure for socket 1 on CPU 9 Domain PKG: Energy consumed: 30.2323 Joules Power consumed: 15.11 Watt Domain PP0: Energy consumed: 0 Joules Power consumed: 0 Watt Domain PP1: Energy consumed: 0 Joules Power consumed: 0 Watt Domain DRAM: Energy consumed: 53.2334 Joules Power consumed: 26.6059 Watt

--------------------------------------------------------------------------------

7.1.2 Temperature in x86-64 cores Information about of temperature of the x86-64 cores is performed thanks to psensor tool. Psensor is a graphical hardware temperature monitor for Linux. It can monitor:

• the temperature of the motherboard and CPU sensors (using lm-sensors), • the temperature of the NVidia GPUs (using XNVCtrl), • the temperature of ATI/AMD GPUs (not enabled in Ubuntu PPAs or official distribution

repositories, see the instructions [Psensor16] for enabling its support), • the temperature of the Hard Disk Drives (using hddtemp or libatasmart), • the rotation speed of the fans (using lm-sensors), • the CPU usage (since 0.6.2.10 and using Gtop2) [Psensor15].

Example: coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +37.0°C (high = +77.0°C, crit = +87.0°C) Core 0: +27.0°C (high = +77.0°C, crit = +87.0°C) Core 1: +30.0°C (high = +77.0°C, crit = +87.0°C) Core 2: +24.0°C (high = +77.0°C, crit = +87.0°C) Core 3: +23.0°C (high = +77.0°C, crit = +87.0°C) Core 4: +26.0°C (high = +77.0°C, crit = +87.0°C) Core 5: +24.0°C (high = +77.0°C, crit = +87.0°C) Core 6: +26.0°C (high = +77.0°C, crit = +87.0°C) Core 7: +25.0°C (high = +77.0°C, crit = +87.0°C) coretemp-isa-0008 Adapter: ISA adapter Physical id 1: +32.0°C (high = +77.0°C, crit = +87.0°C) Core 0: +28.0°C (high = +77.0°C, crit = +87.0°C) Core 1: +25.0°C (high = +77.0°C, crit = +87.0°C) Core 2: +29.0°C (high = +77.0°C, crit = +87.0°C) Core 3: +25.0°C (high = +77.0°C, crit = +87.0°C) Core 4: +28.0°C (high = +77.0°C, crit = +87.0°C) Core 5: +27.0°C (high = +77.0°C, crit = +87.0°C) Core 6: +27.0°C (high = +77.0°C, crit = +87.0°C) Core 7: +26.0°C (high = +77.0°C, crit = +87.0°C)

HARPA


7.1.3 Power and Temperature monitoring in NVIDIA GPU The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML) [https://developer.nvidia.com/nvidia-management-library-nvml], intended to aid in the management and monitoring of NVIDIA GPU devices. This utility allows administrators to query GPU device state and, with the appropriate privileges, allows to modify GPU device state. It is targeted at the Tesla(TM) and Fermi-based Quadro(TM) devices, though limited support is also available on other NVIDIA GPUs. Nvidia-smi ships with NVIDIA GPU display drivers on Linux, and with 64bit Windows Server 2008 R2 and Windows 7. Nvidia-smi can report query information as XML or human readable plain text to either standard output or a file [Nvidia-smi15]. Output example: nvidia-smi Thu Feb 18 12:24:17 2016 +------------------------------------------------------+ | NVIDIA-SMI 352.41 Driver Version: 352.41 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 970 Off | 0000:03:00.0 Off | N/A | | 30% 36C P0 36W / 151W| 15MiB / 4095MiB | 28% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0. 11097 ./mpiuncertainty 45MB | +-----------------------------------------------------------------------------+

Best% Practice:% Use% HARPA%within%HPC applications · HARPA ! FP7-612069-HARPA Project 1...

Documents

Transcript of Best% Practice:% Use% HARPA%within%HPC applications · HARPA ! FP7-612069-HARPA Project 1...