Performance and Energy Monitoring Tools for Modern ... · Performance and Energy Monitoring Tools...

Performance and Energy Monitoring Tools forModern Processor Architectures

Luís Filipe Mataloto Taniça

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Prof. Pedro Filipe Zeferino TomásProf. Leonel Augusto Pires Seabra de Sousa

Examination Comittee

Chairperson: Prof. Nuno Cavaco Gomes HortaSupervisor: Prof. Pedro Filipe Zeferino Tomás

Members of the Committee: Prof. João Nuno de Oliveira e Silva

April 2014

Acknowledgments

First of all, I would like to thank Professors Leonel Sousa and Pedro Tomás, for the support

and coordination of my work. An additional thank to Aleksandar Ilić and Frederico Pratas, for

the patience and guidance, and to Diogo Antão, for the smooth partnership. Finally, I would like

to thank all my family and friends, for the support and motivation.

This work was supported by national funds through FCT – Fundação para a Ciência e a

Tecnologia, under the project P2HSC - Stretching the Limits of Parallel Processing on Heterogenous

Computing Systems under the reference PTDC/EEI-ELC/3152/2012.

Abstract

Accurate on-the-fly characterization of application behavior requires assessing a set of execution-

related parameters at run-time, including performance, power and energy consumption. These

parameters can be obtained by relying on hardware measurement facilities built-in modern multi-

core architectures, such as performance and energy counters. However, current operating systems

do not provide the means to directly obtain these characterization data. Thus, the user needs to

rely on complex custom-built libraries with limited capabilities, which might introduce significant

execution and measurement overheads. In this work, we propose two different tools for efficient

performance, power and energy monitoring of systems with modern multi-core CPUs, that allow

capturing the run-time behavior of a wide range of applications at different system levels: i)

at the user-space level, and ii) at kernel-level, by using the OS scheduler to directly capture

this information. Although the importance of the proposed monitoring facilities is patent for

many purposes, we focus herein on their employment for application characterization with the

Cache-aware Roofline model. The experimental results show the capabilities of the proposed

tools to deliver detailed and accurate information about the behavior of real-world applications

on the underlying architectural resources. Moreover, they allow reconstructing and identifying

the execution patterns of the profiled benchmarks from standard suites (SPEC CPU2006), while

introducing negligible overheads.

Keywords

Performance and Power Monitoring, Application Characterization, Multi-core Architectures,

Cache-aware Roofline Model

iii

Resumo

A caracterização comportamental de aplicações em tempo real requer a avaliação de um con-

junto de parâmetros relacionados com a execução, tais como o desempenho, potência e consumo

de energia, durante a própria execução. Estes parâmetros podem ser obtidos por meio de mecanis-

mos de hardware disponibilizados em arquitecturas modernas multi-core, tais como os contadores

de desempenho e energia. Contudo, os sistemas operativos (SOs) actuais não fornecem os meios

necessários para obter os dados relativos a esta caracterização. Assim sendo, o utilizador neces-

sita de recorrer a bibliotecas complexas e customizadas, com capacidades limitadas, que poderão

adicionar um overhead significativo às medições de execução. Neste trabalho, são propostas duas

técnicas diferentes, que permitem uma monitorização eficiente de desempenho e energia para ar-

quitecturas multi-core. As duas ferramentas de monitorização propostas permitem capturar, em

tempo real, o comportamento de um vasto leque de aplicações a partir de dois níveis distintos:

i) do nível do utilizador, ou user-space, e ii) do nível do sistema, ou kernel-space, utilizando o

scheduler do SO como recurso para capturar esta informação. Embora a importância das inter-

faces de monitorização propostas seja evidente para diversos propósitos, é dedicado um foco central

sobre a caracterização de aplicações segundo o Cache-aware Roofline Model. Os resultados obtidos

demostram as capacidades das ferramentas propostas para providenciar informação detalhada e

precisa sobre o comportamento de aplicações nos recursos arquitecturais. Estas também permitem

a reconstrução e identificação de padrões no perfil the standard benchmarks (SPEC CPU2006),

introduzindo um overhead insignificante.

Palavras Chave

Monitorização de Desempenho e Energia, Caracterização de Aplicações, Arquitecturas Multi-

core, Cache-aware Roofline Model

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 7

2.1 Performance Monitoring Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Performance Model-Specific Registers . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Performance Monitoring Event Configuration . . . . . . . . . . . . . . . . . 10

2.2 Running Average Power Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Linux Kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Performance Monitoring Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 State-of-Art Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Cache-Aware Roofline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 User-Space Monitoring Tool (SpyMon) 19

3.1 Architecture and Main Functionalities . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Spatial Process Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.2 Available Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Linux Kernel Module and Hardware Access Restrictions . . . . . . . . . . . 24

3.2.2 Hardware Readings and Configuration . . . . . . . . . . . . . . . . . . . . . 25

3.2.3 Main Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Profiling Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.2 Cache-aware Roofline Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.3 Information Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vii

Contents

4 Scheduler-Based Monitoring Tool (SchedMon) 35

4.1 Architecture and Main Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 SchedMon’s Linux Kernel Module . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.2 Smon: the user-space tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.3 Available Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.1 Linux Kernel Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 User-space Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.1 Adding Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.2 Defining Event-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3.3 Application Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.4 Cache-aware Roofline Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.5 Information Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Experimental Results 61

5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 SpyMon Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.1 System-wide Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.2 Cache-aware Roofline Model Analysis . . . . . . . . . . . . . . . . . . . . . 66

5.2.3 Power/Energy Consumption Evaluation . . . . . . . . . . . . . . . . . . . . 67

5.3 SchedMon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.1 Application Thread Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3.2 Scheduling Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3.3 Function Call Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.4 Cache-aware Roofline Model Analysis . . . . . . . . . . . . . . . . . . . . . 72

5.3.5 Power/Energy Consumption Evaluation . . . . . . . . . . . . . . . . . . . . 74

5.4 Overhead Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6 Conclusions 81

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

viii

List of Figures

2.1 Multi-core CPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 MSR read and write functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Performance MSRs for Intel’s PMU version 3. Figures obtained from [12]. . . . . . 11

2.4 Energy status MSR layout. Obtained from [12]. . . . . . . . . . . . . . . . . . . . . 12

2.5 Performance Cache-aware Roofline Model (Intel 3770K) . . . . . . . . . . . . . . . 17

3.1 Spacial perception of SpyMon while monitoring 5 threads from 3 applications. . . . 21

3.2 SpyMon’s components interaction and disposition in the Operating System (OS)

privilege layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 SpyMon’s data structures for ioctl() communication. . . . . . . . . . . . . . . . . 26

3.4 SpyMon’s execution flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Illustration of TBM for 3 defined event-sets. . . . . . . . . . . . . . . . . . . . . . . 28

3.6 SpyMon’s usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 SchedMon’s components interaction and disposition in the OS privilege layers. . . . 37

4.2 SchedMon event, event-set and environment structural hierarchy. . . . . . . . . . . 42

4.3 Linux scheduler breakpoints used by SchedMon. . . . . . . . . . . . . . . . . . . . . 45

4.4 SchedMon sampling process illustration. . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 SchedMon ring-buffer implementation overview. . . . . . . . . . . . . . . . . . . . . 50

4.6 Example of a function’s dump information. . . . . . . . . . . . . . . . . . . . . . . 55

4.7 SchedMon function call tracing data structures. . . . . . . . . . . . . . . . . . . . . 56

4.8 Smon event usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.9 Smon evset usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.10 Smon profile usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.11 Smon roof-run and roof-creat usage information. . . . . . . . . . . . . . . . . . 59

5.1 SpyMon performance evaluation of SPEC CPU2006 benchmarks, for a 20ms sampling

time interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 Power consumption of four benchmarks run separately and simultaneously. . . . . 65

ix

List of Figures

5.3 Evaluation of SPEC CPU2006 benchmarks by using the CARM. The sample time

interval was set to 50ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Temporal representation of the CARM for Tonto. . . . . . . . . . . . . . . . . . . . 66

5.5 Application CARM plot showing the floating-point SPEC CPU2006 benchmarks;

the application color characterization was made according to average classification

(double, SSE or AVX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.6 Power evaluation of SPEC CPU2006 benchmarks. . . . . . . . . . . . . . . . . . . 68

5.7 Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks. 69

5.8 Thread hierarchy for an FDTD OpenCL application [14]. . . . . . . . . . . . . . . 70

5.9 Scheduling information for OpenCL application fdtd. . . . . . . . . . . . . . . . . . 70

5.10 Function call tracing of an application containing two processes. The child process,

after being forked, switches its execution image. . . . . . . . . . . . . . . . . . . . . 71

5.11 Milc performance colored according to its function call tracing profile. . . . . . . . 72

5.12 Evaluation of SPEC CPU2006 benchmarks using the CARM. . . . . . . . . . . . . 73

5.13 Application CARM plot showing the floating-point SPEC CPU2006 benchmarks;

the application color characterization was made according to average classification

(double, SSE or AVX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.14 Power evaluation of SPEC CPU2006 benchmarks. . . . . . . . . . . . . . . . . . . 74

5.15 Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks. 75

5.16 Diagram illustrating the performed overhead evaluation tests. . . . . . . . . . . . . 76

5.17 SpyMon’s number of instructions per sample when self-monitoring. . . . . . . . . . 77

5.18 SchedMon’s number of instructions per sample when self-monitoring. . . . . . . . . 78

5.19 Overhead of taking a PMU or a RAPL sample in both SpyMon and SchedMon tools. 79

x

List of Tables

3.1 Sets of PMEs used for performance profiling when using the cache-aware roofline

model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Sample of hardware performance events provided by SpyMon. . . . . . . . . . . . . 31

4.1 Available ioctl() requests to SchedMon’s driver. . . . . . . . . . . . . . . . . . . . 52

5.1 Median Time Counts for SpyMon self-monitoring. . . . . . . . . . . . . . . . . . . 76

5.2 Median Time Counts for SchedMon self-monitoring. . . . . . . . . . . . . . . . . . 76

xi

List of Tables

xii

List of Acronyms

AVX Advanced Vector Extensions

CARM Cache-aware Roofline Model

CPU Central Processing Unit

DP Double Precision

DRAM Data Random-Access Memory

FP Floating Point

GPU Graphics Processing Unit

LLC Last-Level Cache

LPC Logical Processor Core

MSR Model-Specific Register

ORM Original Roofline Model

OS Operating System

PFC Performance Fixed Counter

PMC Performance Monitoring Counter

PME Performance Monitoring Event

PMSR Performance Monitoring Select Register

PMU Performance Monitoring Unit

PPC Physical Processor Core

RAPL Running Average Power Limit

SSE Streaming SIMD Extensions

TBM Time-Based Multiplexing

TSC Time-Stamp Counter

xiii

List of Acronyms

xiv

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

1. Introduction

The constant technological advances in computing systems have led to multi-core architectures,

which contain complex internal mechanisms that are not always easy to understand or analyze.

Following this evolution, adapting and optimizing the execution of real-world applications is of

the most importance, in order to be able to fully explore the potentials of the underlying archi-

tectures. This requires a deep understanding of how the underlying infrastructures work and how

one can efficiently explore them. In order to provide insights about the micro-architectural behav-

ior, Central Processing Unit (CPU) manufacturers already incorporate low-level mechanisms that

provide information about the architecture behavior, at application run-time. However, assessing

these mechanisms usually requires the use of complex interfaces, and a deep understanding of the

the functional principles behind hardware facilities. The work proposed herein aims at exploring

different ways of exporting the full functionality of these hardware interfaces to the user in an

easy and intuitive way, by proposing several tools for performance and power/energy monitoring

at different levels of parallel processing in modern multi-core systems.

1.1 Motivation

Up until recently, computer’s processing power could be increased by using power-hungry tech-

niques, e.g., by increasing the processor’s pipeline depth and therefore its overall frequency. How-

ever, architectural designers experienced great difficulties to accompany this growth, due to its

physical limitations (mainly regarding high-power consumption), marking the end of single-core

systems. With the introduction of multi-core processors, they were able to circumvent these issues.

Multi-core processors are typically based on the replication of a number of identical cores in a

single die, where each core includes a set of private coherent caches and usually a hardware support

for multiple thread execution. The cores usually share a common higher level memory organiza-

tion, typically containing the Last-Level Cache (LLC) and the main memory. Even though these

techniques have allowed to increase the processing power, they still present major challenges. For

instance, the widening gap between processor and memory speeds has caused processors to spend

most of their time waiting for memory data, making frequency increases ineffective. Furthermore,

higher frequencies require deeper pipelines, which makes the design and verification of already

complex processors even more challenging. From a software perspective, the ability to explore the

full performance of multiple execution cores in a single computer has proven to be difficult and,

thus, it has become indispensable for application developers to characterize and understand such

complex systems.

Hardware Performance Monitoring Units (PMUs), available in most modern processors, give

developers the ability to analyze system performance and potential execution bottlenecks. By

using several registers, often called Performance Monitoring Counters (PMCs), PMUs support the

counting or sampling of several micro-architectural events [12]. Moreover, recent architectures also

provide a similar interface for monitoring energy consumption in several architectural components.

2

1.2 Objectives

In Intel’s architectures this interface is called Running Average Power Limit (RAPL) [12].

In order to make use of the referred performance and power interfaces, several methods have

been developed in the recent years, in the form of different libraries and tools that facilitate

the access to those facilities. However, developing an accurate tool for performance and power

consumption monitoring with low overheads is not an easy task. Moreover, the tools need to

provide a simple and intuitive interface, in contrast to the common approaches in the literature

that provide the most functionality by using complex interfaces and, sometimes, hard to use by

the common user.

1.2 Objectives

Although there are several profiling tools available that allow to obtain performance or power

consumption information, there are only a few that provide both functionalities in a single inter-

face. In addition, even if a full performance configuration is provided, the choice of the proper

performance events to monitor is not always trivial, nor the proper way of evaluating these, in

order to obtain a complete overview of the application attainable performance on the underlying

architecture resources. Finally, the ability to provide the full performance and power consumption

evaluation must be passed to the end-user as an easy-to-use interface. However, some of the most

powerful state-of-the-art performance interfaces are too complex [17] or not fully documented,

which hampers their usage.

According to the above needs regarding the full performance and power consumption evaluation

of applications on modern architectures, the main objectives of the herein presented work include:

• The integration of both performance and power consumption evaluation in a single interface;

• The further research for efficient novel approaches that allow the complete evaluation of one

or several application’s performance behavior on modern multi-core architectures;

• To provide a full performance and power consumption evaluation of a set of standard bench-

marks, thus allowing the analysis of their behavior in different scenarios and providing the

ability to detect possible execution bottlenecks, in a modern multi-core architecture.

• Translating the full hardware performance and power monitoring resources capabilities to

the end-user in an easy and intuitive interface.

1.3 Main contributions

The main contributions of the work developed through this thesis correspond to the proposed

monitoring tools:

• SpyMon - The user-space tool that aims at a system-wide performance analysis. The main

functional principle behind this tool relies on spawning a process to each processor core,

3

1. Introduction

which handles the profiling operations for that system’s component. This tool is intended

to integrate both performance and power consumption monitoring, and it is provided to the

end-user via a simple to use interface. This tool has proven to be able to provide a full

system evaluation, even if several tasks are running simultaneously. Although it has shown a

significant increase in power consumption when profiling, the tool has shown not to introduce

high performance overhead.

• SchedMon - The second tool follows a completely different approach and its core functionality

is implemented from the kernel-space, by using a Linux device driver [6]. The tool makes

use of the OS internal scheduling events in order to detect context switching and to obtain

more accurate results. Similarly to SpyMon, it provides all its functionality to the end-user

in an intuitive and easy-to-use command line interface. However, there is a possibility of a

run-time evaluation, by means of a provided user-space library, which exports the kernel-

space core functionality into user-space programs in a set of simple calls. This method has

shown some improvements in terms of imposed overheads. Moreover, it provides additional

functionalities that have proven to improve applications analysis, by additionally providing

the ability to reconstruct the scheduling route of multi-threaded applications, as well as to

assign distinct performance behaviors to specific parts of the application’s code.

Both the herein proposed tool have shown low interference into the monitored applications

performance. In addition, a full performance and power analysis of a set of standard SPEC

CPU2006 [10] benchmarks is provided, which relies on the Cache-aware Roofline Model (CARM) in

order to provide a broader perspective of the application attainable performance on the underlying

multi-core architecture. These benchmarks are widely referenced and used, and there are currently

no detailed information about their performance and power/energy consumption evaluation.

Part of this work has been already published at an international conference:

• [3] Diogo Antão, Luís Taniça, Aleksandar Ilić, Frederico Pratas, Pedro Tomás and Leonel

Sousa, "Monitoring Performance and Power for Application Characterization with Cache-

aware Roofline Model”, In International Conference on Parallel Processing and Applied Math-

ematics (PPAM 2013), Springer, Warsaw, Poland, September 2013.

1.4 Dissertation outline

The remainder of this dissertation is organized as follows. Chapter 2 addresses the background

information required to understand the herein proposed work. First, a general overview of a modern

computer architecture is made, which covers not only the basic description about the available

performance and power/energy monitoring infrastructures, but also how one can configure them

in order to obtain meaningful information. Since both the herein proposed tools interact with

Linux kernel to obtain their functionality, a brief overview on the Linux kernel concepts is also

4

1.4 Dissertation outline

provided. Then, an overview of the most common monitoring challenges and the available state-

of-the-art tools is provided. At last, a brief description about the CARM model is made, since it

is involved in one of the core functionalities herein provided. Chapter 3 introduces a new simple

to use system-wide monitoring tool, which provides the ways to perform a full system performance

and power consumption analysis, and it is mostly implemented in the user-space. An overview

of the tool’s functionalities, as well as of the main implementation aspects that are important for

the understanding of the tool is made. In the end, the tool’s usage information is provided. In

a similar way to Chapter 3, the introduction of a new monitoring tool is made in Chapter 4.

This tool proposes a different approach of the previous one, as it is mostly implemented from the

kernel-space. After a complete overview of the tool’s capabilities, a detailed description about its

internal mechanisms and usage is made. Chapter 5 illustrates the potential of both tools by

means of experimental results. This chapter shows a performance and power/energy consumption

evaluation of several standard benchmarks, by relying on the CARM. In addition to exploring the

full functionality of both tools, a comparison between them is also made, including an overhead

evaluation. Finally, in Chapter 6, the conclusions about the presented work are made, as well as

several improvement suggestions for future research work.

5

1. Introduction

6

2Background

Contents2.1 Performance Monitoring Unit . . . . . . . . . . . . . . . . . . . . . . . 92.2 Running Average Power Limit . . . . . . . . . . . . . . . . . . . . . . . 112.3 Linux Kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Performance Monitoring Challenges . . . . . . . . . . . . . . . . . . . 132.5 State-of-Art Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . 142.6 Cache-Aware Roofline Model . . . . . . . . . . . . . . . . . . . . . . . 162.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

7

2. Background

Modern computing systems have become complex heterogeneous platforms capable of sustaining

high computing power. In the past designers have been able to improve processing performance

by applying power hungry techniques, e.g., by increasing the pipeline depth and, consequently, the

overall working frequency. However, such techniques have become unbearable due to the well known

power wall. To overcome this issue, while continuing to improve processing performance, processor

manufacturers turned to multi-core designs, by replicating a number of typically identical cores on a

single die, where each core includes a set of private coherent caches and dedicated execution engines,

and in some cases hardware support for multiple threads. Although these solutions are able to

provide extra processing power, they also introduce additional complexity into the design, making

it harder for application designers to fully exploit the available processing power. In particular,

all cores share the access to a common higher level memory organization, typically containing the

last level cache and the main memory. This may however result in resource contention, which can

drastically affect the execution efficiency.

Figure 2.1 shows an example of a modern multi-core CPU architecture composed of two

Physical Processor Cores (PPCs), each supporting the simultaneous execution of two threads

(multi-threading). As such, each PPC is divided into two Logical Processor Cores (LPCs), one

for each thread. Thus, each LPC contains a set of registers of its own, e.g., instruction pointer,

stack pointer, general registers and Model-Specific Registers (MSRs). Both LPCs in the same PPC

share the execution resources (e.g., ALU) and the first two level of cache, which might increase

the contention on these resources. Furthermore, all the LPCs share a last-level on-chip (L3) cache

and the off-chip Data Random-Access Memory (DRAM).

PPC0 PPC1

LPC0 LPC2 LPC1 LPC3

L1 cacheL2 cache L2 cache

L1 cache

L3 cache

DRAM

- Stack Pointer- Instruction Pointer- PMU- General Registers EAX ECX EDX …

Multi-core CPU

Figure 2.1: Multi-core CPU architecture

In order to characterize and understand the behavior of such complex computational systems,

we require accurate real-time monitoring facilities. These allow, for example, identifying application

and architectural efficiency bottlenecks for real-case scenarios, thus giving both the programmer

8

2.1 Performance Monitoring Unit

and the computer architect hints on potential optimization targets. The following sections describe

the concepts and hardware resources available in modern architectures that allow real-time moni-

toring and that are relevant for a better understanding of the herein presented work. Section 2.1

describes the architectural interface that allows to extract performance information at run-time.

Next, in section 2.2, a similar interface is presented, which aims at providing run-time information

about the system’s energy status. Since the herein presented work presents tools targeting perfor-

mance information extraction, section 2.4 describes the major present challenges when monitoring

performance. Further on, in section 2.5 a quick overview of the most referenced state-of-the-art

monitoring tools is made. At last, in section 2.6, an introduction to the Cache-aware Roofline

Model [11] is made, which is a major requirement to understand the main contributions of the

presented work.

2.1 Performance Monitoring Unit

The hardware Performance Monitoring Unit (PMU) is an architectural interface, available in

most modern Intel processors since Intel’s Pentium processor [12]. It gives developers the ability

to analyze system performance and potential bottlenecks. This unit is composed by a small set of

MSRs, which are hardware control registers. These registers can be configured to monitor specific

architectural Performance Monitoring Events (PMEs), such as clock cycles, retired instructions,

branch miss-predictions and cache misses.

The following subsections describe in detail different types of MSRs used by the PMU and how

to configure them to monitor specific PMEs. Although the provided information is based on Intel’s

architectural performance monitoring facilities [12], similar mechanisms exist in other processors

like AMD, PowerPC and ARM.

2.1.1 Performance Model-Specific Registers

The PMU is composed of two main types of MSRs: Performance Monitoring Select Registers

(PMSRs) and Performance Monitoring Counters (PMCs). PMSRs are used for configuring the

events to monitor (count) in each PMC. Thus, PMSRs and PMCs work in pairs, which means

that if one writes a word for an event configuration into PMSRx, the correspondent event counts

will be reported into PMCx. The number of available register pairs is usually small (e.g., 4 per

logical CPU in Intel Ivy Bridge), which limits the number of events that can be monitored at a

time.

Later versions of the PMU provide additional functionality by adding some more MSRs to the

facility. They include Performance Fixed Counters (PFCs), easier monitoring control (easy toggle

and overflow status) and some extra MSRs for off-core event configuration. PFCs have a similar

functionality to PMCs. The main difference between these is that one cannot configure which

architectural events a PFC should count. PFC events are predefined by the architecture and can

9

2. Background

void WriteMSR (uint32_t msr_id, uint32_t d, uint32_t a){

__asm__ ("wrmsr" : : "c"(msr_id), "a"(a), "d"(d));}

void ReadMSR (uint32_t msr_id, uint32_t *d, uint32_t *a){

__asm__ ("rdmsr" : "=a"(*a), "=d"(*d) : "c"(msr_id));}

Figure 2.2: MSR read and write functionality

only be enabled.

2.1.2 Performance Monitoring Event Configuration

Configuring and reading performance MSRs can be achieved by using special assembly instruc-

tions (figure 2.2), namely: wrmsr, that allows writing the contents of the general purpose registers

EDX:EAX into the MSR specified by ECX; or the rdmsr assembly instruction, which allows reading

the MSR specified by ECX, into the EDX:EAX general purpose registers. Since the MSRs are 64 bits

long, we need to use two 32-bit general registers for holding the configuration word or the result.

As already referred, there are two types of performance counters: general purpose (PMCs)

and fixed (PFCs). The configuration of a PMC may be done by writing the adequate word into

its corresponding PMSR. The configuration words are architectural dependent and should be

consulted in the respective manual.

Figure 2.3(a) illustrates the bit field layout of a PMSR. The 16 least significant bits, event

select and unit mask, are meant for choosing the event to monitor. The event select bit field

selects the event logic unit (e.g., retired instructions) and the unit mask specifies the condition

that the selected event unit detects (e.g., retired store instructions). The unit mask values are

specific to each event logic unit. It is also possible to define at which privilege levels one wants

the selected event to count. This is done by using bits 16 (user mode) and 17 (OS mode).

When the user mode bit is set, the selected event only counts when the processor operates at

privilege levels 1, 2 or 3. In the same way, OS mode enables counting at privilege level 0. It is

mandatory to enable at least one of these modes and both of them can be set at the same time.

By default, the configured event only counts for the current LPC. However, measuring the PPC is

possible by setting the any thread bit flag. If the APIC interrupt enable bit flag is enabled,

an interruption will be raised every time the correspondent PMC overflows. This might become

handy for defining sampling intervals. Performance counting is enabled in the correspondent PMC

by setting the enable bit flag (bit 22). A more detailed information about this subject can be

found in [12].

Intel’s PMU version 3 (present in, e.g., Sandy Bridge and Ivy Bridge micro-architectures)

provides three PFCs, which configuration is done using only one MSR as described in Figure 2.3(b).

10

2.2 Running Average Power Limit

(a) Performance Monitoring Select Register (PMSR)

(b) Performance Fixed Counter (PFC) control register

Figure 2.3: Performance MSRs for Intel’s PMU version 3. Figures obtained from [12].

As already referred, these registers can only be toggled and count only predefined architectural

performance events. In the current context: PFC0 counts the number of retired instructions;

PFC1 counts core clock cycles when the clock signal on the correspondent core is running; PFC2

counts reference core clock cycles when the clock signal on the correspondent core is running.

The reference clock operates at a fixed frequency, irrespective of core frequency changes. The few

configurations available for this type of counter (privilege level selection, any thread flag and toggle

flag) work in the same way as already described for PMCs. Overflow interruptions are available as

well.

2.2 Running Average Power Limit

On Intel architectures, the PMU does not provide energy information or power metering. In

order to assess this information, Intel introduced the RAPL energy status interface in its most

recent platforms. Energy status is a power metering interface comprising non-architectural MSRs.

Using the disposed set of registers that compose the interface, it is possible to extract energy

consumption information in real-time on different domains, i.e., different regions of the processor

die.

The domains present in a platform may vary across product segments. Platforms targeting

11

2. Background

Figure 2.4: Energy status MSR layout. Obtained from [12].

the client segment feature power metering support for package, PP0 and PP1. The package

domain includes the whole processor die, which means that one can obtain the power consumption

of the chip-set in real-time. The PP0 refers to the cores inside the chip, which gives more detailed

information on which parts of the processor die are consuming the most. Intel’s manual [12] does

not specify the PP1 specific target. The only given information says that it may refer to off-core

devices, which means that it might target different parts of the die that are not cores. Platforms

targeting the server segment also provide package and PP0 support. However, the PP1 domain

is replaced byDRAM. Although it is not described in detail what theDRAM domain really does,

this is likely to target some part of the die that connects and communicated with the computer’s

main memory.

Figure 2.4 represents an energy status counter register layout. These counters cumulate the

consumed energy in real-time and Intel provides one for each of the previously referred domains.

These counters are updated around every millisecond and have a wraparound time of about 60

seconds.

Energy related information (in Joules) is based on the multiplier 1/2ESU , where ESU (energy

status units) is an unsigned integer. This value can be obtained by reading bits 8 through 12 from

the MSR_RAPL_POWER_UNIT register. Its default value is 10000b, indicating that the energy status

unit is in 15.3 micro-Joules increment.

All the registers comprising the energy status interface are read-only and can only be accessed

from privilege level 0. In Linux systems, this means the user needs to create a kernel module to

access these registers, or use any of the already available tools that provide an interface to these

registers.

2.3 Linux Kernel Modules

The hardware facilities described in Sections 2.1 and 2.2 may require special privilege per-

missions in order to be handled. Although PMU readings may be performed from user-space,

configuring PMCs must be done from privilege level 0. RAPL energy status MSRs are not allowed

to be written and should also need special permissions in order to be assessed.

In Linux systems, there are only two different permission levels: i) the user-space, which

comprises hardware privilege levels 1,2 and 3; and ii) the kernel-space, which operates at privilege

12

2.4 Performance Monitoring Challenges

level 0. Therefore, in order to obtain the required permissions for handling the performance and/or

power monitoring infrastructures, software interfaces must contain some component that runs in

the kernel-space side. Running code in Linux kernel can be done in two ways:

• Change the kernel source - since Linux is distributed under an open-source license, it is

possible to have access to its source code and modify it according to our needs. Therefore,

changing Linux source code is one of the ways of being able to run code at privilege level

0. This implies, however, recompiling and re-installing the OS and it is not very practical,

specially when the product is targeted for third parties to use.

• Linux kernel modules - a kernel module is a piece of code that, with the right permissions,

is allowed to be integrated into the Linux kernel, at run-time, thus becoming a part of the

OS’s core and running in privilege level 0. This is a simpler and more elegant way of inserting

code into the Linux kernel, and it does not require the OS’s recompilation and re-installation.

The vast majority of Linux kernel modules is designated as a device driver, despite of being or

not attached to a physical device [6]. The herein proposed tools make use of kernel modules, which

although not connected to any kind of peripheral device whatsoever, may be logically seen as a way

to access the physical hardware resources comprising performance and power/energy consumption

monitoring and, therefore, to call them drivers.

Linux Device Driver

In Linux operating systems everything is "seen" as a file, including hardware devices, thus

standardizing the communication to any physical device to be handled as a regular file. Linux

device drivers are the mechanism that makes possible the communication with a device, by allowing

to redefine the predefined operations over the target device file (e.g., read, write, open or close).

Both the herein presented tools make use of a Linux device driver in order to overcome the possible

hardware privilege restrictions comprising the performance and power/energy monitoring facilities.

2.4 Performance Monitoring Challenges

In the previous sections it was made an overview of the performance monitoring structures

currently available. As simple as it might seem at first sight, these facilities are usually too

complex for the common user. In order to make the proper use of them, a deep knowledge of the

underlying architecture and operating system is required. Therefore, making use of these facilities

for dynamic optimization purposes has proven to be challenging for a number of reasons:

• Limited Hardware Resources - The number of available PMCs is typically very small

(e.g., up to 4 Intel Ivy Bridge processors). Consequently, it limits the number of low-level

hardware events that can be measured simultaneously at any given time. It is safe to assume

13

2. Background

that detecting performance bottlenecks in complex superscalar microprocessors often requires

a broader analysis on several architecture components. In order to get a deep analysis on the

architecture’s behavior, since it requires analyzing more than 4 events, several techniques can

be applied. For offline analysis, one could run the same application several times while mea-

suring different hardware events for each run. However, merging the information from several

runs is not straightforward because there might be asynchronous events (e.g., interrupts and

IO events). There are other architecture elements that might create differences from run

to run, depending on the current processor state (e.g., branch predictor). There are several

techniques that can be used in order to overcome this limitation, where the most common

technique is event multiplexing. This technique consists of switching the configuration of the

PMCs regularly and at short time intervals, thus virtual extending the number of monitored

events.

• Complex Interface - The events measured by PMCs are often low-level and specific to a

micro-architecture implementation. For this reason, it becomes difficult to the end-user to

interpret the obtained counter readings without having detailed information on the architec-

ture specifications. Hence, it is hard to translate the counts from the hardware events to

their actual impact on the end performance.

• High Overhead - Since PMU resources are shared among all processes, they can only be

programmed in supervisor mode. Thus, whenever a process needs to configure or change the

events being monitor, it has to communicate with the underlying operating system. These

expensive communications may happen very frequently, which leads to substantial overhead.

2.5 State-of-Art Monitoring Tools

There are many options in the literature that provide access to hardware performance counters.

In the case of Linux, one of the earliest was the perfctr patch [15] for x86 processors. Perfctr

provided a low latency memory-mapped interface to virtualized 64-bit counters on a per-process or

per-thread basis. Later on, the perfmon [2] interface was submitted to the kernel. When it became

apparent that perfctr would not be accepted into the Linux kernel, perfmon was rewritten and

generalized as perfmon2 [13] to support a wide range of processors under Linux. After a continuing

effort over several years by the performance community to get perfmon2 accepted into the Linux

kernel, it too was rejected and supplanted by yet another abstraction of the hardware counters,

first called perf_counters in kernel 2.6.31 and then perf_events [17] in kernel 2.6.32.

Perf_events is included in the Linux kernel, which makes it the preferable choice over the

other available interfaces. The interface is built around file descriptors, allocated using the in-

troduced system call sys_perf_event_open(). This system call returns a file descriptor repre-

senting a virtual performance counter. Events are specified at open time by using an elaborate

14

2.5 State-of-Art Monitoring Tools

perf_event_attr structure, which contains more than 40 fields that can interact in complex ways.

PMCs are enabled or disabled via ioctl() calls and their value can be read using a call to read().

Sampling can be enabled to periodically read the counters and write the values to a circular buffer,

which must be allocated using mmap() call. Signals are sent to the process holding the referred file

descriptors when new data is available.

Although perf_events has shown to be a quite powerful interface, it might be too complex

for the common user. Moreover, it does not provide access to the RAPL interface. If one requires

monitoring power along performance, a different interface has to be used.

PAPI [4] is one of the available tools that uses perf_events. Its objective is to be highly

portable by reusing the available OS performance interfaces, while allowing the inclusion of plug-

ins to read other counters, such as those provided by NVIDIA Graphics Processing Units (GPUs).

PAPI provides two interfaces to the underlying counter hardware: a simple, high-level interface

and a fully-programmable low-level interface. The high-level interface only provides functions for

starting, stopping and reading the counters. The low-level interface provides much more manage-

ability and control over the available resources. Event multiplexing, multi-thread support, user

callbacks on threshold and statistical profiling are some of the available functionalities. Recent

versions of PAPI also include the possibility to measure power/energy consumption [18]. On the

other hand, if a deep control over the available performance resources is needed, PAPI might

not be the best way to do it, since it does not provide direct access to the performance unit but

virtualizes it instead.

If one is interested in a quick binary profiling, without having to write code to do it, Perf [1]

might be a more preferable choice. This is a profiling Linux command-line tool and one of the

most referenced. It can be seen as an abstraction to the perf_events interface, much more

accessible to the common user. Perf provides a set of commands which allow not only profile but

also to report profiling in a user-friendly way. It provides support for multi-threaded applications,

event multiplexing and statistical profiling, among others. A processor-wide mode is also available,

allowing the user to profile not the application but the system itself. However, this tool lacks the

possibility for power profiling, which obligates the search for other tools when energy information

is a requirement.

Yet another well-known resource is OProfile [5], which is composed by a Linux kernel driver,

a daemon and perf -like command line tool. OProfile’s kernel driver is meant for abstracting

the performance hardware registers and dump the sampling information at regular intervals. The

daemon can be started and stopped by the user and it is responsible for consuming the profiling

information provided by the kernel driver and save it in OProfile’s sampling database. This

database can later be accessed by the user to extrapolate useful profiling information by using the

command-line available tools, like opreport. Although this tool appears to be complete in terms

of performance, it still lacks the functionality for providing energy status information.

15

2. Background

There are several other profiling tools available, like Intel VTune Performance Analyzer [8],

LIKWID [16] or LIMIT [7]. The choice of the right tool is not always trivial and it mostly relies

on the user needs. For instance, one may require higher abstraction, lower overhead, higher control

or more information detail.

The herein described work proposes two distinct monitoring tools: one implemented from the

user-space, which provides a system-wide analysis, and another one, mostly implemented from

the kernel-space, which targets application monitoring. Both proposed tools comprise most of the

state-of-art functionalities and, in addition, the ability to assess power/energy information at run-

time alongside with performance. All the functionality of the tools is translated into an easy-to-use

command-line interface, thus facilitating the usage of the underlying hardware performance and

power facilities. Moreover, a predefined performance configuration is provided, which outputs the

extracted profiling information into a single plot using the CARM [11], thus providing an easier

yet broader perspective of the underlying architecture and application’s attainable performance.

2.6 Cache-Aware Roofline Model

As previously referred, to improve performance, modern multi-core architectures replicate sev-

eral processing cores on a single die. Each core has its own private set of caches (L1, L2), while

the access to the other memory levels (L3, DRAM) is shared among the cores.

Since data accesses and computation operations are performed in parallel, the execution is

limited either by the computation in-core resources or by the memory subsystem capabilities.

For instance, if an application contains a lot of memory operations and only a small amount of

computations over that data, the memory subsystem mechanisms will stall the execution and,

therefore, the computation in-core resources do not reach their peak performance. Based on this

observation, the Original Roofline Model (ORM) [19] shows the attainable performance of a multi-

core architecture by relating its peak Floating Point (FP) performance Fp (in flops/s) with the

theoretical bandwidth of a single memory level, usually DRAM (in DRAM bytes/s). However, since

memory is composed by several hierarchic levels, this model cannot fully describe the behavior of

modern applications and architectures by simply analyzing the behavior of each individual level.

In practice, the accesses to different memory levels can not be decoupled, since the data must

traverse the whole memory hierarchy before in-core computations are performed. The recently pro-

posed Cache-aware Roofline Model (CARM) [11] considers these effects and the complete memory

hierarchy. Thus, it models the performance upper-bounds of multi-core architectures having into

account the different memory levels, in a single plot. In order to achieve this, the CARM consid-

ers performance, F (φ), and bandwidth, B(β), as continuous functions of performed flops φ and

transferred bytes β at different memory levels. The CARM, in contrast to the ORM, perceives

information in a centralized way, i.e., from the point of view of the core, thus allowing to normalize

the information. As a result, in CARM, the operational intensity (I in flops/bytes) is uniquely

16

2.7 Summary

2-6

2-4

2-2

20

22

24

26

2-8 2-6 2-4 2-2 20 22 24 26 28

Perf

orm

ance

[Gflops/

s]

Operational Intensity [flops/byte]

Intel 3770KIvy Bridge

AVX MAD (Peak performance)

Peak L

1 Bandwidth

(L1→

C)

L2→C

L3→C

DRAM→C

ADD/MUL

Figure 2.5: Performance Cache-aware Roofline Model (Intel 3770K)

defined and the attainable performance Fa(I) of the architecture is expressed as follows:

Fa(I) =φ

T= min {B(β)×I, F (φ)} , T=max{ β

B(β),

φ

F (φ)}, I=φ/β. (2.1)

Equation (2.1) states that Fa(I) is limited either by the memory bandwidth or by the in-core

performance. Indeed, since memory transfers and computations overlap, the overall execution is

dominated either by the time to transfer the data, β/B(β), or by the computation time, φ/F (φ).

Figure 2.5 illustrates the CARM for a quad-core Intel 3770K processor. As it can be observed,

Fa(I) is bounded by the peak FP performance (Fp) for the compute-bound region, and the the-

oretical peak bandwidth of the memory level closest to the core for the memory-bound region,

BL1→C . The model’s ridge point corresponds to the minimum operational intensity I required to

achieve maximum performance, where the computations and memory operations are completely

overlapped. Furthermore, Fa(I) can also vary according to the characteristics of the computing

units, i.e., MAD, MUL or ADD units. It can also vary with the available memory bandwidth from

the different cache levels to the core (BL2→C , BL3→C and BDRAM→C), thus creating different

boundaries.

Since the CARM considers all memory operations, including accesses to the different cache

levels, it results in a single-plot model that reveals the area previously uncovered by the ORM [19].

Furthermore, these differences are also reflected in: i) how the model is constructed; ii) how it is

interpreted; and iii) the given guidelines when optimizing applications [11].

2.7 Summary

This chapter describes the main concepts regarding the hardware and software infrastructures

that are relevant for a complete understanding of the herein presented work. An overview of a

modern multi-core CPU architecture is made, which introduces the concepts of physical and logi-

17

2. Background

cal processor cores and, in addition, illustrates the memory resources hierarchy. The performance

and power/energy hardware monitoring facilities are explained in detail, by illustrating their most

relevant structures and how to configure and access them. Further on, an overview of the state-of-

art profiling tools is presented, providing a broader perspective on the most commonly provided

performance and power functionalities. At last, the CARM performance evaluation model is de-

scribed, since it is considered to be one of the most valuable features that composes the herein

proposed tools. This model provides a deep architectural performance analysis, and makes easy

to identify possible hardware and/or software bottlenecks. Gathering the state-of-art most com-

mon functionalities and providing them in a easy-to-use interface is one of the main goals of the

presented work. Moreover, the proposed tools provide both performance and power/energy con-

sumption information in a single interface, and allows to output performance execution results into

the CARM.

18

3User-Space Monitoring Tool

(SpyMon)

Contents3.1 Architecture and Main Functionalities . . . . . . . . . . . . . . . . . . 203.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

19

3. User-Space Monitoring Tool (SpyMon)

The main requirements when performing a performance analysis are: i) a full control of the

monitored target (e.g., an application, a CPU core or even the whole system); ii) the possibility

to configure and select the necessary set of performance events to be monitored; and iii) to be

provided with a fine time granularity information. In addition to these requirements, and to obtain

a more complete picture of the system, providing energy/power consumption information is also a

valuable feature.

This chapter proposes a new tool (SpyMon) for system-wide monitoring. In the first section,

an overview of the tool’s internal structure and design is presented, as well as the features and

benefits to the end-user. Section 3.2 describes the implementation details, i.e., how SpyMon makes

use of the underlying performance and power monitoring facilities in order to provide a simple to

use interface. At last, section 3.3 fully covers the tool’s usage and possible configurations.

3.1 Architecture and Main Functionalities

SpyMon’s main goal is to provide a portable tool with an intuitive interface for the end-user,

without relying on the underlying OS’s monitoring facilities. Hence, most of SpyMon’s implemen-

tation lies in the user-space, in order not to interfere or depend on the running system. SpyMon

targets a core-oriented approach, by monitoring the behavior of each Logical Processor Core (LPC)

and therefore being able to capture the information of all running applications. As a result SpyMon

allows monitoring the whole system, regardless of what is running at a given time instant on each

LPC. This means that, even if the application migrates to another core, launches new threads, or

its execution is constrained by the contention caused by other running applications, SpyMon is able

to capture all.

3.1.1 Spatial Process Organization

The herein proposed tool is composed by a monitor and several spies. The monitor is the

main process of the tool. It is responsible for handling the energy status information and control-

ling the whole execution flow (e.g., the user interface, monitored applications and configuration).

The spies are lightweight processes attached to a predefined LPC and have the purpose of config-

uring and fetching the performance counter readings, therefore producing performance information

samples.

As previously referred, each LPC contains its own set of performance monitoring facilities

(PMU). A single process (or thread) can only access the PMU on the LPC that it is currently

running on, while it cannot access a PMU on a different LPC. Since it is required to gather

performance information from different LPCs simultaneously, the proposed tool implements a

performance monitoring process running in each target LPC.

The typical SpyMon configuration is to launch a spy to monitor the performance of each available

LPC and to pin the monitor to the last one, as shown in Figure 3.1. In the illustrated example,

20


SpyMon monitor SpyMon spy

L1

L2

L1

L2

L3

L1

L2

L1

L2

Physical Core 0 Physical Core 1 Physical Core 2 Physical Core 3

App. 0(Thread 0)

App. 1(Thread 0)

App. 0(Thread 1)

App. 0(Thread 3)

App. 2(Thread 1)

LPC1 LPC5LPC0 LPC4 LPC2 LPC6 LPC3 LPC7

Figure 3.1: Spacial perception of SpyMon while monitoring 5 threads from 3 applications.

the monitor forks 8 new processes (spies) and pins each of them to a different LPC. By default,

the monitor process is pinned to the last LPC, but different configurations are possible, as it will

be described in the next sections. The spies are responsible for handling the communication with

the PMU, in order to output the obtained PMU samples. Since this work relies on facilities for

monitoring energy consumption at the level of the whole chip (RAPL), the monitor is responsible

for the communication with these facilities and, therefore, for reading the energy status information.

As it can be concluded, assigning each spy with that job would introduce additional overhead, since

for each LPC the same values will be read.

3.1.2 Available Features

In order to facilitate the usability of the tool, SpyMon provides a command-line interface, making

all its functionality available to the user in an easy-to-use set of commands. The tool also includes

a set of predefined performance events, which makes possible to run a performance analysis on

the system without the need to consult the manufacturer’s manual. However, it is also possible to

manually extend this set, by defining different raw events before starting the tool. SpyMon provides

total control over the hardware PFCs, allowing to enable or disable each of them individually.

In cases when more PMEs than the available PMCs need to be monitored, event multiplexing is

applied. The ability to choose which LPCs to monitor is also provided, which allows lowering the

overheads in cases when certain cores do not need to be monitored. When energy consumption

monitoring is enabled, the reported values always refer to the whole chip and/or different power

planes within the chip. Sampling mode is also available, which allows profiling the application in

time intervals of finer granularity, thus providing more precise performance analysis.

What to Monitor

Before starting with monitoring, it is firstly required to specify the objective and the LPCs to

be monitored. SpyMon provides the ability to define different monitoring targets:

• System-wide monitoring - The common scenario is to monitor the whole system, by

relying on either performance or energy consumption (or both). When executed with a

21


similar configuration to the one depicted in Figure 3.1, alongside with the required PMEs

configuration, SpyMon provides the ability for a full system performance evaluation.

• Targeted-cores monitoring - The tool allows selecting a specific set of target LPCs to

monitor, as well as to rearrange the spatial process organization. This reduces the tool’s

interference with the system’s performance, in cases that only a set of specific LPCs need

performance monitoring.

• Application monitoring - If monitoring of specific applications is required, the tool allows

keeping track only of the LPCs where those applications run. In this particular case, the

SpyMon user attains the complete control of where each application is running, in order to

ease the interpretation of the monitoring results.

Event Selection

After deciding the set of applications and LPCs for monitoring, the hardware performance

events need to be configured. In SpyMon, PMEs are always configured to run in batches (event-

sets). For instance, if the architecture provides 4 PMCs, then it is possible to configure 4 PMEs

at the same time, that constitute a single event-set. Since there might be some restrictions when

configuring hardware events, it is very important to take these into account when configuring the

PMU. For example, the INST_RETIRED.ALL event can only be configured to be counted in PMC1

[12]. Although most of the state-of-art tools provide simplistic event scheduling, by taking into

account these restrictions, SpyMon configures the PMCs in the same order they are provided by

the user, in order to reduce overheads imposed by increased code complexity. In fact, in modern

multi-core architectures the number of PMC-related restrictions is small. To that respect, SpyMon

provides PMC restriction information, when applicable, and it is the end-user’s responsibility to

ensure correct event ordering.

As previously referred, SpyMon also provides a set of predefined hardware events to facilitate

the configuration for the common user, thus allowing to perform a full analysis of the system’s

performance without the need to consult the manufacturer’s manual. Moreover, the interface also

provides the possibility of setting different architecture-specific PMEs, in addition to the predefined

ones.

SpyMon also provides a very flexible interface to handle performance fixed counters (PFCs).

As referred before, PFCs work in a similar way to PMCs, yet without the possibility to configure

which hardware events to monitor. In SpyMon, a simple interface that allows to enable/disable

each individual PFCs is provided. Moreover, it is also possible to configure which privilege levels

to count (user or OS) [12].

22


Event Multiplexing

As previously referred, one of the biggest limiting factors for accurate performance analysis

in nowadays general-purpose processor architectures lies in the small number of available PMCs

(usually 4 for Intel and up to 6 for AMD architectures). In fact, by taking into account the com-

plexity of nowadays computer systems, this number is usually not sufficient for a full performance

evaluation and therefore event multiplexing must be applied.

In order to monitor more events than the physically provided by the PMU, SpyMon multiplexes

the PMEs in time (Time-Based Multiplexing (TBM)), thus virtually expanding the number of

available PMCs. PMEs are grouped in event-sets, in the same order as in the user’s event con-

figuration. Therefore, TBM is done by switching the currently configured event-set with another

one, in a round-robin manner and at regular time intervals. The exact methodology applied for

TBM is explained in detail in the following text. However, it should noticed that large number of

event-sets also implies higher error on the event count estimation, since different event-sets refer

to different time intervals, i.e., parts of the application’s execution.

Sampling

Sampling refers to the process of extracting performance information at regular intervals, thus

providing the ability to capture the behavior of the underlying system at run-time. SpyMon allows

defining a sampling time interval, which is assigned as a sample duration. When the monitoring

is terminated, a complete set the collected performance samples is outputted.

Energy Status

One of the most important features that differentiates SpyMon from most of the state-of-art

tools, is the ability to provide energy/power consumption information. By specifying an extra pa-

rameter at invocation, the energy consumption information is also included in the reported output.

Since performance and energy/power consumption monitoring rely on different and independent

interfaces, both measurements are simultaneously acquired. When sampling is enabled, SpyMon

takes an energy sample at the same time interval as for performance, thus providing the same

time granularity for both interfaces. The minimum sampling interval is set to 1 millisecond, that

corresponds to the approximate time interval at which the energy status MSRs are updated.

Cache-aware Roofline Analysis

For a common user, defining the extensive set of performance events and fully understanding

the behavior of real-world applications on a target platform is not a trivial task. To ease this

process, SpyMon provides a predefined configuration which allows to make a performance analysis

based on the CARM [11]. When running the tool with this configuration, and by providing a target

application, the tool automatically outputs the performance information in a single and easy to

23


interpret plot. The CARM plot shows the FP performance and operational intensity of each taken

sample as a dot, drawn under the model’s roof, making it simple to detect potential performance

bottlenecks, e.g., from a memory hierarchy point of view.

When using this mode of the tool, it is also possible to define the sampling time interval as well

as to enable energy status information collection. However, energy status information is provided

separately from the model, since CARM only applies to performance.

3.2 Implementation Details

This section presents a detailed description about SpyMon’s implementation. The herein pro-

posed tool is composed by three main parts that interact in a hierarchical way, namely i) the

monitor, which controls the tool’s execution flow and provides all the functionality to the user;

ii) a set of spies, which are responsible for communication with the PMU interface and for han-

dling the performance profiling information; and iii) a linux kernel module, which provides the

access to the hardware facilities, thus overcoming any privilege access restrictions. Figure 3.2 il-

lustrates how the different tool components interact with each other and how they are disposed in

the different privilege layers of the system. A more detailed information on these components and

how they interact is provided in the following text.

User-space

Kernel-space

Hardware

Monitor Spy

System calls

SpyMon’s kernel module

SpyMon’s device

RAPL PMU

Figure 3.2: SpyMon’s components interaction and disposition in the OS privilege layers.

3.2.1 Linux Kernel Module and Hardware Access Restrictions

In nowadays OSs, the access to the hardware performance and energy monitoring facilities is

usually restricted to higher privilege levels, i.e., it is not possible to access these directly from

the user-space. In order to overcome these limitations, SpyMon integrates a specific Linux kernel

module, or driver, which enables the communication with the underlying hardware monitoring

interfaces [9], and resolves the permission restrictions. SpyMon’s driver is composed by a small

number of structures that allow low-level access for the user-space set of commands from the tool,

i.e., the addresses of the underling performance and energy status MSRs and a set of functions that

24


operate over these data structures, including reading from and writing to the hardware counters

and configurations registers.

At the time of the module’s installation, a new device file is created in the /dev directory,

allowing the communication between the user-space processes and the driver. The module is

accessed by calling the ioctl() system call over the device file. By using this call, the tool

is not only able to send a specific command to the module, but also to specify an argument,

which is used to send the proper data structures, either for holding the sample readings or for

configuration purposes. Besides the commands for the module’s initialization and termination,

the main functionality of the driver relies on the IOC_RD_PMU and IOC_WR_PMU commands, for

reading from and writing to the PMUs, respectively. In addition, it also includes the IOC_RD_RAPL

command, for reading the RAPL energy status information.

3.2.2 Hardware Readings and Configuration

As previously referred, the SpyMon’s Linux kernel module provides a set of specific commands

based on ioctl() system calls, in order to allow the spies and monitor the access to privileged

hardware monitoring facilities. According to the type of request that is made to the module, a

corresponding data structure’s address is sent as the ioctl() argument. Figure 3.3 shows how the

sample holding structures are implemented. When the IOC_RD_PMU command is passed through

the ioctl() call to the module, an address to a previously allocated sample_pmu data structure

(see Figure 3.3(a)) is passed as the argument. The module will then read both the PMU and

the time-stamp counters and copy the readings to the user-space data structure referenced by the

provided address. In brief, the readings from a set of nr_fx_ctrs fixed counters (as enabled by the

user) are stored in a fx array, while a gp array holds the values obtained from a set of nr_gp_ctrs

configured general-purpose counters. As presented in Figure 3.3(b), a similar procedure is used

to access energy consumption in the sample_rapl data structure via the ioctl() IOC_RD_RAPL

command. Alongside the tsc time-stamp readings, the energy status information is stored in the

pkg, pp0, pp1 and dram variables, corresponding to the package, power-plane 0, power-plane 1 and

DRAM domains, respectively. A similar structure is used for configuring the PMU events through

the IOC_WR_PMU command. The main difference between the latter and the sample_pmu structure

is that for configuration purposes one 64-bit variable is sufficient to configure the PFCs (see Figure

2.3(b)).

3.2.3 Main Functionality

Figure 3.4 illustrates the execution flow of the tool, from the perspective of both the monitor

and the spy processes. When started, the tool firstly parses the input parameters (step 1). A

detailed description regarding the available options is made in section 3.3. If the ––help sub-

command is provided, the usage information will be printed to the standard output (step 8). If

25


(a) Structure for PMU sample information. (b) Structure for RAPL sample information.

Figure 3.3: SpyMon’s data structures for ioctl() communication.

the ––list sub-command is provided, then the complete list of available hardware events (step 9)

is shown. On the other hand, if either ––start or ––roof argument is provided, then monitoring

parameters are configured according to the user’s input specifications, and the application profiling

is initiated. In brief, ––start activates the most commonly used SpyMon "profiling mode", while

––roof enables run-time cache-aware roofline application monitoring.

Profiling Mode

When the ––start command is provided, the tool firstly parses and verifies the input parame-

ters. Then, the monitor process is pinned to a specific LPC (step 2), by using the sched_setaffinity()

system call. This call allows informing the scheduler in which LPCs is the calling thread allowed

to execute in. By default, the monitor is pinned to the last available LPC, although its affinity

can be changed by the end-user in the initial tool configuration.

Afterwards, the main process forks several new processes (spies), which number corresponds

to the number of required target monitoring cores (step 3). By default, all LPCs are monitored,

i.e., SpyMon firstly detects the number of available LPCs and launches one spy for each LPC. The

general execution diagram for a single spy process is depicted in Figure 3.4(b) and it starts by

setting the pipe communication channel with the monitor (step a).

Following the process spatial configuration, the PMU configurations are made (step 4). After

parsing the provided PME configuration, SpyMon creates a number of event-sets by grouping the

events according to the number of available PMCs. For instance, if the architecture only supports

4 PMCs and the user provides 7 PMEs, then the tool will define 2 event-sets, where the first event-

set contains the first 4 provided PMEs and the second one contains the remaining 3 PMEs. When

the PMU configuration is done, the monitor sends the configuration structures to the spies, by

means of the previously established pipe communication channels. In the spy execution diagram

(see Figure 3.4(b)), this corresponds to step b. From this point on, each spy starts monitoring its

target LPC and producing the performance sampling information accordingly.

If more than one event-set is defined, then event multiplexing is applied. In these cases, a single

performance sample corresponds to a specific predefined time interval in which the performance

counter readings from different event-sets are merged together. This mechanism is performed by

26


Start

Parse Options

command

Display usage

information

Display predefined

eventsLaunch & pin spies

Pin monitor

Send PMU config

-p options Launch & pin apps

Get RAPL sample

Stop spies and

terminate

End

—help —list

—start—roof

yes

no

Apps finished

yes

no

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8) (9)

Output RAPL

sample

(a) Monitor execution diagram.

Start

Set comm with

monitor

Wait mux interval

Configure next evset

End

Get PMU config

Configure evset 0

evsets > 1

Read PMU counters

Output PMU

sample

sample finished

terminate

yes

yesyesno

no

no

(a)

(b)

(c)

(d)

(e) (f)

(g)

(b) A single spy execution diagram.

Figure 3.4: SpyMon’s execution flow.

the spies, since they are responsible for handling the PMU.

When performing a system-wide evaluation, it is usually required to launch specific applications

and analyze their performance, as specified by the end-user. SpyMon provides this functionality via

a set of simple configuration commands, that instructs not only the target applications launches,

but also pinning their execution to the required LPCs (step 5). This is achieved by using the

fork() and execve() system calls.

When all the initializations and configurations are performed, SpyMon initiates profiling. At

this point, the monitor process also starts reading and producing RAPL sampling information

(steps 6-7), until the monitored application terminates. At the same time, each spy reads and

produces PMU information samples at regular intervals (steps d-g). As it can be observed, as

soon as the performance counter readings are retrieved for the current event-set (step e), each

spy activates the event multiplexing if the number of event-set is greater than 1 (evsets > 1), or

27


immediately outputs the counter readings if not (step g). When event multiplexing is activated,

the next event-set is configured (step f), i.e., a set of different events will start being counted

during a multiplexing time interval (step d). When the last event-set counts are retrieved, the

sample is considered to be finished (sample finished) and its contents are outputted (step g).

The described process (steps d-g) is repeated one time for each sampling time interval, until

monitoring is completed.

The information produced by both the monitor and the spies is directly printed out to files.

The number of files corresponds to the number of processes executed inside the tool, i.e., to the

number of spies plus the monitor, where each file corresponds to only one of those processes.

Event Multiplexing

The event multiplexing functionality allows the virtual extension of the number of PMCs pro-

vided by the underlying architecture. For example, if the sampling time interval is defined as 9

milliseconds and there are 3 event-sets, each event-set runs for 3 milliseconds. Then, the mea-

surements taken from each configured event-set are merged together by relying on the following

procedure:

countsestimate = countsmeasured ∗timetotaltimeenabled

. (3.1)

As it can be observed in Equation (3.1), the event multiplexing implies extrapolation of the obtained

counter values for a single event-set (countsestimate) obtained during the multiplexing time interval

(timeenabled) to the overall sampling time taken to perform all event-sets (timetotal). When event

multiplexing is used, the obtained final sample counts (countsestimate) represent a mere estimate

of the real counts. An illustration of the above described method, when 3 event-sets are configured,

is shown in Figure 3.5. As it can be observed, the event-sets are switched at regular time intervals.

When the last event-set is measured, a complete sample is acquired and the first event-set is again

configured, thus initiating the next sampling time interval.

SpyMon’s TBM is implemented by the spies and is illustrated in Figure 3.4(b) (steps d-f).

Set 0 Set 1 Set 2 Set 0 Set 1

Sampling Time Interval

Time

Figure 3.5: Illustration of TBM for 3 defined event-sets.

Cache-aware Roofline Mode

When the ––roof command is provided, a similar process to the previously for ––start is

performed. In this mode, there is no need to provide any kind of PMU configuration, since all

28

3.3 Usage

Event Set Event

0

FP_SSE_PACKED_SINGLEFP_SSE_PACKED_DOUBLEFP_AVX_PACKED_SINGLEFP_AVX_PACKED_DOUBLE

1

FP_SSE_SCALAR_SINGLEFP_SSE_SCALAR_DOUBLE

MEM_UOP_RETIRED_ALL_LOADSMEM_UOP_RETIRED_ALL_STORES

Table 3.1: Sets of PMEs used for performance profiling when using the cache-aware roofline model.

the required events for the CARM are predefined by the tool. The event-set configuration used

by SpyMon when in roofline mode is depicted in Table 3.1 and the corresponding event description

can be found in Table 3.2. The monitored events required by the roofline model involve detecting

both the performed floating point operations and all the corresponding memory operations (loads

and stores).

As it can be observed, it is required to monitor 6 different events in order to assess the number

of FP operations, and 2 additional events to estimate the amount of data traffic. As a result, event

multiplexing is required and the event-set configuration is made according to the information shown

in Table 3.1.

Energy status information, although not part of the performance roofline model, can also be

provided. If this is the case, the monitor process will also take RAPL samples, in the previously

explained way.

3.3 Usage

SpyMon provides an easy to use command-line interface, to facilitate the user to run either a

system-wide or an application-specific full performance and energy/power consumption evaluation.

As previously referred, SpyMon provides different functionalities via a small set of command-line

parameters. Figure 3.6 illustrates the currently implemented four main options, i.e., ––help,

––list, ––start and ––roof. The set of supported options can be retrieved with the ––help

parameter, which also provides a short summary on how to use SpyMon’s interface with different

options. The ––list option outputs a list of the predefined hardware events provided by the tool.

Table 3.2 shows a small set of the predefined hardware events provided by SpyMon.

3.3.1 Profiling Mode

The spymon ––start command provides a fully configurable execution profiling. This option

allows to configure multiple execution parameters, such as process spacial configuration, event

definition, enabling power metering and setting the sampling time interval.

29


$$ spymon --help

Usage:spymon --help

spymon --list

spymon --start [-e ev0[,ev1[...]]] [-f id:mode[,id:mode...]] [-ccore[,core...]] [-r domain[,domain...]] [-s stime] [-mcore[,core[...]]] [-p [core,[core...]] prog [args]]

spymon --roof [-s stime] [-r domain[,domain...]] prog [args]

Figure 3.6: SpyMon’s usage information.

For event configuration, the –e option must be used. The set of required hardware events are

specified as a comma separated event list, where the events can be designated either by a predefined

event name or by a raw event word. For a predefined hardware event, the required events must

be chosen from the event list provided by spymon ––list and passed as the input parameter.

Raw hardware events can be specified by using the format r:evsel:umask:usr:os, where evsel

corresponds to the event select bit field and umask refers to the unit mask field, while usr and

os represent the user and OS bit flags corresponding to the different privilege modes, respectively.

The fields in the raw-event specification format correspond to the previously referred bit fields for

event configuration (see Section 2.1).

In order to enable the fixed architectural events (PFCs), the -f id:mode option should be

specified. The id corresponds to a specific PFC number (e.g., if the architecture provides 3 PFCs,

the id can take the value of 0, 1 or 2), while mode refers to the privilege modes to monitor (1 for

user, 2 for OS and 3 for both). Similarly to the general purpose events, the input parameters

should be provided as a comma separated list.

To allow attaining a full control over the execution and profiling environment, SpyMon provides

the options –c, –m and –p. The –c core option permits to specify which LPC (core) should be

monitored. A list of LPCs should be provided as a comma separated list. Similarly, the –m option

allows to configure in which LPC the monitor process is pinned. The default spacial configuration

for the monitor and spy processes is shown in Figure 3.1, where the monitor is pinned to the last

LPC and a spy is invoked in each LPC.

SpyMon also provides the ability for launching specific applications (including their input pa-

rameters) by using the option –p. Several applications can also be simultaneously invoked and

monitored, by specifying each of them in a separate –p option. Due to the core-oriented system-

wide SpyMon monitoring approach, when multi-threaded applications are analyzed, it is the user’s

responsibility to ensure the spacial control of the execution threads. For this purpose, besides the

application’s binary and input arguments, SpyMon provides an extra parameter (core) to the –p

which allows to control the application’s CPU affinity, i.e., in which LPCs it is allowed to run.

30

3.3 Usage

Table 3.2: Sample of hardware performance events provided by SpyMon.Event Description

UNHALTED_CORE_CYCLES Unhalted core cycles.UNHALTED_REF_CYCLES Unhalted reference cycles.INST_RETIRED_ALL Number of instructions retired.UOPS_RETIRED_ALL Number of µops retired.MEM_UOP_RETIRED_ALL_LOADS Qualify any retired memory µops that are loads.MEM_UOP_RETIRED_ALL_STORES Qualify any retired memory µops that are stores.

FP_SSE_SCALAR_SINGLENumber of SSE single-precision FP scalar µops exe-cuted.

FP_SSE_SCALAR_DOUBLENumber of SSE double-precision FP scalar µops ex-ecuted.

FP_SSE_PACKED_SINGLENumber of SSE single-precision FP packed µops ex-ecuted.

FP_SSE_PACKED_DOUBLENumber of SSE double-precision FP packed µops ex-ecuted.

FP_AVX_PACKED_SINGLENumber of AVX 256-bit packed single-precision FPinstructions executed.

FP_AVX_PACKED_DOUBLENumber of AVX 256-bit packed double-precision FPinstructions executed.

L1D_REPLACEMENT Number of lines brought into the L1 data cache.LLC_REFERENCE Last-level cache references.

L2_RQSTS_CODE_RD_MISSNumber of instruction fetches that missed the L2cache.

OFF_CORE_MISSES_0 Number of L3 misses.SSE - Streaming SIMD Extensions; FP - floating-point; AVX - Advanced Vector Extensions; µops - micro-operations

In order to enable sampling, the –s option must be used, followed by the required sampling

time interval in milliseconds. If this option is not enabled, SpyMon will report the sum of all the

monitored events at the end of the run. The end of the run is determined by the application

with the longest execution time. If no applications are provided, the tool will monitor until an

interruption signal is detected (CTRL-C).

Energy consumption information is delivered when the –r option is enabled. This option needs

at least one domain to be specified. Several domains can be monitored at the same time, as long as

they are provided by the underlying architecture. For Intel architectures and following supported

RAPL power planes, the available domains are pkg, pp0, pp1 and dram.

3.3.2 Cache-aware Roofline Mode

One important feature SpyMon provides is the ability to run a performance analysis based on

the CARM [11]. In order to make use of this functionality, the spymon ––roof command must

be used. When running the tool in roofline mode, one does not need to manually configure any

performance counters, as the tool already contains the hard-coded set of events to use for this

type of analysis. Furthermore, the user is free to define the sampling time interval, as well as to

activate the RAPL energy status interface by using the –r, in the same way that is described for

the profiling mode. However, since energy status information is not a part of the model, it will be

outputted as a separate information.

31


3.3.3 Information Output

As previously referred, SpyMon outputs the profiling information by means of files. The number

of files corresponds to the number of monitored LPCs. If energy status information is enabled,

then its monitoring samples are stored in an additional file.

Both performance and energy information files contain the raw counting values extracted from

the corresponding hardware interface. When more than one event-set is configured, each perfor-

mance counter value is outputted as estimated by applying Equation 3.1. Moreover, time-stamp

information is also provided. An example of a performance holding file line format for a run with

1 fixed counter and 2 general purpose counters would be tsc fx gp0 gp1 gp2. On the other

hand, the line format for a file containing energy status information for the package, pp0 and pp1

domains becomes tsc package pp0 pp1.

When running in cache-aware roofline mode, a post-processing is applied over the outputted

performance files in order to generate the plot containing the performance information stamped

under the lines representing the attainable system’s performance. The number of flops contained

within a sample is calculated according to the following expression:

flops = SCLSP /2 + SCLDP + (SSESP + SSEDP ) ∗ 2 + (AVXSP +AVXDP ) ∗ 4. (3.2)

In a similar way the calculation of the corresponding numbered of transferred bytes relies on the

following procedure:

bytes = (8 ∗ scl + 16 ∗ sse+ 32 ∗ avx) ∗ (LOADS + STORES). (3.3)

The scl, sse and avx variables correspond to the percentage of scalar, Streaming SIMD Exten-

sions (SSE) and Advanced Vector Extensions (AVX) FP instructions over the total number of FP

instructions, respectively. These calculations are necessary since different FP types correspond

to different data widths, and therefore, different number of bytes. The scalar, SSE and AVX

percentages are obtained by relying on the expressions:

scl =SCLSP + SCLDP

SCLSP + SCLDP + SSESP + SSEDP +AVXSP +AVXDP; (3.4)

sse =SSESP + SSEDP

SCLSP + SCLDP + SSESP + SSEDP +AVXSP +AVXDP; (3.5)

avx =AVXSP +AVXDP

SCLSP + SCLDP + SSESP + SSEDP +AVXSP +AVXDP. (3.6)

The above described expressions are applied over the PMC values obtained from the configuration

depicted in Table 3.1, where each capital variable in the equations corresponds to a specific event

counting value.

3.4 Summary

In this chapter, a new system-wide profiling tool (SpyMon) was introduced. This tool offers

an easy to use interface and is capable of delivering architectural performance and power/energy

32

3.4 Summary

consumption information to the user, thus providing the fundamental means for a better under-

standing of the underlying resources functioning.

Currently, the ability to provide both performance and energy consumption information in a

single interface is not an easily found feature in the state-of-art tools, which designates SpyMon as

a preferable choice. Moreover, it allows the user to run a full performance evaluation based on

the CARM [11], without the need for any extra configurations. When executed in this mode, the

tool provides a single plot graph which contains useful information about the execution, allowing

to detect possible architectural bottlenecks or even to improve the execution of the monitored

application on the underlying hardware.

Apart from the implemented means to overcome the possible hardware restrictions, SpyMon is

completely designed to run from the user-space. As a result, a great level of portability is sustained

in SpyMon. However, by running in user-space, it is likely that certain overheads are introduced

when compared to the tradicional interfaces, such as perf_events or similar driver interfaces for

profiling the application.

In these cases, a full control over the system tasks execution and scheduling can be attained and

the communication with the PMU is performed inside the scheduler, i.e., invisible to the actual

application’s execution. On the other hand, SpyMon does not aim at establishing such control

over the OS scheduling mechanisms, since its processes (monitor and spies) allow a core-based

system-wide application performance and energy status monitoring without interfering with the

underlying OS mechanisms.

33


34

4Scheduler-Based Monitoring Tool

(SchedMon)

Contents4.1 Architecture and Main Functionality . . . . . . . . . . . . . . . . . . 364.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

35

4. Scheduler-Based Monitoring Tool (SchedMon)

In Chapter 3, an easy to use monitoring tool (SpyMon) is presented, which aims at a higher

portability, by avoiding dependency from the underlying monitoring interfaces, and provides the

mechanisms for a complete system-wide performance and power evaluation. However, by being

almost fully implemented from the user-space, this tool implies some limitations. For example,

attaining a full control of an application’s execution flow, becomes more challenging since SpyMon

targets the whole system, i.e., provides monitoring based on a core-based approach. In fact, even

if extreme efforts are made to construct an overly controlled environment, e.g., by running only

the desired applications and pinning them to predefined LPCs, monitoring at the application level

is not an easy task, due to the unnecessary interference introduced by the OS running tasks. As a

result, it becomes extremely difficult to extract this interference from the obtained measurements,

especially when monitoring is performed by means of user-space processes.

The performance and power/energy consumption monitoring tool presented in this Chapter,

i.e., SchedMon, adopts a completely different approach, by making use of the OS internal mecha-

nisms in order to obtain a broader and more detail information about the monitored applications.

In contrast to SpyMon, this tool is mostly implemented from the kernel-space and it aims at an

application-based evaluation. Thus, it allows a more accurate profiling of the monitored applica-

tions. SchedMon main principles rely upon its modularity, which allow to easily extend its func-

tionality. In order to achieve that, the tool is designed not to depend on the available OS-specific

performance interfaces (e.g., perf_events), therefore not depending on the already implemented

structure and functionality.

In this chapter a new scheduler-based monitoring tool, SchedMon, is introduced. In section 4.1,

an overview of the tool’s novel mechanisms and implemented features is made. The next section

describes the details about the tool’s implementation. Finally, in section 4.3, an overview of how

to use the tool in order to obtain the required performance and/or power results is made.

4.1 Architecture and Main Functionality

SchedMon is composed of two main parts: i) a Linux kernel module, or driver, which integrates

the tool’s implementation core mechanisms; and ii) a user-space tool (smon), which extracts the

whole functionality of the underlying module and translates it into a simple and intuitive user

interface. The communication between both components is made by means of user-space library,

which provides a set of functions that allow handling the tool’s main functionalities.

Figure 4.1 illustrates the interaction between SchedMon’s components, as well as the their

disposition in the OS privilege layers. As it can be observed, the Linux kernel module is responsible

for interacting with the hardware, thus providing the necessary performance and power/energy

consumption information. The communication between the module and the user-space tool is

made through a set of system calls over the driver’s device file. These necessary communication

commands are provided by the tool’s user-space library. In addition to the command-related

36


User-space

Kernel-space

Hardware

Smon SchedMon’s Library

System calls

SpyMon’s kernel module

SchedMon’s device

RAPL PMU

Shared memory

Figure 4.1: SchedMon’s components interaction and disposition in the OS privilege layers.

communication, a shared memory area is used for exchanging the produced profiling information,

at run-time.

4.1.1 SchedMon’s Linux Kernel Module

The core functionality of SchedMon lies in the OS kernel-space, and it runs as an integrated part

of the OS kernel, therefore widening the possibilities for a better monitoring control. Although

kernel modules are executed as a part of the Linux kernel code, they only allow to insert new

functionalities to the OS, thus not being allowed to change the currently implemented system.

However, the Linux kernel provides a mechanism that makes it possible for inserted modules to

have a broader control of the kernel execution flow. This mechanism is designated as a tracepoint.

A tracepoint can be described as a breakpoint inserted into a specific place inside the code, which

can be enabled by providing it with a callback function. When a tracepoint is enabled, whenever

a specific part of the code is run, the provided callback function will be called. This facilitates not

only debugging, but it also allows inserting code to the kernel itself in an elegant and easy way.

SchedMon makes use of the Linux scheduler tracepoints to keep track of the target monitored

applications. These facilities allow attaining a full scheduling control over the OS running tasks,

including the ability to detect when a certain task is scheduled into a LPC or migrated to a different

one.

The interaction with SchedMon’s kernel module (or driver) is established by using a set of

predefined system calls to the driver. The tool provides this functionality as a user-space library,

which contains the necessary functions to configure and control the monitoring environment at

run-time.

4.1.2 Smon: the user-space tool

In order to facilitate the use and control over the above mentioned kernel module, or driver,

SchedMon integrates a user-space component, smon, which implements the mechanisms for inter-

acting with the driver. This component makes use of the tool’s user-space library, which implements

37


the same data-structures and synchronization methods as the driver. Moreover, smon extends the

tool’s overall functionality by adding a set of user-space features, as it will be described next.

In brief, smon translates its functionality into an easy-to-use command-line interface. This

interface is composed by a set of commands that give full control over the tool’s execution and

parameter configuration.

4.1.3 Available Features

SchedMon follows an application-oriented monitoring methodology, which means that it is meant

for monitoring specific applications, in contrast to a core-oriented or system-wide evaluation (the

scope of the SpyMon tool). The set of features offered by SchedMon were designed according to this

concept, in order to provide very detailed information about the monitored applications.

On-line Analysis

By translating the main functionality to a user-space interface, SchedMon provides the possibil-

ity for a run-time performance and power/energy consumption evaluation. In brief, by using the

provided user-space library, the programs are able to interact with the tool and perform required

actions based on the performance and power/energy consumption output feedback. As a result, the

SchedMon functionality can be easily extended by only relying on its kernel module. Making use

of SchedMon functionality without the need to use the user-space command-line interface, smon,

is one of the great advantages one can obtain from the tool.

Multi-threaded Application Profiling

With the constantly increasing number of parallel resources provided by modern architectures,

multi-threaded applications are becoming more and more mainstream. Therefore, it is absolutely

crucial to be able to profile applications that can spawn tens or even hundreds of threads, in order

to allow the opportunity for architectural or application design optimizations.

The ability to profile multi-threaded applications is one of the main functionalities that is

provided by SchedMon. By setting the proper configuration parameters, one may attach the same

performance monitoring behavior to all child tasks descending from the main monitored task,

recursively. This means that one can profile multi-thread or multi-process applications, by tracing

all the descendant threads or processes, as long as their hierarchy dependency descends from the

target application’s main process.

Task Hierarchy

In addition to monitoring the multi-threaded applications, SchedMon allows to obtain, in real-

time, information about every process or thread that is created, as long as it descends from the

targeted application main process. By using this information, it is possible to construct, either at

38


run-time or after the application’s execution, the complete task hierarchy tree, therefore obtaining

a better understanding of the application’s internal structure and design.

Task Scheduling

The OS scheduler is the internal OS mechanism that is responsible for deciding when and where

a current set of active tasks are allowed to run, i.e., it controls which tasks have access to a specific

LPC at a given time. As already referred, SchedMon relies on the Linux scheduler execution flow

to obtain information about each monitored task. Thus, it is able to trace the task movements

inside the architecture at run-time.

One of the features of the tool is the ability to dump the scheduling information of each

monitored task, at run-time. This means that, if such detail is requested, information about when

each monitored task was using a specific LPC may be provided.

Task Migration

Following a similar strategy employed for the above referred task scheduling feature, SchedMon

also allows extracting information about the exact time interval when a task is migrated by the

scheduler. Task migration refers to the action performed by the OS scheduler when the execution

context of a single task is changed from one LPC to a different one.

Performance Monitoring

The main functionality of every performance monitoring tool is the ability to provide the

necessary means to access the underlying architectural performance facilities (PMU). SchedMon

allows the complete configuration of the underlying performance interface, providing the means to

configure both PMCs and PFCs, in a simple and easy way. In addition, in order to facilitate the

tool’s performance configuration, SchedMon provides a predefined set of PMEs.

A particularity of SchedMon is that PME configuration is kept inside the driver, thus requiring

the definition of PMEs to be made a priori, i.e., before execution. The main advantage of this

method is that events can be reused in several different runs, without the need to be redefined.

Moreover, it is possible to create shell scripts to automate the reconfiguration of the tool with the

required event definitions, which facilitates the configuration across reboots or even across different

platforms.

Energy Status Information

A good power and energy consumption management is becoming increasingly important for

modern computing systems. In this context, SchedMon provides a power and energy consumption

monitoring interface that allows to perform a complete power/energy consumption evaluation of

the hardware facilities, alongside with performance. The energy status information can be toggled

39


as simply as enabling a single option and providing the required monitoring domains, when starting

the profiling execution.


Similarly to SpyMon, the SchedMon tool provides a way to perform an application performance

analysis using the CARM [11]. This functionality provides an easy way to extract and analyze the

execution behavior of an application running on modern multi-core architectures. By analyzing

the outputted single plot diagram, it is possible to detect possible architectural and/or application

bottlenecks.

To this respect, SchedMon provides an extra functionality, which allows the automatic creation of

the model for different general-purpose multi-core architectures, in order to facilitate the portability

of the tool for different architectures. By running predefined assembly benchmarks, it is possible

to extract the necessary model’s parameters for different architectures and, therefore, increase the

model’s precision.

Function Call Tracing

The function call tracing functionality is achieved by intercepting every function call that

the target application invokes. Therefore, SchedMon provides the ability to know which part

of the application’s code is executed, at a specific time. The most referenced state-of-art tools

usually achieve this functionality by back-tracing the program stack each time a new sample is

taken [1] [4] [5]. Hence, in certain execution scenations, it might not be even possible to detect

exactly when a function is entered or exited, as well as to catch all function calls.

In contrast, SchedMon provides the ability to detect when the monitored application enters or

leaves a function. Moreover, if the target application switches its executable file at run-time, the

tool is able to detect and load the information about the new binary functions. At last, if new

processes or threads are created during the execution, SchedMon is also able to keep track of their

calls, independently if they share the same execution code or not.


This section describes the details about the implementation of SchedMon. As it was already

referred, there are two main parts that compose the tool: a Linux kernel module, implemented

as a device driver, which is the core of the tool and allows a greater control over the underlying

hardware resources, and a user-space tool, which interacts with the implemented module by means

of a user-space library and passes is to the user in an easy-to-use command-line interface.

40


4.2.1 Linux Kernel Module

SchedMon’s kernel module, or driver, is the main component of the tool, since it contains all the

main functionality and data structure implementation. The driver, when loaded into the kernel,

creates a file in the /dev directory (the device) which acts as a communication medium to the

driver, i.e., operations over this file trigger the corresponding module’s function to handle that

specific operation. At the moment, SchedMon device driver defines five different operations over

the device file:

• Open - This function is called each time an open() operation is performed over the device

file. At the moment, this functions is only used for initialization and debugging purposes;

• Release - Similarly to open, this function is called each time a device file descriptor is closed

and it serves mostly for debugging purposes;

• Ioctl - This function incorporates most of the user-to-kernel communication functionality.

It is triggered when an ioctl() call is made over the device file, and it allows attaining

a control over the monitoring facilities. An ioctl() call permits not only sending specific

predefined commands to the driver, but also exchanging data between the kernel and the

user-space if a user-address is provided as a function’s argument.

• Mmap - This operation allows a user-space program to share memory with SchedMon’s

device driver, thus reducing the overall communication overhead. This function is triggered

when a mmap() call is made over the device, and it must be performed in order to obtain

profiling information from the driver, as described below;

• Poll - This operation implements the synchronization mechanisms used by SchedMon to

coordinate the read and write operations over the previously allocated shared memory. It

can be used by calling poll() or select() functions over the device.

The proper use of the above described calls is what permits the full control and configuration

of SchedMon’s driver from the user-space.

Events, Event-sets and Environments

SchedMon infrastructure for performance configuration relies on three basic data structures:

event, event-set and environment. These structures are designed to interact in a hierarchical

way, as shown in Figure 4.2, thus allowing the reutilization of not only event but also event-set

definitions. Therefore, one does not have the need to re-create the same events or event-sets across

different runs.

An event data structure contains both the event tag identification (event_tag), which is

defined at the time of the event configuration, and the Performance Monitoring Select Register

41


Envirorment

- nr_evsets

- evset_arr

- profiling options sample_time fork_info (…)

- nr_tasks using this struct

Event-set

- evset_tag- event_arr- fixed_ctr_ctrl fx0_en fx0_any (…)

- global_ctr_ctrl gp0_en fx0_en (…)

Event

- event_tag

- event_configuration

event_select unit_mask OS user (…)

Figure 4.2: SchedMon event, event-set and environment structural hierarchy.

(PMSR) value necessary for the desired event configuration (event_configuration). Since PMUs

usually provide more than one PMC and even several PFCs, the event-set data structure contains

a number of pointers to event structures (event_arr), the PFCs configuration (fixed_ctr_ctrl)

and an additional register variable which contains the information of which PMCs and PFCs

are configured for the event-set (global_ctr_ctrl). Moreover, since this structure contains a

full configuration of the PMU, only one event-set can be configured at a time per LPC. Both

event and event-sets need to be defined before profiling is started, and they are stored inside

the driver. On the other hand, the environment data structure is created at the time of the

run, and it is meant for maintaining the performance configuration for that specific execution,

i.e., it keeps both the pointers for the monitored event-sets (evset_arr) and the profiling options

(profiling_options), such as the sampling time interval and the flags defining the required

profiling information types. At the end of the run, environment data structures are destroyed.

There are currently six available profiling options flags, that need to be defined at the time of

the run, and which have the purpose of enabling specific configuration parameters:

• inherit - Applies for multi-threaded applications and, when set, the forked tasks will also be

monitored, by inheriting the configuration from their parent task.

• on_exec - When set, the monitoring of the target application process is started at the

time when the next execve() system call is made. This guarantees that the application

monitoring starts exactly when its execution starts. This option is actually used by smon,

since the relies on the execve() call in order to run the provided application binary.

• rapl - For all cases when RAPL energy status information is required, this flag must be set.

This enables power and energy consumption profiling during the execution and provides the

resulting sampling measurements.

• migration - When this flag is enabled, whenever one of the monitored tasks is migrated, the

correspondent information is provided as a sample.

• fork - This flag works in a similar way to the migration flag, although it delivers a sample

each time one of the monitoring tasks forks a new task.

42


• sched - When detailed scheduling information is required, this flag can be enabled. If so,

each time a task is scheduled in and out of a specific LPC by the OS, a sample with the

correspondent information is delivered.

The above described option flags, apart from the inherit and on_exec flags, allow defining

what kind of information should be included while profiling. The sample_time parameter may

also be defined when configuring the execution, which configures the performance and power/energy

consumption sampling time interval. Although the above described flags allow enabling or disabling

the different types of profiling information, performance sampling information is always enabled

(by default).

Sample Types

SchedMon’s driver currently provides five different types of samples, which refer to different

previously described profiling configuration parameters, namely: performance, energy status, task

migration, task creation and CPU scheduling information.

Performance samples are always enabled. They provide a complete PMU sample reading

during a specific time interval, which is defined via the sample_time parameter. Moreover, timing

information is also available by providing the time-stamps corresponding to the start and the end

of the sample, and also the duration of each sample. The sample duration might differ from the

difference between the end and start time-stamps, in cases when the sample was taken at different

CPU scheduling intervals. For instance, if a task is scheduled out of the CPU at the middle of a

sample, switched by another task, and then scheduled in again later, the sample duration will not

include the foreign task’s execution time. Information about the corresponding event-set and task

PID are provided as well.

RAPL samples (power/energy consumption samples), when requested, provide the energy sta-

tus counter readings for all available domains of the processor chip, for the same time interval

defined for performance samples. Since power consumption monitoring is performed at the chip

level, there is only one application task monitoring it and, thus, the PID is not necessary. On the

other hand, the sample start and duration times are still provided.

In contrast to performance and RAPL sample types, which provide hardware event counter

readings, the remaining three sample types refer to specific software events. The task migration

samples, when requested, provide information when the task is migrated, which CPUs are involved

in that migration and the timing information (i.e., corresponding time-stamp values). Task creation

samples provide the PIDs of both involved processes (parent and child) and the corresponding time-

stamp values. The scheduling information samples are provided each time a monitored task leaves

the current LPC. In detail, it contains not only the task PID and the LPC identification, but also

the corresponding time-stamps when the task entered and left a certain LPC.

The time-stamp information is obtained by using the rdtsc function, which provides the corre-

43


sponding LPC’s time-stamp counter (TSC) value. Each LPC contains its own TSC register, which

measures the time since the machine was booted. Although Linux implements certain mechanisms

to synchronize the TSCs across the several available LPCs at boot time, there are no guarantees

that TSC values across different LPCs are actually synchronized. Nonetheless, for the sake of

simplicity, TSCs are assumed to be synchronized across all LPCs up to a certain accuracy level.

Monitored Tasks

The Linux kernel uses the same data structure to represent both user-space processes and

threads, which is denominated as a task. SchedMon follows the same methodology. Thus, any

user-space thread or process will be referred herein as a task.

SchedMon defines two types of tasks: leaders and children. In order to monitor an application

using the tool, the target process, or thread, must be registered into the driver. For this, an ioctl()

system call with the proper request must be performed. The task registration request requires two

distinct arguments: the target PID, which is the task identification parameter, and an environment

data structure containing the profiling configuration. Under SchedMon’s driver, every registered

task is appointed as a leader. On the other hand, a child corresponds to a task descending from

a leader. This only applies if the inherit option is enabled upon the leader task registration,

otherwise the driver will not register any children descending from that task.

Each leader task that is registered in the driver is associated with a performance environment,

i.e., a data structure containing the profiling execution configuration. Whenever a child is allocated

by the driver, it inherits its leader performance environment and, therefore the same configuration.

SchedMon’s driver keeps track of the task organization by using Linux double-linked lists, thus

facilitating the process of task creation and destruction. A leader task can be contained in three

types of lists: task_list, cpu_list or wait_list. The task_list contains all the registered

leader tasks, i.e., all the tasks that were registered through the ioctl() system call, which might

be currently monitored or finished (waiting for being unregistered). For each available LPC, there

is a corresponding cpu_list. Each cpu_list contains all the monitored tasks that are currently

running, or scheduled to run, on the corresponding LPC. On the other hand, the wait_list is

reserved for tasks that are already registered but are still waiting for the next execve() system

call event in order to start their monitoring. Therefore, a leader task cannot be present in both

the wait_list and cpu_list at the same time. Each leader task also defines a fourth list head,

children_list, which holds a linked list of every created child, if any.

Scheduling Infrastructure

SchedMon’s scheduling infrastructure constitutes the core functionality of the driver, since it is

responsible for handling both the task infrastructure and profiling operations. As already referred,

the profiling operations depend on a set of events triggered by the Linux scheduler. At the moment,

the tool is able to detect five different scheduling events, by means of kernel tracepoints which,

44


static void sched_process_exec (struct task_struct *p, pid_t pid);

static void sched_process_fork (struct task_struct *parent, struct task_struct *child);

static void sched_switch (struct task_struct *prev, struct task_struct *next);

static void sched_migrate_task (struct task_struct *p, int dest_cpu);

static void sched_process_exit (struct task_struct *p);

Figure 4.3: Linux scheduler breakpoints used by SchedMon.

when triggered, correspond to a specific operation to be handled by the driver. Figure 4.3 depicts

the implemented tracepoint callback headers and their relevant arguments.

The first illustrated tracepoint, sched_process_exec(), is triggered whenever a task executes

an execve() system call, i.e., whenever it switches its execution binary by another one, and the

task identification parameters are passed through the arguments. When this function is triggered

inside the driver, a sequential search over the wait_list is made. If the task is registered and

present in the wait_list, it is removed from this list and inserted into the corresponding cpu_list,

i.e., it is ready to be monitored.

In a similar way, whenever a task forks a new child, sched_process_fork() is called and

the respective pointers to parent and child task structures are provided as arguments. In this

case, since this event is triggered in the LPC running the parent process, the driver searches the

corresponding cpu_list and, if the forking task is registered in the system (and has the inherit

flag enabled), a child data structure is inserted into this same list and it inherits the behavior of

its parent.

The sched_switch() tracepoint is the most frequently called among all the tracepoints, since

it is triggered each time the scheduler switches a task by another one, in a specific LPC. Whenever

this function is called, SchedMon searches the corresponding cpu_list for the task which is marked

to be scheduled out. If found, the PMU configured counters are stopped and the readings are

saved and accumulated with the readings of the next PMU sample. If energy status information is

enabled, a similar process is performed, with the exception of stopping its counters, since it is not

possible to stop the energy status counters. Power/energy consumption sampling is only performed

by the leader tasks, since they represent the main application processes. Furthermore, an evaluation

is made to the sched configuration flag and, if enabled, the corresponding scheduling sample is

produced. After the above described operations are made for the scheduled out task, while still in

the sched_swith() function, a similar search is performed for finding the next scheduled in task. If

found, the initializations are made in order to restart performance and, if applicable, power/energy

sampling. These initializations consist mainly in setting kernel timers that when triggered, produce

a system interruption which allows the driver to take the required measurements or, in the case of

45


performance sampling, to reconfigure the PMU as needed. The sampling process will be explained

later in detail.

Each time a task is migrated, the sched_migrate_task() function is called, providing the

information of which task is migrated and to which LPC it is migrated. Since this call is made

from the LPC where the task is migrated from, a sequential search over the corresponding cpu_list

is made. If the migrated task is registered in SchedMon, it is removed from the current cpu_list

and it is inserted in the destination cpu_list. Moreover, if the migration option flag is set, a

migration sample is also produced.

Finally, the sched_process_exit() tracepoint is triggered whenever a task is terminated. If

the terminated task is one of the SchedMon’s monitored tasks, it is removed from its corresponding

cpu_list and its timers are stopped, in order to stop that task from being monitored.

Sampling

Sampling refers to the process of extracting specific information from the execution at regular

time intervals. Figures 4.4(a) and 4.4(b) illustrate the sampling process and a use-case scenario

of a task being profiled over time, respectively. For the sake of simplicity, the presented diagram

refers to the execution scenarios when:

• There is only one LPC;

• The only scheduling event being triggered is the sched_switch() call;

• The tool is exclusively profiling performance;

• The sampling time interval is 10ms;

• Two event-sets are configured.

In order to provide accurate performance sampling, several auxiliary data structures are used for

this process. The main data structures used for performance sampling are i) the array containing

the different event-set configurations, ii) a Linux high-resolution timer for synchronization purposes

and sampling at the nanosecond granularity, and iii) a temporary PMU sample, which holds the

current sample counts.

After a task is registered for profiling into the SchedMon’s driver (step 1), the scheduling

process is able to detect its presence and to perform the corresponding sampling operations. As

presented in Figure 4.4(b), at start time, 0ms, the Linux scheduler assigns the CPU resources to

the target monitored task. As soon as this task is found in the driver, it conducts the necessary

operations for starting the sampling process.

Firstly, the PMU is configured (step 2), which is done by writing the event-set 0 MSR configu-

ration into the underlying performance facilities. Along with this process, the current PMU sample

values are written into the performance MSRs, both PFCs and PMCs, which in this specific case

46


Register Task

sched switch

Load evset into PMU

Stop counting

in out

(1)

(2) (6)

Start hrtimer

Get time-stamp

Stop hrtimer

Get time-stamp

Start counting

Get PMU readings

(3)

(4)

(5)

(7)

(8)

(9)

hrtimer inter-

ruption

Close current sample

(a)

Dispatch sample

Restart hrtimer

Reconfigure PMU

(b)

(d)

(e)

Reset sample

(c)

(a) Sampling flow diagram.

ms0 8 14 16 21 24 26 31 34

Event-set 0Event-set 1

sched_insched_outtake sample

(b) Sampling example in time.

Figure 4.4: SchedMon sampling process illustration.

are all initialized to zero. After the PMU configuration is performed, the high-resolution timer

is set (step 3). This is done by setting the timer to trigger after 10ms (sampling time interval).

When the timer is configured, a callback function is provided, which is triggered upon the timer’s

termination and is meant for taking and dispatching a sample. Right before starting the counters,

the Time-Stamp Counter (TSC) is obtained (step 4), in order to keep track of time related infor-

mation. Finally, the configuration process is concluded by enabling the PMU counters in order to

start counting (step 5).

In this hypothetical example, the monitored task is scheduled out before the timer interrupt

occurs (at 8ms). Thus, the first operation is to stop the performance counters (step 6), in order

47


to reduce the overheads imposed by the tool. Subsequently, the high-resolution timer is stopped

(step 7), such that it cannot be triggered during the next steps. Moreover, the timer’s remaining

time (2ms), is kept for the next timer configuration. Then, the PMU counters are saved into the

current sample data structure (step 8) and the TSC is again read (step 9) Although this does

not represent the end of the current sample, the time information is still obtained in order to keep

track of the sample duration.

At 14ms of the execution time, the target task is scheduled in again, and a similar procedure

(to the already described one) is performed. The main difference from the previous explanation

occurs in steps 2 and 3. Since the application is still running the event-set 0 (remaining 2ms to

complete the sampling time interval), the current sample contains the counts from the previous

run. Therefore, when configuring the PMU, instead of initializing the PMCs and PFCs to zero,

the current sample values are restored into the counters, thus allowing to continue the previous

sample execution. In a similar way, when setting the high-resolution timer, instead of the default

sampling time interval, it is now started with the time left from the previous sample run, i.e., 2ms.

After 2ms, the timer is triggered before the task is scheduled out. As a result, at 16ms of the

illustrated execution example (see Figure 4.4(b)), the corresponding interruption occurs, allowing

to complete the sample (step a) and perform the proper reconfigurations for the next sampling

interval. After closing the sample, it is dispatched to the user-space (step b), by means of shared

memory (as explained in detail in the following text). With the completion of the previous sample,

the structure holding the current sample information is reset (step c), i.e., the counter values

are again initialized to zero. The PMU is reconfigured according to the event-set 1 (step d)

and the timer is set to start counting a complete sampling time interval (step e). Afterwards,

the previously described procedure is repeated until the end of the task execution (from 16ms in

Figure 4.4(b)).

It is important to notice the described procedure includes several optimizations techniques to

provide accurate performance sampling with minimal introduced overheads. Firstly, whenever the

task is scheduled in or scheduled out of the CPU, the corresponding steps (2-5 and 6-9 in Figure

4.4(a)) are executed inside the scheduler. This allows reducing the visible profiling overhead, since

no task is currently running in the CPU. Secondly, when a timer interruption occurs, the counters

are stopped and started in a way to minimize the interference of the tool to the counter values.

Hence, it is important to notice that the herein referred overhead corresponds to the overhead

induced to the profiling information and not to the overall system’s overhead.

The energy status sampling procedure is similar to the one described for performance. The

main difference relates to the fact that the energy status interface cannot be configured, thus only

the readings are performed. Moreover, power/energy consumption sampling information is only

performed by the leader tasks, as opposite to performance, which is performed by all tasks. At

the moment, RAPL energy status samples are taken with the same granularity as performance

48


samples, i.e., they share the same sampling time interval. However, leaders use a distinct high

resolution timer for performance and power/energy, which facilitates the possible introduction of

a different sampling time interval for power/energy.

Kernel-User Communication

Up until now, in the context of sampling, the action of finishing a sample was usually referred as

"producing" or "dispatching" a sample. The mechanism used by SchedMon that allows exchanging

produced samples between the kernel and the user-space is actually one of the most complex ones,

since it comprises a memory ring-buffer, a virtual memory area, shared between the kernel

and the user-space, and a synchronization mechanism. This mechanism was implemented in

order to reduce the overhead of the communication between the kernel and the user-space, since

there is no need to replicate the obtained memory information.

A ring-buffer is a limited memory buffer implemented in a circular way. For instance, if a ring-

buffer contains ten available slots, it can be filled from slot 0 until 9 and, when the end of the buffer

is reached, it starts filling the slots from the beginning. This type of mechanism is normally used

in producer-consumer problems, where the data is frequently exchanged and temporarily stored in

the memory. Following this producer-consumer methodology, SchedMon implements a ring-buffer

to facilitate the exchage of the produced samples between the kernel and the user-space.

Linux organizes memory by means of pages, which represent chunks of physical memory and

are usually of the size of 4kB. SchedMon’s ring-buffer is therefore an abstraction of an array

containing one or several memory pages. Figure 4.5 depicts the implementation and functionality

of the tool’s ring-buffer, providing information not only about the main data-structures, but also

about the virtual memory spacial disposition of the buffer from both user and kernel-space point

of view.

SchedMon’s driver implements an abstraction of a ring-buffer by allocating a number of pages

and keeping their addresses in a data_array structure. As illustrated on the left side of Figure 4.5,

the virtual addresses of the pages do not necessarily need to be ordered, since they are individually

allocated, i.e., page by page. In order to track the next writing and reading positions, the driver

declares two 32-bit variables, head and tail, which contain the virtual addresses for the next free

and filled positions, respectively. For example, as it can be observed in Figure 4.5, if the head is

positioned the beginning of page 3, its page_number is 3 and its offset is 0.

A virtual memory area corresponds to a chunk of memory that is shared between the kernel

and the user-space. In order to obtain profiling information when using SchedMon, the user-space

program needs to reserve this space by performing a mmap() system call and by providing the

required number of ring-buffer pages, i.e., the shared memory size. The driver then proceeds to

the creation of the buffer obtaining the virtual memory area start address, which is translated

into a user-space virtual address. From this point on, both kernel and user-space have a common

49


page 1

page 3

page 2

page 0

page 2

page 3

page 1

page 0data_array

page number offsethead, tail

0111231

tail head

User-space Virtual Memory

Kernel-space Virtual Memory

Ring-Buffer Abstraction empty samples

Figure 4.5: SchedMon ring-buffer implementation overview.

shared chunk of memory, which does not necessarily correspond to the same virtual address.

In order to synchronize the communication between the kernel and the user-space, a specific

protocol must be implemented. This protocol is needed to instruct the user-space process how to

read the information provided in the buffer. For this, an 8-bit header is added at the beginning of

each sample, which defines the type of sample in the following memory position. For example, 1

refers to PMU samples, while 2 refers to RAPL samples. SchedMon’s library also provides all the

data structures that are necessary for communication purposes. Therefore, since both sides know

the size and structure of each sample type, and by following the ring-buffer implementation, it is

possible to extract meaningful information from the buffer.

Finally, a specific synchronization mechanism is implemented for alerting the user-space

program whenever new data is available, since it is not possible to directly access the driver’s

ring-buffer structure information from the user-space. This is done by means of the poll() system

call. Performing this call permits the user-space process to listen to a predefined set of events for

a number of file descriptors, while staying in a sleep-state mode. Therefore, by using this facility,

the user-space process has the ability to detect when new data are available. SchedMon allows

the user to configure the size of the burst, i.e., the number of samples to be consumed at a time.

Therefore, the driver triggers the corresponding poll() event whenever the required burst size is

available for consuming.

50


Concurrency and Deadlock Avoidance

Concurrency and deadlock avoidance are two main concerns that have to be carefully taken

into account when programming a Linux device driver, specially when using complex internal

mechanisms like the Linux scheduler. SchedMon’s driver contains several data structures that

are shared by all tasks belonging to an application, e.g., the ring-buffer and the child_list

infrastructures. Furthermore, some data structures are even shared among all the tasks registered

in the driver, e.g., the remaining task list infrastructures. Since most of these structures are usually

handled during the Linux scheduler execution, it is important to use specific locking mechanisms

that do not sleep. This type of mechanism, in Linux, is declared as a spinlock. A spinlock is a

type of lock, defined in few assembly instructions, that loops around a variable (the lock holder)

until it is released by any other task that might be keeping it. Therefore, for all the necessary

places of code that require concurrency avoidance, the proper spinlock mechanisms are used.

Avoiding deadlocks when dealing with Linux internal mechanisms is more complicated, since

it implies a previous knowledge of how those internal mechanisms are implemented. In fact, there

is a set of specific actions that SchedMon’s driver is not allowed to perform at run-time. A good

demonstration example is when the driver detects that a burst size number of samples is available

for the user to consume. This event occurs when a new sample is written to the buffer, whether in

interruption mode when the task is scheduled out. In both situations, the driver is not allowed to

enter sleeping mode. The function used to trigger the poll() event to the user-space, i.e., to alert

that there is new information to read, is named wake_up() and, when called, it requires obtaining

the locks associated with the task being awakened. However, the corresponding locks might be

hold by some other Linux infrastructures, like the scheduler, which may provoke the task to enter

into a sleep-state mode, thus causing the complete system to be irresponsive.

The way of avoiding this type of situations, in Linux, is by means of the irq_work_queue

mechanism. This infrastructure allows to postpone jobs to be executed as soon as possible, by

triggering a system interruption as soon as interruptions are re-enabled. In our specific case,

this refers to the time when the Linux scheduler finishes executing. Hence, whenever a possible

deadlock situation is detected, SchedMon’s driver recurs to this mechanism in order to avoid possible

compelling situations.

4.2.2 User-space Tool

The SchedMon’s user-space component, smon, is integrated in the tool in order to facilitate

the access and handling of the underlying driver. By making use of the driver’s user-space library,

for configuration purposes, and by means of the mmap() and poll() system calls, smon translates

the whole tool’s functionality into an easy to use command-line interface.

The main functionalities of smon include i) the creation of events, ii) the definition of event-

sets, by using the already created events, and iii) the ability to profile an application. Smon

51


Request Description

SMON_IOCSEVTSet new event. A structure holding the event configuration mustbe sent.

SMON_IOCGEVTGet event information. The event id must be provided. If theevent id exists, a structure containing the event description isreturned.

SMON_IOCCEVT Check is event id exists. Returns 0 if true.

SMON_IOCSEVSSet event-set. A structure holding the event-set configurationmust be provided.

SMON_IOCGEVSGet event-set. The id must be provided. If the id exists, a struc-ture with the event-set configuration is returned.

SMON_IOCCEVS Check if event-set id exists. Returns 0 if true.

SMON_IOCSTSKRegister task into the driver. The task PID, along with an envi-ronment configuration, must be provided.

SMON_IOCUTSKUnregister task. This must be used when the task is no longerneeded to be monitored, even if its execution has already finished.

SMON_IOCREADUsed to consume N bytes from the buffer, i.e., to instruct that Nbytes have been read..

Table 4.1: Available ioctl() requests to SchedMon’s driver.

firstly parses the user’s input, to detect the required command and the input configuration sets for

that command. The command-line interface is explained in detail in Section 4.3.

Apart from parsing and verifying the user’s input, all the main three functionalities recur to

ioctl() system calls in order to perform the required set of actions. Table 4.1 enumerates and

describes the available ioctl() requests provided by the driver. The first six requests serve for

handling events and event-sets and they represent the whole mechanism behind these two func-

tionalities. The last three requests are used for profiling, such as for registering and unregistering

tasks, as well as consuming memory chunks from the ring-buffer.

Application Profiling

In contrast to events and event-sets handling, profiling requires a number of different mecha-

nisms in order to work properly. Firstly, the target application execution must be handled. This

is done by forking a new process, in order to switch its execution image by the newly required

one. The forked child process, before proceeding to the execve() system call, waits for a shared

semaphore to be released by (smon)’s main process (parent process). Meanwhile, on the parent

process side, after forking the child is performed, a request is sent to the driver for the ring-buffer

allocation. As previously referred, this allows information exchange between the driver and the

user-space process (smon).

When the memory buffer is set, the child’s PID is registered into SchedMon’s driver by using a

ioctl() call and by sending the proper registration request (SMON_IOCSTSK), as shown in Table

4.1. In order to instruct the driver to start monitoring the target child task after the execve() call

is executed, the registered task is configured with the on_exec flag enabled, therefore setting the

52


task to start being monitored as soon as it switches its program’s execution image. This guarantees

that the user target application is fully profiled, from the moment its execution begins. At this

point, the parent process releases the shared semaphore that prevents the child from executing,

and the profiling is initiated.

Profiling is performed by using the same ring-buffer methodology used by the driver. Since the

driver is not aware how the user-space program handles the shared memory region, SchedMon’s

user-space library provides the necessary routines to handle the ring-buffer and, therefore, the

proper communication with the driver.

The process of reading samples from the buffer is initiated by looping around a poll() system

call. This puts the calling process in a sleep-state mode until a new burst of samples is available.

Each time an available data burst is detected, the requested amount of samples is read. Since

different sample types correspond to different data structures and might even have different sizes,

a 8-bit header is injected by the driver before each sample, which contains the information about

the sample type. Whenever the size of a ring-buffer memory page is exhausted, a "stuff" header

is injected after the last sample, thus identifying the end of that page. Finally, at the end of the

profiling, an "end-of-profiling" header is inserted.


As previously referred, SchedMon provides a predefined profiling configuration, which outputs a

full performance evaluation based on the CARM. Similarly to SpyMon, this is achieved by running

the predefined event-set configuration shown in Table 3.1. This application profiling is performed

according to the previously described methodology. The main difference lies in the fact that when

running in CARM mode, there is no need to define which events to monitor.

In SchedMon, an additional functionality is provided to automatically create the CARM for

the detected general-purpose multi-core architecture. As already referred in Section 2.6, to build

the CARM, it is necessary to assess i) the peak FP performance of the architecture; and ii) the

attainable bandwidth for different cache levels. Although the theoretical peak FP performance

and L1 bandwidth can be derived directly from the device manufacturer’s data sheets, assessing

the bandwidth for deeper cache levels must be performed by relying on specific micro-benchmarks.

In order to ease this process, the proposed tool integrates specific assembly-level tests for de-

termining these bandwidth values, as presented in Algorithm 4.1 for Double Precision (DP) FP

AVX instructions. The adapted test procedure varies the size of transferred data to hit differ-

ent cache levels by accessing contiguous and increasing memory addresses. To obtain accurate

bandwidth values, each test code is repeated 8192 times in order to favor the throughput over the

latency and, in each repetition, the values of the monitored performance counters were assessed. In

detail, MEM_UOP_RETIRED_ALL_LOADS and MEM_UOP_RETIRED_ALL_STORES were used to determine

the number of performed load and store operations, respectively.

53


Algorithm 4.1 Bandwidth test codevmovapd 0(%rax), %ymm0;

vmovapd 32(%rax), %ymm1;

vmovapd %ymm2, 64(%rax);







. . . ;

Algorithm 4.2 FP MAD test codevmulpd %ymm0, %ymm0, %ymm0;

vaddpd %ymm1, %ymm1, %ymm1;

vmulpd %ymm2, %ymm2, %ymm2;







...;

A similar procedure is adapted for determining the FP peak performance, by relying on a

set of benchmarks as depicted in Algorithm 4.2. For this particular case, assessing the peak FP

MAD performance of DP FP AVX instructions is performed by relying on FP_AVX_PACKED_DOUBLE

PME. When assessing the peak FP for other types of FP instructions, such as SSE or scalar double,

different PMEs are used, as presented in Table 3.2.

The reported experimentally obtained bandwidth and performance values represent a median

of the counter readings from all 8192 runs.

Function Call Tracing

Function call tracing represents the process of detecting whenever a target application, the

tracee, enters or leaves a function call. This is an important feature, when detecting the potential

execution bottlenecks for the most time consuming parts of the application. This functionality is

introduced in smon and cannot be provided by solely using the tool’s driver, since it is implemented

in the user-space.

The binary application executable files may be dumped in order to extract useful information

about the program. Figure 4.6 shows the dumped assembly code of a simple hello function, which

is used to print the traditional "Hello World!" message to the screen. The squares indicate the

entry and the return function instructions, respectively.

The method used by smon to detect the entering and returning points of a function requires

preprocessing the dumped assembly code of the application. The detected execution points are then

assigned to breakpoint structures, which hold the original bytes contained in those positions and

are used to inject code to those same memory addresses. For instance, in the case of the example

illustrated in Figure 4.6, two breakpoints are created and, once the memory bytes represented by

the squares are saved, each of them is replaced by the CC assembly instruction. This instruction is

called the trap instruction and it is used for the purpose of tracing the execution of a process.

An example of a very well know mechanism that makes use of this instruction is the debugger.

In order to peek or inject code into a running application, the ptrace() system call must be

used. This call allows to trace a target process and provides a vast set of functionalities, such

54


Figure 4.6: Example of a function’s dump information.

as detecting when the tracee performs system calls, when it forks a new thread or process and

controlling its execution, at run-time.

Each time a tracee’s trap instruction is executed, the program stops and the tracer process

is alerted for this event, by means of a SIGTRAP signal. By taking advantage of the above referred

mechanisms, smon is able to detect when a process enters or leaves a function. When the smon

detects that an application’s breakpoint was reached, it replaces the trap instruction byte by the

original byte, thus allowing the application to proceed with the execution. Since ptrace() allows

executing the tracee in a single step mode, smon instructs the target process to execute only one

instruction and, after that, the breakpoint is enabled again (in order to catch repeating calls to

the same function).

Figure 4.7 depicts the main structures used by SchedMon in order to keep track of the function

call tracing information. As already referred, the trace_breakpoint structure corresponds to a

point of interest in the tracee’s program execution memory and contains the corresponding memory

address, the original data contained in the executable file, and the new data (trap instruction)

that will take place. Each breakpoint is associated with a function, which is represented by the

trace_function data structure. Besides the breakpoint information, this structure contains the

name of the corresponding function and its start and end addresses. Keeping the start and end

addresses enables the possibility for a binary search when searching for the hit breakpoint. For

each executable file being traced by SchedMon, there is a corresponding trace_mem_info data

structure. This structure contains a set of functions from the program and, therefore, holds all the

necessary information for tracing the target executable. The trace_task structure allows the tool

to trace multi-threaded applications, by keeping track of each forked process or thread individually.

Hence, each task is thus associated with its execution code (trace_mem_info).

In order to keep track of a process execution flow, including when it forks or switches its

execution image, a set of options must be used when the tracing is initialized. This can be done

by calling ptrace() with the PTRACE_SETOPTIONS command parameter. SchedMon makes use of

the correspoding set of options in order to detect:

• Forks - this option enables detecting whenever a new thread or process is spawned by the

tracee. When this happens, a new trace_task structure is created and inserted into the

task list. This new task inherits the trace_mem_info structure of its parent.

55


trace_task

- pid

- mem_info

- task_list

trace_mem_info

- nr_functions

- function_arr

trace_breakpoint

- address

- original_data

- new_data

trace_function

- nr_breaks

- break_arr

- funtion_name

- start_addr, end_addr

Figure 4.7: SchedMon function call tracing data structures.

• Execution swaps - this option allows the tool to detect whenever the tracee performs a

execve() system call. For each detected call, a new trace_mem_info structure is created

and attached to the target task.

• Terminations - ptrace() also allows to detect when the monitored task finishes executing.

This functionality is used by SchedMon in order to terminate the tracing of the target task.

4.3 Usage

This section describes the details of how to use SchedMon in order to obtain the appropriate

profiling results. As already referred, the tool incorporates not only a Linux kernel driver, which

contains the core functionality, but also a user space tool, smon, which brings that functionality

to the user as a simple to use command-line interface.

Currently, smon provides four different commands with distinct functionalities: i) event,

which allows to create and add new events to the tool, ii) evset, which provides the ways to

define new event-sets, i.e., new PMU configurations, iii) profile, which allows the full profiling

of target applications, and, at last, iv) roof, which provides useful architectural insights based on

the CARM.

4.3.1 Adding Events

Adding an event definition to the tool is done by providing the PMSR field configuration

parameters. In order to facilitate the later recognition of a configured event by the user, a tag

identifier should be provided. Figure 4.8 shows the smon event command usage.

In order to add a new event definition, three arguments must be provided: i) the TAG, which

holds the event identification, ii) the EVSEL, which is the 8-bit value corresponding to the PMSR’s

event selector bit field, and iii) the UMASK, corresponding to the unit mask field of that same MSR.

56

4.3 Usage

usage: smon event --add|-a tag=TAG,evsel=EVSEL,umask=UMASK[,mode=MODE]smon event --list|-l

List of <event --add> parameters:TAG String to tag the new event.MODE 2-bit value defining the running mode (user-1, kernel-2 or both-3).EVSEL 8-bit event selector value.UMASK 8-bit unit mask value.

Figure 4.8: Smon event usage information.

An additional MODE can be added to the configuration. This field defines in which mode (or modes)

the target event counts will be made and it may take the values of 1 (user-space), 2 (kernel-space)

or 3 (both). If not provided, the default value for this field is 3, meaning it will monitor when the

CPU operates in both modes.

There is a second sub-command, ––list, that is used for printing out the list of already

configured events, including the event tags and configured fields. Moreover, each event is assigned

with an integer identification value, which can be used later when defining event-sets.

4.3.2 Defining Event-sets

Similarly to the described event functionality, the smon evset command allows adding new

event-set definitions and listing the already configured ones. Figure 4.9 demonstrates the usage of

this command.

usage: smon evset --add|-a tag=TAG,events=EVID[:EVID[...]][,fixed=FIXED]smon evset --list|-l

List of <evset --add> parameters:TAG String to tag the new event-set.EVID Event ID. Check <smon event -l> for a list of available events.FIXED 12-bit number (4 bits for each fixed ctr): 0-Disabled 1-OS 2-User 3-Both

Figure 4.9: Smon evset usage information.

In order to create a new event-set, at least two parameters must be provided. The first param-

eter, TAG, allows an easier identification of the even-set without requiring to check its configuration

fields. The second parameter refers to a sequential set of general purpose events that must be

provided. For this, several event identification numbers must be provided over the EVID parame-

ter. The number of events is limited to the number of underlying hardware PMCs. In addition to

PMCs, smon allows the configuration of the PFCs. This is done by providing a single hexadecimal

value through the FIXED parameter. For example, if PFC0 and PFC2 need to be enabled for both

privilege modes (user and OS), the correct parameter value would be 0x303.

57


4.3.3 Application Profiling

In order to profile an application with smon, the required event-sets must be already defined

in the driver, as well as the events needed to create event-sets. Figure 4.10 illustrates the usage of

the smon profile sub-command, by providing a complete list and description of each individual

option.

usage: smon profile [[options]] PROG [ARGS...]

List of available options:-b BURST Burst size, i.e., nr of samples transferred at a time

(default is 1000)-c CPUMASK Bind task to specific logical CPUs

(e.g., to bind to CPUs 0,1 & 6 -> CPUMASK=0x43)-e ESID:[...] Eventset(s) to monitor

(if more than one, time-multiplexed round-robin style)-f Deliver information about Forking for the monitored task(s)-i Children (recursive) of monitored process will Inherit monitoring-m Deliver CPU Migration information-o O_FILE Output file (default is "smon.data")-p MMAP_PAGES Number {power of 2} of mmap Pages (default is 1024)-r DOMAIN:[...] deliver RAPL information for specified domains at the time

granularity of STIME-s Deliver CPU Scheduling information (this might have big overhead)-t STIME Sample Time in miliseconds (default is 1000)-x T_FILE Activate function call tracing and output information to T_FILE

(default is "smon.trace")

Figure 4.10: Smon profile usage information.

The smon profile operation provides several options that allow not only to configure the tool’s

execution, but also to define what kind of sampling information is required. The majority of the

depicted options correspond to previously explained functionalities or configurations of SchedMon.

However, the interface enables an extra functionality that was not previously described. By using

the –c option, smon provides a way for binding the target task to a specific set of CPU cores, i.e.,

by restricting the task to be scheduled only into the provided set of LPCs. To achieve this, the

value of the CPUMASK parameter must be specified. This value is an hexadecimal number, where

a bit set to one enables the LPC corresponding to that bit position in the provided bit word, as

depicted in Figure 4.10.

Another peculiarity that should be highlighted is the possibility of providing the output infor-

mation in two separate files: the T_FILE, when the function call tracing option is enabled, and the

O_FILE, which outputs all the profiling information. The information format contained in these

files will be explained later.

58

4.3 Usage

usage: smon roof-run [-t STIME] [-r DOMAIN:[...]] [-o OUTFILE] PROG [ARGS...]smon roof-creat

List of parameters:STIME Sampling time interval in ms (default is 10)DOMAIN Energy status domain (pkg, pp0, pp1 or dram)OUTFILE Output file (default is "smon.data")

Figure 4.11: Smon roof-run and roof-creat usage information.

4.3.4 Cache-aware Roofline Mode

In order to ease performance information assessing, SchedMon provides a predefined perfor-

mance configuration which allows to output the information according to the CARM [11]. This

functionality is also integrated in smon’s command-line interface, thus providing an easy and intu-

itive usage. Figure 4.11 illustrates the CARM related command usage information. As previously

referred, in addition to the traditional cache-aware roofline evaluation, SchedMon provides a way

to generate the model parameters by executing predefined micro-benchmarks. This functionality

does not only improve the model parameters for the underlying architecture, but also facilitates

the tool’s portability to different architectures.

The command-line usage for the referred functionalities is really straight forward, since all the

configurations are already hard-coded into the tool. The only configurable parameters allow: i) the

possibility of changing the sampling time interval, ii) the ability to enable energy status profiling,

and iii) redirecting the output information to a different file. As already referred, power metering

information is not included in the CARM and therefore it is provided as an extra information.

4.3.5 Information Output

SchedMon defines two different file types that contain the profiling information output:

• smon.data - this file contains all the profiling output, with the exception of the function

call tracing information. At the time, the file is formatted in ASCII and each line contains a

single profiling sample (e.g., PMU, RAPL or scheduling information).

• smon.trace - if the -x option is enabled at the time of profiling, the function call tracing

information is stored in this file. Each line of the file contains the time-stamp information of

when a specific application function was called (or returned).

When running in CARMmode, the performance sampling information is stored in the smon.data

file, which is processed after the application run in order to generate a third file containing the

performance counts plotted against the CARM (smon.plot). SchedMon also provides a set of

scripts that facilitate parsing the output information.

59


4.4 Summary

In this Chapter, a new easy and intuitive scheduler-based application profiling tool (SchedMon)

is proposed. The tool targets independence from any available performance or power interface and

it is designed in a modular way, which does not only facilitate portability, but it also eases the

addition of future functionalities.

SchedMon is composed by i) a Linux kernel module, or driver, that facilitates the access to

the underlying hardware performance and power facilities and helps overcoming possible privilege

restrictions; ii) a user-space library, which allows the interaction between user-space programs and

the driver and, therefore, the ability to perform run-time application profiling; and iii) a user-space

tool (smon) that does not only facilitate the usage of the tool, but it also provides a new function

call tracing functionality.

SchedMon gathers most of the state-of-art performance monitoring capabilities, energy status

information and application function tracing, and packs them into a simple and intuitive command-

line interface. Furthermore, it provides the ability not only to detect the underlying architecture’s

attainable performance, according to the CARM, but also to output, in a single plot, the exe-

cution performance profiling information against this model, which facilitates the understanding

of the underlying hardware resources and allows detecting possible architectural or application

bottlenecks.

60

5Experimental Results

Contents5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . 625.2 SpyMon Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 625.3 SchedMon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.4 Overhead Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

61

5. Experimental Results

This Chapter targets the evaluation and functionality demonstration of the herein presented

tools, by illustrating and analyzing the obtained results for different experimental scenarios. Section

5.1 describes the experimental environment, i.e., the experimental conditions in which the presented

tests were run. Section 5.2 explores and analyzes the main functionalities of the SpyMon tool.

Similarly, Section 5.3 presents the experimental evaluation of the proposed scheduler-based profiling

tool SchedMon. In order to compare the efficiency of both tools, similar experimental scenarios

are considered. However, additional results are presented, for each tool, in order to exploit the

extra functionalities not shared between them. At last, in Section 5.4, a discussion regarding the

introduced overheads of both performance and power/energy consumption monitoring is made.

5.1 Experimental Environment

The herein presented results were obtained in a machine containing an Intel i7 3770K processor,

which is an Ivy Bridge based micro-architecture with 4 physical cores and with hyper-threading

support, i.e., 8 LPCs. It operates at 3.5GHz, although it can attain 3.9GHz in turbo boost mode,

and its memory organization comprises 3 cache levels, namely: 32kB L1 cache, 256kB L2 cache

and 8192kB L3 cache. The cache levels L1 and L2 are shared between the LPCs contained in the

same PPC, and the last-level cache, L3, is shared among all LPCs. The DRAM memory controllers

support up to two channels of DDR3 operating at 2x933MHz.

The above described architecture provides a PMU containing 3 PFCs and 4 PMCs. With respect

to the energy status interface, information regarding the package, power-plane 0 and power-plane 1

is available. The performance and power hardware facilities are configured as described in Sections

2.1 and 2.2.

The Linux kernel uses the non-maskable interrupt watchdog to periodically detect if CPU is

locked. In order to achieve this, the watchdog makes use of the underlying PMCs, which in turn

interferes with any mechanism that makes use of the hardware PMU. Therefore, before executing

the presented tests, the Linux watchdog was disabled. This can be achieved by writing the value

of 0 into the /proc/sys/kernel/nmi_watchdog system configuration file.

Since Intel’s turbo boost functionality is implemented in a very complex way and, therefore,

may complicate the understanding of the obtained performance results, the processor’s clock was

set to a fixed frequency of 3.5GHz, which corresponds to the maximum non-turbo frequency.

5.2 SpyMon Experimental Evaluation

This Section presents the obtained experimental results for the SpyMon monitoring tool. The

performed experimental evaluation was conducted in order to demonstrate the tool’s possibility for

a system-wide performance and power/energy consumption analysis. Moreover, a set of standard

FP benchmarks from the SPEC CPU2006 suite [10] is evaluated in terms of both performance and

62


power/energy consumption and their CARM analysis is also presented.

5.2.1 System-wide Profiling

Figure 5.1 illustrates a performance evaluation of four distinct SPEC CPU2006 benchmarks

(milc, namd, GemsFDTD and tonto). In order to obtain the depicted results, each benchmark

test was executed individually, without the interference of any other applications (with the excep-

tion of the OS tasks). For each execution, the benchmark process was pinned to its corresponding

LPC, as shown in Figures 5.1(a), 5.1(c), 5.1(e) and 5.1(g) for milc, namd, GemsFDTD and

tonto, respectively. Each of the shown LPCs was chosen in order to belong to a distinct PPC.

After running each of the four tests individually, a final run was performed, in which all the four

tests were run at the same time. The obtained results are presented in Figures 5.1(b), 5.1(d), 5.1(f)

and 5.1(h). In each of the runs, the sampling time interval was set to 20ms.

By analyzing Figure 5.1, several informational details can be extracted. First, all the bench-

marks achieve lower performance when run alongside each other, due to a shared resource con-

tention. This conclusion can be taken by observing that i) each benchmark duration is longer

when run alongside others, and ii) each benchmark performance values are significantly lower. For

example, milc takes around 210s to execute alone, in contrast to around 280s when run alongside

others. In addition, it achieves performance values of around 3.1GFlops/s when performing alone,

in opposition to a maximum value of around 2.9GFlops/s when executed with others.

Another interesting observation relies on the shapes of the obtained plots, where different parts

of the execution can be detected. For instance, when running the milc benchmark alone (see

Figure 5.1(a)), at least three distinct execution phases can be identified, where each of them ocurrs

in refulat time intervals and delivers different attainable performance (in GFlops/s). However,

when run together, the shapes of each benchmark execution appear to change according to the

concurrent applications. For example, the shape of the GemsFDTD benchmark is completely

distorted when run with the other applications (see Figures 5.1(e) and 5.1(f)).

It is important to determine where in the architecture the previously described performance

interferences happen. As already referred, each benchmark was run in a different PPC and, there-

fore, do not share any in-core computational resources with the other applications. Hence, the

interference between applications can be associated to memory contentions in the shared cache

level (L3) and in DRAM.

Since the inter-task interference mainly relates to memory contention, an interesting phe-

nomenon can be observed for namd, which shape does not seem to be affected by the other

benchmarks (see Figures 5.1(c) and 5.1(d)). This happens because namd is most likely compute-

bound, i.e., its performance is mainly limited by the predominant computations, and does not

highly depend on the memory operations. Therefore, while the other benchmarks seem to dispute

the access to the shared memory resources, namd mostly depends on the in-core computational

63


0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Pe

rfo

rma

nce

[G

Flo

ps/s

]

Time [s]

(a) Milc running alone (core 0)

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Perfo

rman

ce [G

Flop

s/s]

Time [s]

(b) Milc running with others (core 0)

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Pe

rfo

rma

nce

[G

Flo

ps/s

]

Time [s]

(c) Namd running alone (core 1)

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Perfo

rman

ce [G

Flop

s/s]

Time [s]

(d) Namd running with others (core 1)

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Pe

rfo

rma

nce

[G

Flo

ps/s

]

Time [s]

(e) GemsFDTD running alone (core 2)

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Pe

rfo

rma

nce

[G

Flo

ps/s

]

Time [s]

(f) GemsFDTD running with others (core 2)

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Pe

rfo

rma

nce

[G

Flo

ps/s

]

Time [s]

(g) Tonto running alone (core 3)

0

1

2

3

4

5

0 50 100 150 200 250 300 350 400

Perfo

rman

ce [G

Flop

s/s]

Time [s]

(h) Tonto running with others (core 3)

Figure 5.1: SpyMon performance evaluation of SPEC CPU2006 benchmarks, for a 20ms samplingtime interval.

resources, and the attained performance corresponds to the one obtained when namd was executed

without any interference of other applications.

Finally, it should be emphasized that the presented performance results might include a small

OS interference, since SpyMon monitors the core and not the application itself. Thus, the presented

results correspond to the execution of all tasks detected by the PMU during the time of the

64


20

25

30

35

40

0 50 100 150 200 250 300 350 400

Po

we

r [W

]

Time [s]

(a) Milc running alone (core 0)

20

25

30

35

40

0 50 100 150 200 250 300 350 400

Po

we

r [W

]

Time [s]

(b) Namd running alone (core 1)

20

25

30

35

40

0 50 100 150 200 250 300 350 400

Po

we

r [W

]

Time [s]

(c) GemsFDTD running alone (core 2)

20

25

30

35

40

0 50 100 150 200 250 300 350 400

Po

we

r [W

]

Time [s]

(d) Tonto running alone (core 3)

20 25 30 35 40

0 50 100 150 200 250 300 350 400

Pow

er [W

]

Time [s]

finished tonto milc finished finished

namd finished

GemsFDTD

(e) Four benchmarks run simultaneously.

Figure 5.2: Power consumption of four benchmarks run separately and simultaneously.

benchmarks execution.

Figure 5.2 depicts the experimentally obtained power consumption for the above described

test conditions. The plotted information corresponds to the package domain, i.e., it represents the

power consumption of the whole chip. When each benchmark is executed alone (see Figures 5.2(a),

65


5.2(b), 5.2(c) and 5.2(d)), the chip power consumption is around 25W . As it can be observed, the

power consumption does not only depend on the core being activated or not, but also on the

resource utilization. For instance, as shown in Figure 5.2(a) the power consumption assumes a

shape similar to the one observed in the milc performance profile (see Figure 5.1(a)). On the

other hand, Figure 5.2(e) shows the power consumption when all benchmarks were simultaneously

executed. A it can be observed, each additional activated LPC corresponds to an increment of

approximately 5W in the system’s power consumption (see Figure 5.2(e)).

5.2.2 Cache-aware Roofline Model Analysis

In order to obtain a more detailed picture on the attainable performance of the applications

from an architecture point of view, a set of benchmarks is analyzed by relying on the CARM.

Figure 5.3 illustrates the execution of calculix and tonto benchmarks, where each dot in the

CARM represents a different monitoring sample. In both cases, the sampling time interval was

set to 50ms and different colors were used to represent the predominant FP types (scalar, SSE or

AVX).

2-12

2-8

2-4

20

24

2-16 2-14 2-12 2-10 2-8 2-6 2-4 2-2 20

Operational Intensity [flops/byte]

Per

form

ance

[GFl

ops/

s] DBL SSE AVX

(1)

(4)(3)(2)

(2) AVX (ADD,MUL) / SSE (MAD) Roofline(1) AVX (MAD) Roofline

(3) SSE (ADD,MUL) / DBL (MAD) Roofline(4) DBL (ADD,MUL)Roofline

(a) Calculix

2-1

20

21

22

23

24

2-5 2-4 2-3 2-2 2-1

Perfo

rman

ce [G

Flop

s/s]

Operational Intensity [flops/bytes]

AVX (ADD,MUL) / SSE (MAD) Roofline

SSE (ADD,MUL) / DBL (MAD) Roofline

DBL (ADD,MUL)Roofline

(b) Tonto

Figure 5.3: Evaluation of SPEC CPU2006 benchmarks by using the CARM. The sample timeinterval was set to 50ms

040

80120

160200 2

-5

2-4

2-320

21

22

23

Operational Intensity

[flops/byte]Time [s]

Performance [GFlops/s]

DBL SSE AVX

Figure 5.4: Temporal representation of the CARM for Tonto.

66


Figure 5.3(a) contains the performance profiling information for calculix. As it can be noticed,

there are two predominant types of FP instructions, scalar and AVX, and each type is associated

with its corresponding roof line. For instance, by observing the blue dots, one can conclude that

they are forming a shape similar to that of the roof lines, thus proving the existence of roof lines

delimiting performance.

According to the CARM, it can be observed that calculix is not completely compute-bound,

since there are some parts of its execution where it is in the memory-bound area. A good example

of this can be observed in the higher performance FP scalar samples, i.e., the top DBL samples,

which even draw a ridge point similar to that of the model. Moreover, since this part of the exe-

cution reaches higher values than the scalar attainable performance for ADD or MUL operations,

simultaneous addition and multiplication operations (MAD) were likely performed.

Figure 5.3(b) contains the CARM information for tonto. Similarly to calculix, this test

presents two distinct performance parts, which contain the predominant scalar and SSE FP types,

respectively. As it can by observed in Figure 5.4, the two distinct parts of execution are interchange-

ably switching in time. During the parts of the execution corresponding to the scalar instructions

(DBL), one can conclude that tonto is mainly memory-bound, since it is at the left side of its

correspondent ridge point, both for ADD/MULL and MAD roof lines. In fact, Figure 5.1 proves

that these zones of the execution are memory dependent and inflict changes in the performance

shapes of applications running alongside. On the other hand, when executing SSE instructions, it

is considered to be more compute-bound.

Figure 5.5 illustrates the performance CARM analysis for a number of SPEC CPU2006 bench-

marks. Each point corresponds to the average of the measured samples for each test and the colors

have the same meaning as previously explained. According to the CARM, the namd, calculix and

milc benchmarks are considered to be compute-bound for their average execution. On the other

hand, the soplex, povray and lbm are considered to be memory-bound. Finally, the gamess,

tonto andGemsFDTD might be considered either memory-bound or compute-bound, depending

on the usage of FP operations.

5.2.3 Power/Energy Consumption Evaluation

In order to demonstrate SpyMon’s power/energy consumption monitoring functionality, in ad-

dition to the previously illustrated results, a set of standard SPEC CPU2006 benchmarks was run

individually, with the predefined tool’s process configuration (one spy for each LPC) and a sampling

time interval of 50ms. Figure 5.6 shows, in time, the obtained power consumption measurements

for calculix and tonto. The plotted information corresponds only to the package domain, i.e., it

represents the power consumption of the whole chip. Similarly to what was previously described

for the CARM analysis, distinct types of FP correspond to different colors.

In contrast to tonto, where different execution phases are interleaved at regular intervals,

67


DBL (MAD) Roofline

GemsFDTDcalculix

gamesslbmmilc

namdpovraysoplex DBL SSE AVX

Operational Intensity [flops/bytes]2-5 2-4 2-3 2-2 2-1

2-3

2-2

2-1

20

21

22

23

24

25

Per

form

ance

[GFl

ops/

s]

AVX MAD Roofline

SSE (MAD) RooflineAVX (ADD,MUL) Roofline

SSE (ADD,MUL) Roofline

DBL (ADD,MUL) RooflineAVX/SSE L1 LOAD Roofline

DBL L1 LOAD Roofline

tonto

Figure 5.5: Application CARM plot showing the floating-point SPEC CPU2006 benchmarks; theapplication color characterization was made according to average classification (double, SSE orAVX).

20

22

24

26

28

30

0 20 40 60 80 100 120 140 160

Pow

er [W

]

Time [s]

DBL SSE AVX

(a) Calculix

20

22

24

26

28

30

0 50 100 150 200

Pow

er [W

]

Time [s]

DBL SSE AVX

(b) Tonto

Figure 5.6: Power evaluation of SPEC CPU2006 benchmarks.

calculix mixes the use of both AVX and scalar FP operations. Furthermore, it can be observed

that distinct FP types correspond to different levels of power consumption, where scalar regions

indicate the lowest values and AVX the highest.

Figure 5.7 illustrates the average power consumption, as well as the total energy consumption

for a number of different SPEC CPU2006 benchmarks. Since the average power consumption does

not significantly vary across the different benchmarks, the differences in the energy consumption

mostly relate to the duration of each benchmark.

5.3 SchedMon

This section presents the obtained experimental results for the SchedMon monitoring tool. The

performed experimental tests are intended to illustrate the capabilities of the tool, in term of

its distinct functionalities. First, an Finite-Difference in Time-Domain (FDTD) OpenCL multi-

thread application [14] is tested in order to illustrate the tool’s ability for detecting multiple

68

5.3 SchedMon

0

5

10

15

20

25

30

GemsFDTD

calculix

gamess

gromacs

lbm milcnamd

soplex

tonto

0

2

4

6

8

10

12

14

Pow

er [W

]

Ene

rgy

[kJ]

Power Energy

Figure 5.7: Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks.

thread executions at run-time. Next, the function call tracing functionality is demonstrated for

a simple multi-threaded application and a real-world SPEC CPU2006 benchmark. Finally, a set

of FP benchmarks from the SPEC CPU2006 suite is evaluated in both performance, by using the

CARM, and power/energy consumption. In both cases, the function call tracing is highlighted,

instead of the predominant FP type, in order to provide a different evaluation perspective from

the one presented for SpyMon.

5.3.1 Application Thread Hierarchy

Figure 5.8 depicts the dependency process tree of the executed FDTD OpenCL application [14],

where each node contains the PID of a monitored task. The main task that was registered into

SchedMon was the one on top (786) and it corresponds to the leader task of this execution. During

the execution, whenever this task or a subsequent child forks a new task, it is also registered into

the tool and starts being monitored immediately. In order to perform a multi-task evaluation, the

–i option was enabled.

As shown in Figure 5.8, SchedMon allows to profile multi-threaded applications, regardless of

the thread dependency level, i.e., the number of levels of the dependency tree. In this specific case,

one can observe that the monitored application is composed by 9 distinct tasks, which construct a

four-level dependency tree.

5.3.2 Scheduling Information

SchedMon allows not only to detect and monitor multi-threaded applications, but it also provides

the means to analyze the scheduling route of each task’s execution. This allows to obtain a more

detailed information on the system’s scheduling mechanisms, as well as to extract useful insights

about the application’s structure.

69


Figure 5.8: Thread hierarchy for an FDTD OpenCL application [14].

CPU 0

CPU 1

CPU 2

CPU 3

CPU 4

CPU 5

CPU 6

CPU 7

0 5 10 15 20 25 30Time [s]

786

786

786

787 787 787 787

788 788 788 788

789 789 789789789 789 789 789789789

790 790 790 790

790790

790790 790

790 790 790 790 790 790 790 790

790790790

790 790

792 792 792 792 792 792

795795

796796796796 796

791 791 791 791 791

Figure 5.9: Scheduling information for OpenCL application fdtd.

Figure 5.9 shows the scheduling information corresponding to the previously referred OpenCL

application test. As it can be seen, SchedMon is capable of monitoring all the information regarding

when each of the application tasks enters or leaves a CPU (LPC). Since the underlying hardware

contains 8 LPCs and the tested application is composed by 9 tasks, it is not possible to run all the

tasks at the same time, at a given moment on all LPCs. In this specific test, the OS scheduler solves

this issue by constantly migrating the task 790 from one core to another. For example, at around

5ms of the execution time, the task 790 is migrated from LPC 5 to LPC 6. Another interesting

phenomenon can be observed observed at around 9 seconds of the execution, where all tasks stop

executing for about one second, with the sole exception of the leader task. This indicates that all

the tasks are waiting either for resources or instructions from the main thread (786), thus showing

the capabilities of SchedMon to provide insights on the application structure.

5.3.3 Function Call Tracing

As previously referred, SchedMon provides the ability of tracing function calls of a given applica-

tion. This is conducted by instrumenting the adequate memory locations with a trap instruction,

70

5.3 SchedMon

thus implementing breakpoints. Figure 5.10 illustrates the execution and obtained profiling in-

formation of an application containing two processes, which are associated with different binary

executable files. Figure 5.10(a) presents the detailed information obtained from SchedMon. The

first column indicates if a function is being called or returned. The next column contains the

elapsed time information, in seconds. The third column indicates the PID of the corresponding

task. In the last column, the function names and addresses are provided. Figure 5.10(b) graphically

represents the information shown in Figure 5.10(a).

(a) SchedMon’s output.

Task

ATa

sk B

Timemain main

foo bar

main main

one one

two

threebarfork

execve

(b) Time diagram.

Figure 5.10: Function call tracing of an application containing two processes. The child process,after being forked, switches its execution image.

In detail, the application execution happens in the following order: i) the main process (task

A) forks a child process (task B); ii) the parent calls the foo() and bar() local functions, waits

for the child and leaves; iii) after being forked, task B calls the bar() local function and switches

its execution image (execve); iv) task b calls function_one() (one), which in turn triggers

funtion_two() (two); v) the child calls function_three() (three) and terminates.

In order to further explore and demonstrate the full potential of this functionality, the perfor-

71


0

1

2

3

4

5

0 50 100 150 200

Perfo

rman

ce [G

Flop

s/s]

Time [sec]

imp_gauge_force() grsource_imp() eo_fermion_force() ks_congrad()

Figure 5.11: Milc performance colored according to its function call tracing profile.

mance of milc SPEC CPU2006 benchmark was evaluated and analyzed according to its function

call trace profile. As previously observed, milc presented several distinct phases and hence it

was a preferred benchmark for this particular demonstration. Figure 5.11 depicts the performance

analysis of milc in time, and the sample colors represent the currently executed high-level function.

As it can be observed in Figure 5.11, it is possible to extract a pattern from milc execution,

where each the distinct performance phase corresponds to a different high-level function. This

allows not only evaluating specific execution parts of a given application, but also detecting possible

performance bottlenecks.

Regarding SchedMon’s function call tracing functionality, the following conclusions can be made:

• the tool is able to detect other application calls, without the need to change their code or

recompiling;

• it is possible to trace multi-threaded applications in a single run;

• when an execve() call is performed, the instrumentation and detection of new loaded func-

tions is possible at run-time;

• recursive function calls are also detected;

• it allows the evaluation of different parts of the execution, as well as the detection of possible

performance bottlenecks.

5.3.4 Cache-aware Roofline Model Analysis

As previously referred, SchedMon also provides an execution mode that outputs the profiling

information in order to facilitate the application anaylis according to the CARM. Figure 5.12 shows

the CARM performance analysis for calculix and tonto benchmarks. Similarly to the conditions

adopted for SpyMon evaluation, both tests were performed with a sampling time interval of 50ms,

and energy status information was simultaneously obtained.

72

5.3 SchedMon

2-142-122-102-82-62-42-2202224

2-16 2-14 2-12 2-10 2-8 2-6 2-4 2-2 20

Perfo

rman

ce [G

Flop

s/s]


results_() spooles()

mastruct()

(2) AVX (ADD,MUL) / SSE (MAD) Roofline(1) AVX (MAD) Roofline

(3) SSE (ADD,MUL) / DBL (MAD) Roofline(4) DBL (ADD,MUL)Roofline

(a) Calculix

2-1

20

21

22

23

24

2-5 2-4 2-3 2-2 2-1

Perfo

rman

ce [G

Flop

s/s]


make_constraint_data() add_constraint()

make_fock_matrix()

AVX (ADD,MUL) / SSE (MAD) Roofline

SSE (ADD,MUL) / DBL (MAD) Roofline

DBL (ADD,MUL)Roofline

(b) Tonto

Figure 5.12: Evaluation of SPEC CPU2006 benchmarks using the CARM.

In order to provide a different perspective from the already presented CARM profiles for cal-

culix and tonto (see Figures 5.3(a) and 5.3(b)), the CARM information samples are now colored

according to their function call tracing profiles. In Figure 5.12(b), it can be observed that tonto’s

high-level functions correspond to distinct CARM phases. The functions add_constraint()

and make_constraint_data() present a similar behavior and achieve higher performance, whilst

make_fock_matrix() delivers a lower performance and it is contained in the memory-bound area.

In contrast to Figure 5.3(b), additional information can be obtained by analyzing the execution

call tracing profile, which allows detecting and further optimize the possible bottlenecks of the ap-

plication execution. On the other hand, as it can be observed in Figure 5.12(a), calculix high-level

functions do not demonstrate any visible patterns when plotted against the CARM.

When comparing calculix execution (see Figure 5.12(a)) against the same test performed with

SpyMon (see Figure 5.3(a)), one can notice that most of the trail disappeared, i.e., a large number

of samples is concentrated at the top. A possible explanation for this behavior can be found in the

fact that SchedMon introduces less performance interference into the monitored application. As a

results, the tested application spends most of its time on the top right area of its shape, meaning

that it is less vulnerable to possible resource contention in the memory subsystem provided the

the tool itself.

On the other hand, when looking at tonto’s profiling information in Figure 5.12(b), one can

observe that the samples are more spread than the ones taken with SpyMon. This might happen

because SpyMon introduces a significantly higher amount of memory operations into the profiling

(see Section 5.4), which might reduce the operational intensity of the samples. In addition, when

analyzing the top performance samples, i.e., the top rightmost dots, one can observe that a higher

performance is attained when monitoring the application with SchedMon.

Finally, Figure 5.13 provides a detailed analysis regarding the average performance of several

SPEC CPU2006 according to the CARM. The presented results are similar to the ones observed

with SpyMon (see Figure 5.5). Minor differences might refer to the different overhead imposed

73


DBL (MAD) Roofline

GemsFDTDcalculix

gamesslbmmilc

namdpovraysoplex DBL SSE AVX

Operational Intensity [flops/bytes]2-5 2-4 2-3 2-2 2-1

2-3

2-2

2-1

20

21

22

23

24

25

Per

form

ance

[GFl

ops/

s]

AVX MAD Roofline

SSE (MAD) RooflineAVX (ADD,MUL) Roofline

SSE (ADD,MUL) Roofline

DBL (ADD,MUL) RooflineAVX/SSE L1 LOAD Roofline

DBL L1 LOAD Roofline

tonto

Figure 5.13: Application CARM plot showing the floating-point SPEC CPU2006 benchmarks; theapplication color characterization was made according to average classification (double, SSE orAVX).

by the two tools and to the fact that the tests were performed at different times, thus different

machine states may introduce slight differences in the results.

5.3.5 Power/Energy Consumption Evaluation

Figure 5.14 illustrates the obtained power consumption, in time, for both calculix and tonto

benchmarks. As already described, the energy information samples were obtained alongside with

performance when running in CARM mode, i.e., with a sampling time interval of 50ms.

20

22

24

26

28

30

0 20 40 60 80 100 120 140 160

Pow

er [W

]

Time [s]

results_() spooles()

mastruct()

(a) Calculix

20

22

24

26

28

30

0 50 100 150 200

Pow

er [W

]

Time [s]

make_constraint_data() add_constraint()

make_fock_matrix()

(b) Tonto

Figure 5.14: Power evaluation of SPEC CPU2006 benchmarks.

In order to provide a different perspective from the one already presented, when analyzing the

power consumption profiles for calculix and tonto with SpyMon (see Figures 5.6(a) and 5.6(b)),

the samples are now colored according to their function call tracing profiles. By analyzing Figure

5.14(b), it can be observed that distinct phases of tonto’s execution in time correspond to different

functions. Thus, tonto contains clearly observable phases both in time and in the CARM. On

the other hand, Figure 5.14(a) confirms the previous CARM conclusions that calculix does not

74

5.4 Overhead Discussion

present any visible patterns in its distinct execution parts.

In comparison to SpyMon, it can be observed that the power consumption is reduced when using

SchedMon as the monitoring tool. This can be explained by the fact that SchedMon does not create

additional tasks for monitoring, i.e., it makes use of the available system running tasks in order

to periodically read the energy status information. On the other hand, SpyMon is composed by

9 processes (monitor plus spies), which are actively monitoring the different LPCs at run-time,

including those that are not currently running any application.

0

5

10

15

20

25

30

GemsFDTD

calculix

gamess

gromacs

lbm milcnamd

soplex

tonto

0

2

4

6

8

10

12

14

Pow

er [W

]

Ene

rgy

[kJ]

Power Energy

Figure 5.15: Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks.

Finally, Figure 5.15, illustrates the average power consumption and the total energy consump-

tion of several SPEC CPU2006 benchmarks. Apart from milc, for which a significantly lower

average power consumption can be observed, all the benchmarks achieve a similar average power

consumption. As a result, the energy consumption of these benchmarks relates directly to their

execution time. As it was already referred, in comparison to the results obtained with SpyMon (see

Figure 5.7), a slightly lower power consumption can be noticed when relying on the SchedMon tool.


In order to obtain an overview of the overhead introduced by the herein proposed tools, two

distinct evaluation tests were performed for each tool. Figure 5.16 illustrates the imposed over-

heads by the tools in time. As it can be observed, the tools use a timer which defines the sampling

time interval (TT ). Whenever the timer is triggered, the OS assures the execution of the tool, by

switching the currently running task with the tool. The overhead corresponding to the OS mech-

anisms triggered by the tool is depicted as TO. Finally, the tool’s execution overhead corresponds

to TH .

Figure 5.16 also shows the main scope of two evaluation tests (EA) and (EB) that were per-

formed in order to analyze both tools overheads. In the first evaluation test (EA), each tool is set

75


SampleTime 1ms 2ms 5ms 10ms 25ms 50ms 100ms

PMU 1,059 2,064 5,080 10,107 25,188 50,323 100,592(5,89%) (3,22%) (1,61%) (1,07%) (0,75%) (0,65%) (0,59%)

PMU & 1,059 2,064 5,080 10,107 25,185 50,323 100,582RAPL (5,89%) (3,21%) (1,61%) (1,07%) (0,74%) (0,65%) (0,58%)

Table 5.1: Median Time Counts for SpyMon self-monitoring.

SampleTime 1ms 2ms 5ms 10ms 25ms 50ms 100ms

PMU 1,004 2,009 5,025 10,051 25,130 50,261 100,523(0,40%) (0,46%) (0,50%) (0,51%) (0,52%) (0,52%) (0,52%)

PMU & 1,004 2,009 5,025 10,051 25,130 50,261 100,524RAPL (0,40%) (0,46%) (0,50%) (0,51%) (0,52%) (0,52%) (0,52%)

Table 5.2: Median Time Counts for SchedMon self-monitoring.

to profile itself, i.e., it is run without the execution of any monitored task. Therefore, the obtained

PMU sampling information should correspond to the events performed by the tool, assuming that

no other task runs in the system. As a result, the obtained time should represent the maximum

overall overhead per sample. In the second evaluation test (EB), a more refined time analysis is

performed. This was achieved by instrumenting each tool with the rdtsc instruction, in order to

obtain the precise time overhead corresponding to the process of taking a single sample.

Evaluation A

The first test consists in running each of the tools individually, using the same predefined

configuration as used for the CARM analysis, but without executing any target benchmark appli-

cation. Hence, the tools should provide performance measurements that correspond to their own

execution. However, it should be taken into account that other OS tasks might introduce small

interference to the execution.

Tables 5.1 and 5.2 contain the median values of the obtained time measurements (in ms) for

SpyMon and SchedMon, respectively. The results are obtained for different sampling time inter-

vals, and for two distinct situations, namely: i) when only performance sampling information is

Timer OS Tool

Time

- Timer Time Interval (~ sampling time interval)

OS Timer OS Tool OS Timer… …

TT TO TH

EA

EB

TT

- OS Time Interval (mostly scheduling)TO

- Tool Time Interval (time to take a PMU sample)TH

- Evaluation AEA

- Evaluation BEB

Figure 5.16: Diagram illustrating the performed overhead evaluation tests.

76


0

5000

10000

15000

20000

25000

1ms 2ms 5ms 10ms 25ms 50ms 100ms

Instruc(on

s per Sam

ple


OT

ST

LD

(a) PMU

0

5000

10000

15000

20000

25000


Instruc(on

s per Sam

ple


OT

ST

LD

(b) PMU & RAPL

Figure 5.17: SpyMon’s number of instructions per sample when self-monitoring.

profiled; and ii) when performance sampling information is profiled alongside with energy/power

consumption sampling. In both situations, the time measurements are performed when taking a

PMU sample. Therefore, the obtained time values should not present any significant differences.

In addition, it is presented the percentage value that corresponds to the overhead corresponding

to a specific sampling time interval. This percentage is calculated according to the expression:

ovh = (timemeasured − timesample)/timesample. (5.1)

This expression takes the assumption that the sampling time interval (timesample) is perfectly

accurate.

The results presented in Tables 5.1 and 5.2 include the sampling time interval imposed by the

timer (TT ), the OS overhead when scheduling in and scheduling out the tool (TO), and the tool’s

overhead of taking a performance sample (TH), i.e.:

TA = TT + 2TO + TH . (5.2)

Therefore, the overall overhead of the tool corresponds to:

TOVH = 2TO + TH . (5.3)

When comparing the results for both tools, it can be observed that SpyMon introduces signifi-

cantly higher overheads than SchedMon. SpyMon results demonstrate an overhead between 0.58%

and 5.89%, whereas SchedMon presents an overhead between 0.40% and 0.52%. Since the tools

perform the same number of instructions each time a sample is taken, the differences obtained in

the overheads for different samples may refer to the OS interference (TO).

Figures 5.17 and 5.18 show the average number of performed instructions (on a per-type ba-

sis) obtained for the above described tests. As it can be observed in Figures 5.17(a) and 5.17(b),

SpyMon introduces an overhead of about 14000 instructions per sample when monitoring perfor-

mance and an overhead of about 25000 instructions per sample when monitoring both performance

and power/energy consumption on the same LPC. On the other hand, SchedMon introduces an

77


0

500

1000

1500

2000

2500

3000

3500

4000


Instruc(on

s per Sam

ple


OT

ST

LD

(a) PMU

0

500

1000

1500

2000

2500

3000

3500

4000


Instruc(on

s per Sam

ple


OT

ST

LD

(b) PMU & RAPL

Figure 5.18: SchedMon’s number of instructions per sample when self-monitoring.

overhead of about 3000 instructions per sample when monitoring performance (see Figure 5.18(a))

and an overhead of about 3500 instructions per sample when monitoring both performance and

power/energy consumption (see Figure 5.18(b)).

As it can be noticed, for both tools, the number of instructions per sample is significantly

increased when RAPL samples are taken (in contrast to what was shown during the evaluation of

the timer overheads). This happens because during a measured time interval (TA), the instructions

related to a PMU sample and a RAPL sample are measured by the performance counters. On

the other hand, the RAPL samples do not interfere with TA, since they are hidden in the timer’s

interval time (TT ).

Finally, it should be emphasized that the above shown results for evaluation test A do not

correspond to the overheads solely introduces by the tools, since they contain the interference of

any OS tasks that ran during the experimental evaluation.

Evaluation B

In order to perform the evaluation test B, both tools were run in similar conditions to the

above described for evaluation A. However, instead of presenting the obtained sampling informa-

tion, both tools were instrumented in a way to measure the sole overhead of taking a sample

(TH).

Figure 5.19 illustrates the obtained results for both tools. Figure 5.19(a) showns the overheads

of taking a PMU sample, while Figure 5.19(b) shows the overheads of taking a RAPL sample.

As it can be observed in Figure 5.19, SchedMon presents a lower overhead, in both cases. The

overhead of producing a PMU sample is around 1.39µs in SchedMon, compared to an overhead

of around 1.65µs in SpyMon. On the other hand, the introduced overhead for producing a RAPL

sample is around 1.25µs for SchedMon, compared to the overhead of around 1.30µs in SpyMon.

78

5.5 Summary

1.00

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80

1 2 5 10 25 50 100

PMU Sam

ple Overhead (μs)

Samping Time Interval (ms)

SpyMon

SchedMon

(a) PMU

1.00

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80

1 2 5 10 25 50 100

RAPL Sam

ple Overhead (μs)

Sampling Time Interval (ms)

SpyMon

SchedMon

(b) RAPL

Figure 5.19: Overhead of taking a PMU or a RAPL sample in both SpyMon and SchedMon tools.

5.5 Summary

This chapter presented the necessary experimental results that allow to illustrate the different

features of the herein presented tools. SpyMon has demonstrated to be a good system-wide perfor-

mance tool, capable of delivering information about the whole system’s performance and energy

status. It also allows to evaluate an application performance according to the CARM. SchedMon

has also proven to be able not only to extract and deliver performance and power/energy con-

sumption information, but also to provide the means for a CARM analysis. Although SchedMon

targets the application and not the whole system, it allows to monitor multi-threaded applications

and it is able to reconstruct the whole execution, by tracing process dependencies, function calls

and providing task scheduling information.

The results obtained with SpyMon allowed to evaluate the interference (both in performance

and power consumption) of multiple applications running at the same time. Moreover, it provided

a complete CARM and power/energy consumption evaluation of a set of FP SPEC CPU2006

benchmarks, according to their predominant FP types.

On the other hand, SchedMon has demonstrated to be able to reconstruct a full multi-threaded

application execution, from the scheduling point of view. SchedMon also provided a complete

CARM and power/energy consumption evaluation for a set of FP SPEC CPU2006 benchmarks.

However, the performed analysis included additional insightful information about the benchmarks

function call tracing profile.

In terms of overheads, SchedMon has demonstrated lower overheads than SpyMon, either with or

without taking into account the OS interference. Moreover, SpyMon has demonstrated to introduce

a higher power consumption, which relates to the fact the it is composed by several processes, that

run in different LPCs during the entire profiling, and thus increase the overall power consumption.

Despite these differences, in overall, both monitoring methods allow a user/programmer to get a

clear picture of the behavior of the application and how its execution is affected by the processor

architectural limitations.

79


80

6Conclusions

Contents6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

81

6. Conclusions

This thesis proposes two distinct monitoring methods that combine the advantages of the

recently proposed Cache-aware Roofline Model (CARM) [11], with real-time accurate monitoring

facilities in a way that allows application developers to easily relate the application behavior with

the architecture characteristics, thus fostering new application optimizations. Both proposed tools

rely on the available hardware counting interfaces for obtaining the measurements, namely the

Performance Monitoring Unit (PMU), for performance, and the Running Average Power Limit

(RAPL), for energy status information, and aim at providing the full underlying functionality to

the user in a simple and intuitive way.

The first tool, SpyMon, targets system-wide monitoring and it is mostly implemented from

the user-space, which increases its portability and independence from the OS. It launches and

pins a distinct process to each CPU’s core that is intended to be monitored. It also invokes an

additional main process, which controls the whole tool’s execution and performs the energy status

profiling. The SpyMon’s implementation provides a fully configurable performance environment

and incorporates the ability to monitor the system’s power consumption in a single tool, which is

provided as a simple to use command-line interface. Moreover, it is possible to perform a predefined

performance analysis using the CARM, which eases the tool’s configuration and provides additional

useful performance information.

The second tool, SchedMon, targets application profiling and it is mostly implemented from the

kernel-space. It makes use of the OS scheduling events in order to keep track of the monitored

application, and also to reduce the interference of other running tasks. In addition to performance

and power monitoring, SchedMon allows to trace the complete execution of multi-threaded applica-

tions (process dependencies, function calls and task scheduling information) which makes possible

to reconstruct the complete application execution. In addition, it also provides the ability to per-

form a predefined performance evaluation based on the CARM. All the functionality is available

to the user as a simple and intuitive command-line interface. Moreover, for detailed run-time per-

formance and/or power evaluation, it is also possible to directly interact with the tool’s underlying

mechanisms by using a provided user-space library.

The performed experimental tests demonstrate the capabilities of the proposed tools to deliver

detailed information about how running applications perform on top of the underlying architectural

resources. A full CARM and power/energy consumption analysis was performed for several SPEC

CPU2006 benchmarks, which provided insightful information of each application’s ability to explore

the full attainable performance of the underlying resources. SpyMon has also proven to be able

to obtain detailed system-wide information, which allowed to observe how different applications

interfere with each other (both in performance and power consumption domains) when running

simultaneously in a multi-core architecture. On the other hand, SchedMon succeeded in detecting

and monitoring the execution of multi-threaded applications, thus being able to reconstruct the

whole application execution by tracing its process dependencies, function calls and task scheduling

82

6.1 Future Work

information. SchedMon also provided a complete CARM and power/energy consumption evaluation

for a set of FP SPEC CPU2006 benchmarks. In addition, the performed analysis includes insightful

information about the benchmarks function call tracing profile. In terms of introduced overheads,

both tools have shown low interference into the monitored applications. The overhead of producing

a PMU sample is around 1.39µs in SchedMon, compared to an overhead of around 1.65µs in SpyMon.

On the other hand, the introduced overhead for producing a RAPL sample is around 1.25µs for

SchedMon, compared to an overhead of around 1.30µs in SpyMon, which makes SchedMon the tool

with the lowest overheads.

6.1 Future Work

The herein presented tools, although demonstrating great potential, can be improved in several

aspects:

• Support for different architectures - One major improvement to the tools could be

achieved by increasing their support for different architectures. At the moment, mainly

recent Intel micro-architectures are supported, but no major difficulties are expected when

porting the tools functionality to other micro-architectures from different vendors, such as

AMD or ARM.

• Reducing memory operations - Although the inferred overhead has shown to be low,

most of it relies on memory operations and, therefore, both tools can be further optimized

to reduce the number of memory accesses. A first approach could be done to compress the

output data format, thus reducing the transferred information size.

• Extend to different interfaces - So far, the tools provide the facilities to access two distinct

hardware interfaces, PMU and RAPL. However, recent versions of other system components,

like GPU, provide similar interfaces that could be added to the monitoring tools.

• Improve call tracing - The function call tracing functionality implemented in SchedMon

can be improved in order to provide a greater control over the target applications. For

example, reducing the tracing interference by detecting recursive function calls, or functions

being called form inside loops.

In conclusion, as performance and power consumption optimizations are becoming a greater

concern, the greater is the need for powerful tools that can, in simple ways, integrate multiple

functionalities that allow to extract meaningful information about application and architectural

infrastructures. Despite the complexity of the provided hardware interfaces and the system’s mech-

anisms that provide the required information, we successfully implemented two different methods

for obtaining, in a simple and fully configurable way, the necessary information for a full perfor-

83

6. Conclusions

mance and power analysis, from both the application and system’s perspective. Nevertheless, both

tools achieve the initial objectives while still giving room for future improvements.

84

Bibliography

[1] Perf Wiki tutorial on perf. https://perf.wiki.kernel.org/index.php/Tutorial. Accessed:

2013-06-25.

[2] Perfmon2 sourceforge project page. http://perfmon2.sourceforge.net/. Accessed: 2013-

06-20.

[3] Antao, D., Taniça, L., Ilic, A., Pratas, F., Tomás, P., and Sousa, L. (2013). Monitoring perfor-

mance and power for application characterization with cache-aware roofline model. International

Conference on Parallel Processing and Applied Mathematics, page 14.

[4] Browne, S., Dongarra, J., Garner, N., Ho, G., and Mucci, P. (2000). A portable program-

ming interface for performance evaluation on modern processors. International Journal of High

Performance Computing Applications, 14(3):189–204.

[5] Cohen, W. E. (2004). Tuning programs with oprofile. Wide Open Magazine, 1:53–62.

[6] Corbet, J., Rubini, A., and Kroah-Hartman, G. (2005). Linux device drivers. " O’Reilly Media,

Inc.".

[7] Demme, J. and Sethumadhavan, S. (2011). Rapid identification of architectural bottlenecks

via precise event counting. In ACM SIGARCH Computer Architecture News, volume 39, pages

353–364. ACM.

[8] Donnell, J. (2004). Java performance profiling using the vtune performance analyzer.

[9] Fog, A. (2014). Software optimization resources. http://www.agner.org. Accessed: 2014-02-

10.

[10] Henning, J. L. (2006). Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer

Architecture News, 34(4):1–17.

[11] Ilic, A., Pratas, F., and Sousa, L. (2013). Cache-aware roofline model: Upgrading the loft.

Computer Architecture Letters, PP(99).

[12] Intel, I. (2013). 64 and ia-32 architectures software developer’s manual. Volume 3: System

Programming Guide.

85

https://perf.wiki.kernel.org/index.php/Tutorial

http://perfmon2.sourceforge.net/

http://www.agner.org

Bibliography

[13] Jarp, S., Jurga, R., and Nowak, A. (2008). Perfmon2: A leap forward in performance moni-

toring. In Journal of Physics: Conference Series, volume 119, page 042017. IOP Publishing.

[14] Kuan, L., Tomas, P., and Sousa, L. (2013). A comparison of computing architectures and

parallelization frameworks based on a two-dimensional fdtd. In International Conference on

High Performance Computing and Simulation (HPCS), pages 339–346. IEEE.

[15] Pettersson, M. (2009). Perfctr: Linux performance monitoring counters driver. Retrieved Dec.

[16] Treibig, J., Hager, G., and Wellein, G. (2010). Likwid: A lightweight performance-oriented

tool suite for x86 multicore environments. In International Conference on Parallel Processing

Workshops (ICPPW), pages 207–216. IEEE.

[17] Weaver, V. M. (2013). Linux perf_event features and overhead. In International Workshop

on Performance Analysis of Workload Optimized Systems (FastPath), page 80.

[18] Weaver, V. M., Johnson, M., Kasichayanula, K., Ralph, J., Luszczek, P., Terpstra, D., and

Moore, S. (2012). Measuring energy and power with papi. In International Conference on

Parallel Processing Workshops (ICPPW), pages 262–268. IEEE.

[19] Williams, S., Waterman, A., and Patterson, D. (2009). Roofline: an insightful visual perfor-

mance model for multicore architectures. Communications of the ACM, 52(4):65–76.

86

Bibliography

87

Performance and Energy Monitoring Tools for Modern ... · Performance and Energy Monitoring Tools...

Documents

Transcript of Performance and Energy Monitoring Tools for Modern ... · Performance and Energy Monitoring Tools...