Performance and Energy Monitoring Tools for Modern ... · Performance and Energy Monitoring Tools...
Transcript of Performance and Energy Monitoring Tools for Modern ... · Performance and Energy Monitoring Tools...
Performance and Energy Monitoring Tools forModern Processor Architectures
Luís Filipe Mataloto Taniça
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Prof. Pedro Filipe Zeferino TomásProf. Leonel Augusto Pires Seabra de Sousa
Examination Comittee
Chairperson: Prof. Nuno Cavaco Gomes HortaSupervisor: Prof. Pedro Filipe Zeferino Tomás
Members of the Committee: Prof. João Nuno de Oliveira e Silva
April 2014
Acknowledgments
First of all, I would like to thank Professors Leonel Sousa and Pedro Tomás, for the support
and coordination of my work. An additional thank to Aleksandar Ilić and Frederico Pratas, for
the patience and guidance, and to Diogo Antão, for the smooth partnership. Finally, I would like
to thank all my family and friends, for the support and motivation.
This work was supported by national funds through FCT – Fundação para a Ciência e a
Tecnologia, under the project P2HSC - Stretching the Limits of Parallel Processing on Heterogenous
Computing Systems under the reference PTDC/EEI-ELC/3152/2012.
Abstract
Accurate on-the-fly characterization of application behavior requires assessing a set of execution-
related parameters at run-time, including performance, power and energy consumption. These
parameters can be obtained by relying on hardware measurement facilities built-in modern multi-
core architectures, such as performance and energy counters. However, current operating systems
do not provide the means to directly obtain these characterization data. Thus, the user needs to
rely on complex custom-built libraries with limited capabilities, which might introduce significant
execution and measurement overheads. In this work, we propose two different tools for efficient
performance, power and energy monitoring of systems with modern multi-core CPUs, that allow
capturing the run-time behavior of a wide range of applications at different system levels: i)
at the user-space level, and ii) at kernel-level, by using the OS scheduler to directly capture
this information. Although the importance of the proposed monitoring facilities is patent for
many purposes, we focus herein on their employment for application characterization with the
Cache-aware Roofline model. The experimental results show the capabilities of the proposed
tools to deliver detailed and accurate information about the behavior of real-world applications
on the underlying architectural resources. Moreover, they allow reconstructing and identifying
the execution patterns of the profiled benchmarks from standard suites (SPEC CPU2006), while
introducing negligible overheads.
Keywords
Performance and Power Monitoring, Application Characterization, Multi-core Architectures,
Cache-aware Roofline Model
iii
Resumo
A caracterização comportamental de aplicações em tempo real requer a avaliação de um con-
junto de parâmetros relacionados com a execução, tais como o desempenho, potência e consumo
de energia, durante a própria execução. Estes parâmetros podem ser obtidos por meio de mecanis-
mos de hardware disponibilizados em arquitecturas modernas multi-core, tais como os contadores
de desempenho e energia. Contudo, os sistemas operativos (SOs) actuais não fornecem os meios
necessários para obter os dados relativos a esta caracterização. Assim sendo, o utilizador neces-
sita de recorrer a bibliotecas complexas e customizadas, com capacidades limitadas, que poderão
adicionar um overhead significativo às medições de execução. Neste trabalho, são propostas duas
técnicas diferentes, que permitem uma monitorização eficiente de desempenho e energia para ar-
quitecturas multi-core. As duas ferramentas de monitorização propostas permitem capturar, em
tempo real, o comportamento de um vasto leque de aplicações a partir de dois níveis distintos:
i) do nível do utilizador, ou user-space, e ii) do nível do sistema, ou kernel-space, utilizando o
scheduler do SO como recurso para capturar esta informação. Embora a importância das inter-
faces de monitorização propostas seja evidente para diversos propósitos, é dedicado um foco central
sobre a caracterização de aplicações segundo o Cache-aware Roofline Model. Os resultados obtidos
demostram as capacidades das ferramentas propostas para providenciar informação detalhada e
precisa sobre o comportamento de aplicações nos recursos arquitecturais. Estas também permitem
a reconstrução e identificação de padrões no perfil the standard benchmarks (SPEC CPU2006),
introduzindo um overhead insignificante.
Palavras Chave
Monitorização de Desempenho e Energia, Caracterização de Aplicações, Arquitecturas Multi-
core, Cache-aware Roofline Model
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 7
2.1 Performance Monitoring Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Performance Model-Specific Registers . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Performance Monitoring Event Configuration . . . . . . . . . . . . . . . . . 10
2.2 Running Average Power Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Linux Kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Performance Monitoring Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 State-of-Art Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Cache-Aware Roofline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 User-Space Monitoring Tool (SpyMon) 19
3.1 Architecture and Main Functionalities . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Spatial Process Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Available Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Linux Kernel Module and Hardware Access Restrictions . . . . . . . . . . . 24
3.2.2 Hardware Readings and Configuration . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Main Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Profiling Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Cache-aware Roofline Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Information Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vii
Contents
4 Scheduler-Based Monitoring Tool (SchedMon) 35
4.1 Architecture and Main Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 SchedMon’s Linux Kernel Module . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Smon: the user-space tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Available Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 Linux Kernel Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 User-space Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Adding Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2 Defining Event-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Application Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.4 Cache-aware Roofline Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.5 Information Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Experimental Results 61
5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 SpyMon Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 System-wide Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Cache-aware Roofline Model Analysis . . . . . . . . . . . . . . . . . . . . . 66
5.2.3 Power/Energy Consumption Evaluation . . . . . . . . . . . . . . . . . . . . 67
5.3 SchedMon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Application Thread Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.2 Scheduling Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.3 Function Call Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.4 Cache-aware Roofline Model Analysis . . . . . . . . . . . . . . . . . . . . . 72
5.3.5 Power/Energy Consumption Evaluation . . . . . . . . . . . . . . . . . . . . 74
5.4 Overhead Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 Conclusions 81
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
viii
List of Figures
2.1 Multi-core CPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 MSR read and write functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Performance MSRs for Intel’s PMU version 3. Figures obtained from [12]. . . . . . 11
2.4 Energy status MSR layout. Obtained from [12]. . . . . . . . . . . . . . . . . . . . . 12
2.5 Performance Cache-aware Roofline Model (Intel 3770K) . . . . . . . . . . . . . . . 17
3.1 Spacial perception of SpyMon while monitoring 5 threads from 3 applications. . . . 21
3.2 SpyMon’s components interaction and disposition in the Operating System (OS)
privilege layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 SpyMon’s data structures for ioctl() communication. . . . . . . . . . . . . . . . . 26
3.4 SpyMon’s execution flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Illustration of TBM for 3 defined event-sets. . . . . . . . . . . . . . . . . . . . . . . 28
3.6 SpyMon’s usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 SchedMon’s components interaction and disposition in the OS privilege layers. . . . 37
4.2 SchedMon event, event-set and environment structural hierarchy. . . . . . . . . . . 42
4.3 Linux scheduler breakpoints used by SchedMon. . . . . . . . . . . . . . . . . . . . . 45
4.4 SchedMon sampling process illustration. . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 SchedMon ring-buffer implementation overview. . . . . . . . . . . . . . . . . . . . . 50
4.6 Example of a function’s dump information. . . . . . . . . . . . . . . . . . . . . . . 55
4.7 SchedMon function call tracing data structures. . . . . . . . . . . . . . . . . . . . . 56
4.8 Smon event usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.9 Smon evset usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.10 Smon profile usage information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.11 Smon roof-run and roof-creat usage information. . . . . . . . . . . . . . . . . . 59
5.1 SpyMon performance evaluation of SPEC CPU2006 benchmarks, for a 20ms sampling
time interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Power consumption of four benchmarks run separately and simultaneously. . . . . 65
ix
List of Figures
5.3 Evaluation of SPEC CPU2006 benchmarks by using the CARM. The sample time
interval was set to 50ms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Temporal representation of the CARM for Tonto. . . . . . . . . . . . . . . . . . . . 66
5.5 Application CARM plot showing the floating-point SPEC CPU2006 benchmarks;
the application color characterization was made according to average classification
(double, SSE or AVX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Power evaluation of SPEC CPU2006 benchmarks. . . . . . . . . . . . . . . . . . . 68
5.7 Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks. 69
5.8 Thread hierarchy for an FDTD OpenCL application [14]. . . . . . . . . . . . . . . 70
5.9 Scheduling information for OpenCL application fdtd. . . . . . . . . . . . . . . . . . 70
5.10 Function call tracing of an application containing two processes. The child process,
after being forked, switches its execution image. . . . . . . . . . . . . . . . . . . . . 71
5.11 Milc performance colored according to its function call tracing profile. . . . . . . . 72
5.12 Evaluation of SPEC CPU2006 benchmarks using the CARM. . . . . . . . . . . . . 73
5.13 Application CARM plot showing the floating-point SPEC CPU2006 benchmarks;
the application color characterization was made according to average classification
(double, SSE or AVX). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.14 Power evaluation of SPEC CPU2006 benchmarks. . . . . . . . . . . . . . . . . . . 74
5.15 Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks. 75
5.16 Diagram illustrating the performed overhead evaluation tests. . . . . . . . . . . . . 76
5.17 SpyMon’s number of instructions per sample when self-monitoring. . . . . . . . . . 77
5.18 SchedMon’s number of instructions per sample when self-monitoring. . . . . . . . . 78
5.19 Overhead of taking a PMU or a RAPL sample in both SpyMon and SchedMon tools. 79
x
List of Tables
3.1 Sets of PMEs used for performance profiling when using the cache-aware roofline
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Sample of hardware performance events provided by SpyMon. . . . . . . . . . . . . 31
4.1 Available ioctl() requests to SchedMon’s driver. . . . . . . . . . . . . . . . . . . . 52
5.1 Median Time Counts for SpyMon self-monitoring. . . . . . . . . . . . . . . . . . . 76
5.2 Median Time Counts for SchedMon self-monitoring. . . . . . . . . . . . . . . . . . 76
xi
List of Tables
xii
List of Acronyms
AVX Advanced Vector Extensions
CARM Cache-aware Roofline Model
CPU Central Processing Unit
DP Double Precision
DRAM Data Random-Access Memory
FP Floating Point
GPU Graphics Processing Unit
LLC Last-Level Cache
LPC Logical Processor Core
MSR Model-Specific Register
ORM Original Roofline Model
OS Operating System
PFC Performance Fixed Counter
PMC Performance Monitoring Counter
PME Performance Monitoring Event
PMSR Performance Monitoring Select Register
PMU Performance Monitoring Unit
PPC Physical Processor Core
RAPL Running Average Power Limit
SSE Streaming SIMD Extensions
TBM Time-Based Multiplexing
TSC Time-Stamp Counter
xiii
List of Acronyms
xiv
1Introduction
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
1. Introduction
The constant technological advances in computing systems have led to multi-core architectures,
which contain complex internal mechanisms that are not always easy to understand or analyze.
Following this evolution, adapting and optimizing the execution of real-world applications is of
the most importance, in order to be able to fully explore the potentials of the underlying archi-
tectures. This requires a deep understanding of how the underlying infrastructures work and how
one can efficiently explore them. In order to provide insights about the micro-architectural behav-
ior, Central Processing Unit (CPU) manufacturers already incorporate low-level mechanisms that
provide information about the architecture behavior, at application run-time. However, assessing
these mechanisms usually requires the use of complex interfaces, and a deep understanding of the
the functional principles behind hardware facilities. The work proposed herein aims at exploring
different ways of exporting the full functionality of these hardware interfaces to the user in an
easy and intuitive way, by proposing several tools for performance and power/energy monitoring
at different levels of parallel processing in modern multi-core systems.
1.1 Motivation
Up until recently, computer’s processing power could be increased by using power-hungry tech-
niques, e.g., by increasing the processor’s pipeline depth and therefore its overall frequency. How-
ever, architectural designers experienced great difficulties to accompany this growth, due to its
physical limitations (mainly regarding high-power consumption), marking the end of single-core
systems. With the introduction of multi-core processors, they were able to circumvent these issues.
Multi-core processors are typically based on the replication of a number of identical cores in a
single die, where each core includes a set of private coherent caches and usually a hardware support
for multiple thread execution. The cores usually share a common higher level memory organiza-
tion, typically containing the Last-Level Cache (LLC) and the main memory. Even though these
techniques have allowed to increase the processing power, they still present major challenges. For
instance, the widening gap between processor and memory speeds has caused processors to spend
most of their time waiting for memory data, making frequency increases ineffective. Furthermore,
higher frequencies require deeper pipelines, which makes the design and verification of already
complex processors even more challenging. From a software perspective, the ability to explore the
full performance of multiple execution cores in a single computer has proven to be difficult and,
thus, it has become indispensable for application developers to characterize and understand such
complex systems.
Hardware Performance Monitoring Units (PMUs), available in most modern processors, give
developers the ability to analyze system performance and potential execution bottlenecks. By
using several registers, often called Performance Monitoring Counters (PMCs), PMUs support the
counting or sampling of several micro-architectural events [12]. Moreover, recent architectures also
provide a similar interface for monitoring energy consumption in several architectural components.
2
1.2 Objectives
In Intel’s architectures this interface is called Running Average Power Limit (RAPL) [12].
In order to make use of the referred performance and power interfaces, several methods have
been developed in the recent years, in the form of different libraries and tools that facilitate
the access to those facilities. However, developing an accurate tool for performance and power
consumption monitoring with low overheads is not an easy task. Moreover, the tools need to
provide a simple and intuitive interface, in contrast to the common approaches in the literature
that provide the most functionality by using complex interfaces and, sometimes, hard to use by
the common user.
1.2 Objectives
Although there are several profiling tools available that allow to obtain performance or power
consumption information, there are only a few that provide both functionalities in a single inter-
face. In addition, even if a full performance configuration is provided, the choice of the proper
performance events to monitor is not always trivial, nor the proper way of evaluating these, in
order to obtain a complete overview of the application attainable performance on the underlying
architecture resources. Finally, the ability to provide the full performance and power consumption
evaluation must be passed to the end-user as an easy-to-use interface. However, some of the most
powerful state-of-the-art performance interfaces are too complex [17] or not fully documented,
which hampers their usage.
According to the above needs regarding the full performance and power consumption evaluation
of applications on modern architectures, the main objectives of the herein presented work include:
• The integration of both performance and power consumption evaluation in a single interface;
• The further research for efficient novel approaches that allow the complete evaluation of one
or several application’s performance behavior on modern multi-core architectures;
• To provide a full performance and power consumption evaluation of a set of standard bench-
marks, thus allowing the analysis of their behavior in different scenarios and providing the
ability to detect possible execution bottlenecks, in a modern multi-core architecture.
• Translating the full hardware performance and power monitoring resources capabilities to
the end-user in an easy and intuitive interface.
1.3 Main contributions
The main contributions of the work developed through this thesis correspond to the proposed
monitoring tools:
• SpyMon - The user-space tool that aims at a system-wide performance analysis. The main
functional principle behind this tool relies on spawning a process to each processor core,
3
1. Introduction
which handles the profiling operations for that system’s component. This tool is intended
to integrate both performance and power consumption monitoring, and it is provided to the
end-user via a simple to use interface. This tool has proven to be able to provide a full
system evaluation, even if several tasks are running simultaneously. Although it has shown a
significant increase in power consumption when profiling, the tool has shown not to introduce
high performance overhead.
• SchedMon - The second tool follows a completely different approach and its core functionality
is implemented from the kernel-space, by using a Linux device driver [6]. The tool makes
use of the OS internal scheduling events in order to detect context switching and to obtain
more accurate results. Similarly to SpyMon, it provides all its functionality to the end-user
in an intuitive and easy-to-use command line interface. However, there is a possibility of a
run-time evaluation, by means of a provided user-space library, which exports the kernel-
space core functionality into user-space programs in a set of simple calls. This method has
shown some improvements in terms of imposed overheads. Moreover, it provides additional
functionalities that have proven to improve applications analysis, by additionally providing
the ability to reconstruct the scheduling route of multi-threaded applications, as well as to
assign distinct performance behaviors to specific parts of the application’s code.
Both the herein proposed tool have shown low interference into the monitored applications
performance. In addition, a full performance and power analysis of a set of standard SPEC
CPU2006 [10] benchmarks is provided, which relies on the Cache-aware Roofline Model (CARM) in
order to provide a broader perspective of the application attainable performance on the underlying
multi-core architecture. These benchmarks are widely referenced and used, and there are currently
no detailed information about their performance and power/energy consumption evaluation.
Part of this work has been already published at an international conference:
• [3] Diogo Antão, Luís Taniça, Aleksandar Ilić, Frederico Pratas, Pedro Tomás and Leonel
Sousa, "Monitoring Performance and Power for Application Characterization with Cache-
aware Roofline Model”, In International Conference on Parallel Processing and Applied Math-
ematics (PPAM 2013), Springer, Warsaw, Poland, September 2013.
1.4 Dissertation outline
The remainder of this dissertation is organized as follows. Chapter 2 addresses the background
information required to understand the herein proposed work. First, a general overview of a modern
computer architecture is made, which covers not only the basic description about the available
performance and power/energy monitoring infrastructures, but also how one can configure them
in order to obtain meaningful information. Since both the herein proposed tools interact with
Linux kernel to obtain their functionality, a brief overview on the Linux kernel concepts is also
4
1.4 Dissertation outline
provided. Then, an overview of the most common monitoring challenges and the available state-
of-the-art tools is provided. At last, a brief description about the CARM model is made, since it
is involved in one of the core functionalities herein provided. Chapter 3 introduces a new simple
to use system-wide monitoring tool, which provides the ways to perform a full system performance
and power consumption analysis, and it is mostly implemented in the user-space. An overview
of the tool’s functionalities, as well as of the main implementation aspects that are important for
the understanding of the tool is made. In the end, the tool’s usage information is provided. In
a similar way to Chapter 3, the introduction of a new monitoring tool is made in Chapter 4.
This tool proposes a different approach of the previous one, as it is mostly implemented from the
kernel-space. After a complete overview of the tool’s capabilities, a detailed description about its
internal mechanisms and usage is made. Chapter 5 illustrates the potential of both tools by
means of experimental results. This chapter shows a performance and power/energy consumption
evaluation of several standard benchmarks, by relying on the CARM. In addition to exploring the
full functionality of both tools, a comparison between them is also made, including an overhead
evaluation. Finally, in Chapter 6, the conclusions about the presented work are made, as well as
several improvement suggestions for future research work.
5
1. Introduction
6
2Background
Contents2.1 Performance Monitoring Unit . . . . . . . . . . . . . . . . . . . . . . . 92.2 Running Average Power Limit . . . . . . . . . . . . . . . . . . . . . . . 112.3 Linux Kernel Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Performance Monitoring Challenges . . . . . . . . . . . . . . . . . . . 132.5 State-of-Art Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . 142.6 Cache-Aware Roofline Model . . . . . . . . . . . . . . . . . . . . . . . 162.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7
2. Background
Modern computing systems have become complex heterogeneous platforms capable of sustaining
high computing power. In the past designers have been able to improve processing performance
by applying power hungry techniques, e.g., by increasing the pipeline depth and, consequently, the
overall working frequency. However, such techniques have become unbearable due to the well known
power wall. To overcome this issue, while continuing to improve processing performance, processor
manufacturers turned to multi-core designs, by replicating a number of typically identical cores on a
single die, where each core includes a set of private coherent caches and dedicated execution engines,
and in some cases hardware support for multiple threads. Although these solutions are able to
provide extra processing power, they also introduce additional complexity into the design, making
it harder for application designers to fully exploit the available processing power. In particular,
all cores share the access to a common higher level memory organization, typically containing the
last level cache and the main memory. This may however result in resource contention, which can
drastically affect the execution efficiency.
Figure 2.1 shows an example of a modern multi-core CPU architecture composed of two
Physical Processor Cores (PPCs), each supporting the simultaneous execution of two threads
(multi-threading). As such, each PPC is divided into two Logical Processor Cores (LPCs), one
for each thread. Thus, each LPC contains a set of registers of its own, e.g., instruction pointer,
stack pointer, general registers and Model-Specific Registers (MSRs). Both LPCs in the same PPC
share the execution resources (e.g., ALU) and the first two level of cache, which might increase
the contention on these resources. Furthermore, all the LPCs share a last-level on-chip (L3) cache
and the off-chip Data Random-Access Memory (DRAM).
PPC0 PPC1
LPC0 LPC2 LPC1 LPC3
L1 cacheL2 cache L2 cache
L1 cache
L3 cache
DRAM
- Stack Pointer- Instruction Pointer- PMU- General Registers EAX ECX EDX …
Multi-core CPU
Figure 2.1: Multi-core CPU architecture
In order to characterize and understand the behavior of such complex computational systems,
we require accurate real-time monitoring facilities. These allow, for example, identifying application
and architectural efficiency bottlenecks for real-case scenarios, thus giving both the programmer
8
2.1 Performance Monitoring Unit
and the computer architect hints on potential optimization targets. The following sections describe
the concepts and hardware resources available in modern architectures that allow real-time moni-
toring and that are relevant for a better understanding of the herein presented work. Section 2.1
describes the architectural interface that allows to extract performance information at run-time.
Next, in section 2.2, a similar interface is presented, which aims at providing run-time information
about the system’s energy status. Since the herein presented work presents tools targeting perfor-
mance information extraction, section 2.4 describes the major present challenges when monitoring
performance. Further on, in section 2.5 a quick overview of the most referenced state-of-the-art
monitoring tools is made. At last, in section 2.6, an introduction to the Cache-aware Roofline
Model [11] is made, which is a major requirement to understand the main contributions of the
presented work.
2.1 Performance Monitoring Unit
The hardware Performance Monitoring Unit (PMU) is an architectural interface, available in
most modern Intel processors since Intel’s Pentium processor [12]. It gives developers the ability
to analyze system performance and potential bottlenecks. This unit is composed by a small set of
MSRs, which are hardware control registers. These registers can be configured to monitor specific
architectural Performance Monitoring Events (PMEs), such as clock cycles, retired instructions,
branch miss-predictions and cache misses.
The following subsections describe in detail different types of MSRs used by the PMU and how
to configure them to monitor specific PMEs. Although the provided information is based on Intel’s
architectural performance monitoring facilities [12], similar mechanisms exist in other processors
like AMD, PowerPC and ARM.
2.1.1 Performance Model-Specific Registers
The PMU is composed of two main types of MSRs: Performance Monitoring Select Registers
(PMSRs) and Performance Monitoring Counters (PMCs). PMSRs are used for configuring the
events to monitor (count) in each PMC. Thus, PMSRs and PMCs work in pairs, which means
that if one writes a word for an event configuration into PMSRx, the correspondent event counts
will be reported into PMCx. The number of available register pairs is usually small (e.g., 4 per
logical CPU in Intel Ivy Bridge), which limits the number of events that can be monitored at a
time.
Later versions of the PMU provide additional functionality by adding some more MSRs to the
facility. They include Performance Fixed Counters (PFCs), easier monitoring control (easy toggle
and overflow status) and some extra MSRs for off-core event configuration. PFCs have a similar
functionality to PMCs. The main difference between these is that one cannot configure which
architectural events a PFC should count. PFC events are predefined by the architecture and can
9
2. Background
void WriteMSR (uint32_t msr_id, uint32_t d, uint32_t a){
__asm__ ("wrmsr" : : "c"(msr_id), "a"(a), "d"(d));}
void ReadMSR (uint32_t msr_id, uint32_t *d, uint32_t *a){
__asm__ ("rdmsr" : "=a"(*a), "=d"(*d) : "c"(msr_id));}
Figure 2.2: MSR read and write functionality
only be enabled.
2.1.2 Performance Monitoring Event Configuration
Configuring and reading performance MSRs can be achieved by using special assembly instruc-
tions (figure 2.2), namely: wrmsr, that allows writing the contents of the general purpose registers
EDX:EAX into the MSR specified by ECX; or the rdmsr assembly instruction, which allows reading
the MSR specified by ECX, into the EDX:EAX general purpose registers. Since the MSRs are 64 bits
long, we need to use two 32-bit general registers for holding the configuration word or the result.
As already referred, there are two types of performance counters: general purpose (PMCs)
and fixed (PFCs). The configuration of a PMC may be done by writing the adequate word into
its corresponding PMSR. The configuration words are architectural dependent and should be
consulted in the respective manual.
Figure 2.3(a) illustrates the bit field layout of a PMSR. The 16 least significant bits, event
select and unit mask, are meant for choosing the event to monitor. The event select bit field
selects the event logic unit (e.g., retired instructions) and the unit mask specifies the condition
that the selected event unit detects (e.g., retired store instructions). The unit mask values are
specific to each event logic unit. It is also possible to define at which privilege levels one wants
the selected event to count. This is done by using bits 16 (user mode) and 17 (OS mode).
When the user mode bit is set, the selected event only counts when the processor operates at
privilege levels 1, 2 or 3. In the same way, OS mode enables counting at privilege level 0. It is
mandatory to enable at least one of these modes and both of them can be set at the same time.
By default, the configured event only counts for the current LPC. However, measuring the PPC is
possible by setting the any thread bit flag. If the APIC interrupt enable bit flag is enabled,
an interruption will be raised every time the correspondent PMC overflows. This might become
handy for defining sampling intervals. Performance counting is enabled in the correspondent PMC
by setting the enable bit flag (bit 22). A more detailed information about this subject can be
found in [12].
Intel’s PMU version 3 (present in, e.g., Sandy Bridge and Ivy Bridge micro-architectures)
provides three PFCs, which configuration is done using only one MSR as described in Figure 2.3(b).
10
2.2 Running Average Power Limit
(a) Performance Monitoring Select Register (PMSR)
(b) Performance Fixed Counter (PFC) control register
Figure 2.3: Performance MSRs for Intel’s PMU version 3. Figures obtained from [12].
As already referred, these registers can only be toggled and count only predefined architectural
performance events. In the current context: PFC0 counts the number of retired instructions;
PFC1 counts core clock cycles when the clock signal on the correspondent core is running; PFC2
counts reference core clock cycles when the clock signal on the correspondent core is running.
The reference clock operates at a fixed frequency, irrespective of core frequency changes. The few
configurations available for this type of counter (privilege level selection, any thread flag and toggle
flag) work in the same way as already described for PMCs. Overflow interruptions are available as
well.
2.2 Running Average Power Limit
On Intel architectures, the PMU does not provide energy information or power metering. In
order to assess this information, Intel introduced the RAPL energy status interface in its most
recent platforms. Energy status is a power metering interface comprising non-architectural MSRs.
Using the disposed set of registers that compose the interface, it is possible to extract energy
consumption information in real-time on different domains, i.e., different regions of the processor
die.
The domains present in a platform may vary across product segments. Platforms targeting
11
2. Background
Figure 2.4: Energy status MSR layout. Obtained from [12].
the client segment feature power metering support for package, PP0 and PP1. The package
domain includes the whole processor die, which means that one can obtain the power consumption
of the chip-set in real-time. The PP0 refers to the cores inside the chip, which gives more detailed
information on which parts of the processor die are consuming the most. Intel’s manual [12] does
not specify the PP1 specific target. The only given information says that it may refer to off-core
devices, which means that it might target different parts of the die that are not cores. Platforms
targeting the server segment also provide package and PP0 support. However, the PP1 domain
is replaced byDRAM. Although it is not described in detail what theDRAM domain really does,
this is likely to target some part of the die that connects and communicated with the computer’s
main memory.
Figure 2.4 represents an energy status counter register layout. These counters cumulate the
consumed energy in real-time and Intel provides one for each of the previously referred domains.
These counters are updated around every millisecond and have a wraparound time of about 60
seconds.
Energy related information (in Joules) is based on the multiplier 1/2ESU , where ESU (energy
status units) is an unsigned integer. This value can be obtained by reading bits 8 through 12 from
the MSR_RAPL_POWER_UNIT register. Its default value is 10000b, indicating that the energy status
unit is in 15.3 micro-Joules increment.
All the registers comprising the energy status interface are read-only and can only be accessed
from privilege level 0. In Linux systems, this means the user needs to create a kernel module to
access these registers, or use any of the already available tools that provide an interface to these
registers.
2.3 Linux Kernel Modules
The hardware facilities described in Sections 2.1 and 2.2 may require special privilege per-
missions in order to be handled. Although PMU readings may be performed from user-space,
configuring PMCs must be done from privilege level 0. RAPL energy status MSRs are not allowed
to be written and should also need special permissions in order to be assessed.
In Linux systems, there are only two different permission levels: i) the user-space, which
comprises hardware privilege levels 1,2 and 3; and ii) the kernel-space, which operates at privilege
12
2.4 Performance Monitoring Challenges
level 0. Therefore, in order to obtain the required permissions for handling the performance and/or
power monitoring infrastructures, software interfaces must contain some component that runs in
the kernel-space side. Running code in Linux kernel can be done in two ways:
• Change the kernel source - since Linux is distributed under an open-source license, it is
possible to have access to its source code and modify it according to our needs. Therefore,
changing Linux source code is one of the ways of being able to run code at privilege level
0. This implies, however, recompiling and re-installing the OS and it is not very practical,
specially when the product is targeted for third parties to use.
• Linux kernel modules - a kernel module is a piece of code that, with the right permissions,
is allowed to be integrated into the Linux kernel, at run-time, thus becoming a part of the
OS’s core and running in privilege level 0. This is a simpler and more elegant way of inserting
code into the Linux kernel, and it does not require the OS’s recompilation and re-installation.
The vast majority of Linux kernel modules is designated as a device driver, despite of being or
not attached to a physical device [6]. The herein proposed tools make use of kernel modules, which
although not connected to any kind of peripheral device whatsoever, may be logically seen as a way
to access the physical hardware resources comprising performance and power/energy consumption
monitoring and, therefore, to call them drivers.
Linux Device Driver
In Linux operating systems everything is "seen" as a file, including hardware devices, thus
standardizing the communication to any physical device to be handled as a regular file. Linux
device drivers are the mechanism that makes possible the communication with a device, by allowing
to redefine the predefined operations over the target device file (e.g., read, write, open or close).
Both the herein presented tools make use of a Linux device driver in order to overcome the possible
hardware privilege restrictions comprising the performance and power/energy monitoring facilities.
2.4 Performance Monitoring Challenges
In the previous sections it was made an overview of the performance monitoring structures
currently available. As simple as it might seem at first sight, these facilities are usually too
complex for the common user. In order to make the proper use of them, a deep knowledge of the
underlying architecture and operating system is required. Therefore, making use of these facilities
for dynamic optimization purposes has proven to be challenging for a number of reasons:
• Limited Hardware Resources - The number of available PMCs is typically very small
(e.g., up to 4 Intel Ivy Bridge processors). Consequently, it limits the number of low-level
hardware events that can be measured simultaneously at any given time. It is safe to assume
13
2. Background
that detecting performance bottlenecks in complex superscalar microprocessors often requires
a broader analysis on several architecture components. In order to get a deep analysis on the
architecture’s behavior, since it requires analyzing more than 4 events, several techniques can
be applied. For offline analysis, one could run the same application several times while mea-
suring different hardware events for each run. However, merging the information from several
runs is not straightforward because there might be asynchronous events (e.g., interrupts and
IO events). There are other architecture elements that might create differences from run
to run, depending on the current processor state (e.g., branch predictor). There are several
techniques that can be used in order to overcome this limitation, where the most common
technique is event multiplexing. This technique consists of switching the configuration of the
PMCs regularly and at short time intervals, thus virtual extending the number of monitored
events.
• Complex Interface - The events measured by PMCs are often low-level and specific to a
micro-architecture implementation. For this reason, it becomes difficult to the end-user to
interpret the obtained counter readings without having detailed information on the architec-
ture specifications. Hence, it is hard to translate the counts from the hardware events to
their actual impact on the end performance.
• High Overhead - Since PMU resources are shared among all processes, they can only be
programmed in supervisor mode. Thus, whenever a process needs to configure or change the
events being monitor, it has to communicate with the underlying operating system. These
expensive communications may happen very frequently, which leads to substantial overhead.
2.5 State-of-Art Monitoring Tools
There are many options in the literature that provide access to hardware performance counters.
In the case of Linux, one of the earliest was the perfctr patch [15] for x86 processors. Perfctr
provided a low latency memory-mapped interface to virtualized 64-bit counters on a per-process or
per-thread basis. Later on, the perfmon [2] interface was submitted to the kernel. When it became
apparent that perfctr would not be accepted into the Linux kernel, perfmon was rewritten and
generalized as perfmon2 [13] to support a wide range of processors under Linux. After a continuing
effort over several years by the performance community to get perfmon2 accepted into the Linux
kernel, it too was rejected and supplanted by yet another abstraction of the hardware counters,
first called perf_counters in kernel 2.6.31 and then perf_events [17] in kernel 2.6.32.
Perf_events is included in the Linux kernel, which makes it the preferable choice over the
other available interfaces. The interface is built around file descriptors, allocated using the in-
troduced system call sys_perf_event_open(). This system call returns a file descriptor repre-
senting a virtual performance counter. Events are specified at open time by using an elaborate
14
2.5 State-of-Art Monitoring Tools
perf_event_attr structure, which contains more than 40 fields that can interact in complex ways.
PMCs are enabled or disabled via ioctl() calls and their value can be read using a call to read().
Sampling can be enabled to periodically read the counters and write the values to a circular buffer,
which must be allocated using mmap() call. Signals are sent to the process holding the referred file
descriptors when new data is available.
Although perf_events has shown to be a quite powerful interface, it might be too complex
for the common user. Moreover, it does not provide access to the RAPL interface. If one requires
monitoring power along performance, a different interface has to be used.
PAPI [4] is one of the available tools that uses perf_events. Its objective is to be highly
portable by reusing the available OS performance interfaces, while allowing the inclusion of plug-
ins to read other counters, such as those provided by NVIDIA Graphics Processing Units (GPUs).
PAPI provides two interfaces to the underlying counter hardware: a simple, high-level interface
and a fully-programmable low-level interface. The high-level interface only provides functions for
starting, stopping and reading the counters. The low-level interface provides much more manage-
ability and control over the available resources. Event multiplexing, multi-thread support, user
callbacks on threshold and statistical profiling are some of the available functionalities. Recent
versions of PAPI also include the possibility to measure power/energy consumption [18]. On the
other hand, if a deep control over the available performance resources is needed, PAPI might
not be the best way to do it, since it does not provide direct access to the performance unit but
virtualizes it instead.
If one is interested in a quick binary profiling, without having to write code to do it, Perf [1]
might be a more preferable choice. This is a profiling Linux command-line tool and one of the
most referenced. It can be seen as an abstraction to the perf_events interface, much more
accessible to the common user. Perf provides a set of commands which allow not only profile but
also to report profiling in a user-friendly way. It provides support for multi-threaded applications,
event multiplexing and statistical profiling, among others. A processor-wide mode is also available,
allowing the user to profile not the application but the system itself. However, this tool lacks the
possibility for power profiling, which obligates the search for other tools when energy information
is a requirement.
Yet another well-known resource is OProfile [5], which is composed by a Linux kernel driver,
a daemon and perf -like command line tool. OProfile’s kernel driver is meant for abstracting
the performance hardware registers and dump the sampling information at regular intervals. The
daemon can be started and stopped by the user and it is responsible for consuming the profiling
information provided by the kernel driver and save it in OProfile’s sampling database. This
database can later be accessed by the user to extrapolate useful profiling information by using the
command-line available tools, like opreport. Although this tool appears to be complete in terms
of performance, it still lacks the functionality for providing energy status information.
15
2. Background
There are several other profiling tools available, like Intel VTune Performance Analyzer [8],
LIKWID [16] or LIMIT [7]. The choice of the right tool is not always trivial and it mostly relies
on the user needs. For instance, one may require higher abstraction, lower overhead, higher control
or more information detail.
The herein described work proposes two distinct monitoring tools: one implemented from the
user-space, which provides a system-wide analysis, and another one, mostly implemented from
the kernel-space, which targets application monitoring. Both proposed tools comprise most of the
state-of-art functionalities and, in addition, the ability to assess power/energy information at run-
time alongside with performance. All the functionality of the tools is translated into an easy-to-use
command-line interface, thus facilitating the usage of the underlying hardware performance and
power facilities. Moreover, a predefined performance configuration is provided, which outputs the
extracted profiling information into a single plot using the CARM [11], thus providing an easier
yet broader perspective of the underlying architecture and application’s attainable performance.
2.6 Cache-Aware Roofline Model
As previously referred, to improve performance, modern multi-core architectures replicate sev-
eral processing cores on a single die. Each core has its own private set of caches (L1, L2), while
the access to the other memory levels (L3, DRAM) is shared among the cores.
Since data accesses and computation operations are performed in parallel, the execution is
limited either by the computation in-core resources or by the memory subsystem capabilities.
For instance, if an application contains a lot of memory operations and only a small amount of
computations over that data, the memory subsystem mechanisms will stall the execution and,
therefore, the computation in-core resources do not reach their peak performance. Based on this
observation, the Original Roofline Model (ORM) [19] shows the attainable performance of a multi-
core architecture by relating its peak Floating Point (FP) performance Fp (in flops/s) with the
theoretical bandwidth of a single memory level, usually DRAM (in DRAM bytes/s). However, since
memory is composed by several hierarchic levels, this model cannot fully describe the behavior of
modern applications and architectures by simply analyzing the behavior of each individual level.
In practice, the accesses to different memory levels can not be decoupled, since the data must
traverse the whole memory hierarchy before in-core computations are performed. The recently pro-
posed Cache-aware Roofline Model (CARM) [11] considers these effects and the complete memory
hierarchy. Thus, it models the performance upper-bounds of multi-core architectures having into
account the different memory levels, in a single plot. In order to achieve this, the CARM consid-
ers performance, F (φ), and bandwidth, B(β), as continuous functions of performed flops φ and
transferred bytes β at different memory levels. The CARM, in contrast to the ORM, perceives
information in a centralized way, i.e., from the point of view of the core, thus allowing to normalize
the information. As a result, in CARM, the operational intensity (I in flops/bytes) is uniquely
16
2.7 Summary
2-6
2-4
2-2
20
22
24
26
2-8 2-6 2-4 2-2 20 22 24 26 28
Perf
orm
ance
[Gflops/
s]
Operational Intensity [flops/byte]
Intel 3770KIvy Bridge
AVX MAD (Peak performance)
Peak L
1 Bandwidth
(L1→
C)
L2→C
L3→C
DRAM→C
ADD/MUL
Figure 2.5: Performance Cache-aware Roofline Model (Intel 3770K)
defined and the attainable performance Fa(I) of the architecture is expressed as follows:
Fa(I) =φ
T= min {B(β)×I, F (φ)} , T=max{ β
B(β),
φ
F (φ)}, I=φ/β. (2.1)
Equation (2.1) states that Fa(I) is limited either by the memory bandwidth or by the in-core
performance. Indeed, since memory transfers and computations overlap, the overall execution is
dominated either by the time to transfer the data, β/B(β), or by the computation time, φ/F (φ).
Figure 2.5 illustrates the CARM for a quad-core Intel 3770K processor. As it can be observed,
Fa(I) is bounded by the peak FP performance (Fp) for the compute-bound region, and the the-
oretical peak bandwidth of the memory level closest to the core for the memory-bound region,
BL1→C . The model’s ridge point corresponds to the minimum operational intensity I required to
achieve maximum performance, where the computations and memory operations are completely
overlapped. Furthermore, Fa(I) can also vary according to the characteristics of the computing
units, i.e., MAD, MUL or ADD units. It can also vary with the available memory bandwidth from
the different cache levels to the core (BL2→C , BL3→C and BDRAM→C), thus creating different
boundaries.
Since the CARM considers all memory operations, including accesses to the different cache
levels, it results in a single-plot model that reveals the area previously uncovered by the ORM [19].
Furthermore, these differences are also reflected in: i) how the model is constructed; ii) how it is
interpreted; and iii) the given guidelines when optimizing applications [11].
2.7 Summary
This chapter describes the main concepts regarding the hardware and software infrastructures
that are relevant for a complete understanding of the herein presented work. An overview of a
modern multi-core CPU architecture is made, which introduces the concepts of physical and logi-
17
2. Background
cal processor cores and, in addition, illustrates the memory resources hierarchy. The performance
and power/energy hardware monitoring facilities are explained in detail, by illustrating their most
relevant structures and how to configure and access them. Further on, an overview of the state-of-
art profiling tools is presented, providing a broader perspective on the most commonly provided
performance and power functionalities. At last, the CARM performance evaluation model is de-
scribed, since it is considered to be one of the most valuable features that composes the herein
proposed tools. This model provides a deep architectural performance analysis, and makes easy
to identify possible hardware and/or software bottlenecks. Gathering the state-of-art most com-
mon functionalities and providing them in a easy-to-use interface is one of the main goals of the
presented work. Moreover, the proposed tools provide both performance and power/energy con-
sumption information in a single interface, and allows to output performance execution results into
the CARM.
18
3User-Space Monitoring Tool
(SpyMon)
Contents3.1 Architecture and Main Functionalities . . . . . . . . . . . . . . . . . . 203.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
19
3. User-Space Monitoring Tool (SpyMon)
The main requirements when performing a performance analysis are: i) a full control of the
monitored target (e.g., an application, a CPU core or even the whole system); ii) the possibility
to configure and select the necessary set of performance events to be monitored; and iii) to be
provided with a fine time granularity information. In addition to these requirements, and to obtain
a more complete picture of the system, providing energy/power consumption information is also a
valuable feature.
This chapter proposes a new tool (SpyMon) for system-wide monitoring. In the first section,
an overview of the tool’s internal structure and design is presented, as well as the features and
benefits to the end-user. Section 3.2 describes the implementation details, i.e., how SpyMon makes
use of the underlying performance and power monitoring facilities in order to provide a simple to
use interface. At last, section 3.3 fully covers the tool’s usage and possible configurations.
3.1 Architecture and Main Functionalities
SpyMon’s main goal is to provide a portable tool with an intuitive interface for the end-user,
without relying on the underlying OS’s monitoring facilities. Hence, most of SpyMon’s implemen-
tation lies in the user-space, in order not to interfere or depend on the running system. SpyMon
targets a core-oriented approach, by monitoring the behavior of each Logical Processor Core (LPC)
and therefore being able to capture the information of all running applications. As a result SpyMon
allows monitoring the whole system, regardless of what is running at a given time instant on each
LPC. This means that, even if the application migrates to another core, launches new threads, or
its execution is constrained by the contention caused by other running applications, SpyMon is able
to capture all.
3.1.1 Spatial Process Organization
The herein proposed tool is composed by a monitor and several spies. The monitor is the
main process of the tool. It is responsible for handling the energy status information and control-
ling the whole execution flow (e.g., the user interface, monitored applications and configuration).
The spies are lightweight processes attached to a predefined LPC and have the purpose of config-
uring and fetching the performance counter readings, therefore producing performance information
samples.
As previously referred, each LPC contains its own set of performance monitoring facilities
(PMU). A single process (or thread) can only access the PMU on the LPC that it is currently
running on, while it cannot access a PMU on a different LPC. Since it is required to gather
performance information from different LPCs simultaneously, the proposed tool implements a
performance monitoring process running in each target LPC.
The typical SpyMon configuration is to launch a spy to monitor the performance of each available
LPC and to pin the monitor to the last one, as shown in Figure 3.1. In the illustrated example,
20
3.1 Architecture and Main Functionalities
SpyMon monitor SpyMon spy
L1
L2
L1
L2
L3
L1
L2
L1
L2
Physical Core 0 Physical Core 1 Physical Core 2 Physical Core 3
App. 0(Thread 0)
App. 1(Thread 0)
App. 0(Thread 1)
App. 0(Thread 3)
App. 2(Thread 1)
LPC1 LPC5LPC0 LPC4 LPC2 LPC6 LPC3 LPC7
Figure 3.1: Spacial perception of SpyMon while monitoring 5 threads from 3 applications.
the monitor forks 8 new processes (spies) and pins each of them to a different LPC. By default,
the monitor process is pinned to the last LPC, but different configurations are possible, as it will
be described in the next sections. The spies are responsible for handling the communication with
the PMU, in order to output the obtained PMU samples. Since this work relies on facilities for
monitoring energy consumption at the level of the whole chip (RAPL), the monitor is responsible
for the communication with these facilities and, therefore, for reading the energy status information.
As it can be concluded, assigning each spy with that job would introduce additional overhead, since
for each LPC the same values will be read.
3.1.2 Available Features
In order to facilitate the usability of the tool, SpyMon provides a command-line interface, making
all its functionality available to the user in an easy-to-use set of commands. The tool also includes
a set of predefined performance events, which makes possible to run a performance analysis on
the system without the need to consult the manufacturer’s manual. However, it is also possible to
manually extend this set, by defining different raw events before starting the tool. SpyMon provides
total control over the hardware PFCs, allowing to enable or disable each of them individually.
In cases when more PMEs than the available PMCs need to be monitored, event multiplexing is
applied. The ability to choose which LPCs to monitor is also provided, which allows lowering the
overheads in cases when certain cores do not need to be monitored. When energy consumption
monitoring is enabled, the reported values always refer to the whole chip and/or different power
planes within the chip. Sampling mode is also available, which allows profiling the application in
time intervals of finer granularity, thus providing more precise performance analysis.
What to Monitor
Before starting with monitoring, it is firstly required to specify the objective and the LPCs to
be monitored. SpyMon provides the ability to define different monitoring targets:
• System-wide monitoring - The common scenario is to monitor the whole system, by
relying on either performance or energy consumption (or both). When executed with a
21
3. User-Space Monitoring Tool (SpyMon)
similar configuration to the one depicted in Figure 3.1, alongside with the required PMEs
configuration, SpyMon provides the ability for a full system performance evaluation.
• Targeted-cores monitoring - The tool allows selecting a specific set of target LPCs to
monitor, as well as to rearrange the spatial process organization. This reduces the tool’s
interference with the system’s performance, in cases that only a set of specific LPCs need
performance monitoring.
• Application monitoring - If monitoring of specific applications is required, the tool allows
keeping track only of the LPCs where those applications run. In this particular case, the
SpyMon user attains the complete control of where each application is running, in order to
ease the interpretation of the monitoring results.
Event Selection
After deciding the set of applications and LPCs for monitoring, the hardware performance
events need to be configured. In SpyMon, PMEs are always configured to run in batches (event-
sets). For instance, if the architecture provides 4 PMCs, then it is possible to configure 4 PMEs
at the same time, that constitute a single event-set. Since there might be some restrictions when
configuring hardware events, it is very important to take these into account when configuring the
PMU. For example, the INST_RETIRED.ALL event can only be configured to be counted in PMC1
[12]. Although most of the state-of-art tools provide simplistic event scheduling, by taking into
account these restrictions, SpyMon configures the PMCs in the same order they are provided by
the user, in order to reduce overheads imposed by increased code complexity. In fact, in modern
multi-core architectures the number of PMC-related restrictions is small. To that respect, SpyMon
provides PMC restriction information, when applicable, and it is the end-user’s responsibility to
ensure correct event ordering.
As previously referred, SpyMon also provides a set of predefined hardware events to facilitate
the configuration for the common user, thus allowing to perform a full analysis of the system’s
performance without the need to consult the manufacturer’s manual. Moreover, the interface also
provides the possibility of setting different architecture-specific PMEs, in addition to the predefined
ones.
SpyMon also provides a very flexible interface to handle performance fixed counters (PFCs).
As referred before, PFCs work in a similar way to PMCs, yet without the possibility to configure
which hardware events to monitor. In SpyMon, a simple interface that allows to enable/disable
each individual PFCs is provided. Moreover, it is also possible to configure which privilege levels
to count (user or OS) [12].
22
3.1 Architecture and Main Functionalities
Event Multiplexing
As previously referred, one of the biggest limiting factors for accurate performance analysis
in nowadays general-purpose processor architectures lies in the small number of available PMCs
(usually 4 for Intel and up to 6 for AMD architectures). In fact, by taking into account the com-
plexity of nowadays computer systems, this number is usually not sufficient for a full performance
evaluation and therefore event multiplexing must be applied.
In order to monitor more events than the physically provided by the PMU, SpyMon multiplexes
the PMEs in time (Time-Based Multiplexing (TBM)), thus virtually expanding the number of
available PMCs. PMEs are grouped in event-sets, in the same order as in the user’s event con-
figuration. Therefore, TBM is done by switching the currently configured event-set with another
one, in a round-robin manner and at regular time intervals. The exact methodology applied for
TBM is explained in detail in the following text. However, it should noticed that large number of
event-sets also implies higher error on the event count estimation, since different event-sets refer
to different time intervals, i.e., parts of the application’s execution.
Sampling
Sampling refers to the process of extracting performance information at regular intervals, thus
providing the ability to capture the behavior of the underlying system at run-time. SpyMon allows
defining a sampling time interval, which is assigned as a sample duration. When the monitoring
is terminated, a complete set the collected performance samples is outputted.
Energy Status
One of the most important features that differentiates SpyMon from most of the state-of-art
tools, is the ability to provide energy/power consumption information. By specifying an extra pa-
rameter at invocation, the energy consumption information is also included in the reported output.
Since performance and energy/power consumption monitoring rely on different and independent
interfaces, both measurements are simultaneously acquired. When sampling is enabled, SpyMon
takes an energy sample at the same time interval as for performance, thus providing the same
time granularity for both interfaces. The minimum sampling interval is set to 1 millisecond, that
corresponds to the approximate time interval at which the energy status MSRs are updated.
Cache-aware Roofline Analysis
For a common user, defining the extensive set of performance events and fully understanding
the behavior of real-world applications on a target platform is not a trivial task. To ease this
process, SpyMon provides a predefined configuration which allows to make a performance analysis
based on the CARM [11]. When running the tool with this configuration, and by providing a target
application, the tool automatically outputs the performance information in a single and easy to
23
3. User-Space Monitoring Tool (SpyMon)
interpret plot. The CARM plot shows the FP performance and operational intensity of each taken
sample as a dot, drawn under the model’s roof, making it simple to detect potential performance
bottlenecks, e.g., from a memory hierarchy point of view.
When using this mode of the tool, it is also possible to define the sampling time interval as well
as to enable energy status information collection. However, energy status information is provided
separately from the model, since CARM only applies to performance.
3.2 Implementation Details
This section presents a detailed description about SpyMon’s implementation. The herein pro-
posed tool is composed by three main parts that interact in a hierarchical way, namely i) the
monitor, which controls the tool’s execution flow and provides all the functionality to the user;
ii) a set of spies, which are responsible for communication with the PMU interface and for han-
dling the performance profiling information; and iii) a linux kernel module, which provides the
access to the hardware facilities, thus overcoming any privilege access restrictions. Figure 3.2 il-
lustrates how the different tool components interact with each other and how they are disposed in
the different privilege layers of the system. A more detailed information on these components and
how they interact is provided in the following text.
User-space
Kernel-space
Hardware
Monitor Spy
System calls
SpyMon’s kernel module
SpyMon’s device
RAPL PMU
Figure 3.2: SpyMon’s components interaction and disposition in the OS privilege layers.
3.2.1 Linux Kernel Module and Hardware Access Restrictions
In nowadays OSs, the access to the hardware performance and energy monitoring facilities is
usually restricted to higher privilege levels, i.e., it is not possible to access these directly from
the user-space. In order to overcome these limitations, SpyMon integrates a specific Linux kernel
module, or driver, which enables the communication with the underlying hardware monitoring
interfaces [9], and resolves the permission restrictions. SpyMon’s driver is composed by a small
number of structures that allow low-level access for the user-space set of commands from the tool,
i.e., the addresses of the underling performance and energy status MSRs and a set of functions that
24
3.2 Implementation Details
operate over these data structures, including reading from and writing to the hardware counters
and configurations registers.
At the time of the module’s installation, a new device file is created in the /dev directory,
allowing the communication between the user-space processes and the driver. The module is
accessed by calling the ioctl() system call over the device file. By using this call, the tool
is not only able to send a specific command to the module, but also to specify an argument,
which is used to send the proper data structures, either for holding the sample readings or for
configuration purposes. Besides the commands for the module’s initialization and termination,
the main functionality of the driver relies on the IOC_RD_PMU and IOC_WR_PMU commands, for
reading from and writing to the PMUs, respectively. In addition, it also includes the IOC_RD_RAPL
command, for reading the RAPL energy status information.
3.2.2 Hardware Readings and Configuration
As previously referred, the SpyMon’s Linux kernel module provides a set of specific commands
based on ioctl() system calls, in order to allow the spies and monitor the access to privileged
hardware monitoring facilities. According to the type of request that is made to the module, a
corresponding data structure’s address is sent as the ioctl() argument. Figure 3.3 shows how the
sample holding structures are implemented. When the IOC_RD_PMU command is passed through
the ioctl() call to the module, an address to a previously allocated sample_pmu data structure
(see Figure 3.3(a)) is passed as the argument. The module will then read both the PMU and
the time-stamp counters and copy the readings to the user-space data structure referenced by the
provided address. In brief, the readings from a set of nr_fx_ctrs fixed counters (as enabled by the
user) are stored in a fx array, while a gp array holds the values obtained from a set of nr_gp_ctrs
configured general-purpose counters. As presented in Figure 3.3(b), a similar procedure is used
to access energy consumption in the sample_rapl data structure via the ioctl() IOC_RD_RAPL
command. Alongside the tsc time-stamp readings, the energy status information is stored in the
pkg, pp0, pp1 and dram variables, corresponding to the package, power-plane 0, power-plane 1 and
DRAM domains, respectively. A similar structure is used for configuring the PMU events through
the IOC_WR_PMU command. The main difference between the latter and the sample_pmu structure
is that for configuration purposes one 64-bit variable is sufficient to configure the PFCs (see Figure
2.3(b)).
3.2.3 Main Functionality
Figure 3.4 illustrates the execution flow of the tool, from the perspective of both the monitor
and the spy processes. When started, the tool firstly parses the input parameters (step 1). A
detailed description regarding the available options is made in section 3.3. If the ––help sub-
command is provided, the usage information will be printed to the standard output (step 8). If
25
3. User-Space Monitoring Tool (SpyMon)
(a) Structure for PMU sample information. (b) Structure for RAPL sample information.
Figure 3.3: SpyMon’s data structures for ioctl() communication.
the ––list sub-command is provided, then the complete list of available hardware events (step 9)
is shown. On the other hand, if either ––start or ––roof argument is provided, then monitoring
parameters are configured according to the user’s input specifications, and the application profiling
is initiated. In brief, ––start activates the most commonly used SpyMon "profiling mode", while
––roof enables run-time cache-aware roofline application monitoring.
Profiling Mode
When the ––start command is provided, the tool firstly parses and verifies the input parame-
ters. Then, the monitor process is pinned to a specific LPC (step 2), by using the sched_setaffinity()
system call. This call allows informing the scheduler in which LPCs is the calling thread allowed
to execute in. By default, the monitor is pinned to the last available LPC, although its affinity
can be changed by the end-user in the initial tool configuration.
Afterwards, the main process forks several new processes (spies), which number corresponds
to the number of required target monitoring cores (step 3). By default, all LPCs are monitored,
i.e., SpyMon firstly detects the number of available LPCs and launches one spy for each LPC. The
general execution diagram for a single spy process is depicted in Figure 3.4(b) and it starts by
setting the pipe communication channel with the monitor (step a).
Following the process spatial configuration, the PMU configurations are made (step 4). After
parsing the provided PME configuration, SpyMon creates a number of event-sets by grouping the
events according to the number of available PMCs. For instance, if the architecture only supports
4 PMCs and the user provides 7 PMEs, then the tool will define 2 event-sets, where the first event-
set contains the first 4 provided PMEs and the second one contains the remaining 3 PMEs. When
the PMU configuration is done, the monitor sends the configuration structures to the spies, by
means of the previously established pipe communication channels. In the spy execution diagram
(see Figure 3.4(b)), this corresponds to step b. From this point on, each spy starts monitoring its
target LPC and producing the performance sampling information accordingly.
If more than one event-set is defined, then event multiplexing is applied. In these cases, a single
performance sample corresponds to a specific predefined time interval in which the performance
counter readings from different event-sets are merged together. This mechanism is performed by
26
3.2 Implementation Details
Start
Parse Options
command
Display usage
information
Display predefined
eventsLaunch & pin spies
Pin monitor
Send PMU config
-p options Launch & pin apps
Get RAPL sample
Stop spies and
terminate
End
—help —list
—start—roof
yes
no
Apps finished
yes
no
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8) (9)
Output RAPL
sample
(a) Monitor execution diagram.
Start
Set comm with
monitor
Wait mux interval
Configure next evset
End
Get PMU config
Configure evset 0
evsets > 1
Read PMU counters
Output PMU
sample
sample finished
terminate
yes
yesyesno
no
no
(a)
(b)
(c)
(d)
(e) (f)
(g)
(b) A single spy execution diagram.
Figure 3.4: SpyMon’s execution flow.
the spies, since they are responsible for handling the PMU.
When performing a system-wide evaluation, it is usually required to launch specific applications
and analyze their performance, as specified by the end-user. SpyMon provides this functionality via
a set of simple configuration commands, that instructs not only the target applications launches,
but also pinning their execution to the required LPCs (step 5). This is achieved by using the
fork() and execve() system calls.
When all the initializations and configurations are performed, SpyMon initiates profiling. At
this point, the monitor process also starts reading and producing RAPL sampling information
(steps 6-7), until the monitored application terminates. At the same time, each spy reads and
produces PMU information samples at regular intervals (steps d-g). As it can be observed, as
soon as the performance counter readings are retrieved for the current event-set (step e), each
spy activates the event multiplexing if the number of event-set is greater than 1 (evsets > 1), or
27
3. User-Space Monitoring Tool (SpyMon)
immediately outputs the counter readings if not (step g). When event multiplexing is activated,
the next event-set is configured (step f), i.e., a set of different events will start being counted
during a multiplexing time interval (step d). When the last event-set counts are retrieved, the
sample is considered to be finished (sample finished) and its contents are outputted (step g).
The described process (steps d-g) is repeated one time for each sampling time interval, until
monitoring is completed.
The information produced by both the monitor and the spies is directly printed out to files.
The number of files corresponds to the number of processes executed inside the tool, i.e., to the
number of spies plus the monitor, where each file corresponds to only one of those processes.
Event Multiplexing
The event multiplexing functionality allows the virtual extension of the number of PMCs pro-
vided by the underlying architecture. For example, if the sampling time interval is defined as 9
milliseconds and there are 3 event-sets, each event-set runs for 3 milliseconds. Then, the mea-
surements taken from each configured event-set are merged together by relying on the following
procedure:
countsestimate = countsmeasured ∗timetotaltimeenabled
. (3.1)
As it can be observed in Equation (3.1), the event multiplexing implies extrapolation of the obtained
counter values for a single event-set (countsestimate) obtained during the multiplexing time interval
(timeenabled) to the overall sampling time taken to perform all event-sets (timetotal). When event
multiplexing is used, the obtained final sample counts (countsestimate) represent a mere estimate
of the real counts. An illustration of the above described method, when 3 event-sets are configured,
is shown in Figure 3.5. As it can be observed, the event-sets are switched at regular time intervals.
When the last event-set is measured, a complete sample is acquired and the first event-set is again
configured, thus initiating the next sampling time interval.
SpyMon’s TBM is implemented by the spies and is illustrated in Figure 3.4(b) (steps d-f).
Set 0 Set 1 Set 2 Set 0 Set 1
Sampling Time Interval
Time
Figure 3.5: Illustration of TBM for 3 defined event-sets.
Cache-aware Roofline Mode
When the ––roof command is provided, a similar process to the previously for ––start is
performed. In this mode, there is no need to provide any kind of PMU configuration, since all
28
3.3 Usage
Event Set Event
0
FP_SSE_PACKED_SINGLEFP_SSE_PACKED_DOUBLEFP_AVX_PACKED_SINGLEFP_AVX_PACKED_DOUBLE
1
FP_SSE_SCALAR_SINGLEFP_SSE_SCALAR_DOUBLE
MEM_UOP_RETIRED_ALL_LOADSMEM_UOP_RETIRED_ALL_STORES
Table 3.1: Sets of PMEs used for performance profiling when using the cache-aware roofline model.
the required events for the CARM are predefined by the tool. The event-set configuration used
by SpyMon when in roofline mode is depicted in Table 3.1 and the corresponding event description
can be found in Table 3.2. The monitored events required by the roofline model involve detecting
both the performed floating point operations and all the corresponding memory operations (loads
and stores).
As it can be observed, it is required to monitor 6 different events in order to assess the number
of FP operations, and 2 additional events to estimate the amount of data traffic. As a result, event
multiplexing is required and the event-set configuration is made according to the information shown
in Table 3.1.
Energy status information, although not part of the performance roofline model, can also be
provided. If this is the case, the monitor process will also take RAPL samples, in the previously
explained way.
3.3 Usage
SpyMon provides an easy to use command-line interface, to facilitate the user to run either a
system-wide or an application-specific full performance and energy/power consumption evaluation.
As previously referred, SpyMon provides different functionalities via a small set of command-line
parameters. Figure 3.6 illustrates the currently implemented four main options, i.e., ––help,
––list, ––start and ––roof. The set of supported options can be retrieved with the ––help
parameter, which also provides a short summary on how to use SpyMon’s interface with different
options. The ––list option outputs a list of the predefined hardware events provided by the tool.
Table 3.2 shows a small set of the predefined hardware events provided by SpyMon.
3.3.1 Profiling Mode
The spymon ––start command provides a fully configurable execution profiling. This option
allows to configure multiple execution parameters, such as process spacial configuration, event
definition, enabling power metering and setting the sampling time interval.
29
3. User-Space Monitoring Tool (SpyMon)
$$ spymon --help
Usage:spymon --help
spymon --list
spymon --start [-e ev0[,ev1[...]]] [-f id:mode[,id:mode...]] [-ccore[,core...]] [-r domain[,domain...]] [-s stime] [-mcore[,core[...]]] [-p [core,[core...]] prog [args]]
spymon --roof [-s stime] [-r domain[,domain...]] prog [args]
Figure 3.6: SpyMon’s usage information.
For event configuration, the –e option must be used. The set of required hardware events are
specified as a comma separated event list, where the events can be designated either by a predefined
event name or by a raw event word. For a predefined hardware event, the required events must
be chosen from the event list provided by spymon ––list and passed as the input parameter.
Raw hardware events can be specified by using the format r:evsel:umask:usr:os, where evsel
corresponds to the event select bit field and umask refers to the unit mask field, while usr and
os represent the user and OS bit flags corresponding to the different privilege modes, respectively.
The fields in the raw-event specification format correspond to the previously referred bit fields for
event configuration (see Section 2.1).
In order to enable the fixed architectural events (PFCs), the -f id:mode option should be
specified. The id corresponds to a specific PFC number (e.g., if the architecture provides 3 PFCs,
the id can take the value of 0, 1 or 2), while mode refers to the privilege modes to monitor (1 for
user, 2 for OS and 3 for both). Similarly to the general purpose events, the input parameters
should be provided as a comma separated list.
To allow attaining a full control over the execution and profiling environment, SpyMon provides
the options –c, –m and –p. The –c core option permits to specify which LPC (core) should be
monitored. A list of LPCs should be provided as a comma separated list. Similarly, the –m option
allows to configure in which LPC the monitor process is pinned. The default spacial configuration
for the monitor and spy processes is shown in Figure 3.1, where the monitor is pinned to the last
LPC and a spy is invoked in each LPC.
SpyMon also provides the ability for launching specific applications (including their input pa-
rameters) by using the option –p. Several applications can also be simultaneously invoked and
monitored, by specifying each of them in a separate –p option. Due to the core-oriented system-
wide SpyMon monitoring approach, when multi-threaded applications are analyzed, it is the user’s
responsibility to ensure the spacial control of the execution threads. For this purpose, besides the
application’s binary and input arguments, SpyMon provides an extra parameter (core) to the –p
which allows to control the application’s CPU affinity, i.e., in which LPCs it is allowed to run.
30
3.3 Usage
Table 3.2: Sample of hardware performance events provided by SpyMon.Event Description
UNHALTED_CORE_CYCLES Unhalted core cycles.UNHALTED_REF_CYCLES Unhalted reference cycles.INST_RETIRED_ALL Number of instructions retired.UOPS_RETIRED_ALL Number of µops retired.MEM_UOP_RETIRED_ALL_LOADS Qualify any retired memory µops that are loads.MEM_UOP_RETIRED_ALL_STORES Qualify any retired memory µops that are stores.
FP_SSE_SCALAR_SINGLENumber of SSE single-precision FP scalar µops exe-cuted.
FP_SSE_SCALAR_DOUBLENumber of SSE double-precision FP scalar µops ex-ecuted.
FP_SSE_PACKED_SINGLENumber of SSE single-precision FP packed µops ex-ecuted.
FP_SSE_PACKED_DOUBLENumber of SSE double-precision FP packed µops ex-ecuted.
FP_AVX_PACKED_SINGLENumber of AVX 256-bit packed single-precision FPinstructions executed.
FP_AVX_PACKED_DOUBLENumber of AVX 256-bit packed double-precision FPinstructions executed.
L1D_REPLACEMENT Number of lines brought into the L1 data cache.LLC_REFERENCE Last-level cache references.
L2_RQSTS_CODE_RD_MISSNumber of instruction fetches that missed the L2cache.
OFF_CORE_MISSES_0 Number of L3 misses.SSE - Streaming SIMD Extensions; FP - floating-point; AVX - Advanced Vector Extensions; µops - micro-operations
In order to enable sampling, the –s option must be used, followed by the required sampling
time interval in milliseconds. If this option is not enabled, SpyMon will report the sum of all the
monitored events at the end of the run. The end of the run is determined by the application
with the longest execution time. If no applications are provided, the tool will monitor until an
interruption signal is detected (CTRL-C).
Energy consumption information is delivered when the –r option is enabled. This option needs
at least one domain to be specified. Several domains can be monitored at the same time, as long as
they are provided by the underlying architecture. For Intel architectures and following supported
RAPL power planes, the available domains are pkg, pp0, pp1 and dram.
3.3.2 Cache-aware Roofline Mode
One important feature SpyMon provides is the ability to run a performance analysis based on
the CARM [11]. In order to make use of this functionality, the spymon ––roof command must
be used. When running the tool in roofline mode, one does not need to manually configure any
performance counters, as the tool already contains the hard-coded set of events to use for this
type of analysis. Furthermore, the user is free to define the sampling time interval, as well as to
activate the RAPL energy status interface by using the –r, in the same way that is described for
the profiling mode. However, since energy status information is not a part of the model, it will be
outputted as a separate information.
31
3. User-Space Monitoring Tool (SpyMon)
3.3.3 Information Output
As previously referred, SpyMon outputs the profiling information by means of files. The number
of files corresponds to the number of monitored LPCs. If energy status information is enabled,
then its monitoring samples are stored in an additional file.
Both performance and energy information files contain the raw counting values extracted from
the corresponding hardware interface. When more than one event-set is configured, each perfor-
mance counter value is outputted as estimated by applying Equation 3.1. Moreover, time-stamp
information is also provided. An example of a performance holding file line format for a run with
1 fixed counter and 2 general purpose counters would be tsc fx gp0 gp1 gp2. On the other
hand, the line format for a file containing energy status information for the package, pp0 and pp1
domains becomes tsc package pp0 pp1.
When running in cache-aware roofline mode, a post-processing is applied over the outputted
performance files in order to generate the plot containing the performance information stamped
under the lines representing the attainable system’s performance. The number of flops contained
within a sample is calculated according to the following expression:
flops = SCLSP /2 + SCLDP + (SSESP + SSEDP ) ∗ 2 + (AVXSP +AVXDP ) ∗ 4. (3.2)
In a similar way the calculation of the corresponding numbered of transferred bytes relies on the
following procedure:
bytes = (8 ∗ scl + 16 ∗ sse+ 32 ∗ avx) ∗ (LOADS + STORES). (3.3)
The scl, sse and avx variables correspond to the percentage of scalar, Streaming SIMD Exten-
sions (SSE) and Advanced Vector Extensions (AVX) FP instructions over the total number of FP
instructions, respectively. These calculations are necessary since different FP types correspond
to different data widths, and therefore, different number of bytes. The scalar, SSE and AVX
percentages are obtained by relying on the expressions:
scl =SCLSP + SCLDP
SCLSP + SCLDP + SSESP + SSEDP +AVXSP +AVXDP; (3.4)
sse =SSESP + SSEDP
SCLSP + SCLDP + SSESP + SSEDP +AVXSP +AVXDP; (3.5)
avx =AVXSP +AVXDP
SCLSP + SCLDP + SSESP + SSEDP +AVXSP +AVXDP. (3.6)
The above described expressions are applied over the PMC values obtained from the configuration
depicted in Table 3.1, where each capital variable in the equations corresponds to a specific event
counting value.
3.4 Summary
In this chapter, a new system-wide profiling tool (SpyMon) was introduced. This tool offers
an easy to use interface and is capable of delivering architectural performance and power/energy
32
3.4 Summary
consumption information to the user, thus providing the fundamental means for a better under-
standing of the underlying resources functioning.
Currently, the ability to provide both performance and energy consumption information in a
single interface is not an easily found feature in the state-of-art tools, which designates SpyMon as
a preferable choice. Moreover, it allows the user to run a full performance evaluation based on
the CARM [11], without the need for any extra configurations. When executed in this mode, the
tool provides a single plot graph which contains useful information about the execution, allowing
to detect possible architectural bottlenecks or even to improve the execution of the monitored
application on the underlying hardware.
Apart from the implemented means to overcome the possible hardware restrictions, SpyMon is
completely designed to run from the user-space. As a result, a great level of portability is sustained
in SpyMon. However, by running in user-space, it is likely that certain overheads are introduced
when compared to the tradicional interfaces, such as perf_events or similar driver interfaces for
profiling the application.
In these cases, a full control over the system tasks execution and scheduling can be attained and
the communication with the PMU is performed inside the scheduler, i.e., invisible to the actual
application’s execution. On the other hand, SpyMon does not aim at establishing such control
over the OS scheduling mechanisms, since its processes (monitor and spies) allow a core-based
system-wide application performance and energy status monitoring without interfering with the
underlying OS mechanisms.
33
3. User-Space Monitoring Tool (SpyMon)
34
4Scheduler-Based Monitoring Tool
(SchedMon)
Contents4.1 Architecture and Main Functionality . . . . . . . . . . . . . . . . . . 364.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
35
4. Scheduler-Based Monitoring Tool (SchedMon)
In Chapter 3, an easy to use monitoring tool (SpyMon) is presented, which aims at a higher
portability, by avoiding dependency from the underlying monitoring interfaces, and provides the
mechanisms for a complete system-wide performance and power evaluation. However, by being
almost fully implemented from the user-space, this tool implies some limitations. For example,
attaining a full control of an application’s execution flow, becomes more challenging since SpyMon
targets the whole system, i.e., provides monitoring based on a core-based approach. In fact, even
if extreme efforts are made to construct an overly controlled environment, e.g., by running only
the desired applications and pinning them to predefined LPCs, monitoring at the application level
is not an easy task, due to the unnecessary interference introduced by the OS running tasks. As a
result, it becomes extremely difficult to extract this interference from the obtained measurements,
especially when monitoring is performed by means of user-space processes.
The performance and power/energy consumption monitoring tool presented in this Chapter,
i.e., SchedMon, adopts a completely different approach, by making use of the OS internal mecha-
nisms in order to obtain a broader and more detail information about the monitored applications.
In contrast to SpyMon, this tool is mostly implemented from the kernel-space and it aims at an
application-based evaluation. Thus, it allows a more accurate profiling of the monitored applica-
tions. SchedMon main principles rely upon its modularity, which allow to easily extend its func-
tionality. In order to achieve that, the tool is designed not to depend on the available OS-specific
performance interfaces (e.g., perf_events), therefore not depending on the already implemented
structure and functionality.
In this chapter a new scheduler-based monitoring tool, SchedMon, is introduced. In section 4.1,
an overview of the tool’s novel mechanisms and implemented features is made. The next section
describes the details about the tool’s implementation. Finally, in section 4.3, an overview of how
to use the tool in order to obtain the required performance and/or power results is made.
4.1 Architecture and Main Functionality
SchedMon is composed of two main parts: i) a Linux kernel module, or driver, which integrates
the tool’s implementation core mechanisms; and ii) a user-space tool (smon), which extracts the
whole functionality of the underlying module and translates it into a simple and intuitive user
interface. The communication between both components is made by means of user-space library,
which provides a set of functions that allow handling the tool’s main functionalities.
Figure 4.1 illustrates the interaction between SchedMon’s components, as well as the their
disposition in the OS privilege layers. As it can be observed, the Linux kernel module is responsible
for interacting with the hardware, thus providing the necessary performance and power/energy
consumption information. The communication between the module and the user-space tool is
made through a set of system calls over the driver’s device file. These necessary communication
commands are provided by the tool’s user-space library. In addition to the command-related
36
4.1 Architecture and Main Functionality
User-space
Kernel-space
Hardware
Smon SchedMon’s Library
System calls
SpyMon’s kernel module
SchedMon’s device
RAPL PMU
Shared memory
Figure 4.1: SchedMon’s components interaction and disposition in the OS privilege layers.
communication, a shared memory area is used for exchanging the produced profiling information,
at run-time.
4.1.1 SchedMon’s Linux Kernel Module
The core functionality of SchedMon lies in the OS kernel-space, and it runs as an integrated part
of the OS kernel, therefore widening the possibilities for a better monitoring control. Although
kernel modules are executed as a part of the Linux kernel code, they only allow to insert new
functionalities to the OS, thus not being allowed to change the currently implemented system.
However, the Linux kernel provides a mechanism that makes it possible for inserted modules to
have a broader control of the kernel execution flow. This mechanism is designated as a tracepoint.
A tracepoint can be described as a breakpoint inserted into a specific place inside the code, which
can be enabled by providing it with a callback function. When a tracepoint is enabled, whenever
a specific part of the code is run, the provided callback function will be called. This facilitates not
only debugging, but it also allows inserting code to the kernel itself in an elegant and easy way.
SchedMon makes use of the Linux scheduler tracepoints to keep track of the target monitored
applications. These facilities allow attaining a full scheduling control over the OS running tasks,
including the ability to detect when a certain task is scheduled into a LPC or migrated to a different
one.
The interaction with SchedMon’s kernel module (or driver) is established by using a set of
predefined system calls to the driver. The tool provides this functionality as a user-space library,
which contains the necessary functions to configure and control the monitoring environment at
run-time.
4.1.2 Smon: the user-space tool
In order to facilitate the use and control over the above mentioned kernel module, or driver,
SchedMon integrates a user-space component, smon, which implements the mechanisms for inter-
acting with the driver. This component makes use of the tool’s user-space library, which implements
37
4. Scheduler-Based Monitoring Tool (SchedMon)
the same data-structures and synchronization methods as the driver. Moreover, smon extends the
tool’s overall functionality by adding a set of user-space features, as it will be described next.
In brief, smon translates its functionality into an easy-to-use command-line interface. This
interface is composed by a set of commands that give full control over the tool’s execution and
parameter configuration.
4.1.3 Available Features
SchedMon follows an application-oriented monitoring methodology, which means that it is meant
for monitoring specific applications, in contrast to a core-oriented or system-wide evaluation (the
scope of the SpyMon tool). The set of features offered by SchedMon were designed according to this
concept, in order to provide very detailed information about the monitored applications.
On-line Analysis
By translating the main functionality to a user-space interface, SchedMon provides the possibil-
ity for a run-time performance and power/energy consumption evaluation. In brief, by using the
provided user-space library, the programs are able to interact with the tool and perform required
actions based on the performance and power/energy consumption output feedback. As a result, the
SchedMon functionality can be easily extended by only relying on its kernel module. Making use
of SchedMon functionality without the need to use the user-space command-line interface, smon,
is one of the great advantages one can obtain from the tool.
Multi-threaded Application Profiling
With the constantly increasing number of parallel resources provided by modern architectures,
multi-threaded applications are becoming more and more mainstream. Therefore, it is absolutely
crucial to be able to profile applications that can spawn tens or even hundreds of threads, in order
to allow the opportunity for architectural or application design optimizations.
The ability to profile multi-threaded applications is one of the main functionalities that is
provided by SchedMon. By setting the proper configuration parameters, one may attach the same
performance monitoring behavior to all child tasks descending from the main monitored task,
recursively. This means that one can profile multi-thread or multi-process applications, by tracing
all the descendant threads or processes, as long as their hierarchy dependency descends from the
target application’s main process.
Task Hierarchy
In addition to monitoring the multi-threaded applications, SchedMon allows to obtain, in real-
time, information about every process or thread that is created, as long as it descends from the
targeted application main process. By using this information, it is possible to construct, either at
38
4.1 Architecture and Main Functionality
run-time or after the application’s execution, the complete task hierarchy tree, therefore obtaining
a better understanding of the application’s internal structure and design.
Task Scheduling
The OS scheduler is the internal OS mechanism that is responsible for deciding when and where
a current set of active tasks are allowed to run, i.e., it controls which tasks have access to a specific
LPC at a given time. As already referred, SchedMon relies on the Linux scheduler execution flow
to obtain information about each monitored task. Thus, it is able to trace the task movements
inside the architecture at run-time.
One of the features of the tool is the ability to dump the scheduling information of each
monitored task, at run-time. This means that, if such detail is requested, information about when
each monitored task was using a specific LPC may be provided.
Task Migration
Following a similar strategy employed for the above referred task scheduling feature, SchedMon
also allows extracting information about the exact time interval when a task is migrated by the
scheduler. Task migration refers to the action performed by the OS scheduler when the execution
context of a single task is changed from one LPC to a different one.
Performance Monitoring
The main functionality of every performance monitoring tool is the ability to provide the
necessary means to access the underlying architectural performance facilities (PMU). SchedMon
allows the complete configuration of the underlying performance interface, providing the means to
configure both PMCs and PFCs, in a simple and easy way. In addition, in order to facilitate the
tool’s performance configuration, SchedMon provides a predefined set of PMEs.
A particularity of SchedMon is that PME configuration is kept inside the driver, thus requiring
the definition of PMEs to be made a priori, i.e., before execution. The main advantage of this
method is that events can be reused in several different runs, without the need to be redefined.
Moreover, it is possible to create shell scripts to automate the reconfiguration of the tool with the
required event definitions, which facilitates the configuration across reboots or even across different
platforms.
Energy Status Information
A good power and energy consumption management is becoming increasingly important for
modern computing systems. In this context, SchedMon provides a power and energy consumption
monitoring interface that allows to perform a complete power/energy consumption evaluation of
the hardware facilities, alongside with performance. The energy status information can be toggled
39
4. Scheduler-Based Monitoring Tool (SchedMon)
as simply as enabling a single option and providing the required monitoring domains, when starting
the profiling execution.
Cache-aware Roofline Model
Similarly to SpyMon, the SchedMon tool provides a way to perform an application performance
analysis using the CARM [11]. This functionality provides an easy way to extract and analyze the
execution behavior of an application running on modern multi-core architectures. By analyzing
the outputted single plot diagram, it is possible to detect possible architectural and/or application
bottlenecks.
To this respect, SchedMon provides an extra functionality, which allows the automatic creation of
the model for different general-purpose multi-core architectures, in order to facilitate the portability
of the tool for different architectures. By running predefined assembly benchmarks, it is possible
to extract the necessary model’s parameters for different architectures and, therefore, increase the
model’s precision.
Function Call Tracing
The function call tracing functionality is achieved by intercepting every function call that
the target application invokes. Therefore, SchedMon provides the ability to know which part
of the application’s code is executed, at a specific time. The most referenced state-of-art tools
usually achieve this functionality by back-tracing the program stack each time a new sample is
taken [1] [4] [5]. Hence, in certain execution scenations, it might not be even possible to detect
exactly when a function is entered or exited, as well as to catch all function calls.
In contrast, SchedMon provides the ability to detect when the monitored application enters or
leaves a function. Moreover, if the target application switches its executable file at run-time, the
tool is able to detect and load the information about the new binary functions. At last, if new
processes or threads are created during the execution, SchedMon is also able to keep track of their
calls, independently if they share the same execution code or not.
4.2 Implementation Details
This section describes the details about the implementation of SchedMon. As it was already
referred, there are two main parts that compose the tool: a Linux kernel module, implemented
as a device driver, which is the core of the tool and allows a greater control over the underlying
hardware resources, and a user-space tool, which interacts with the implemented module by means
of a user-space library and passes is to the user in an easy-to-use command-line interface.
40
4.2 Implementation Details
4.2.1 Linux Kernel Module
SchedMon’s kernel module, or driver, is the main component of the tool, since it contains all the
main functionality and data structure implementation. The driver, when loaded into the kernel,
creates a file in the /dev directory (the device) which acts as a communication medium to the
driver, i.e., operations over this file trigger the corresponding module’s function to handle that
specific operation. At the moment, SchedMon device driver defines five different operations over
the device file:
• Open - This function is called each time an open() operation is performed over the device
file. At the moment, this functions is only used for initialization and debugging purposes;
• Release - Similarly to open, this function is called each time a device file descriptor is closed
and it serves mostly for debugging purposes;
• Ioctl - This function incorporates most of the user-to-kernel communication functionality.
It is triggered when an ioctl() call is made over the device file, and it allows attaining
a control over the monitoring facilities. An ioctl() call permits not only sending specific
predefined commands to the driver, but also exchanging data between the kernel and the
user-space if a user-address is provided as a function’s argument.
• Mmap - This operation allows a user-space program to share memory with SchedMon’s
device driver, thus reducing the overall communication overhead. This function is triggered
when a mmap() call is made over the device, and it must be performed in order to obtain
profiling information from the driver, as described below;
• Poll - This operation implements the synchronization mechanisms used by SchedMon to
coordinate the read and write operations over the previously allocated shared memory. It
can be used by calling poll() or select() functions over the device.
The proper use of the above described calls is what permits the full control and configuration
of SchedMon’s driver from the user-space.
Events, Event-sets and Environments
SchedMon infrastructure for performance configuration relies on three basic data structures:
event, event-set and environment. These structures are designed to interact in a hierarchical
way, as shown in Figure 4.2, thus allowing the reutilization of not only event but also event-set
definitions. Therefore, one does not have the need to re-create the same events or event-sets across
different runs.
An event data structure contains both the event tag identification (event_tag), which is
defined at the time of the event configuration, and the Performance Monitoring Select Register
41
4. Scheduler-Based Monitoring Tool (SchedMon)
Envirorment
- nr_evsets
- evset_arr
- profiling options sample_time fork_info (…)
- nr_tasks using this struct
Event-set
- evset_tag- event_arr- fixed_ctr_ctrl fx0_en fx0_any (…)
- global_ctr_ctrl gp0_en fx0_en (…)
Event
- event_tag
- event_configuration
event_select unit_mask OS user (…)
Figure 4.2: SchedMon event, event-set and environment structural hierarchy.
(PMSR) value necessary for the desired event configuration (event_configuration). Since PMUs
usually provide more than one PMC and even several PFCs, the event-set data structure contains
a number of pointers to event structures (event_arr), the PFCs configuration (fixed_ctr_ctrl)
and an additional register variable which contains the information of which PMCs and PFCs
are configured for the event-set (global_ctr_ctrl). Moreover, since this structure contains a
full configuration of the PMU, only one event-set can be configured at a time per LPC. Both
event and event-sets need to be defined before profiling is started, and they are stored inside
the driver. On the other hand, the environment data structure is created at the time of the
run, and it is meant for maintaining the performance configuration for that specific execution,
i.e., it keeps both the pointers for the monitored event-sets (evset_arr) and the profiling options
(profiling_options), such as the sampling time interval and the flags defining the required
profiling information types. At the end of the run, environment data structures are destroyed.
There are currently six available profiling options flags, that need to be defined at the time of
the run, and which have the purpose of enabling specific configuration parameters:
• inherit - Applies for multi-threaded applications and, when set, the forked tasks will also be
monitored, by inheriting the configuration from their parent task.
• on_exec - When set, the monitoring of the target application process is started at the
time when the next execve() system call is made. This guarantees that the application
monitoring starts exactly when its execution starts. This option is actually used by smon,
since the relies on the execve() call in order to run the provided application binary.
• rapl - For all cases when RAPL energy status information is required, this flag must be set.
This enables power and energy consumption profiling during the execution and provides the
resulting sampling measurements.
• migration - When this flag is enabled, whenever one of the monitored tasks is migrated, the
correspondent information is provided as a sample.
• fork - This flag works in a similar way to the migration flag, although it delivers a sample
each time one of the monitoring tasks forks a new task.
42
4.2 Implementation Details
• sched - When detailed scheduling information is required, this flag can be enabled. If so,
each time a task is scheduled in and out of a specific LPC by the OS, a sample with the
correspondent information is delivered.
The above described option flags, apart from the inherit and on_exec flags, allow defining
what kind of information should be included while profiling. The sample_time parameter may
also be defined when configuring the execution, which configures the performance and power/energy
consumption sampling time interval. Although the above described flags allow enabling or disabling
the different types of profiling information, performance sampling information is always enabled
(by default).
Sample Types
SchedMon’s driver currently provides five different types of samples, which refer to different
previously described profiling configuration parameters, namely: performance, energy status, task
migration, task creation and CPU scheduling information.
Performance samples are always enabled. They provide a complete PMU sample reading
during a specific time interval, which is defined via the sample_time parameter. Moreover, timing
information is also available by providing the time-stamps corresponding to the start and the end
of the sample, and also the duration of each sample. The sample duration might differ from the
difference between the end and start time-stamps, in cases when the sample was taken at different
CPU scheduling intervals. For instance, if a task is scheduled out of the CPU at the middle of a
sample, switched by another task, and then scheduled in again later, the sample duration will not
include the foreign task’s execution time. Information about the corresponding event-set and task
PID are provided as well.
RAPL samples (power/energy consumption samples), when requested, provide the energy sta-
tus counter readings for all available domains of the processor chip, for the same time interval
defined for performance samples. Since power consumption monitoring is performed at the chip
level, there is only one application task monitoring it and, thus, the PID is not necessary. On the
other hand, the sample start and duration times are still provided.
In contrast to performance and RAPL sample types, which provide hardware event counter
readings, the remaining three sample types refer to specific software events. The task migration
samples, when requested, provide information when the task is migrated, which CPUs are involved
in that migration and the timing information (i.e., corresponding time-stamp values). Task creation
samples provide the PIDs of both involved processes (parent and child) and the corresponding time-
stamp values. The scheduling information samples are provided each time a monitored task leaves
the current LPC. In detail, it contains not only the task PID and the LPC identification, but also
the corresponding time-stamps when the task entered and left a certain LPC.
The time-stamp information is obtained by using the rdtsc function, which provides the corre-
43
4. Scheduler-Based Monitoring Tool (SchedMon)
sponding LPC’s time-stamp counter (TSC) value. Each LPC contains its own TSC register, which
measures the time since the machine was booted. Although Linux implements certain mechanisms
to synchronize the TSCs across the several available LPCs at boot time, there are no guarantees
that TSC values across different LPCs are actually synchronized. Nonetheless, for the sake of
simplicity, TSCs are assumed to be synchronized across all LPCs up to a certain accuracy level.
Monitored Tasks
The Linux kernel uses the same data structure to represent both user-space processes and
threads, which is denominated as a task. SchedMon follows the same methodology. Thus, any
user-space thread or process will be referred herein as a task.
SchedMon defines two types of tasks: leaders and children. In order to monitor an application
using the tool, the target process, or thread, must be registered into the driver. For this, an ioctl()
system call with the proper request must be performed. The task registration request requires two
distinct arguments: the target PID, which is the task identification parameter, and an environment
data structure containing the profiling configuration. Under SchedMon’s driver, every registered
task is appointed as a leader. On the other hand, a child corresponds to a task descending from
a leader. This only applies if the inherit option is enabled upon the leader task registration,
otherwise the driver will not register any children descending from that task.
Each leader task that is registered in the driver is associated with a performance environment,
i.e., a data structure containing the profiling execution configuration. Whenever a child is allocated
by the driver, it inherits its leader performance environment and, therefore the same configuration.
SchedMon’s driver keeps track of the task organization by using Linux double-linked lists, thus
facilitating the process of task creation and destruction. A leader task can be contained in three
types of lists: task_list, cpu_list or wait_list. The task_list contains all the registered
leader tasks, i.e., all the tasks that were registered through the ioctl() system call, which might
be currently monitored or finished (waiting for being unregistered). For each available LPC, there
is a corresponding cpu_list. Each cpu_list contains all the monitored tasks that are currently
running, or scheduled to run, on the corresponding LPC. On the other hand, the wait_list is
reserved for tasks that are already registered but are still waiting for the next execve() system
call event in order to start their monitoring. Therefore, a leader task cannot be present in both
the wait_list and cpu_list at the same time. Each leader task also defines a fourth list head,
children_list, which holds a linked list of every created child, if any.
Scheduling Infrastructure
SchedMon’s scheduling infrastructure constitutes the core functionality of the driver, since it is
responsible for handling both the task infrastructure and profiling operations. As already referred,
the profiling operations depend on a set of events triggered by the Linux scheduler. At the moment,
the tool is able to detect five different scheduling events, by means of kernel tracepoints which,
44
4.2 Implementation Details
static void sched_process_exec (struct task_struct *p, pid_t pid);
static void sched_process_fork (struct task_struct *parent, struct task_struct *child);
static void sched_switch (struct task_struct *prev, struct task_struct *next);
static void sched_migrate_task (struct task_struct *p, int dest_cpu);
static void sched_process_exit (struct task_struct *p);
Figure 4.3: Linux scheduler breakpoints used by SchedMon.
when triggered, correspond to a specific operation to be handled by the driver. Figure 4.3 depicts
the implemented tracepoint callback headers and their relevant arguments.
The first illustrated tracepoint, sched_process_exec(), is triggered whenever a task executes
an execve() system call, i.e., whenever it switches its execution binary by another one, and the
task identification parameters are passed through the arguments. When this function is triggered
inside the driver, a sequential search over the wait_list is made. If the task is registered and
present in the wait_list, it is removed from this list and inserted into the corresponding cpu_list,
i.e., it is ready to be monitored.
In a similar way, whenever a task forks a new child, sched_process_fork() is called and
the respective pointers to parent and child task structures are provided as arguments. In this
case, since this event is triggered in the LPC running the parent process, the driver searches the
corresponding cpu_list and, if the forking task is registered in the system (and has the inherit
flag enabled), a child data structure is inserted into this same list and it inherits the behavior of
its parent.
The sched_switch() tracepoint is the most frequently called among all the tracepoints, since
it is triggered each time the scheduler switches a task by another one, in a specific LPC. Whenever
this function is called, SchedMon searches the corresponding cpu_list for the task which is marked
to be scheduled out. If found, the PMU configured counters are stopped and the readings are
saved and accumulated with the readings of the next PMU sample. If energy status information is
enabled, a similar process is performed, with the exception of stopping its counters, since it is not
possible to stop the energy status counters. Power/energy consumption sampling is only performed
by the leader tasks, since they represent the main application processes. Furthermore, an evaluation
is made to the sched configuration flag and, if enabled, the corresponding scheduling sample is
produced. After the above described operations are made for the scheduled out task, while still in
the sched_swith() function, a similar search is performed for finding the next scheduled in task. If
found, the initializations are made in order to restart performance and, if applicable, power/energy
sampling. These initializations consist mainly in setting kernel timers that when triggered, produce
a system interruption which allows the driver to take the required measurements or, in the case of
45
4. Scheduler-Based Monitoring Tool (SchedMon)
performance sampling, to reconfigure the PMU as needed. The sampling process will be explained
later in detail.
Each time a task is migrated, the sched_migrate_task() function is called, providing the
information of which task is migrated and to which LPC it is migrated. Since this call is made
from the LPC where the task is migrated from, a sequential search over the corresponding cpu_list
is made. If the migrated task is registered in SchedMon, it is removed from the current cpu_list
and it is inserted in the destination cpu_list. Moreover, if the migration option flag is set, a
migration sample is also produced.
Finally, the sched_process_exit() tracepoint is triggered whenever a task is terminated. If
the terminated task is one of the SchedMon’s monitored tasks, it is removed from its corresponding
cpu_list and its timers are stopped, in order to stop that task from being monitored.
Sampling
Sampling refers to the process of extracting specific information from the execution at regular
time intervals. Figures 4.4(a) and 4.4(b) illustrate the sampling process and a use-case scenario
of a task being profiled over time, respectively. For the sake of simplicity, the presented diagram
refers to the execution scenarios when:
• There is only one LPC;
• The only scheduling event being triggered is the sched_switch() call;
• The tool is exclusively profiling performance;
• The sampling time interval is 10ms;
• Two event-sets are configured.
In order to provide accurate performance sampling, several auxiliary data structures are used for
this process. The main data structures used for performance sampling are i) the array containing
the different event-set configurations, ii) a Linux high-resolution timer for synchronization purposes
and sampling at the nanosecond granularity, and iii) a temporary PMU sample, which holds the
current sample counts.
After a task is registered for profiling into the SchedMon’s driver (step 1), the scheduling
process is able to detect its presence and to perform the corresponding sampling operations. As
presented in Figure 4.4(b), at start time, 0ms, the Linux scheduler assigns the CPU resources to
the target monitored task. As soon as this task is found in the driver, it conducts the necessary
operations for starting the sampling process.
Firstly, the PMU is configured (step 2), which is done by writing the event-set 0 MSR configu-
ration into the underlying performance facilities. Along with this process, the current PMU sample
values are written into the performance MSRs, both PFCs and PMCs, which in this specific case
46
4.2 Implementation Details
Register Task
sched switch
Load evset into PMU
Stop counting
in out
(1)
(2) (6)
Start hrtimer
Get time-stamp
Stop hrtimer
Get time-stamp
Start counting
Get PMU readings
(3)
(4)
(5)
(7)
(8)
(9)
hrtimer inter-
ruption
Close current sample
(a)
Dispatch sample
Restart hrtimer
Reconfigure PMU
(b)
(d)
(e)
Reset sample
(c)
(a) Sampling flow diagram.
ms0 8 14 16 21 24 26 31 34
Event-set 0Event-set 1
sched_insched_outtake sample
(b) Sampling example in time.
Figure 4.4: SchedMon sampling process illustration.
are all initialized to zero. After the PMU configuration is performed, the high-resolution timer
is set (step 3). This is done by setting the timer to trigger after 10ms (sampling time interval).
When the timer is configured, a callback function is provided, which is triggered upon the timer’s
termination and is meant for taking and dispatching a sample. Right before starting the counters,
the Time-Stamp Counter (TSC) is obtained (step 4), in order to keep track of time related infor-
mation. Finally, the configuration process is concluded by enabling the PMU counters in order to
start counting (step 5).
In this hypothetical example, the monitored task is scheduled out before the timer interrupt
occurs (at 8ms). Thus, the first operation is to stop the performance counters (step 6), in order
47
4. Scheduler-Based Monitoring Tool (SchedMon)
to reduce the overheads imposed by the tool. Subsequently, the high-resolution timer is stopped
(step 7), such that it cannot be triggered during the next steps. Moreover, the timer’s remaining
time (2ms), is kept for the next timer configuration. Then, the PMU counters are saved into the
current sample data structure (step 8) and the TSC is again read (step 9) Although this does
not represent the end of the current sample, the time information is still obtained in order to keep
track of the sample duration.
At 14ms of the execution time, the target task is scheduled in again, and a similar procedure
(to the already described one) is performed. The main difference from the previous explanation
occurs in steps 2 and 3. Since the application is still running the event-set 0 (remaining 2ms to
complete the sampling time interval), the current sample contains the counts from the previous
run. Therefore, when configuring the PMU, instead of initializing the PMCs and PFCs to zero,
the current sample values are restored into the counters, thus allowing to continue the previous
sample execution. In a similar way, when setting the high-resolution timer, instead of the default
sampling time interval, it is now started with the time left from the previous sample run, i.e., 2ms.
After 2ms, the timer is triggered before the task is scheduled out. As a result, at 16ms of the
illustrated execution example (see Figure 4.4(b)), the corresponding interruption occurs, allowing
to complete the sample (step a) and perform the proper reconfigurations for the next sampling
interval. After closing the sample, it is dispatched to the user-space (step b), by means of shared
memory (as explained in detail in the following text). With the completion of the previous sample,
the structure holding the current sample information is reset (step c), i.e., the counter values
are again initialized to zero. The PMU is reconfigured according to the event-set 1 (step d)
and the timer is set to start counting a complete sampling time interval (step e). Afterwards,
the previously described procedure is repeated until the end of the task execution (from 16ms in
Figure 4.4(b)).
It is important to notice the described procedure includes several optimizations techniques to
provide accurate performance sampling with minimal introduced overheads. Firstly, whenever the
task is scheduled in or scheduled out of the CPU, the corresponding steps (2-5 and 6-9 in Figure
4.4(a)) are executed inside the scheduler. This allows reducing the visible profiling overhead, since
no task is currently running in the CPU. Secondly, when a timer interruption occurs, the counters
are stopped and started in a way to minimize the interference of the tool to the counter values.
Hence, it is important to notice that the herein referred overhead corresponds to the overhead
induced to the profiling information and not to the overall system’s overhead.
The energy status sampling procedure is similar to the one described for performance. The
main difference relates to the fact that the energy status interface cannot be configured, thus only
the readings are performed. Moreover, power/energy consumption sampling information is only
performed by the leader tasks, as opposite to performance, which is performed by all tasks. At
the moment, RAPL energy status samples are taken with the same granularity as performance
48
4.2 Implementation Details
samples, i.e., they share the same sampling time interval. However, leaders use a distinct high
resolution timer for performance and power/energy, which facilitates the possible introduction of
a different sampling time interval for power/energy.
Kernel-User Communication
Up until now, in the context of sampling, the action of finishing a sample was usually referred as
"producing" or "dispatching" a sample. The mechanism used by SchedMon that allows exchanging
produced samples between the kernel and the user-space is actually one of the most complex ones,
since it comprises a memory ring-buffer, a virtual memory area, shared between the kernel
and the user-space, and a synchronization mechanism. This mechanism was implemented in
order to reduce the overhead of the communication between the kernel and the user-space, since
there is no need to replicate the obtained memory information.
A ring-buffer is a limited memory buffer implemented in a circular way. For instance, if a ring-
buffer contains ten available slots, it can be filled from slot 0 until 9 and, when the end of the buffer
is reached, it starts filling the slots from the beginning. This type of mechanism is normally used
in producer-consumer problems, where the data is frequently exchanged and temporarily stored in
the memory. Following this producer-consumer methodology, SchedMon implements a ring-buffer
to facilitate the exchage of the produced samples between the kernel and the user-space.
Linux organizes memory by means of pages, which represent chunks of physical memory and
are usually of the size of 4kB. SchedMon’s ring-buffer is therefore an abstraction of an array
containing one or several memory pages. Figure 4.5 depicts the implementation and functionality
of the tool’s ring-buffer, providing information not only about the main data-structures, but also
about the virtual memory spacial disposition of the buffer from both user and kernel-space point
of view.
SchedMon’s driver implements an abstraction of a ring-buffer by allocating a number of pages
and keeping their addresses in a data_array structure. As illustrated on the left side of Figure 4.5,
the virtual addresses of the pages do not necessarily need to be ordered, since they are individually
allocated, i.e., page by page. In order to track the next writing and reading positions, the driver
declares two 32-bit variables, head and tail, which contain the virtual addresses for the next free
and filled positions, respectively. For example, as it can be observed in Figure 4.5, if the head is
positioned the beginning of page 3, its page_number is 3 and its offset is 0.
A virtual memory area corresponds to a chunk of memory that is shared between the kernel
and the user-space. In order to obtain profiling information when using SchedMon, the user-space
program needs to reserve this space by performing a mmap() system call and by providing the
required number of ring-buffer pages, i.e., the shared memory size. The driver then proceeds to
the creation of the buffer obtaining the virtual memory area start address, which is translated
into a user-space virtual address. From this point on, both kernel and user-space have a common
49
4. Scheduler-Based Monitoring Tool (SchedMon)
page 1
page 3
page 2
page 0
page 2
page 3
page 1
page 0data_array
page number offsethead, tail
0111231
tail head
User-space Virtual Memory
Kernel-space Virtual Memory
Ring-Buffer Abstraction empty samples
Figure 4.5: SchedMon ring-buffer implementation overview.
shared chunk of memory, which does not necessarily correspond to the same virtual address.
In order to synchronize the communication between the kernel and the user-space, a specific
protocol must be implemented. This protocol is needed to instruct the user-space process how to
read the information provided in the buffer. For this, an 8-bit header is added at the beginning of
each sample, which defines the type of sample in the following memory position. For example, 1
refers to PMU samples, while 2 refers to RAPL samples. SchedMon’s library also provides all the
data structures that are necessary for communication purposes. Therefore, since both sides know
the size and structure of each sample type, and by following the ring-buffer implementation, it is
possible to extract meaningful information from the buffer.
Finally, a specific synchronization mechanism is implemented for alerting the user-space
program whenever new data is available, since it is not possible to directly access the driver’s
ring-buffer structure information from the user-space. This is done by means of the poll() system
call. Performing this call permits the user-space process to listen to a predefined set of events for
a number of file descriptors, while staying in a sleep-state mode. Therefore, by using this facility,
the user-space process has the ability to detect when new data are available. SchedMon allows
the user to configure the size of the burst, i.e., the number of samples to be consumed at a time.
Therefore, the driver triggers the corresponding poll() event whenever the required burst size is
available for consuming.
50
4.2 Implementation Details
Concurrency and Deadlock Avoidance
Concurrency and deadlock avoidance are two main concerns that have to be carefully taken
into account when programming a Linux device driver, specially when using complex internal
mechanisms like the Linux scheduler. SchedMon’s driver contains several data structures that
are shared by all tasks belonging to an application, e.g., the ring-buffer and the child_list
infrastructures. Furthermore, some data structures are even shared among all the tasks registered
in the driver, e.g., the remaining task list infrastructures. Since most of these structures are usually
handled during the Linux scheduler execution, it is important to use specific locking mechanisms
that do not sleep. This type of mechanism, in Linux, is declared as a spinlock. A spinlock is a
type of lock, defined in few assembly instructions, that loops around a variable (the lock holder)
until it is released by any other task that might be keeping it. Therefore, for all the necessary
places of code that require concurrency avoidance, the proper spinlock mechanisms are used.
Avoiding deadlocks when dealing with Linux internal mechanisms is more complicated, since
it implies a previous knowledge of how those internal mechanisms are implemented. In fact, there
is a set of specific actions that SchedMon’s driver is not allowed to perform at run-time. A good
demonstration example is when the driver detects that a burst size number of samples is available
for the user to consume. This event occurs when a new sample is written to the buffer, whether in
interruption mode when the task is scheduled out. In both situations, the driver is not allowed to
enter sleeping mode. The function used to trigger the poll() event to the user-space, i.e., to alert
that there is new information to read, is named wake_up() and, when called, it requires obtaining
the locks associated with the task being awakened. However, the corresponding locks might be
hold by some other Linux infrastructures, like the scheduler, which may provoke the task to enter
into a sleep-state mode, thus causing the complete system to be irresponsive.
The way of avoiding this type of situations, in Linux, is by means of the irq_work_queue
mechanism. This infrastructure allows to postpone jobs to be executed as soon as possible, by
triggering a system interruption as soon as interruptions are re-enabled. In our specific case,
this refers to the time when the Linux scheduler finishes executing. Hence, whenever a possible
deadlock situation is detected, SchedMon’s driver recurs to this mechanism in order to avoid possible
compelling situations.
4.2.2 User-space Tool
The SchedMon’s user-space component, smon, is integrated in the tool in order to facilitate
the access and handling of the underlying driver. By making use of the driver’s user-space library,
for configuration purposes, and by means of the mmap() and poll() system calls, smon translates
the whole tool’s functionality into an easy to use command-line interface.
The main functionalities of smon include i) the creation of events, ii) the definition of event-
sets, by using the already created events, and iii) the ability to profile an application. Smon
51
4. Scheduler-Based Monitoring Tool (SchedMon)
Request Description
SMON_IOCSEVTSet new event. A structure holding the event configuration mustbe sent.
SMON_IOCGEVTGet event information. The event id must be provided. If theevent id exists, a structure containing the event description isreturned.
SMON_IOCCEVT Check is event id exists. Returns 0 if true.
SMON_IOCSEVSSet event-set. A structure holding the event-set configurationmust be provided.
SMON_IOCGEVSGet event-set. The id must be provided. If the id exists, a struc-ture with the event-set configuration is returned.
SMON_IOCCEVS Check if event-set id exists. Returns 0 if true.
SMON_IOCSTSKRegister task into the driver. The task PID, along with an envi-ronment configuration, must be provided.
SMON_IOCUTSKUnregister task. This must be used when the task is no longerneeded to be monitored, even if its execution has already finished.
SMON_IOCREADUsed to consume N bytes from the buffer, i.e., to instruct that Nbytes have been read..
Table 4.1: Available ioctl() requests to SchedMon’s driver.
firstly parses the user’s input, to detect the required command and the input configuration sets for
that command. The command-line interface is explained in detail in Section 4.3.
Apart from parsing and verifying the user’s input, all the main three functionalities recur to
ioctl() system calls in order to perform the required set of actions. Table 4.1 enumerates and
describes the available ioctl() requests provided by the driver. The first six requests serve for
handling events and event-sets and they represent the whole mechanism behind these two func-
tionalities. The last three requests are used for profiling, such as for registering and unregistering
tasks, as well as consuming memory chunks from the ring-buffer.
Application Profiling
In contrast to events and event-sets handling, profiling requires a number of different mecha-
nisms in order to work properly. Firstly, the target application execution must be handled. This
is done by forking a new process, in order to switch its execution image by the newly required
one. The forked child process, before proceeding to the execve() system call, waits for a shared
semaphore to be released by (smon)’s main process (parent process). Meanwhile, on the parent
process side, after forking the child is performed, a request is sent to the driver for the ring-buffer
allocation. As previously referred, this allows information exchange between the driver and the
user-space process (smon).
When the memory buffer is set, the child’s PID is registered into SchedMon’s driver by using a
ioctl() call and by sending the proper registration request (SMON_IOCSTSK), as shown in Table
4.1. In order to instruct the driver to start monitoring the target child task after the execve() call
is executed, the registered task is configured with the on_exec flag enabled, therefore setting the
52
4.2 Implementation Details
task to start being monitored as soon as it switches its program’s execution image. This guarantees
that the user target application is fully profiled, from the moment its execution begins. At this
point, the parent process releases the shared semaphore that prevents the child from executing,
and the profiling is initiated.
Profiling is performed by using the same ring-buffer methodology used by the driver. Since the
driver is not aware how the user-space program handles the shared memory region, SchedMon’s
user-space library provides the necessary routines to handle the ring-buffer and, therefore, the
proper communication with the driver.
The process of reading samples from the buffer is initiated by looping around a poll() system
call. This puts the calling process in a sleep-state mode until a new burst of samples is available.
Each time an available data burst is detected, the requested amount of samples is read. Since
different sample types correspond to different data structures and might even have different sizes,
a 8-bit header is injected by the driver before each sample, which contains the information about
the sample type. Whenever the size of a ring-buffer memory page is exhausted, a "stuff" header
is injected after the last sample, thus identifying the end of that page. Finally, at the end of the
profiling, an "end-of-profiling" header is inserted.
Cache-aware Roofline Model
As previously referred, SchedMon provides a predefined profiling configuration, which outputs a
full performance evaluation based on the CARM. Similarly to SpyMon, this is achieved by running
the predefined event-set configuration shown in Table 3.1. This application profiling is performed
according to the previously described methodology. The main difference lies in the fact that when
running in CARM mode, there is no need to define which events to monitor.
In SchedMon, an additional functionality is provided to automatically create the CARM for
the detected general-purpose multi-core architecture. As already referred in Section 2.6, to build
the CARM, it is necessary to assess i) the peak FP performance of the architecture; and ii) the
attainable bandwidth for different cache levels. Although the theoretical peak FP performance
and L1 bandwidth can be derived directly from the device manufacturer’s data sheets, assessing
the bandwidth for deeper cache levels must be performed by relying on specific micro-benchmarks.
In order to ease this process, the proposed tool integrates specific assembly-level tests for de-
termining these bandwidth values, as presented in Algorithm 4.1 for Double Precision (DP) FP
AVX instructions. The adapted test procedure varies the size of transferred data to hit differ-
ent cache levels by accessing contiguous and increasing memory addresses. To obtain accurate
bandwidth values, each test code is repeated 8192 times in order to favor the throughput over the
latency and, in each repetition, the values of the monitored performance counters were assessed. In
detail, MEM_UOP_RETIRED_ALL_LOADS and MEM_UOP_RETIRED_ALL_STORES were used to determine
the number of performed load and store operations, respectively.
53
4. Scheduler-Based Monitoring Tool (SchedMon)
Algorithm 4.1 Bandwidth test codevmovapd 0(%rax), %ymm0;
vmovapd 32(%rax), %ymm1;
vmovapd %ymm2, 64(%rax);
vmovapd 96(%rax), %ymm3;
vmovapd 128(%rax), %ymm4;
vmovapd %ymm5, 160(%rax);
vmovapd 192(%rax), %ymm6;
vmovapd 224(%rax), %ymm7;
vmovapd %ymm8, 256(%rax);
. . . ;
Algorithm 4.2 FP MAD test codevmulpd %ymm0, %ymm0, %ymm0;
vaddpd %ymm1, %ymm1, %ymm1;
vmulpd %ymm2, %ymm2, %ymm2;
vaddpd %ymm3, %ymm3, %ymm3;
vmulpd %ymm4, %ymm4, %ymm4;
vaddpd %ymm5, %ymm5, %ymm5;
vmulpd %ymm6, %ymm6, %ymm6;
vaddpd %ymm7, %ymm7, %ymm7;
vmulpd %ymm8, %ymm8, %ymm8;
...;
A similar procedure is adapted for determining the FP peak performance, by relying on a
set of benchmarks as depicted in Algorithm 4.2. For this particular case, assessing the peak FP
MAD performance of DP FP AVX instructions is performed by relying on FP_AVX_PACKED_DOUBLE
PME. When assessing the peak FP for other types of FP instructions, such as SSE or scalar double,
different PMEs are used, as presented in Table 3.2.
The reported experimentally obtained bandwidth and performance values represent a median
of the counter readings from all 8192 runs.
Function Call Tracing
Function call tracing represents the process of detecting whenever a target application, the
tracee, enters or leaves a function call. This is an important feature, when detecting the potential
execution bottlenecks for the most time consuming parts of the application. This functionality is
introduced in smon and cannot be provided by solely using the tool’s driver, since it is implemented
in the user-space.
The binary application executable files may be dumped in order to extract useful information
about the program. Figure 4.6 shows the dumped assembly code of a simple hello function, which
is used to print the traditional "Hello World!" message to the screen. The squares indicate the
entry and the return function instructions, respectively.
The method used by smon to detect the entering and returning points of a function requires
preprocessing the dumped assembly code of the application. The detected execution points are then
assigned to breakpoint structures, which hold the original bytes contained in those positions and
are used to inject code to those same memory addresses. For instance, in the case of the example
illustrated in Figure 4.6, two breakpoints are created and, once the memory bytes represented by
the squares are saved, each of them is replaced by the CC assembly instruction. This instruction is
called the trap instruction and it is used for the purpose of tracing the execution of a process.
An example of a very well know mechanism that makes use of this instruction is the debugger.
In order to peek or inject code into a running application, the ptrace() system call must be
used. This call allows to trace a target process and provides a vast set of functionalities, such
54
4.2 Implementation Details
Figure 4.6: Example of a function’s dump information.
as detecting when the tracee performs system calls, when it forks a new thread or process and
controlling its execution, at run-time.
Each time a tracee’s trap instruction is executed, the program stops and the tracer process
is alerted for this event, by means of a SIGTRAP signal. By taking advantage of the above referred
mechanisms, smon is able to detect when a process enters or leaves a function. When the smon
detects that an application’s breakpoint was reached, it replaces the trap instruction byte by the
original byte, thus allowing the application to proceed with the execution. Since ptrace() allows
executing the tracee in a single step mode, smon instructs the target process to execute only one
instruction and, after that, the breakpoint is enabled again (in order to catch repeating calls to
the same function).
Figure 4.7 depicts the main structures used by SchedMon in order to keep track of the function
call tracing information. As already referred, the trace_breakpoint structure corresponds to a
point of interest in the tracee’s program execution memory and contains the corresponding memory
address, the original data contained in the executable file, and the new data (trap instruction)
that will take place. Each breakpoint is associated with a function, which is represented by the
trace_function data structure. Besides the breakpoint information, this structure contains the
name of the corresponding function and its start and end addresses. Keeping the start and end
addresses enables the possibility for a binary search when searching for the hit breakpoint. For
each executable file being traced by SchedMon, there is a corresponding trace_mem_info data
structure. This structure contains a set of functions from the program and, therefore, holds all the
necessary information for tracing the target executable. The trace_task structure allows the tool
to trace multi-threaded applications, by keeping track of each forked process or thread individually.
Hence, each task is thus associated with its execution code (trace_mem_info).
In order to keep track of a process execution flow, including when it forks or switches its
execution image, a set of options must be used when the tracing is initialized. This can be done
by calling ptrace() with the PTRACE_SETOPTIONS command parameter. SchedMon makes use of
the correspoding set of options in order to detect:
• Forks - this option enables detecting whenever a new thread or process is spawned by the
tracee. When this happens, a new trace_task structure is created and inserted into the
task list. This new task inherits the trace_mem_info structure of its parent.
55
4. Scheduler-Based Monitoring Tool (SchedMon)
trace_task
- pid
- mem_info
- task_list
trace_mem_info
- nr_functions
- function_arr
trace_breakpoint
- address
- original_data
- new_data
trace_function
- nr_breaks
- break_arr
- funtion_name
- start_addr, end_addr
Figure 4.7: SchedMon function call tracing data structures.
• Execution swaps - this option allows the tool to detect whenever the tracee performs a
execve() system call. For each detected call, a new trace_mem_info structure is created
and attached to the target task.
• Terminations - ptrace() also allows to detect when the monitored task finishes executing.
This functionality is used by SchedMon in order to terminate the tracing of the target task.
4.3 Usage
This section describes the details of how to use SchedMon in order to obtain the appropriate
profiling results. As already referred, the tool incorporates not only a Linux kernel driver, which
contains the core functionality, but also a user space tool, smon, which brings that functionality
to the user as a simple to use command-line interface.
Currently, smon provides four different commands with distinct functionalities: i) event,
which allows to create and add new events to the tool, ii) evset, which provides the ways to
define new event-sets, i.e., new PMU configurations, iii) profile, which allows the full profiling
of target applications, and, at last, iv) roof, which provides useful architectural insights based on
the CARM.
4.3.1 Adding Events
Adding an event definition to the tool is done by providing the PMSR field configuration
parameters. In order to facilitate the later recognition of a configured event by the user, a tag
identifier should be provided. Figure 4.8 shows the smon event command usage.
In order to add a new event definition, three arguments must be provided: i) the TAG, which
holds the event identification, ii) the EVSEL, which is the 8-bit value corresponding to the PMSR’s
event selector bit field, and iii) the UMASK, corresponding to the unit mask field of that same MSR.
56
4.3 Usage
usage: smon event --add|-a tag=TAG,evsel=EVSEL,umask=UMASK[,mode=MODE]smon event --list|-l
List of <event --add> parameters:TAG String to tag the new event.MODE 2-bit value defining the running mode (user-1, kernel-2 or both-3).EVSEL 8-bit event selector value.UMASK 8-bit unit mask value.
Figure 4.8: Smon event usage information.
An additional MODE can be added to the configuration. This field defines in which mode (or modes)
the target event counts will be made and it may take the values of 1 (user-space), 2 (kernel-space)
or 3 (both). If not provided, the default value for this field is 3, meaning it will monitor when the
CPU operates in both modes.
There is a second sub-command, ––list, that is used for printing out the list of already
configured events, including the event tags and configured fields. Moreover, each event is assigned
with an integer identification value, which can be used later when defining event-sets.
4.3.2 Defining Event-sets
Similarly to the described event functionality, the smon evset command allows adding new
event-set definitions and listing the already configured ones. Figure 4.9 demonstrates the usage of
this command.
usage: smon evset --add|-a tag=TAG,events=EVID[:EVID[...]][,fixed=FIXED]smon evset --list|-l
List of <evset --add> parameters:TAG String to tag the new event-set.EVID Event ID. Check <smon event -l> for a list of available events.FIXED 12-bit number (4 bits for each fixed ctr): 0-Disabled 1-OS 2-User 3-Both
Figure 4.9: Smon evset usage information.
In order to create a new event-set, at least two parameters must be provided. The first param-
eter, TAG, allows an easier identification of the even-set without requiring to check its configuration
fields. The second parameter refers to a sequential set of general purpose events that must be
provided. For this, several event identification numbers must be provided over the EVID parame-
ter. The number of events is limited to the number of underlying hardware PMCs. In addition to
PMCs, smon allows the configuration of the PFCs. This is done by providing a single hexadecimal
value through the FIXED parameter. For example, if PFC0 and PFC2 need to be enabled for both
privilege modes (user and OS), the correct parameter value would be 0x303.
57
4. Scheduler-Based Monitoring Tool (SchedMon)
4.3.3 Application Profiling
In order to profile an application with smon, the required event-sets must be already defined
in the driver, as well as the events needed to create event-sets. Figure 4.10 illustrates the usage of
the smon profile sub-command, by providing a complete list and description of each individual
option.
usage: smon profile [[options]] PROG [ARGS...]
List of available options:-b BURST Burst size, i.e., nr of samples transferred at a time
(default is 1000)-c CPUMASK Bind task to specific logical CPUs
(e.g., to bind to CPUs 0,1 & 6 -> CPUMASK=0x43)-e ESID:[...] Eventset(s) to monitor
(if more than one, time-multiplexed round-robin style)-f Deliver information about Forking for the monitored task(s)-i Children (recursive) of monitored process will Inherit monitoring-m Deliver CPU Migration information-o O_FILE Output file (default is "smon.data")-p MMAP_PAGES Number {power of 2} of mmap Pages (default is 1024)-r DOMAIN:[...] deliver RAPL information for specified domains at the time
granularity of STIME-s Deliver CPU Scheduling information (this might have big overhead)-t STIME Sample Time in miliseconds (default is 1000)-x T_FILE Activate function call tracing and output information to T_FILE
(default is "smon.trace")
Figure 4.10: Smon profile usage information.
The smon profile operation provides several options that allow not only to configure the tool’s
execution, but also to define what kind of sampling information is required. The majority of the
depicted options correspond to previously explained functionalities or configurations of SchedMon.
However, the interface enables an extra functionality that was not previously described. By using
the –c option, smon provides a way for binding the target task to a specific set of CPU cores, i.e.,
by restricting the task to be scheduled only into the provided set of LPCs. To achieve this, the
value of the CPUMASK parameter must be specified. This value is an hexadecimal number, where
a bit set to one enables the LPC corresponding to that bit position in the provided bit word, as
depicted in Figure 4.10.
Another peculiarity that should be highlighted is the possibility of providing the output infor-
mation in two separate files: the T_FILE, when the function call tracing option is enabled, and the
O_FILE, which outputs all the profiling information. The information format contained in these
files will be explained later.
58
4.3 Usage
usage: smon roof-run [-t STIME] [-r DOMAIN:[...]] [-o OUTFILE] PROG [ARGS...]smon roof-creat
List of parameters:STIME Sampling time interval in ms (default is 10)DOMAIN Energy status domain (pkg, pp0, pp1 or dram)OUTFILE Output file (default is "smon.data")
Figure 4.11: Smon roof-run and roof-creat usage information.
4.3.4 Cache-aware Roofline Mode
In order to ease performance information assessing, SchedMon provides a predefined perfor-
mance configuration which allows to output the information according to the CARM [11]. This
functionality is also integrated in smon’s command-line interface, thus providing an easy and intu-
itive usage. Figure 4.11 illustrates the CARM related command usage information. As previously
referred, in addition to the traditional cache-aware roofline evaluation, SchedMon provides a way
to generate the model parameters by executing predefined micro-benchmarks. This functionality
does not only improve the model parameters for the underlying architecture, but also facilitates
the tool’s portability to different architectures.
The command-line usage for the referred functionalities is really straight forward, since all the
configurations are already hard-coded into the tool. The only configurable parameters allow: i) the
possibility of changing the sampling time interval, ii) the ability to enable energy status profiling,
and iii) redirecting the output information to a different file. As already referred, power metering
information is not included in the CARM and therefore it is provided as an extra information.
4.3.5 Information Output
SchedMon defines two different file types that contain the profiling information output:
• smon.data - this file contains all the profiling output, with the exception of the function
call tracing information. At the time, the file is formatted in ASCII and each line contains a
single profiling sample (e.g., PMU, RAPL or scheduling information).
• smon.trace - if the -x option is enabled at the time of profiling, the function call tracing
information is stored in this file. Each line of the file contains the time-stamp information of
when a specific application function was called (or returned).
When running in CARMmode, the performance sampling information is stored in the smon.data
file, which is processed after the application run in order to generate a third file containing the
performance counts plotted against the CARM (smon.plot). SchedMon also provides a set of
scripts that facilitate parsing the output information.
59
4. Scheduler-Based Monitoring Tool (SchedMon)
4.4 Summary
In this Chapter, a new easy and intuitive scheduler-based application profiling tool (SchedMon)
is proposed. The tool targets independence from any available performance or power interface and
it is designed in a modular way, which does not only facilitate portability, but it also eases the
addition of future functionalities.
SchedMon is composed by i) a Linux kernel module, or driver, that facilitates the access to
the underlying hardware performance and power facilities and helps overcoming possible privilege
restrictions; ii) a user-space library, which allows the interaction between user-space programs and
the driver and, therefore, the ability to perform run-time application profiling; and iii) a user-space
tool (smon) that does not only facilitate the usage of the tool, but it also provides a new function
call tracing functionality.
SchedMon gathers most of the state-of-art performance monitoring capabilities, energy status
information and application function tracing, and packs them into a simple and intuitive command-
line interface. Furthermore, it provides the ability not only to detect the underlying architecture’s
attainable performance, according to the CARM, but also to output, in a single plot, the exe-
cution performance profiling information against this model, which facilitates the understanding
of the underlying hardware resources and allows detecting possible architectural or application
bottlenecks.
60
5Experimental Results
Contents5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . 625.2 SpyMon Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 625.3 SchedMon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.4 Overhead Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
61
5. Experimental Results
This Chapter targets the evaluation and functionality demonstration of the herein presented
tools, by illustrating and analyzing the obtained results for different experimental scenarios. Section
5.1 describes the experimental environment, i.e., the experimental conditions in which the presented
tests were run. Section 5.2 explores and analyzes the main functionalities of the SpyMon tool.
Similarly, Section 5.3 presents the experimental evaluation of the proposed scheduler-based profiling
tool SchedMon. In order to compare the efficiency of both tools, similar experimental scenarios
are considered. However, additional results are presented, for each tool, in order to exploit the
extra functionalities not shared between them. At last, in Section 5.4, a discussion regarding the
introduced overheads of both performance and power/energy consumption monitoring is made.
5.1 Experimental Environment
The herein presented results were obtained in a machine containing an Intel i7 3770K processor,
which is an Ivy Bridge based micro-architecture with 4 physical cores and with hyper-threading
support, i.e., 8 LPCs. It operates at 3.5GHz, although it can attain 3.9GHz in turbo boost mode,
and its memory organization comprises 3 cache levels, namely: 32kB L1 cache, 256kB L2 cache
and 8192kB L3 cache. The cache levels L1 and L2 are shared between the LPCs contained in the
same PPC, and the last-level cache, L3, is shared among all LPCs. The DRAM memory controllers
support up to two channels of DDR3 operating at 2x933MHz.
The above described architecture provides a PMU containing 3 PFCs and 4 PMCs. With respect
to the energy status interface, information regarding the package, power-plane 0 and power-plane 1
is available. The performance and power hardware facilities are configured as described in Sections
2.1 and 2.2.
The Linux kernel uses the non-maskable interrupt watchdog to periodically detect if CPU is
locked. In order to achieve this, the watchdog makes use of the underlying PMCs, which in turn
interferes with any mechanism that makes use of the hardware PMU. Therefore, before executing
the presented tests, the Linux watchdog was disabled. This can be achieved by writing the value
of 0 into the /proc/sys/kernel/nmi_watchdog system configuration file.
Since Intel’s turbo boost functionality is implemented in a very complex way and, therefore,
may complicate the understanding of the obtained performance results, the processor’s clock was
set to a fixed frequency of 3.5GHz, which corresponds to the maximum non-turbo frequency.
5.2 SpyMon Experimental Evaluation
This Section presents the obtained experimental results for the SpyMon monitoring tool. The
performed experimental evaluation was conducted in order to demonstrate the tool’s possibility for
a system-wide performance and power/energy consumption analysis. Moreover, a set of standard
FP benchmarks from the SPEC CPU2006 suite [10] is evaluated in terms of both performance and
62
5.2 SpyMon Experimental Evaluation
power/energy consumption and their CARM analysis is also presented.
5.2.1 System-wide Profiling
Figure 5.1 illustrates a performance evaluation of four distinct SPEC CPU2006 benchmarks
(milc, namd, GemsFDTD and tonto). In order to obtain the depicted results, each benchmark
test was executed individually, without the interference of any other applications (with the excep-
tion of the OS tasks). For each execution, the benchmark process was pinned to its corresponding
LPC, as shown in Figures 5.1(a), 5.1(c), 5.1(e) and 5.1(g) for milc, namd, GemsFDTD and
tonto, respectively. Each of the shown LPCs was chosen in order to belong to a distinct PPC.
After running each of the four tests individually, a final run was performed, in which all the four
tests were run at the same time. The obtained results are presented in Figures 5.1(b), 5.1(d), 5.1(f)
and 5.1(h). In each of the runs, the sampling time interval was set to 20ms.
By analyzing Figure 5.1, several informational details can be extracted. First, all the bench-
marks achieve lower performance when run alongside each other, due to a shared resource con-
tention. This conclusion can be taken by observing that i) each benchmark duration is longer
when run alongside others, and ii) each benchmark performance values are significantly lower. For
example, milc takes around 210s to execute alone, in contrast to around 280s when run alongside
others. In addition, it achieves performance values of around 3.1GFlops/s when performing alone,
in opposition to a maximum value of around 2.9GFlops/s when executed with others.
Another interesting observation relies on the shapes of the obtained plots, where different parts
of the execution can be detected. For instance, when running the milc benchmark alone (see
Figure 5.1(a)), at least three distinct execution phases can be identified, where each of them ocurrs
in refulat time intervals and delivers different attainable performance (in GFlops/s). However,
when run together, the shapes of each benchmark execution appear to change according to the
concurrent applications. For example, the shape of the GemsFDTD benchmark is completely
distorted when run with the other applications (see Figures 5.1(e) and 5.1(f)).
It is important to determine where in the architecture the previously described performance
interferences happen. As already referred, each benchmark was run in a different PPC and, there-
fore, do not share any in-core computational resources with the other applications. Hence, the
interference between applications can be associated to memory contentions in the shared cache
level (L3) and in DRAM.
Since the inter-task interference mainly relates to memory contention, an interesting phe-
nomenon can be observed for namd, which shape does not seem to be affected by the other
benchmarks (see Figures 5.1(c) and 5.1(d)). This happens because namd is most likely compute-
bound, i.e., its performance is mainly limited by the predominant computations, and does not
highly depend on the memory operations. Therefore, while the other benchmarks seem to dispute
the access to the shared memory resources, namd mostly depends on the in-core computational
63
5. Experimental Results
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Pe
rfo
rma
nce
[G
Flo
ps/s
]
Time [s]
(a) Milc running alone (core 0)
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Perfo
rman
ce [G
Flop
s/s]
Time [s]
(b) Milc running with others (core 0)
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Pe
rfo
rma
nce
[G
Flo
ps/s
]
Time [s]
(c) Namd running alone (core 1)
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Perfo
rman
ce [G
Flop
s/s]
Time [s]
(d) Namd running with others (core 1)
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Pe
rfo
rma
nce
[G
Flo
ps/s
]
Time [s]
(e) GemsFDTD running alone (core 2)
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Pe
rfo
rma
nce
[G
Flo
ps/s
]
Time [s]
(f) GemsFDTD running with others (core 2)
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Pe
rfo
rma
nce
[G
Flo
ps/s
]
Time [s]
(g) Tonto running alone (core 3)
0
1
2
3
4
5
0 50 100 150 200 250 300 350 400
Perfo
rman
ce [G
Flop
s/s]
Time [s]
(h) Tonto running with others (core 3)
Figure 5.1: SpyMon performance evaluation of SPEC CPU2006 benchmarks, for a 20ms samplingtime interval.
resources, and the attained performance corresponds to the one obtained when namd was executed
without any interference of other applications.
Finally, it should be emphasized that the presented performance results might include a small
OS interference, since SpyMon monitors the core and not the application itself. Thus, the presented
results correspond to the execution of all tasks detected by the PMU during the time of the
64
5.2 SpyMon Experimental Evaluation
20
25
30
35
40
0 50 100 150 200 250 300 350 400
Po
we
r [W
]
Time [s]
(a) Milc running alone (core 0)
20
25
30
35
40
0 50 100 150 200 250 300 350 400
Po
we
r [W
]
Time [s]
(b) Namd running alone (core 1)
20
25
30
35
40
0 50 100 150 200 250 300 350 400
Po
we
r [W
]
Time [s]
(c) GemsFDTD running alone (core 2)
20
25
30
35
40
0 50 100 150 200 250 300 350 400
Po
we
r [W
]
Time [s]
(d) Tonto running alone (core 3)
20 25 30 35 40
0 50 100 150 200 250 300 350 400
Pow
er [W
]
Time [s]
finished tonto milc finished finished
namd finished
GemsFDTD
(e) Four benchmarks run simultaneously.
Figure 5.2: Power consumption of four benchmarks run separately and simultaneously.
benchmarks execution.
Figure 5.2 depicts the experimentally obtained power consumption for the above described
test conditions. The plotted information corresponds to the package domain, i.e., it represents the
power consumption of the whole chip. When each benchmark is executed alone (see Figures 5.2(a),
65
5. Experimental Results
5.2(b), 5.2(c) and 5.2(d)), the chip power consumption is around 25W . As it can be observed, the
power consumption does not only depend on the core being activated or not, but also on the
resource utilization. For instance, as shown in Figure 5.2(a) the power consumption assumes a
shape similar to the one observed in the milc performance profile (see Figure 5.1(a)). On the
other hand, Figure 5.2(e) shows the power consumption when all benchmarks were simultaneously
executed. A it can be observed, each additional activated LPC corresponds to an increment of
approximately 5W in the system’s power consumption (see Figure 5.2(e)).
5.2.2 Cache-aware Roofline Model Analysis
In order to obtain a more detailed picture on the attainable performance of the applications
from an architecture point of view, a set of benchmarks is analyzed by relying on the CARM.
Figure 5.3 illustrates the execution of calculix and tonto benchmarks, where each dot in the
CARM represents a different monitoring sample. In both cases, the sampling time interval was
set to 50ms and different colors were used to represent the predominant FP types (scalar, SSE or
AVX).
2-12
2-8
2-4
20
24
2-16 2-14 2-12 2-10 2-8 2-6 2-4 2-2 20
Operational Intensity [flops/byte]
Per
form
ance
[GFl
ops/
s] DBL SSE AVX
(1)
(4)(3)(2)
(2) AVX (ADD,MUL) / SSE (MAD) Roofline(1) AVX (MAD) Roofline
(3) SSE (ADD,MUL) / DBL (MAD) Roofline(4) DBL (ADD,MUL)Roofline
(a) Calculix
2-1
20
21
22
23
24
2-5 2-4 2-3 2-2 2-1
Perfo
rman
ce [G
Flop
s/s]
Operational Intensity [flops/bytes]
AVX (ADD,MUL) / SSE (MAD) Roofline
SSE (ADD,MUL) / DBL (MAD) Roofline
DBL (ADD,MUL)Roofline
(b) Tonto
Figure 5.3: Evaluation of SPEC CPU2006 benchmarks by using the CARM. The sample timeinterval was set to 50ms
040
80120
160200 2
-5
2-4
2-320
21
22
23
Operational Intensity
[flops/byte]Time [s]
Performance [GFlops/s]
DBL SSE AVX
Figure 5.4: Temporal representation of the CARM for Tonto.
66
5.2 SpyMon Experimental Evaluation
Figure 5.3(a) contains the performance profiling information for calculix. As it can be noticed,
there are two predominant types of FP instructions, scalar and AVX, and each type is associated
with its corresponding roof line. For instance, by observing the blue dots, one can conclude that
they are forming a shape similar to that of the roof lines, thus proving the existence of roof lines
delimiting performance.
According to the CARM, it can be observed that calculix is not completely compute-bound,
since there are some parts of its execution where it is in the memory-bound area. A good example
of this can be observed in the higher performance FP scalar samples, i.e., the top DBL samples,
which even draw a ridge point similar to that of the model. Moreover, since this part of the exe-
cution reaches higher values than the scalar attainable performance for ADD or MUL operations,
simultaneous addition and multiplication operations (MAD) were likely performed.
Figure 5.3(b) contains the CARM information for tonto. Similarly to calculix, this test
presents two distinct performance parts, which contain the predominant scalar and SSE FP types,
respectively. As it can by observed in Figure 5.4, the two distinct parts of execution are interchange-
ably switching in time. During the parts of the execution corresponding to the scalar instructions
(DBL), one can conclude that tonto is mainly memory-bound, since it is at the left side of its
correspondent ridge point, both for ADD/MULL and MAD roof lines. In fact, Figure 5.1 proves
that these zones of the execution are memory dependent and inflict changes in the performance
shapes of applications running alongside. On the other hand, when executing SSE instructions, it
is considered to be more compute-bound.
Figure 5.5 illustrates the performance CARM analysis for a number of SPEC CPU2006 bench-
marks. Each point corresponds to the average of the measured samples for each test and the colors
have the same meaning as previously explained. According to the CARM, the namd, calculix and
milc benchmarks are considered to be compute-bound for their average execution. On the other
hand, the soplex, povray and lbm are considered to be memory-bound. Finally, the gamess,
tonto andGemsFDTD might be considered either memory-bound or compute-bound, depending
on the usage of FP operations.
5.2.3 Power/Energy Consumption Evaluation
In order to demonstrate SpyMon’s power/energy consumption monitoring functionality, in ad-
dition to the previously illustrated results, a set of standard SPEC CPU2006 benchmarks was run
individually, with the predefined tool’s process configuration (one spy for each LPC) and a sampling
time interval of 50ms. Figure 5.6 shows, in time, the obtained power consumption measurements
for calculix and tonto. The plotted information corresponds only to the package domain, i.e., it
represents the power consumption of the whole chip. Similarly to what was previously described
for the CARM analysis, distinct types of FP correspond to different colors.
In contrast to tonto, where different execution phases are interleaved at regular intervals,
67
5. Experimental Results
DBL (MAD) Roofline
GemsFDTDcalculix
gamesslbmmilc
namdpovraysoplex DBL SSE AVX
Operational Intensity [flops/bytes]2-5 2-4 2-3 2-2 2-1
2-3
2-2
2-1
20
21
22
23
24
25
Per
form
ance
[GFl
ops/
s]
AVX MAD Roofline
SSE (MAD) RooflineAVX (ADD,MUL) Roofline
SSE (ADD,MUL) Roofline
DBL (ADD,MUL) RooflineAVX/SSE L1 LOAD Roofline
DBL L1 LOAD Roofline
tonto
Figure 5.5: Application CARM plot showing the floating-point SPEC CPU2006 benchmarks; theapplication color characterization was made according to average classification (double, SSE orAVX).
20
22
24
26
28
30
0 20 40 60 80 100 120 140 160
Pow
er [W
]
Time [s]
DBL SSE AVX
(a) Calculix
20
22
24
26
28
30
0 50 100 150 200
Pow
er [W
]
Time [s]
DBL SSE AVX
(b) Tonto
Figure 5.6: Power evaluation of SPEC CPU2006 benchmarks.
calculix mixes the use of both AVX and scalar FP operations. Furthermore, it can be observed
that distinct FP types correspond to different levels of power consumption, where scalar regions
indicate the lowest values and AVX the highest.
Figure 5.7 illustrates the average power consumption, as well as the total energy consumption
for a number of different SPEC CPU2006 benchmarks. Since the average power consumption does
not significantly vary across the different benchmarks, the differences in the energy consumption
mostly relate to the duration of each benchmark.
5.3 SchedMon
This section presents the obtained experimental results for the SchedMon monitoring tool. The
performed experimental tests are intended to illustrate the capabilities of the tool, in term of
its distinct functionalities. First, an Finite-Difference in Time-Domain (FDTD) OpenCL multi-
thread application [14] is tested in order to illustrate the tool’s ability for detecting multiple
68
5.3 SchedMon
0
5
10
15
20
25
30
GemsFDTD
calculix
gamess
gromacs
lbm milcnamd
soplex
tonto
0
2
4
6
8
10
12
14
Pow
er [W
]
Ene
rgy
[kJ]
Power Energy
Figure 5.7: Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks.
thread executions at run-time. Next, the function call tracing functionality is demonstrated for
a simple multi-threaded application and a real-world SPEC CPU2006 benchmark. Finally, a set
of FP benchmarks from the SPEC CPU2006 suite is evaluated in both performance, by using the
CARM, and power/energy consumption. In both cases, the function call tracing is highlighted,
instead of the predominant FP type, in order to provide a different evaluation perspective from
the one presented for SpyMon.
5.3.1 Application Thread Hierarchy
Figure 5.8 depicts the dependency process tree of the executed FDTD OpenCL application [14],
where each node contains the PID of a monitored task. The main task that was registered into
SchedMon was the one on top (786) and it corresponds to the leader task of this execution. During
the execution, whenever this task or a subsequent child forks a new task, it is also registered into
the tool and starts being monitored immediately. In order to perform a multi-task evaluation, the
–i option was enabled.
As shown in Figure 5.8, SchedMon allows to profile multi-threaded applications, regardless of
the thread dependency level, i.e., the number of levels of the dependency tree. In this specific case,
one can observe that the monitored application is composed by 9 distinct tasks, which construct a
four-level dependency tree.
5.3.2 Scheduling Information
SchedMon allows not only to detect and monitor multi-threaded applications, but it also provides
the means to analyze the scheduling route of each task’s execution. This allows to obtain a more
detailed information on the system’s scheduling mechanisms, as well as to extract useful insights
about the application’s structure.
69
5. Experimental Results
Figure 5.8: Thread hierarchy for an FDTD OpenCL application [14].
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
0 5 10 15 20 25 30Time [s]
786
786
786
787 787 787 787
788 788 788 788
789 789 789789789 789 789 789789789
790 790 790 790
790790
790790 790
790 790 790 790 790 790 790 790
790790790
790 790
792 792 792 792 792 792
795795
796796796796 796
791 791 791 791 791
Figure 5.9: Scheduling information for OpenCL application fdtd.
Figure 5.9 shows the scheduling information corresponding to the previously referred OpenCL
application test. As it can be seen, SchedMon is capable of monitoring all the information regarding
when each of the application tasks enters or leaves a CPU (LPC). Since the underlying hardware
contains 8 LPCs and the tested application is composed by 9 tasks, it is not possible to run all the
tasks at the same time, at a given moment on all LPCs. In this specific test, the OS scheduler solves
this issue by constantly migrating the task 790 from one core to another. For example, at around
5ms of the execution time, the task 790 is migrated from LPC 5 to LPC 6. Another interesting
phenomenon can be observed observed at around 9 seconds of the execution, where all tasks stop
executing for about one second, with the sole exception of the leader task. This indicates that all
the tasks are waiting either for resources or instructions from the main thread (786), thus showing
the capabilities of SchedMon to provide insights on the application structure.
5.3.3 Function Call Tracing
As previously referred, SchedMon provides the ability of tracing function calls of a given applica-
tion. This is conducted by instrumenting the adequate memory locations with a trap instruction,
70
5.3 SchedMon
thus implementing breakpoints. Figure 5.10 illustrates the execution and obtained profiling in-
formation of an application containing two processes, which are associated with different binary
executable files. Figure 5.10(a) presents the detailed information obtained from SchedMon. The
first column indicates if a function is being called or returned. The next column contains the
elapsed time information, in seconds. The third column indicates the PID of the corresponding
task. In the last column, the function names and addresses are provided. Figure 5.10(b) graphically
represents the information shown in Figure 5.10(a).
(a) SchedMon’s output.
Task
ATa
sk B
Timemain main
foo bar
main main
one one
two
threebarfork
execve
(b) Time diagram.
Figure 5.10: Function call tracing of an application containing two processes. The child process,after being forked, switches its execution image.
In detail, the application execution happens in the following order: i) the main process (task
A) forks a child process (task B); ii) the parent calls the foo() and bar() local functions, waits
for the child and leaves; iii) after being forked, task B calls the bar() local function and switches
its execution image (execve); iv) task b calls function_one() (one), which in turn triggers
funtion_two() (two); v) the child calls function_three() (three) and terminates.
In order to further explore and demonstrate the full potential of this functionality, the perfor-
71
5. Experimental Results
0
1
2
3
4
5
0 50 100 150 200
Perfo
rman
ce [G
Flop
s/s]
Time [sec]
imp_gauge_force() grsource_imp() eo_fermion_force() ks_congrad()
Figure 5.11: Milc performance colored according to its function call tracing profile.
mance of milc SPEC CPU2006 benchmark was evaluated and analyzed according to its function
call trace profile. As previously observed, milc presented several distinct phases and hence it
was a preferred benchmark for this particular demonstration. Figure 5.11 depicts the performance
analysis of milc in time, and the sample colors represent the currently executed high-level function.
As it can be observed in Figure 5.11, it is possible to extract a pattern from milc execution,
where each the distinct performance phase corresponds to a different high-level function. This
allows not only evaluating specific execution parts of a given application, but also detecting possible
performance bottlenecks.
Regarding SchedMon’s function call tracing functionality, the following conclusions can be made:
• the tool is able to detect other application calls, without the need to change their code or
recompiling;
• it is possible to trace multi-threaded applications in a single run;
• when an execve() call is performed, the instrumentation and detection of new loaded func-
tions is possible at run-time;
• recursive function calls are also detected;
• it allows the evaluation of different parts of the execution, as well as the detection of possible
performance bottlenecks.
5.3.4 Cache-aware Roofline Model Analysis
As previously referred, SchedMon also provides an execution mode that outputs the profiling
information in order to facilitate the application anaylis according to the CARM. Figure 5.12 shows
the CARM performance analysis for calculix and tonto benchmarks. Similarly to the conditions
adopted for SpyMon evaluation, both tests were performed with a sampling time interval of 50ms,
and energy status information was simultaneously obtained.
72
5.3 SchedMon
2-142-122-102-82-62-42-2202224
2-16 2-14 2-12 2-10 2-8 2-6 2-4 2-2 20
Perfo
rman
ce [G
Flop
s/s]
Operational Intensity [flops/bytes]
results_() spooles()
mastruct()
(2) AVX (ADD,MUL) / SSE (MAD) Roofline(1) AVX (MAD) Roofline
(3) SSE (ADD,MUL) / DBL (MAD) Roofline(4) DBL (ADD,MUL)Roofline
(a) Calculix
2-1
20
21
22
23
24
2-5 2-4 2-3 2-2 2-1
Perfo
rman
ce [G
Flop
s/s]
Operational Intensity [flops/bytes]
make_constraint_data() add_constraint()
make_fock_matrix()
AVX (ADD,MUL) / SSE (MAD) Roofline
SSE (ADD,MUL) / DBL (MAD) Roofline
DBL (ADD,MUL)Roofline
(b) Tonto
Figure 5.12: Evaluation of SPEC CPU2006 benchmarks using the CARM.
In order to provide a different perspective from the already presented CARM profiles for cal-
culix and tonto (see Figures 5.3(a) and 5.3(b)), the CARM information samples are now colored
according to their function call tracing profiles. In Figure 5.12(b), it can be observed that tonto’s
high-level functions correspond to distinct CARM phases. The functions add_constraint()
and make_constraint_data() present a similar behavior and achieve higher performance, whilst
make_fock_matrix() delivers a lower performance and it is contained in the memory-bound area.
In contrast to Figure 5.3(b), additional information can be obtained by analyzing the execution
call tracing profile, which allows detecting and further optimize the possible bottlenecks of the ap-
plication execution. On the other hand, as it can be observed in Figure 5.12(a), calculix high-level
functions do not demonstrate any visible patterns when plotted against the CARM.
When comparing calculix execution (see Figure 5.12(a)) against the same test performed with
SpyMon (see Figure 5.3(a)), one can notice that most of the trail disappeared, i.e., a large number
of samples is concentrated at the top. A possible explanation for this behavior can be found in the
fact that SchedMon introduces less performance interference into the monitored application. As a
results, the tested application spends most of its time on the top right area of its shape, meaning
that it is less vulnerable to possible resource contention in the memory subsystem provided the
the tool itself.
On the other hand, when looking at tonto’s profiling information in Figure 5.12(b), one can
observe that the samples are more spread than the ones taken with SpyMon. This might happen
because SpyMon introduces a significantly higher amount of memory operations into the profiling
(see Section 5.4), which might reduce the operational intensity of the samples. In addition, when
analyzing the top performance samples, i.e., the top rightmost dots, one can observe that a higher
performance is attained when monitoring the application with SchedMon.
Finally, Figure 5.13 provides a detailed analysis regarding the average performance of several
SPEC CPU2006 according to the CARM. The presented results are similar to the ones observed
with SpyMon (see Figure 5.5). Minor differences might refer to the different overhead imposed
73
5. Experimental Results
DBL (MAD) Roofline
GemsFDTDcalculix
gamesslbmmilc
namdpovraysoplex DBL SSE AVX
Operational Intensity [flops/bytes]2-5 2-4 2-3 2-2 2-1
2-3
2-2
2-1
20
21
22
23
24
25
Per
form
ance
[GFl
ops/
s]
AVX MAD Roofline
SSE (MAD) RooflineAVX (ADD,MUL) Roofline
SSE (ADD,MUL) Roofline
DBL (ADD,MUL) RooflineAVX/SSE L1 LOAD Roofline
DBL L1 LOAD Roofline
tonto
Figure 5.13: Application CARM plot showing the floating-point SPEC CPU2006 benchmarks; theapplication color characterization was made according to average classification (double, SSE orAVX).
by the two tools and to the fact that the tests were performed at different times, thus different
machine states may introduce slight differences in the results.
5.3.5 Power/Energy Consumption Evaluation
Figure 5.14 illustrates the obtained power consumption, in time, for both calculix and tonto
benchmarks. As already described, the energy information samples were obtained alongside with
performance when running in CARM mode, i.e., with a sampling time interval of 50ms.
20
22
24
26
28
30
0 20 40 60 80 100 120 140 160
Pow
er [W
]
Time [s]
results_() spooles()
mastruct()
(a) Calculix
20
22
24
26
28
30
0 50 100 150 200
Pow
er [W
]
Time [s]
make_constraint_data() add_constraint()
make_fock_matrix()
(b) Tonto
Figure 5.14: Power evaluation of SPEC CPU2006 benchmarks.
In order to provide a different perspective from the one already presented, when analyzing the
power consumption profiles for calculix and tonto with SpyMon (see Figures 5.6(a) and 5.6(b)),
the samples are now colored according to their function call tracing profiles. By analyzing Figure
5.14(b), it can be observed that distinct phases of tonto’s execution in time correspond to different
functions. Thus, tonto contains clearly observable phases both in time and in the CARM. On
the other hand, Figure 5.14(a) confirms the previous CARM conclusions that calculix does not
74
5.4 Overhead Discussion
present any visible patterns in its distinct execution parts.
In comparison to SpyMon, it can be observed that the power consumption is reduced when using
SchedMon as the monitoring tool. This can be explained by the fact that SchedMon does not create
additional tasks for monitoring, i.e., it makes use of the available system running tasks in order
to periodically read the energy status information. On the other hand, SpyMon is composed by
9 processes (monitor plus spies), which are actively monitoring the different LPCs at run-time,
including those that are not currently running any application.
0
5
10
15
20
25
30
GemsFDTD
calculix
gamess
gromacs
lbm milcnamd
soplex
tonto
0
2
4
6
8
10
12
14
Pow
er [W
]
Ene
rgy
[kJ]
Power Energy
Figure 5.15: Power and energy evaluation for different floating-point SPEC CPU2006 benchmarks.
Finally, Figure 5.15, illustrates the average power consumption and the total energy consump-
tion of several SPEC CPU2006 benchmarks. Apart from milc, for which a significantly lower
average power consumption can be observed, all the benchmarks achieve a similar average power
consumption. As a result, the energy consumption of these benchmarks relates directly to their
execution time. As it was already referred, in comparison to the results obtained with SpyMon (see
Figure 5.7), a slightly lower power consumption can be noticed when relying on the SchedMon tool.
5.4 Overhead Discussion
In order to obtain an overview of the overhead introduced by the herein proposed tools, two
distinct evaluation tests were performed for each tool. Figure 5.16 illustrates the imposed over-
heads by the tools in time. As it can be observed, the tools use a timer which defines the sampling
time interval (TT ). Whenever the timer is triggered, the OS assures the execution of the tool, by
switching the currently running task with the tool. The overhead corresponding to the OS mech-
anisms triggered by the tool is depicted as TO. Finally, the tool’s execution overhead corresponds
to TH .
Figure 5.16 also shows the main scope of two evaluation tests (EA) and (EB) that were per-
formed in order to analyze both tools overheads. In the first evaluation test (EA), each tool is set
75
5. Experimental Results
SampleTime 1ms 2ms 5ms 10ms 25ms 50ms 100ms
PMU 1,059 2,064 5,080 10,107 25,188 50,323 100,592(5,89%) (3,22%) (1,61%) (1,07%) (0,75%) (0,65%) (0,59%)
PMU & 1,059 2,064 5,080 10,107 25,185 50,323 100,582RAPL (5,89%) (3,21%) (1,61%) (1,07%) (0,74%) (0,65%) (0,58%)
Table 5.1: Median Time Counts for SpyMon self-monitoring.
SampleTime 1ms 2ms 5ms 10ms 25ms 50ms 100ms
PMU 1,004 2,009 5,025 10,051 25,130 50,261 100,523(0,40%) (0,46%) (0,50%) (0,51%) (0,52%) (0,52%) (0,52%)
PMU & 1,004 2,009 5,025 10,051 25,130 50,261 100,524RAPL (0,40%) (0,46%) (0,50%) (0,51%) (0,52%) (0,52%) (0,52%)
Table 5.2: Median Time Counts for SchedMon self-monitoring.
to profile itself, i.e., it is run without the execution of any monitored task. Therefore, the obtained
PMU sampling information should correspond to the events performed by the tool, assuming that
no other task runs in the system. As a result, the obtained time should represent the maximum
overall overhead per sample. In the second evaluation test (EB), a more refined time analysis is
performed. This was achieved by instrumenting each tool with the rdtsc instruction, in order to
obtain the precise time overhead corresponding to the process of taking a single sample.
Evaluation A
The first test consists in running each of the tools individually, using the same predefined
configuration as used for the CARM analysis, but without executing any target benchmark appli-
cation. Hence, the tools should provide performance measurements that correspond to their own
execution. However, it should be taken into account that other OS tasks might introduce small
interference to the execution.
Tables 5.1 and 5.2 contain the median values of the obtained time measurements (in ms) for
SpyMon and SchedMon, respectively. The results are obtained for different sampling time inter-
vals, and for two distinct situations, namely: i) when only performance sampling information is
Timer OS Tool
Time
- Timer Time Interval (~ sampling time interval)
OS Timer OS Tool OS Timer… …
TT TO TH
EA
EB
TT
- OS Time Interval (mostly scheduling)TO
- Tool Time Interval (time to take a PMU sample)TH
- Evaluation AEA
- Evaluation BEB
Figure 5.16: Diagram illustrating the performed overhead evaluation tests.
76
5.4 Overhead Discussion
0
5000
10000
15000
20000
25000
1ms 2ms 5ms 10ms 25ms 50ms 100ms
Instruc(on
s per Sam
ple
Sampling Time Interval
OT
ST
LD
(a) PMU
0
5000
10000
15000
20000
25000
1ms 2ms 5ms 10ms 25ms 50ms 100ms
Instruc(on
s per Sam
ple
Sampling Time Interval
OT
ST
LD
(b) PMU & RAPL
Figure 5.17: SpyMon’s number of instructions per sample when self-monitoring.
profiled; and ii) when performance sampling information is profiled alongside with energy/power
consumption sampling. In both situations, the time measurements are performed when taking a
PMU sample. Therefore, the obtained time values should not present any significant differences.
In addition, it is presented the percentage value that corresponds to the overhead corresponding
to a specific sampling time interval. This percentage is calculated according to the expression:
ovh = (timemeasured − timesample)/timesample. (5.1)
This expression takes the assumption that the sampling time interval (timesample) is perfectly
accurate.
The results presented in Tables 5.1 and 5.2 include the sampling time interval imposed by the
timer (TT ), the OS overhead when scheduling in and scheduling out the tool (TO), and the tool’s
overhead of taking a performance sample (TH), i.e.:
TA = TT + 2TO + TH . (5.2)
Therefore, the overall overhead of the tool corresponds to:
TOVH = 2TO + TH . (5.3)
When comparing the results for both tools, it can be observed that SpyMon introduces signifi-
cantly higher overheads than SchedMon. SpyMon results demonstrate an overhead between 0.58%
and 5.89%, whereas SchedMon presents an overhead between 0.40% and 0.52%. Since the tools
perform the same number of instructions each time a sample is taken, the differences obtained in
the overheads for different samples may refer to the OS interference (TO).
Figures 5.17 and 5.18 show the average number of performed instructions (on a per-type ba-
sis) obtained for the above described tests. As it can be observed in Figures 5.17(a) and 5.17(b),
SpyMon introduces an overhead of about 14000 instructions per sample when monitoring perfor-
mance and an overhead of about 25000 instructions per sample when monitoring both performance
and power/energy consumption on the same LPC. On the other hand, SchedMon introduces an
77
5. Experimental Results
0
500
1000
1500
2000
2500
3000
3500
4000
1ms 2ms 5ms 10ms 25ms 50ms 100ms
Instruc(on
s per Sam
ple
Sampling Time Interval
OT
ST
LD
(a) PMU
0
500
1000
1500
2000
2500
3000
3500
4000
1ms 2ms 5ms 10ms 25ms 50ms 100ms
Instruc(on
s per Sam
ple
Sampling Time Interval
OT
ST
LD
(b) PMU & RAPL
Figure 5.18: SchedMon’s number of instructions per sample when self-monitoring.
overhead of about 3000 instructions per sample when monitoring performance (see Figure 5.18(a))
and an overhead of about 3500 instructions per sample when monitoring both performance and
power/energy consumption (see Figure 5.18(b)).
As it can be noticed, for both tools, the number of instructions per sample is significantly
increased when RAPL samples are taken (in contrast to what was shown during the evaluation of
the timer overheads). This happens because during a measured time interval (TA), the instructions
related to a PMU sample and a RAPL sample are measured by the performance counters. On
the other hand, the RAPL samples do not interfere with TA, since they are hidden in the timer’s
interval time (TT ).
Finally, it should be emphasized that the above shown results for evaluation test A do not
correspond to the overheads solely introduces by the tools, since they contain the interference of
any OS tasks that ran during the experimental evaluation.
Evaluation B
In order to perform the evaluation test B, both tools were run in similar conditions to the
above described for evaluation A. However, instead of presenting the obtained sampling informa-
tion, both tools were instrumented in a way to measure the sole overhead of taking a sample
(TH).
Figure 5.19 illustrates the obtained results for both tools. Figure 5.19(a) showns the overheads
of taking a PMU sample, while Figure 5.19(b) shows the overheads of taking a RAPL sample.
As it can be observed in Figure 5.19, SchedMon presents a lower overhead, in both cases. The
overhead of producing a PMU sample is around 1.39µs in SchedMon, compared to an overhead
of around 1.65µs in SpyMon. On the other hand, the introduced overhead for producing a RAPL
sample is around 1.25µs for SchedMon, compared to the overhead of around 1.30µs in SpyMon.
78
5.5 Summary
1.00
1.10
1.20
1.30
1.40
1.50
1.60
1.70
1.80
1 2 5 10 25 50 100
PMU Sam
ple Overhead (μs)
Samping Time Interval (ms)
SpyMon
SchedMon
(a) PMU
1.00
1.10
1.20
1.30
1.40
1.50
1.60
1.70
1.80
1 2 5 10 25 50 100
RAPL Sam
ple Overhead (μs)
Sampling Time Interval (ms)
SpyMon
SchedMon
(b) RAPL
Figure 5.19: Overhead of taking a PMU or a RAPL sample in both SpyMon and SchedMon tools.
5.5 Summary
This chapter presented the necessary experimental results that allow to illustrate the different
features of the herein presented tools. SpyMon has demonstrated to be a good system-wide perfor-
mance tool, capable of delivering information about the whole system’s performance and energy
status. It also allows to evaluate an application performance according to the CARM. SchedMon
has also proven to be able not only to extract and deliver performance and power/energy con-
sumption information, but also to provide the means for a CARM analysis. Although SchedMon
targets the application and not the whole system, it allows to monitor multi-threaded applications
and it is able to reconstruct the whole execution, by tracing process dependencies, function calls
and providing task scheduling information.
The results obtained with SpyMon allowed to evaluate the interference (both in performance
and power consumption) of multiple applications running at the same time. Moreover, it provided
a complete CARM and power/energy consumption evaluation of a set of FP SPEC CPU2006
benchmarks, according to their predominant FP types.
On the other hand, SchedMon has demonstrated to be able to reconstruct a full multi-threaded
application execution, from the scheduling point of view. SchedMon also provided a complete
CARM and power/energy consumption evaluation for a set of FP SPEC CPU2006 benchmarks.
However, the performed analysis included additional insightful information about the benchmarks
function call tracing profile.
In terms of overheads, SchedMon has demonstrated lower overheads than SpyMon, either with or
without taking into account the OS interference. Moreover, SpyMon has demonstrated to introduce
a higher power consumption, which relates to the fact the it is composed by several processes, that
run in different LPCs during the entire profiling, and thus increase the overall power consumption.
Despite these differences, in overall, both monitoring methods allow a user/programmer to get a
clear picture of the behavior of the application and how its execution is affected by the processor
architectural limitations.
79
5. Experimental Results
80
6Conclusions
Contents6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
81
6. Conclusions
This thesis proposes two distinct monitoring methods that combine the advantages of the
recently proposed Cache-aware Roofline Model (CARM) [11], with real-time accurate monitoring
facilities in a way that allows application developers to easily relate the application behavior with
the architecture characteristics, thus fostering new application optimizations. Both proposed tools
rely on the available hardware counting interfaces for obtaining the measurements, namely the
Performance Monitoring Unit (PMU), for performance, and the Running Average Power Limit
(RAPL), for energy status information, and aim at providing the full underlying functionality to
the user in a simple and intuitive way.
The first tool, SpyMon, targets system-wide monitoring and it is mostly implemented from
the user-space, which increases its portability and independence from the OS. It launches and
pins a distinct process to each CPU’s core that is intended to be monitored. It also invokes an
additional main process, which controls the whole tool’s execution and performs the energy status
profiling. The SpyMon’s implementation provides a fully configurable performance environment
and incorporates the ability to monitor the system’s power consumption in a single tool, which is
provided as a simple to use command-line interface. Moreover, it is possible to perform a predefined
performance analysis using the CARM, which eases the tool’s configuration and provides additional
useful performance information.
The second tool, SchedMon, targets application profiling and it is mostly implemented from the
kernel-space. It makes use of the OS scheduling events in order to keep track of the monitored
application, and also to reduce the interference of other running tasks. In addition to performance
and power monitoring, SchedMon allows to trace the complete execution of multi-threaded applica-
tions (process dependencies, function calls and task scheduling information) which makes possible
to reconstruct the complete application execution. In addition, it also provides the ability to per-
form a predefined performance evaluation based on the CARM. All the functionality is available
to the user as a simple and intuitive command-line interface. Moreover, for detailed run-time per-
formance and/or power evaluation, it is also possible to directly interact with the tool’s underlying
mechanisms by using a provided user-space library.
The performed experimental tests demonstrate the capabilities of the proposed tools to deliver
detailed information about how running applications perform on top of the underlying architectural
resources. A full CARM and power/energy consumption analysis was performed for several SPEC
CPU2006 benchmarks, which provided insightful information of each application’s ability to explore
the full attainable performance of the underlying resources. SpyMon has also proven to be able
to obtain detailed system-wide information, which allowed to observe how different applications
interfere with each other (both in performance and power consumption domains) when running
simultaneously in a multi-core architecture. On the other hand, SchedMon succeeded in detecting
and monitoring the execution of multi-threaded applications, thus being able to reconstruct the
whole application execution by tracing its process dependencies, function calls and task scheduling
82
6.1 Future Work
information. SchedMon also provided a complete CARM and power/energy consumption evaluation
for a set of FP SPEC CPU2006 benchmarks. In addition, the performed analysis includes insightful
information about the benchmarks function call tracing profile. In terms of introduced overheads,
both tools have shown low interference into the monitored applications. The overhead of producing
a PMU sample is around 1.39µs in SchedMon, compared to an overhead of around 1.65µs in SpyMon.
On the other hand, the introduced overhead for producing a RAPL sample is around 1.25µs for
SchedMon, compared to an overhead of around 1.30µs in SpyMon, which makes SchedMon the tool
with the lowest overheads.
6.1 Future Work
The herein presented tools, although demonstrating great potential, can be improved in several
aspects:
• Support for different architectures - One major improvement to the tools could be
achieved by increasing their support for different architectures. At the moment, mainly
recent Intel micro-architectures are supported, but no major difficulties are expected when
porting the tools functionality to other micro-architectures from different vendors, such as
AMD or ARM.
• Reducing memory operations - Although the inferred overhead has shown to be low,
most of it relies on memory operations and, therefore, both tools can be further optimized
to reduce the number of memory accesses. A first approach could be done to compress the
output data format, thus reducing the transferred information size.
• Extend to different interfaces - So far, the tools provide the facilities to access two distinct
hardware interfaces, PMU and RAPL. However, recent versions of other system components,
like GPU, provide similar interfaces that could be added to the monitoring tools.
• Improve call tracing - The function call tracing functionality implemented in SchedMon
can be improved in order to provide a greater control over the target applications. For
example, reducing the tracing interference by detecting recursive function calls, or functions
being called form inside loops.
In conclusion, as performance and power consumption optimizations are becoming a greater
concern, the greater is the need for powerful tools that can, in simple ways, integrate multiple
functionalities that allow to extract meaningful information about application and architectural
infrastructures. Despite the complexity of the provided hardware interfaces and the system’s mech-
anisms that provide the required information, we successfully implemented two different methods
for obtaining, in a simple and fully configurable way, the necessary information for a full perfor-
83
6. Conclusions
mance and power analysis, from both the application and system’s perspective. Nevertheless, both
tools achieve the initial objectives while still giving room for future improvements.
84
Bibliography
[1] Perf Wiki tutorial on perf. https://perf.wiki.kernel.org/index.php/Tutorial. Accessed:
2013-06-25.
[2] Perfmon2 sourceforge project page. http://perfmon2.sourceforge.net/. Accessed: 2013-
06-20.
[3] Antao, D., Taniça, L., Ilic, A., Pratas, F., Tomás, P., and Sousa, L. (2013). Monitoring perfor-
mance and power for application characterization with cache-aware roofline model. International
Conference on Parallel Processing and Applied Mathematics, page 14.
[4] Browne, S., Dongarra, J., Garner, N., Ho, G., and Mucci, P. (2000). A portable program-
ming interface for performance evaluation on modern processors. International Journal of High
Performance Computing Applications, 14(3):189–204.
[5] Cohen, W. E. (2004). Tuning programs with oprofile. Wide Open Magazine, 1:53–62.
[6] Corbet, J., Rubini, A., and Kroah-Hartman, G. (2005). Linux device drivers. " O’Reilly Media,
Inc.".
[7] Demme, J. and Sethumadhavan, S. (2011). Rapid identification of architectural bottlenecks
via precise event counting. In ACM SIGARCH Computer Architecture News, volume 39, pages
353–364. ACM.
[8] Donnell, J. (2004). Java performance profiling using the vtune performance analyzer.
[9] Fog, A. (2014). Software optimization resources. http://www.agner.org. Accessed: 2014-02-
10.
[10] Henning, J. L. (2006). Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer
Architecture News, 34(4):1–17.
[11] Ilic, A., Pratas, F., and Sousa, L. (2013). Cache-aware roofline model: Upgrading the loft.
Computer Architecture Letters, PP(99).
[12] Intel, I. (2013). 64 and ia-32 architectures software developer’s manual. Volume 3: System
Programming Guide.
85
Bibliography
[13] Jarp, S., Jurga, R., and Nowak, A. (2008). Perfmon2: A leap forward in performance moni-
toring. In Journal of Physics: Conference Series, volume 119, page 042017. IOP Publishing.
[14] Kuan, L., Tomas, P., and Sousa, L. (2013). A comparison of computing architectures and
parallelization frameworks based on a two-dimensional fdtd. In International Conference on
High Performance Computing and Simulation (HPCS), pages 339–346. IEEE.
[15] Pettersson, M. (2009). Perfctr: Linux performance monitoring counters driver. Retrieved Dec.
[16] Treibig, J., Hager, G., and Wellein, G. (2010). Likwid: A lightweight performance-oriented
tool suite for x86 multicore environments. In International Conference on Parallel Processing
Workshops (ICPPW), pages 207–216. IEEE.
[17] Weaver, V. M. (2013). Linux perf_event features and overhead. In International Workshop
on Performance Analysis of Workload Optimized Systems (FastPath), page 80.
[18] Weaver, V. M., Johnson, M., Kasichayanula, K., Ralph, J., Luszczek, P., Terpstra, D., and
Moore, S. (2012). Measuring energy and power with papi. In International Conference on
Parallel Processing Workshops (ICPPW), pages 262–268. IEEE.
[19] Williams, S., Waterman, A., and Patterson, D. (2009). Roofline: an insightful visual perfor-
mance model for multicore architectures. Communications of the ACM, 52(4):65–76.
86
Bibliography
87