Post on 05-Jul-2020
POLITECNICO DI MILANOMaster in Computer Science and Engineering
Dipartimento di Elettronica, Informazione e Bioingegneria
Exploring the end-to-end compression
to optimize the power-performance
tradeoff in NoC-based multicores
Supervisor: Prof. William FORNACIARI
Assistant Supervisor: Dr. Davide Zoni
Master Thesis of:
Fabio PANCOT
Student n. 799521
Academic Year 2015-2016
Estratto in lingua italiana
La continua evoluzione del mercato tecnologico, sia in termini di applicazioni
che di dispositivi, ha posto nuove sfide ai produttori hardware per poter sot-
tostare alle sempre piu crescenti richieste di bassi consumi e migliori per-
formance. La Network-on-Chip (NoC) si e imposta come l’interconnessione
on-chip di riferimento, in grado ottenere migliore scalabilita e flessibilita. Il
suo impatto sulle performance e sui consumi energetici pero e un aspetto da
non tralasciare. Cio porta alla ricerca di nuove strategie di ottimizzazione.
La letteratura riporta diverse metodologie per investigare questo tradeoff,
dalla modifica all’architettura dei router, scheduling prioritizzato dei pac-
chetti, utilizzo di topologie ottimizzate. In questo scenario, la compressione
dati risulta una soluzione pratica per ridurre l’iniezione di traffico, riducendo
quindi la banda di canale richiesta, il profilo energetico e incrementando le
performance del sistema. I meccanismi di compressione sono divisi in due
categorie: compressione in cache e compressione end-to-end (E2E). La prima
ha l’obiettivo di aumentare virtualmente la capacita dei banchi di cache in-
serendo piu righe compresse in una sola riga fisica, mentre la seconda invece
comprime i blocchi di cache prima che essi vengano pacchettizzati e iniettati
nella rete. Precedenti analisi hanno inoltre evidenziato che la maggioranza dei
flit nella rete e rappresentata dal trasferimento di dati. Partendo da queste
assunzioni, noi proponiamo un’analisi dell’impatto che la compressione E2E
introduce sulla NoC. Questa tesi investiga il tradeoff potenza-prestazioni of-
ferto dalla compressione E2E sulla NoC, con particolare enfasi sugli overhead
in potenza e prestazioni, cosı come il rapporto di compressione raggiungibile
nei confronti del rapporto ottimo ottenuto da un modello oracolo. Abbiamo
selezionato questo tipo di schema per via della sua trasparenza nei confronti
della NoC, dando maggiore flessibilita nel design del sistema e per via dei suoi
overhead contenuti. Precedenti lavori sulla compressione dati si concentrano
solo sul risparmio energetico e le latenze ridotte sui pacchetti, senza consider-
i
are gli overhead introdotti, tralasciando inoltre l’impatto sulle prestazioni del
sistema. In questa tesi presentiamo un’analisi qualitativa degli aspetti critici
nella compressione E2E, per poter migliorare prestazioni e consumo ener-
getico, e un’analisi quantitativa che esplora gli aspetti energetici e di latenza
in termini di overhead introdotti dalla compressione. Discuteremo inoltre
l’impossibilita di migliorare le prestazioni del sistema sfruttando la compres-
sione E2E, e infine analizzeremo i benefici derivanti dall’uso di piu algoritmi
di compressione eseguiti in parallelo comparandoli con singoli algoritmi di
compressione. Il blocco di compressione utilizzato per le analisi e stato inte-
grato nel simulatore di architetture GEM5 e comparato con un’architettura
di riferimento che non integra il meccanismo di compressione. I risultati
sono stati collezionati usando un’architettura 16-core, eseguendo su di essa
un sottoinsieme della suite di benchmark SpecCPU2006. I risultati mostrano
che la compressione E2E analizzata puo risparmiare fino al 26% di energia,
mantenendo le performance praticamente inalterate. La versione parallela
inoltre permette di ottenere un miglioramento in termini ti rapporto di com-
pressione pari al 5.7%, aumentando la stabilita di compressione nel tempo di
un 5% in media, comparato con meccanismi di compressione singoli.
Abstract
The continous evolution of market applications and technological devices
have posed new challenges to hardware manufacturers, in order to meet ever-
increasing low-power and performance requirements. The Network-on-Chip
(NoC) emerged as a de-facto on-chip interconnect to meet scalability and
flexibility. However, its non-negligible impact on performance and power con-
sumption on the entire chip forces different optimization strategies to cope
with the market requirements. The literature reports several optimization
methodologies to cope with such a tradeoff ranging from the router architec-
ture changes to priority-based packet scheduling techniques and optimized
NoC topologies. In this scenario, data compression techniques represent a
viable solution to reduce the injected traffic thus reducing the required band-
width, the energy profile and increasing the system-wide performance. The
compression techniques falls in two broad categories: cache compression and
end-to-end compression (E2E). The former has the objective to virtually in-
crease the cache bank capacity by fitting more compressed cache lines into
the same cache bank, where the latter instead aims to compress cache blocks
before their injection into the on-chip interconnect. Previous analysis on dif-
ferent workloads highlighted the fact that the majority of the network traffic
is due to data transfers. Starting from this assumptions, we propose an anal-
ysis of the impact that E2E compression scheme has on the NoC. This thesis
investigates the power-performance tradeoff offered by the E2E compression
in NoCs with particular emphasis on the power and performance overheads
as well as the achievable compression ratio with respect to the optimal com-
pression ratio from a golden model. We selected this type of scheme because
of its transparency with respect to the rest of the NoC, giving more flexi-
bility in terms of system design and Intellectual Property (IP) embedding,
and because of its very small area and power overheads. Previous works on
data compression focus their attention only on power savings and reduced
iii
packet latency, without taking into account any introduced overhead, nor
describing how performance of the entire system is affected. We then present
a qualitative analysis of the most critical aspects in E2E compression to
cope with, in order to improve performance and power consumption, and a
quantitative analysis that explores the energy and latency overheads intro-
duced by such compression scheme. We will also discuss the impossibility
of E2E compression to improve the system performance and finally analyze
the benefits deriving from the utilization of more compression mechanisms in
parallel, compared to single-compression-algorithm schemes. The compres-
sion methodology has been integrated in the GEM5 full-system simulator
and compared to a baseline NoC architecture that does not employ data
compression. Results have been collected using a 16-core architecture run-
ning a subset of the SpecCPU2006 benchmark suite. Results shows that E2E
compression achieve 26% energy savings with almost the same performance
of the baseline NoC. Its parallel version increase the traffic compressibility
by a 5.7%, and increase its stability over time by 5% on average, compared
to single-compression-algorithm schemes.
Acknowledgments
To my parents, Alcide and Marina, that supported me during all this journey,
patiently listening to all my insecurities and grumblings, giving me motiva-
tion in the darkest hours.
Without them, none of the following pages could exist.
To Lorenzo, that albeit being far away in Berlin, supported my work and
helped me smile. Thank you for always having been.
To Luca, Mauro, Andrea and Luca, without them working on this thesis
would be really more boring.
To Claudio e Laura, for being good friends, and for the laugher we made,
and will make, working together at Bar Bianco.
To Ottilia, that helped me relieve stress with laugher and her good
company.
To Davide, that opened to me the world of research field, and gave hard,
but wise, teachings.
To professor W. Fornaciari and the rest of the HIPEAC Research Group,
in which I’ve found a friendly group of people during this period.
vi
Contents
1 Introduction 1
1.1 Goals and Contributions . . . . . . . . . . . . . . . . . . . . . 6
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 State of the Art 8
2.1 Cache Compression . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 End-to-End Compression . . . . . . . . . . . . . . . . . . . . . 12
3 Evaluation Methodology 16
3.1 Architectural Description . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Compression Block . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Output Selection Policy . . . . . . . . . . . . . . . . . 20
3.1.3 Decompression Block . . . . . . . . . . . . . . . . . . . 21
3.1.4 Simulation Model . . . . . . . . . . . . . . . . . . . . . 22
3.2 Compression Techniques . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Delta Compression . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Zero Compression . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Specialized Output Selection Policy . . . . . . . . . . . 27
3.3 Methodology Evaluation Engine . . . . . . . . . . . . . . . . . 29
3.3.1 Power Consumptions Evaluation . . . . . . . . . . . . . 30
3.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . 31
3.3.3 Compression Determinism Evaluation . . . . . . . . . . 31
4 Results 32
4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Overheads Analysis . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Compression Policy Analysis . . . . . . . . . . . . . . . 39
viii
CONTENTS CONTENTS
4.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Compression Ratio Analysis . . . . . . . . . . . . . . . . . . . 46
5 Conclusions and Future Works 49
5.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography 52
ix
List of Figures
1.1 Dot Per Inch evolution for smartphone devices, during years. . 2
1.2 Packet type distribution on different workloads from the Spec-
CPU2006 benchmark suite on a 4x4 2D-mesh with a channel
width of 64bit and a cache line size of 64B. . . . . . . . . . . . 4
1.3 Flit type distribution on different workloads from the Spec-
CPU2006 benchmark suite considering a 4x4 2D-mesh with a
channel width of 64bit and a cache line of 64B. Data packets
are 9 flit long and control packets are single-flit-packets. . . . . 5
1.4 Energy savings due to the introduction of compression tech-
nique on different workloads from SpecCPU2006 benchmark
suite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Graphic representation of the baseline NIC, its positioning in-
side a NoC and its internal blocks. . . . . . . . . . . . . . . . 17
3.2 Graphic representation of the NIC with compression and de-
compression modules in place. . . . . . . . . . . . . . . . . . . 18
3.3 Internal composition of the compression block. . . . . . . . . . 19
3.4 Detail of the output selection and filtering performed by the
multiplexer. It filters out N bits from a line width of B bits,
assuming N≤B. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Internal composition of the decompression block. . . . . . . . 22
3.6 First execution phase of Delta compression. . . . . . . . . . . 24
3.7 Second execution phase of Delta compression. . . . . . . . . . 25
3.8 Zero compression mechanism. . . . . . . . . . . . . . . . . . . 26
3.9 Block diagram of the Methodology Evaluation Engine work-
flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
x
LIST OF FIGURES LIST OF FIGURES
4.1 Energy consumptions results on the three considered compression-
enabled NoC architectures, normalized to the baseline archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Total injected flits for each compression-enabled architecture,
obtained from simulations on the SpecCPU2006 benchmark
subsuite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Compression ratio for each compression algorithm, obtained
from simulations on the SpecCPU2006 benchmark subsuite. . 37
4.4 Average hops/flit performed for the baseline NoC architecture. 40
4.5 Simulation execution time results on the three compression-
enabled NoC architectures, normalized to the baseline archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 VNETs comparison results on parallel-enabled architecture,
normalized to the baseline architecture. . . . . . . . . . . . . . 42
4.7 Total latency comparison on the three considered compression-
enabled NoC architectures, normalized to the baseline archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Execution time comparison between the three baseline archi-
tectures with different router’s pipeline latencies. . . . . . . . 45
4.9 Packet distribution against the compression module used in
parallel-enabled architecture. . . . . . . . . . . . . . . . . . . 46
4.10 Standard deviation of the compression ratio of the three con-
sidered compression-enabled NoC architectures. . . . . . . . . 47
xi
List of Tables
1.1 Screen characteristics for different market products, and their
frame buffer bandwidth. . . . . . . . . . . . . . . . . . . . . . 3
4.1 Experimental Setup: architecture specifications and traffic work-
loads used in simulations. . . . . . . . . . . . . . . . . . . . . 33
4.2 Experimental Setup: compression techniques specifications. . 34
xii
Acronyms
NoC = Network-on-Chip
QoS = Quality of Services
IP = Intellectual Property
SoC = System on Chip
CMP = Chip Multi-Processor
NIC = Network Interface Controller
VNET = Virtual Network
PE = Processing Element
CC = Compression Controller
DC = Decompression Controller
xiii
Chapter 1
Introduction
In the last decade the market segment imposes a tremendous increase in
the performance requirements and low power constraints to the computing
architectures ranging from High Performance Computing (HPC) to the Em-
bedded Systems. The traditional bus-based multi-core design can not match
such requirements anymore, mainly due to the limited performance scala-
bility of the on-chip interconnect [8]. In this scenario, the Network-on-Chip
(NoC) [21] emerged as the de-facto on-chip interconnect solution for scalable
multi and many-core architectures. Thanks to its efficient wiring usage and
the ability to multiplex more traffic flows on the same channels, the NoCs
ease the matching of the imposed Quality of Service (QoS) requirements.
Moreover, it offers a scalable and flexible interconnection fabric that can be
customized depending on the specific applications.
Differently from the bus-based interconnection, where all the components
share the same physical channel in order to communicate, the NoC offers a
distributed approach. It is constituted by several nodes, connected between
them through the utilization of bidirectional links. Each node contains a sim-
ple core, an L1 or L2 cache bank, and a Network Interface Controller (NIC).
In order to perform communication with the rest of the system, each node
is coupled with a router, that provides the functionality of routing packets,
both arriving from the NIC and from other nodes. The NIC has instead the
important role to give communication functionality between its node and the
1
rest of the system: it provides an interface capable of elaborate messages into
packets and to reconstruct arriving packets into messages destined to its core.
The way in which nodes are organized is defined by the topology that the
network employs. Actually, there exist different network topologies: thorus,
trees, bidirectional trees, 2D mesh, 3D mesh and so on. The most popular
topology used in NoC is the 2D-Mesh, that is a quadrilateral matrix-style
topology in which each node is connected through its router to its neighbors
in four directions, north, south, east and west.
However, NoCs provide worse end-to-end latencies compared to on-chip
bus solutions. On the other hand, the coherence protocols can not rely on
a total message order imposed by the interconnect, due to the fact the NoC
offers different paths between each source/destination pairs. All in all, the
emergence of the Dark Silicon [10] imposes the introduction of several system-
wide architectural methodologies to optimally exploit the available power
budget, thus forcing an accurate power optimization of the NoC design as
well. Last, the continuous increase in the bandwidth requirements imposed
the introduction of bandwidth-aware optimization strategies on the NoCs
even if they are traditionally considered an high bandwidth interconnects.
0
100
200
300
400
500
600
700
800
900
Dot Per In
ch
Figure 1.1: Dot Per Inch evolution for smartphone devices, during years.
For example, the mobile revolution enables high screen resolutions on
portable devices that, nowadays, are expected to deliver the same user expe-
rience of desktop and server architectures. Figure 1.1 shows how smartphone
2
screen’s characteristics have evolved during last decade, highlighting the fact
that in the last two years this trend has increased more rapidly with respect
to its observation window.
In such a scenario, in which dedicated processors and high-resolution
components are involved, a huge amount of data have to be computed and
transmitted back and forth between processing elements and I/O devices, as
fast as possible, to maintain acceptable QoS.
Product Screen resolution Refresh rate Bandwidth(Gbps)
Nexus 5X 1080x1920 60Hz 2.9
Samsung S7 1440x2560 60Hz 5.3
IPhone 7 1337x750 60Hz 1.5
OnePlus 3 1080x1920 63Hz 3.1
Table 1.1: Screen characteristics for different market products, and their frame
buffer bandwidth.
For example, even the simple task of refreshing a screen’s smartphone
has become a data-intensive task that requires the movement of a great
amount of traffic. Values from table 1.1, which represent different bandwidth
requirements for market products, give a good insight about how critical is
the role of the interconnect on managing the transport of such amount of
traffic per second.
In this scenario, the data compression techniques deliver a viable solution
to limit the bandwidth requirements, thus increasing the energy efficiency,
and performance of the device. The compression techniques are organized in
two broad categories. The Cache Compression [1, 2, 9, 14], acts at the cache
level, by storing the cache lines in a compressed fashion, with a net increase
in the cache capacity.
On the other hand, end-to-end (E2E) compression [23, 22, 1] aims to com-
press (decompress) data at the injection (ejection) point of the interconnect.
This implementation is transparent to the SoC design, and aims at reduc-
ing the number of transferred data through the SoC. It is also completely
orthogonal to any other possible optimizations.
In this work the E2E compression has been investigated as a viable mean
to reduce the traffic on the NoC without affecting the IP design since it is
transparent to the rest of the SoC. The on-chip networks allows the transmis-
sion of messages between the CPUs and the memory system by transforming
3
each message into a control or a data packet. Control packets are used to
keep the coherence within the SoC while data packets are used for data
transfers. Each packet is further split in multiple flits that are the atomic
transmission unit in the NoC. A packet has a single head flit that carries
routing information as well as the request/response command issued from
the sender. Multiple body flits are possible to transfer the data associated
with the message. Last, a single tail flit closes the packet. The tail flit is a
special body flit that signals the end of the packet to the network resources.
To this extent, data packets only can be compressed, since control packets
are short and usually single flit long.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
gems soplex bzip2 h264ref gobmk gromacs hmmer mcf namd omnetpp leslie3d
data packets control packets
Figure 1.2: Packet type distribution on different workloads from the SpecCPU2006
benchmark suite on a 4x4 2D-mesh with a channel width of 64bit and a cache line
size of 64B.
Figure 1.2 details the distribution of data packets with respect to the
overall network packets on SpecCPU2006 benchmark suite, showing that
data packets cover only 1/3 of the total.
Albeit the portion of data packets represents 1/3 of the total traffic, its
flit count hold the majority of the traffic that flows into the NoC, as shown in
Figure 1.3. This highlights the fact that the traffic load is dominated by data
packet flits. In particular, the number of flits that compose a data packet
depends on the channel width and the cache line size.
4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
gems soplex bzip2 h264ref gobmk gromacs hmmer mcf namd omnetpp leslie3d
data flits control flits
Figure 1.3: Flit type distribution on different workloads from the SpecCPU2006
benchmark suite considering a 4x4 2D-mesh with a channel width of 64bit and a
cache line of 64B. Data packets are 9 flit long and control packets are single-flit-
packets.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
mcf bzip2 gobmk hmmer soplex namd h264ref gromacs leslie3d omnetpp gems
golden model
Figure 1.4: Energy savings due to the introduction of compression technique on
different workloads from SpecCPU2006 benchmark suite.
To further strengthen the energy saving opportunities offered by the E2E
compression techniques, Figure 1.4 reports the energy savings for 11 Spec-
CPU2006 benchmarks when each data packet is shorten up to a single-flit-
5
1.1. Goals and Contributions
packet, i.e., using a golden compression model that always perfectly compress
each data packet. The golden compression model shows an averaged 60% en-
ergy reducing with respect to the baseline, non-compressed NoC traffic.
1.1 Goals and Contributions
This thesis aims to provide an analysis on the impact of E2E compression
on the NoC in terms of performance, power consumptions and overheads
introduced due to its utilization. Architectural modifications to the system
will be also presented, other than a parallel compression scheme that exploit
parallel execution of more compression modules to obtain better results.
Our contributions with this thesis account for both a qualitative and quan-
titative description of the implications subsiding the introduction of E2E
compression technique on the NoC.
Qualitatively, we will give the following contributions:
• we analyze the impact of E2E compression on energy consumption,
describing the effectiveness of such optimization as a power-saver;
• we investigate the usefulness in using a policy-based E2E compression
approach, to select a subset of packets eligible to compression, against
an approach of whole data packet compression;
• we analyze the impact of E2E compression against performance, ex-
plaining why compression is not a suitable optimization technique ca-
pable to boost the system’s speedup;
• we present a parallel compression model capable of executing more
compression algorithms in parallel, describing how such implementation
can give better results in terms of compression ratio and compression
stability, compared to the baseline architecture.
We will also quantitatively describe the impact of compression in terms
of overheads:
• We analyze the impact of the energy overhead, due to the utilization
of compression/decompression blocks, against the benefits derived by
compressing packets, discussing the presence of a break-even point;
6
1.2. Thesis Structure
• We discuss the impact of increased latency experienced by data packets
due to the extra cycles introduced to perform compression on data.
1.2 Thesis Structure
The rest of the thesis is structured in 4 chapters. Chapter 2 describes the
state of the art in compression techniques, both on cache and end-to-end,
already researched in NoCs. Chapter 3 provides a detailed description of the
methodology used for the analysis and the architecture modifications made
to the system. Chapter 4 exposes the obtained results with real-application
benchmark suite. Finally, Chapter 5 points out conclusions on this work.
7
Chapter 2
State of the Art
Data compression is a fairly new technique introduced in Chip Multiprocessor
(CMP) environment to exploit the tradeoff between power consumption and
performance execution by lowering the amount of traffic and, consequently,
dynamic power consumption due to minor switching activity on routers. This
chapter reviews the state of the art considering data compression on NoCs
as a way to improve overall performance while preserve energy dissipation.
Particularly, some of these methodologies are focused on expanding the
available memory capacity by applying compression on data to store more
smaller-size data into the same cache line, while other applies compression
between source and destination by compressing packets before injecting and
decompressing on receive, working at NIC level.
These aforementioned techniques are usually referred to cache compres-
sion and compression at the NIC (also known as end-to-end compression),
but are also characterized by the type of compression algorithm employed.
The common key concept behind the development of such techniques is
the redundancy in data packets: Ekman et al. [9] show how traffic patterns,
such as long runs of 0’s and 1’s, occur very frequently in different workloads,
highlighting the possibility to reduce this redundancy by applying compres-
sion on data.
Pekhimenko et al. [14] give more details on how redundancies can ap-
pear in data block and that these occurrences are widely present in many
applications and workloads. They show that, in addition to zero/one runs,
redundancy can be found in two other type of patterns: Repeated values, de-
fined as a single, small value repeated multiple times in a contiguous region of
memory, and narrow values, small data types stored in memory using larger
8
2.1. Cache Compression
data type then required, leading in a potential waste of possible utilizable
memory space. They also denote that all the above patterns fall under the
notion of low dynamic range: a set of values where differences between them
are much smaller then the values themselves.
Actually, two major typologies of compression algorithm are used: the
first one, exploiting the aforementioned concept of redundancies in data,
compares a part of the data itself, known as base, and computes a set of
differences from other portion of data. This approach is dependent on the
content of data itself, and is less deterministic in the achievable compression
ratio. The second type of algorithm relies on the comparison of the data to
a well-defined set of patterns, which include repeated symbols, 0’s or 1’s or
frequent observable patterns, and substitutes data with a smaller code that
represents the type of pattern detected. This last type is less data dependent
and so more deterministic, but at the same time can achieve less compression
ratio if compared with the former.
Starting from these motivations, different methodologies have been de-
veloped to address in different ways the needs of more memory bandwidth
and avoid increased packet latency and high dynamic power consumption on
NoCs by exploiting data-intrinsic characteristics already described.
2.1 Cache Compression
Cache Compression is a type of compression scheme that aims to increase the
effective cache capacity by fitting more compressed cache lines into a single
line.
Alameldeen et al. [1] describe how they can achieve good results by
applying Frequent Pattern Compression technique on L2 caches.
FPC compresses and decompresses data on a cache line basis by compar-
ison between actual data and a fixed-pattern table, derived from previous
analysis on different workloads.
Cache lines are split into 32-bit words, and each of them can be encoded
with a 3-bit prefix plus remaining data. If a word matches one entry of the
Frequent Pattern table, then it is encoded and stored in cache; if no matches
are encountered in this phase, the word is then stored as it is.
In order to effectively increase potential cache size, a cache line must be
able to pack more compressed caches lines than uncompressed lines into the
same space. Potentially, one cache line can be compressed to any number of
9
2.1. Cache Compression
bits: this assumption leads in the necessity to maintain a cache organization
that don’t break its accessibility to data and that each compressed and un-
compressed lines are well-padded to avoid misalignments. To address this,
authors introduced a modification of their scheme, called Segmented Frequent
Pattern Compression, that increases L2 granularity to 8-byte segment(cache
lines are seen as groups of segments), eventually padded with zeroes to match
a multiple of a segment, adding overhead in terms of area occupation. As this
methodology maintains data in L1 caches in an uncompressed form and in
a compressed form when stored into L2 cache banks, this overhead remains
negligible compared with the introduced optimizations, also if it is compared
to the added benefit of achievable compression ratio, up to 52%.
Das et al. [7] show how Frequent Pattern compression can be practi-
cally applied on L2 caches. They highlight some management aspects to be
taken in account when introducing FPC in caches design, and particularly
they describe how to avoid data access issues and misalignment due to the
variable-length compression scheme of FPC. More precisely authors define a
compression phase, called Compaction, that takes place under two circum-
stances: on a write hit and data to be written are bigger than the original
compressed line, and on replacements when line chosen to be replaced has
not the same size of the new line that will replace it. Compaction avoid the
generation of holes inside on a cache line, compacting them into contiguous
space and padding data with 0’s to maintain alignment at a segment level.
Because of this possible line fragmentation, authors introduced also the con-
cept of invariants to check the correctness of data inside every compressed
line. All these modifications comes with an overhead in terms of area and
power consumption of 0.183mm2 and 0.287W, respectively, with latency re-
ductions of 21% on average, dynamic power dissipation reduction of 7% on
average and a CPI reduction of about 7%, compared with baseline architec-
ture without compression.
Pekhimenko et al. [14] differently from previous works exploited another
compression mechanism, called Base Delta Immediate Compression, as a fast,
simple and effective alternative in implementing Cache Compression.
As usually high-level data structures(such as C struct or arrays) are used
in common application to represent large pieces of data and because of the
nature of the computation itself, a large amount of data patterns can be
stored in less then their actual size, so authors exploited this behavior to
10
2.1. Cache Compression
apply Base Delta Compression with the goal of diminishing both network
latency and power consumptions in a lightweight fashion.
In fact, Base Delta compression relies on a modest hardware module
composed by simple adders and a mask both in compressor and decompressor
modules, because the methodology itself needs only the actual cache line to
start compression: the idea is to represent the initial cache line as a common
base value and a sequence of deltas, defined as the subtraction between the
base and the actual portion of the line being compressed.
The algorithm views a cache line as a set of fixed-size values multiple of
8-byte size. First of all, a base has to be defined: for the Base Delta case,
authors adopted a simple base selection policy in which the first arbitrary 8
bytes of a line are defined as the base; after this step, B + ∆ computes the
deltas with the previously selected base, and it compresses the cache line if,
and only if, the sum of base and deltas is strictly smaller than the cache line
itself.
In order to gain better results in terms of compressibility, authors added
a second base to the baseline compression scheme, to take in account also
for immediate values as zero runs, repeated and narrow values, calling it
Base∆Immediate Cache Compression. This last introduction does not im-
pact overhead in terms of complexity and area, but increase the possible
compressibility level. This implementation, compared with Zero Content
Augmented Cache Compression, Frequent Value Compression and Frequent
Pattern Compression, provides a simple yet efficient technique that provides
high effective cache capacity increase and system performance improvement;
results show that B∆I improves system performances for single-core work-
loads of about 8% and 9.5% and 11.2% for two-cores and four-cores workloads
respectively, compared with the three aforementioned algorithms.
Villa et al. [20] have instead proposed a cache compression mechanism called
Dynamic Zero Compression, that exploits the prevalence of zero cache byte
on SRAM cache banks with the goal of reduce energy consumptions.
In order to support DZC mechanism, each cache byte is coupled with a Zero
Indicator Bit (ZIB), that indicated whether its associated byte contains all
zero bits.
On a cache write, the mechanism checks if all the eight bit of the byte
are zero, and if so it writes only the ZIB. In the case that the byte is not
zero, it is normally written to the cache. On a read access DZC employs a
11
2.2. End-to-End Compression
local byte word gating circuitry, controlled by the ZIB, to avoid swinging the
bitlines, and it is turned on only if the ZIB is not set, otherwise remains off
to save energy. Modifications are also made on bus drivers connecting sub-
banks to CPU, in order to avoid driving the I/O busses for zero bytes. These
introductions have their cost: area overheads are approximately around 9%
more than their baseline architecture, mostly due to the ZIBs introduction
into the banks. In terms of energy breakdown the DZC cache adds a 5pJ and
6.7 pJ energy consumptions for read and write on a zero byte, respectively.
For a non-zero byte, those levels increase to 11.4pJ for a read and 26.9pJ
for a write on DZC cache. DZC introduces also a timing delay, due to the
added circuitry to support DZC itself, that corresponds to a total of 2FO4
gate delay for a read access.
Simulations results on mediabench benchmark suite shows that Dynamic
Zero Compression achieve a 26% energy reduction on data cache accesses
and a 10% on instruction cache accesses.
2.2 End-to-End Compression
End-to-End Compression is another way to employ data compression in NoC
to achieve low network latencies and power consumptions.
Even if Cache Compression introduces benefits in terms of lowering power
consumptions and increasing the effective cache capacity, it adds design com-
plexity and hardware modifications to cache banks; researchers have found
that also moving compression/decompression modules to the Network In-
terface Controller can reach significant results, this time with the objective
of compressing data only when they have to be transmitted over the net-
work and consequently decompressed when ejected from it. This approach
avoids hardware modifications, while it follows a plug-and-play behaviour
that makes this solution flexible and scalable. Has to be cited that, in order to
work well, each tile have to have at least a pair of Compressor/Decompressor
modules, which can lead to a potential drawback if complexity of modules
grows.
Zhu et al. [23] applied this end-to-end paradigm by exploiting a com-
pression scheme based on Frequent Pattern Compression, described earlier,
called Frequent Value Compression. In their work, authors rearranged FPC
in such a way that it can complete compression prior to inject traffic into the
12
2.2. End-to-End Compression
network and decompression as the header flits arrives to the NIC, without
modifying any other existing module.
FVC mechanism relies on the concept that a lot of traffic patterns ap-
pear very frequently during executions, and that each workload has its own
set of frequent values. By combining these two aspects, authors developed
an adaptive table-based compression scheme that adapt itself based on the
workload, leading in a more flexible compression scheme that can achieve
better results compared to fixed table-based ones.
To ensure that compression and decompression can be accomplished with-
out loss of data, every pair of source and destination tile have to maintain a
so-called Frequent Value Table in a synchronized way on both sides: before
injecting a data message, it is matched against the FV Table and if there’s
a hit, that message is replaced by the index of the corresponding hit on the
FV table. On receiving, if the corresponding compression tag is set, the
compressed message is checked against the receiver FV table and, because of
synchronization, the original message is decoded without problems.
In order to gain runtime adaptivity and hence better results, an LRU
replacement policy is applied to the mechanism to actually adapt the FV
table content to changes in traffic patterns. Of course, every time a value is
replaced on one table, that modification have to be forwarded to each other
table to maintain FV table’s coherency.
Finally, with the introduction of such a compression scheme, savings in
terms of routers power consumption reach a 16.7% reduction, CPI reduction
of about 23.5% compared to the baseline model.
Besides their Cache Compression implementation, Das et al. [7] also in-
troduced Compression in the NIC methodology that exploits Frequent Pat-
tern Compression algorithm to gain benefits also when implemented with an
end-to-end approach; Authors focused on the analysis of compression and
decompression latencies and area overhead and how they impact the overall
performances, since the compression algorithm has been described earlier.
Compression and decompression phases takes place at a cost: both in-
troduce extra execution cycles and extra dynamic power dissipation; To face
this problem, the solution given here is that, since FPC can be done in five
cycles, this delay can be hided by utilizing a pipelined router model; on the
other hand, decompression has not to wait until the tail flit has arrived to
the NI, but once it has received the header flit, it can start decoding the mes-
13
2.2. End-to-End Compression
sage, resulting in a visible extra latency of two cycles. This added latency
do not impact the overall execution latency, rather it reduces it of about
20% compared to baseline execution. Finally, thanks to the simplicity of the
hardware modules that employs compression, power consumption remains
negligible compared to the power savings introduced, about 10% improve-
ments on average, by the methodology itself.
Differently, Zhan et al. [22] applied end-to-end mechanism on NoCs by
exploiting the Delta Compression algorithm to actually reduce traffic. They
highlight the importance of plug-and-play capability and, more precisely, the
fact that introducing such a scheme is orthogonal and complementary to any
other optimizations in other blocks.
No∆ is an adaptation of Delta Compression scheme in order to be used
in NoC environment: starting from the common key concept of low dynamic
range of values, authors aim to represent a data message as a common base
value of fixed size plus an array of relative differences from that base.
They describe also a compression policy to actually determine if a data
message can be compressed or not at runtime; Assuming that data portion
of the message has D bytes size, the base value has B bytes size and the other
values represented as an array of differences, No∆ determine compressibility
if and only if max{size(∆i)} < B, ∀i ∈ {1, 2..., n}.This methodology focuses not only on how to physically implement a
lightweight module for compression and decompression inside each NI of the
network, but also on the calculation of optimal base size and describe the
introduction of a multiple-bases variant, that tries in parallel to compress
a data message using different bases with different fixed-sizes to achieve a
better compression ratio.
Evaluation of such a technique on SPEC2006CPU Benchmark suite high-
light the potential benefits given by No∆: it can achieve an average com-
pression ratio of 21.1%, with latency reduction of 10.1% and network loads
reduction of 13.1% compared to baseline model; Also, thanks to the fact
that compressing data enlights the network load, No∆ can lower the dy-
namic power consumption of about 11% on average while power consumption
overhead for each couple compressor/decompressor account for only 1mW in
terms of dynamic power loss and 9.6uW in terms of leakage power loss.
Pekhimenko et al. [13] instead focused on the different assumption that
14
2.2. End-to-End Compression
compression, besides its capability to decrease dynamic power consumption
by decreasing network traffic, increases the bit toggle count of each com-
pressed packet, a fact that can increase dynamic power consumptions. A bit
toggle count is defined as the number of switch from 0 to 1 or vice-versa on a
communication channel or link between two consecutive transmitted packets;
The higher the ratio, the higher will be the switches on links and so dynamic
energy will increase consequently.
Starting from this assumption, Authors developed an hardware mecha-
nism to exploit this tradeoff between saves introduced by compression and
power consumptions overhead due to increased bit toggle count: the idea is
to avoid that the bit toggle count increases much during consecutive data
transmissions.
To correctly manage this decision policy, an Energy Control module is
introduced with the goal of correctly predict if the next message will be
transmitted on the channel in a compressed form or not.
It uses an activation function that includes several metrics to determine
compressibility: the bit toggle count for the uncompressed packet, the bit
toggle count for the same packet but compressed, the current bandwidth uti-
lization and the actual compression ratio. Also, to maintain data alignment
on data storage, Metadata Consolidation phase plays a big role on gather
all compression-related bits and moving them in contiguous regions of space.
Its selection policy gain good results in terms of decrease in bit toggle count
(6 to 16%), but on the other side it limits the achievable compression ratio.
Differently from other works, this toggle-aware compression mechanism
focuses more on the runtime decision to compress or not a packet based on
the actual transmitted data on physical links; it do not describe a specific
compression type, but uses 6 different types of algorithm including the afore-
mentioned Frequent Pattern Compression and Base∆Immediate.Because of
different compression algorithm compress data in different modes with dif-
ferent entropies, the utilization of more than one compression algorithm is
done in order to have better chances to reach an high compression ratio with
the minimum bit toggle count compared among the algorithms.
15
Chapter 3
Evaluation Methodology
In this chapter, the evaluation methodology that we used to analyze the
impact of compression mechanism on the NoC is presented.
It will be described how we setup the analysis environment, which tools
have been used and how they interact with each other, in order to retrieve
the desired data.
An architectural description of the modification made to the NoC is pre-
sented, to give a better understanding on where the considered modules are
placed and what is their impact on the system, as well as a description of two
compression techniques used in the evaluation. A simulation model of the
compression and decompression blocks is given to clarify how we simulated
those components and how we modeled them with the objective to capture
their behaviour on the simulated environment.
The rest of the chapter is organized as follows. Section 3.1 describes the
architectural modifications made to the NoC to support compression, section
3.2 gives a detailed description of the compression techniques selected for our
evaluation. Finally, section 3.3 describes the analysis flow and tools used to
retrieve the results presented in chapter 4.
16
3.1. Architectural Description
3.1 Architectural Description
In figure 3.1 are represented the architectural building blocks of a Network
Interface Controller (NIC), and its position inside a NoC.
The NIC has the role of mediating communication between its IP core to
the rest of the system through its attached router.
Figure 3.1: Graphic representation of the baseline NIC, its positioning inside a
NoC and its internal blocks.
As we can see from the figure, it is composed by two main flows. The
first one stores messages coming from the IP block into a message buffer,
waiting to be flitisized. At every NIC’s wake up, NIC checks if there is a
ready message into the message buffer, and if so, it flitisizes it and stores it
into one of the NIC’s output flit buffers. On the other side, at each wake up
the NIC checks also if there is a complete sequence of flits in one of its input
flit buffers, and if so it picks them up and reconstruct the original message.
The reconstructed message is then queued into an input message buffer that
queues all the messages destined to the IP block.
It is noticeable that every modification done at the NIC level, in terms of
insertion of hardware modules, does not affect the rest of the system, because
of it’s intrinsic goal to maintain transparency between IP blocks and the rest
of the network.
By encapsulating then compression modules into the NIC, it results in a
transparent mechanism with respect to the rest of the system: in this way
17
3.1. Architectural Description
its introduction is orthogonal to other possible applicable optimizations and
gives more flexibility in the design phase.
Figure 3.2: Graphic representation of the NIC with compression and decompres-
sion modules in place.
We modified the presented baseline architecture by placing a couple of
compression and decompression modules in each node’s NIC, as shown in
figure 3.2.
Particularly, the compression block is placed between the output message
buffer, accepting messages from the cache controller, and the hardware block
that packetizes the message into flits. If the message is compressible, i.e. if it
is a data message, it advances to the compression stage. Once finished, the
message goes to the following stage, that flitisizes it, sets the compression
flag on the header flit, and stores flits into one of the output flit buffer on
the NIC, waiting to be injected into the NoC.
The decompression block is placed between the hardware block that re-
construct the message and the input message buffer that serves the cache
controller. Once the message has been packed it is checked against its com-
pression flag, and if it results in a compressed message, it advances to the
decompression stage, otherwise it is bypassed directly into the input message
buffer that serves the cache controller.
18
3.1. Architectural Description
3.1.1 Compression Block
The architectural representation of our compression block is shown in figure
3.3.
It is constituted by several compressor modules placed in parallel, each
one characterized by a different employed algorithm and delay to perform
compression on data. Once a message arrives at the input of the block, it
is forwarded simultaneously to all the compressors, and depending on their
actual state, they start compressing the message, or store it into their waiting
queues.
Compressor modules are grouped in two sets, depending on their delay
characteristics: low-latency and high-latency response modules.
A low-latency response module is a compressor module that can give an
output in a short number of clock cycles, typically near to 1, where high-
latency response modules can have more cycles to give an output, typically
from 5 cycles up.
A Compression Controller (CC) is needed to perform the output selection
from all the outputs of compressors. Each module is attached to the CC and
signals its end and its obtained compression ratio on the message.
Figure 3.3: Internal composition of the compression block.
19
3.1. Architectural Description
Because we need to ensure that only one compressor’s output is granted
between all the possible compressors, a simple selection policy is defined to
grant the output for the rightmost multiplexer shown in figure 3.3.
3.1.2 Output Selection Policy
The selection policy employed by the CC is described here in its more general
form, in order to be applicable to an arbitrary number of modules.
If a low-latency module fires its done signal to the CC, its ratio value is
checked against a 50% threshold value, and if is equal or higher to that
threshold, CC directly grants its output to the multiplexer, aborting all the
other active tasks. By doing so, the CC speculates over the achievable com-
pression ratio, aiding low-latency response, as long as a decent compression
ratio is reached.
If more than one low-latency module fires at the same time, they are
checked against the same threshold, and CC grants the output of the com-
pressor that, among those modules with their ratio over the threshold, has
the highest compression ratio.
Differently for high-latency compressors, CC does not performs a thresh-
old filtering. In fact, if the policy have to compare high-latency modules, it
means that speculation over low-latency modules failed. So the CC simply
compares the ratio values of all the compressors after their termination and
grants the module’s output with the associated higher ratio between all the
compressors used, both low and high-latency ones.
This second rule is based on the fact that is not worth to speculate on
latency when comparing high-latency modules, rather than wait to all the
compressors to finish and hope for a better result with respect to low-latency
ones. Because is not known if high-latency modules could have surely better
result than low-latency ones (albeit being them less to 50%), all the outputs
of all modules are checked, to be sure to grant the module with the highest
compression ratio.
Once the output is granted, the CC also forwards the information about
which compressor has been granted, that will be embedded into the header
flit to guarantee the correct decompression on receive.
The output lines of each compressor have equal width to the original
20
3.1. Architectural Description
message size, because in the worst case of no compression, the message is
forwarded without modification. So, in order to give the correct output to
the next stage, the output of the multiplexer is filtered with a bitmask that
filters only the amount of data representing the message, as shown in figure
3.4.
Figure 3.4: Detail of the output selection and filtering performed by the multi-
plexer. It filters out N bits from a line width of B bits, assuming N≤B.
3.1.3 Decompression Block
Similarly to compression block, the decompression block consists in a vector
of decompressor modules in parallel.
Once a message is retrieved by reconstruction from its flits, its compres-
sion flag is checked to detect if it needs to be decompressed, before sending
it to the cache controller’s input buffers.
As the message arrives at the decompression block, the Decompression
Controller (DC) selects the correct multiplexer’s input line from the com-
pression type flag retrieved in the previous stage from the header flit. We do
this to be sure that a message compressed with a specific compressor module
is decompressed with the corresponding decompression module that employs
the same algorithm.
The DC is also employed to select the correct decompressor module’s
output, to be sure to deliver the decoded message to the following input
message buffer. The decompression block, and its constitutive sub-blocks, is
represented in figure 3.5.
21
3.1. Architectural Description
Figure 3.5: Internal composition of the decompression block.
3.1.4 Simulation Model
We modeled our compression and decompression blocks as two delay queues
that directly adds D queueing delay cycles to messages, that are variable
depending on the compression algorithm used.
22
3.2. Compression Techniques
3.2 Compression Techniques
Actually, there are plenty of compression techniques on the market, each
one with different algorithms and hardware designs that they employ. Most
of them fall under two set, defined as table-based and content-aware algo-
rithms: the former uses tables containing patterns associated with a code,
that are matched against the content of the analyzed data [1, 23, 7]. If there
is a match, the portion of matched data is substituted with its associated
code on the table. Tables can be adapted to the traffic in order to better
follow traffic behaviour by substitute table entries at runtime [23]. The latter
instead exploits the content of data, as described in chapter 2, in order to
transmit fewer data that can be reconstructed without the usage of additional
elements.
We selected two methodologies that belong to this last set because of
the fact that implementing a E2E table-based algorithm have the following
major drawbacks:
• using tables requires that each C/D module have to have at least one
table used to perform compression and decompression phases, as de-
scribed in [1]. This results in extra area and power overheads, due to
the fact that tables are represented as physical registers once synthe-
sized on chip.
• updating entries of those tables requires a coherence mechanism to
maintain coherence among all the modules, as described in [23]; this
implies an increment in traffic injection that can be potentially coun-
terproductive with respect to the goal to enlight traffic.
Beyond these limitations, our selection has been influenced also by the
fact that we want to highlight benefits deriving from compression without
incurring in problems regarding complex design schemes that highly affect
power and area overheads. Given so, we have found that content-aware
algorithms have very little area and power overheads, as described in [22],
and hence are the most suitable options for our investigation.
Following subsections detail the characteristics of Delta compression and
Zero compression mechanisms used in this thesis to obtain the results out-
lined in chapter 4.
23
3.2. Compression Techniques
3.2.1 Delta Compression
Delta compression is a method to transmit data in the form of differences
rather than the complete data set. This is possible due to the important
observation that most of memory and cache data have a low dynamic range
[22], [14] for different reasons: programs usually arrange data in array, change
their register numbers or memory addresses across a large number of instruc-
tions, similar data values are often grouped together, and more in general
because of the memory features of spatial and temporal locality.
That said, original data can be then seen as a common base value of fixed
size and an array of relative differences, called ∆s. We selected as base the
first flit of the data that is originally composed by N flits.
Other implementations explore the possibility to use different basis than
the first flit, but results show that best achievements in terms of both com-
pression ratio, power and area overheads are given by selecting the first flit
as the base [14].
Figure 3.6: First execution phase of Delta compression.
In order to perform compression, this algorithm compares the base flit
with all the remaining flits by byte-wise subtracting the considered flit to the
base, and filling a compression mask with 0 if the byte that corresponds to the
24
3.2. Compression Techniques
position in the mask has no difference with respect to the base, or 1 otherwise,
as described in figure 3.6. This phase requires only N-1 subtractors, where
N is the total number of data flits analyzed.
Once the compression mask is filled, the algorithm analyzes the compres-
sion mask at fixed positions starting from an offset that represents the i-th
flit inside the whole mask.
Figure 3.7: Second execution phase of Delta compression.
By finding the position of the leftmost zero that has no 1’s on its right
starting from the flit offset of each group of B bit, the algorithm can obtain
the resulting savings for that flit, and passes that value to a multiplexer that
filters only the right portion of data, as shown in figure 3.7.
Given a flit size of B bytes, and a payload dimension of N flits, the compres-
sion mask occupies (N − 1) ∗B bits.
Finally the generated ∆s are packed together as a new compressed message
and passed to the next NIC stage.
With this implementation, considering a flit dimension of N bytes and data
message represented by 1 head flit and N-1 body flits, we can obtain a data
packet formed by 1 head flit, 1 body flit representing a base and 25×N other
body flits containing ∆s and the compression mask.
The entire compression phase takes D cycles to complete, that are ac-
counted for in simulations as an additional packet’s queueing delay.
At the receiving side, the decompression module is triggered every time
the head flit of a data packet has the compression flag set, and it starts its
restoring phase.
25
3.2. Compression Techniques
Similarly to the compressor, decompressor uses a N-1 vector of byte-wise
adders in parallel, taking as first operand the base flit and as second operand
each ∆, and based on the additional informations attached, it can restore
the original flit without issues. Equally to the compression phase, its cost in
terms of delay cycles is D cycles.
The entire operation of compress and decompress a packet impacts on its
queueing latency by an additive amount of 2 ∗D cycles on execution. Differ-
ently from other implementations, in which compression and decompression
delays are masked by pipeline improvements [22] and accounts for only one
extra delay cycle, we wanted to analyze the worst-case scenario with such a
technique without any other optimization present on the system.
3.2.2 Zero Compression
Zero compression is a simple compression algorithm that aims to represent
every zero-content byte present in the data with a single zero-bit.
Figure 3.8: Zero compression mechanism.
The algorithm analyzes each byte of the message by taking its bits in
input to an OR port to see if their sum is equal to zero, as described in Figure
3.8. If so, the compression mask is updated with a 1 in the corresponding
byte position, to maintain the offset of the compressed byte and the new
message is updated by adding a zero bit to it. If the sum does not results
in a zero, then the byte is not compressible, and it is passed as it is for the
26
3.2. Compression Techniques
compressed message generation, and a zero bit, indicating that the byte is
not compressed, is set into the compression mask. A compression flag is set
in the header flit, that will be used later to trigger the decompression phase
at the receiving side. The compressed message is then passed at the next
NIC stage.
The compression mask is N ∗ B bits wide, assuming that a flit is com-
posed by B bytes and the payload represented by N flits, and it represent
the overhead that has to be attached to the compressed packet in order to
correctly restore original data.
With such an implementation, considering that an uncompressed data
packet is N flits long with each flit represented by B bytes, we can shrink
it to a total of 1 header flit, 1 body flit for all the zeroes and 1 body flit
for the compression mask. The compression phase accounts for D cycles on
execution, that is directly added to the packet’s queueing latency.
On the receiving side, the decompressor is triggered every time a packet
has its compression flag set. Based on the informations contained in the
compression mask, it expands to zero those bytes that are marked with a 1
in the mask, and finally the original data is regenerated.
Like the former phase, decompression takes D additional execution cy-
cle in order to give an output, that is accounted for as additional packet’s
queueing latency.
This implementation accounts in total for 2 × D additional cycles on
packet’s queueing delays.
3.2.3 Specialized Output Selection Policy
In our evaluation methodology we wanted to capture the impact of compres-
sion by using the two aforementioned techniques, paired together as described
in section 3.1.1.
We have made the following modifications to the general selection pol-
icy presented in section 3.1.2, in order to use it with the delta and zero
compression modules .
Based on their characteristic delays, Zero compression will be a low-
latency module and Delta compression an high-latency one.
So, our specialized output selection policy will be simpler than the gen-
eral one: low-latency module’s ratio result is directly checked against 50%
threshold, without waiting other possible low-latency modules because it is
27
3.2. Compression Techniques
the only present in the compression block. If it has a ratio greater than
the threshold, its output is granted by the CC and the high-latency task is
aborted.
If not, the CC waits also high-latency module to terminate, and grants
the output to the compressor module that has the greatest compression ratio
between the two.
28
3.3. Methodology Evaluation Engine
3.3 Methodology Evaluation Engine
Figure 3.9: Block diagram of the Methodology Evaluation Engine workflow.
Figure 3.9 depicts the workflow used to retrieve the results that we present
in the next chapter.
We collected data by sampling the network’s state at regular intervals
of N million instructions for a total simulated execution time of S seconds.
A configuration file is used by the Simulator Engine to pass all the needed
parameters, both for the sampling and for the definition of the NoC structure.
Once simulation finishes, raw statistics files are collected and passed to
the Elaboration Engine, that filters them to extract only useful data for
the evaluation. A post-simulation power estimation tool calculates dynamic
power consumption, and three files are then generated, grouping power, per-
formance, latency and compression-related statistics, mediated over the num-
ber of sampled epochs. Finally, those files are passed to a Graph Plotter that
presents the aggregated data in a graphical representation.
Following subsections give a detailed description on how the Elaboration
Engine performs calculations on the raw statistics data.
29
3.3. Methodology Evaluation Engine
3.3.1 Power Consumptions Evaluation
Power consumption calculation is performed based on the injection rate on
links per executed cycle on simulation.
To obtain power results, an injection rate value has to be passed to the
power estimation tool. The Elaboration Engine calculates link injection rate
using the following formula:
injection rate =
(∑vnets
i=0 hopcounti)/execycles
R
L(3.1)
where R stands for the total number of routers in the NoC, L for the
number of links for each router and execycles for the total cycles executed
during simulation.
The hopcount represents the total link traversals that each flit performs
in order to reach its destination from a source node. We collected hop count
metrics by summing up every flit’s hop count before being injected into the
network.
Once obtained the dynamic instantaneous power dissipation, Elabora-
tion Engine calculates dynamic energy dissipation by multiplying that value
with the corresponding simulated execution time, in seconds, recalling the
mathematical relation that subsist between instantaneous power and energy:
Energy =
∫ t1
t0
P (t)dt (3.2)
Power consumption overheads for the compression/decompression phases
have been calculated by taking into account the amount of execution cycles
spent on performing those phases, their power dissipation per cycle and the
total executed cycles of the simulation.
Dissipated power overheads per cycle has been obtained with the following
formula:
compression overhead =PC/D × cyclesC/D
execycles(3.3)
where PC/D is the power consumption of the couple modules compressor/decompressor,
cyclesC/D are the sum of cycles spent performing compression and decom-
pression during execution and execycles represent the total cycles executed
during simulation.
30
3.3. Methodology Evaluation Engine
Once retrieved, it has been added to the link’s energy as additional con-
sumption.
3.3.2 Performance Evaluation
Performance have been analyzed by collecting statistics regarding the simu-
lated execution time and network latency for every sampled epoch. Those
two metrics have been selected because are the most representative in order
to investigate compression impact in terms of overall execution speedup and
latency variations on the network.
Simulated execution time represents the time interval between simulation
launch and its termination, and represent the passed time in seconds as if
the workload was executed on a real hardware SoC.
Network latency represents instead the total latency introduced by the
network, due to switching, traversing and dispatching packets among IPs.
3.3.3 Compression Determinism Evaluation
Determinism of compression means the variability associated to the calcu-
lated compression ratio per packet at runtime. In order to explore the pos-
sibility to reduce the standard deviation window on average compression
ratio, we collected, for each simulated epoch, the compression ratio achieved
by every data packet.
Then, we have post-calculated the average compression ratio over each
epochs and its associated standard deviation, in order to better understand
how much the compression ratio shifts from packet to packet.
31
Chapter 4
Results
In this chapter is presented a comprehensive analysis of the impact of data
compression on the NoC.
Will be discussed how compression can lower dynamic power consump-
tions by acting on the amount of data injected into the network. It will shown
how this mechanism does not affect overall speedup of the system, as it is
affected by other factors not directly correlated to the traffic reduction of the
network. Finally, an analysis on the compression ratio and its determinism
is presented.
The rest of the chapter is organized as follows. Section 4.1 describes the
simulation environment setup, as well as the used benchmarks. Section 4.2
analyzes the impact that compression/decompression introduction has on the
dynamic energy dissipation and Section 4.3 describes how performance are
influenced. Finally, Section 4.4 discusses how the introduction of a parallel-
compression approach impacts the achiavable compression ratio and its stan-
dard deviation, compared to single-algorithm techiques.
4.1 Simulation Setup
The compression and decompression modules presented in sections 3.1.1 and
3.1.3 have been integrated in the enhanced version [27, 26, 19, 18, 5, 6, 24, 25]
of GEM5 cycle accurate simulator [4].
The simulator architecture without the two blocks is considered as the base-
line architecture, and consist in a 4x4 NoC with 2D-mesh topology with 16
32
4.1. Simulation Setup
Processor Core2GHz, In-Order Alpha Core, 1 cycle per
execution phase
Cache line size 64 byte
L1I Cache 16kB 4-way set Associative
L1D Cache 16kB 4-way set Associative
L2 Cache 512kB per Bank, 8-way set Associative
Coherence Protocol MOESI (3 VNET, 2 VC per VNET protocol)
Channel width 64 bit
Flit size 8 byte
Control packet size 1 flit
Data packet size 9 flits, 1 header flit + 8 body flits
Topology 2D-Mesh 4x4 at 16 Cores
Network frequency 1GHz
Technology 45nm at 1.0V
Real TrafficSubset of 11 benchmarks from SpecCPU2006
benchmark suite.
Table 4.1: Experimental Setup: architecture specifications and traffic workloads
used in simulations.
Alpha cores and NoC routers with 2 virtual channels per VNET, as described
in table 4.1.
Channel width for links has been set to 64 bit, with a resulting flit size of 8
bytes. In order to pack a 64-byte size cache line, a data packet consist in 8
body flits that carry the payload and one header flit, when control packets
requires instead only 1 flit.
We collected our result using a compressor/decompressor module with
the two compression methodologies exposed in sections 3.2.1 and 3.2.2.
We evaluated then three different configuration of the module: delta-enabled
where the compression block is forced to compress only with the delta com-
pressor module, zero-enabled where the block is forced to compress only with
the zero compressor module, and finally parallel-enabled in which the com-
pression block works normally.
Table 4.2 details the specific parameters used to tune our modules for
simulation. Values regarding the introduced delay and power consumptions
33
4.1. Simulation Setup
are relative to one couple of compressor/decompressor. Power consumption
values for delta module are retrieved from [22]. Albeit dynamic power con-
sumptions are declared as negligible in [20], we decided to set dynamic power
consumptions for the couple compressor/decompressor as the same value of
our delta compression module, in order to model a more realistic scenario.
Compression
Methodology
Max. Comp.
Ratio
Introduced
delay (cycles)
Power
Consumption
(mW)
Delta 60% 10 2
Zero 75% 2 2
Parallel 75% 10 or 2 4
Table 4.2: Experimental Setup: compression techniques specifications.
A MOESI-based protocol is used as coherence memory protocol for the
system. This protocol virtually isolate messages in three different VNETs
to guarantee deadlock-free message dispatching. Traffic isolation guarantees
also that only one VNET is employed to carry and manage data messages,
where control messages are injected in the other two VNETs. For our simu-
lations, this protocol assumes that VNET0 and VNET1 carry the coherence
control packets, or single-flit packets, where VNET2 carries data packets.
DSENT NoC power estimation tool [17] has been used in order to extract
power data from the simulated NoC architecture, and GNU Octave plotting
tool used to plot graphs from the obtained results.
A subset of 11 benchmarks from SpecCPU2006 benchmark suite have
been used to simulate real traffic workloads on the simulator. It provides
integer and floating-point single-threaded benchmarks able to stress the NoC
with high traffic workloads [11].
34
4.2. Energy Analysis
4.2 Energy Analysis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
delta zero paralleldelta zero parallel
Figure 4.1: Energy consumptions results on the three considered compression-
enabled NoC architectures, normalized to the baseline architecture.
Total energy consumption results are depicted in Figure 4.1.
The three different grouped columns on the X axis represent the three con-
sidered architectures per considered benchmark. Energy consumption is re-
ported on the Y axis and normalized to the baseline architecture. The lighter
bars represent the energy consumption on the link, where the darker bars on
the top represents the compression energy overheads. Those values of energy
were obtained with DSENT [17] for each simulated epoch, and mediated over
the number of epochs executed, as described in Section 3.3.1. We adopted
the Normalized Average Epochs Energy (NAEE) metric to compare different
architectures:
AEEk =
∑epochsi=1 EEi
countepochs(4.1)
NAEEk =AEEk
AEEbaseline
(4.2)
Where EEi in the equation 4.1 stands for the energy consumption of
each epoch and the number of considered epochs. Average Epochs Energy for
each compression architecture is then normalized to the baseline architecture
result, as reported in equation 4.2.
35
4.2. Energy Analysis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
delta zero parallel
Figure 4.2: Total injected flits for each compression-enabled architecture, obtained
from simulations on the SpecCPU2006 benchmark subsuite.
Figure 4.2 shows the total injected flits on the network, normalized to
the baseline architecture. On the X axis, each of the three grouped columns
represents a compression-enabled architecture, where on the ordinate the
amount of injected flits is shown. On average, parallel-enabled architecture
achieves a 27% less flit injection compared to baseline model, and outper-
forms delta-enabled and zero-enabled architectures by a 6% and 6.4% on
average, respectively. Figure 4.3 represents the compression ratio achieved
by the three different architectures, represented by each of the three grouped
columns on the X axis, where Y axis shows the compression ratio value.
We adopted Compression Ratio Value (CRV) [15, 16] metric to perform our
analysis. It is defined as:
CRVk =baseline flit count
compressed flit count(4.3)
CRV for parallel-enabled architecture reach a 1.42x value, where delta-enabled
and zero-enabled only a 1.29x and 1.30x on average, compared to the baseline
architecture.
The results show that energy dissipation is reduced by an average of 19%,
22% and 26% with the delta-enabled, zero-enabled and parallel-enabled ar-
chitectures respectively, among all the analyzed benchmarks.
36
4.2. Energy Analysis
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
delta zero parallel
Figure 4.3: Compression ratio for each compression algorithm, obtained from sim-
ulations on the SpecCPU2006 benchmark subsuite.
In the worst case, as shown by NAMD workload results, it is nearly equal to
the baseline configuration without compression enabled. This behaviour is
due to the fact that, because its achieved compression ratio in both architec-
tures varies from 1.01x to 1.04x, compression does not significantly reduces
the number of injected flits into the network.
On the other hand, GOBMK and LESLIE3D workloads experience the best
benefits by compression, with a 68% and 60% energy reduction respectively,
considering the parallel-enabled architecture.
Since their compression ratios are 2.04x and 2.11x respectively, they actually
inject less than half of the baseline flits and also have an high hops/flit ratio
compared to the other benchmarks, thus leading to a lower link utilization
(see Figure 4.4).
A comparison between delta-enabled and zero-enabled architectures high-
lights the fact that some benchmarks prefers one technique against the other,
where other give instead very similar results regardless the employed tech-
nique.
In the former case, BZIP2 and HMMER are more sensible to delta tech-
nique, as highlighted by their compression ratios. Delta ratio outperforms
zero by 8% on BZIP2 and by a 10% on HMMER, and energy consumptions
reflect these changes. Delta technique for the two workloads performs better
37
4.2. Energy Analysis
than zero, with a 7% and 10% more savings with respect to zero technique,
without considering compression overheads.
The latter case is represented by SOPLEX, H264REF, GROMACS and
OMNETPP benchmarks. As we can see, because their compression ratios
are near the same for the two architectures, link energy consumptions are
practically the same.
4.2.1 Overheads Analysis
Figure 4.2 shows the overheads as the small portion on the top of each re-
ported bar. The overhead represents the extra consumed energy due to the
compression/decompression activity.
As we can see, their impact on energy consumptions is really small: on
average, they account for 3.9%, 1.2% and 2.1% for delta-enabled, zero-enabled
and parallel-enabled models respectively.
An interesting characteristic that is derived from those results is that
the energy consumption of the two introduced modules is directly dependent
on the execution cycles performed by the module itself, as detailed by the
calculation method in Section 3.3.1.
Delta-enabled and zero-enabled overheads show that their energy con-
sumption always follow a 5:1 ratio, because of their aforementioned depen-
dency on execution cycles. Having 10 and 2 execution cycles respectively,
their overheads will always be near or equal to a 5:1 ratio. Given this, being
equal link consumptions, zero compression is preferable compared to delta
due to its less module consumption during execution. Parallel-enabled mod-
ule’s overheads fall under the two cases, because it mixes its execution cycles
spent, based on the technique selected for each packet at runtime: for ex-
ample, parallel module’s consumptions are nearly the same with respect to
delta on HMMER, because the 81% of the traffic is compressed by delta
compression technique, as highlighted in Figure 4.9.
Finally, we investigated the Break-Even Point (BEP) of energy consump-
tions. The BEP represents that point beyond which the usage of compression
does not introduce benefits anymore, but its energy overhead nullifies energy
savings due to less traffic injection.
Although the majority of analyzed benchmarks does not suffer from over-
head introduction, an extreme case is represented by NAMD workload: its
scarce energy savings are totally nullified by the compression overhead of
38
4.2. Energy Analysis
both delta-enabled and parallel-enabled architectures. In this case in fact the
utilization of the zero-enabled architecture could be the only one that does
not nullifies energy savings, mostly due to the fact that its module works for
a lot less execution cycles with respect delta one.
4.2.2 Compression Policy Analysis
We also investigated the possibility to compress only a subset of data packets
by using a selection policy on the distance that each packet travels from
source to destination. We want to observe if, by compressing only a portion
of data packets with a certain characteristic, it could give benefits to the
system. Our cost metric is then the distance of each packet.
The cost that an uncompressed packet pays for reach its destination is
modelled as follows:
COSTbaseline = flits× (hops× costL) + costR (4.4)
Where costL represents the cost ,expressed in cycles, of traversing a link,
and costR represents the cost, expressed in cycles, to traverse a router. Flits
accounts for the number of flits forming a data packet, and hops finally de-
scribes the number of traversals done on links for each flit. For the compressed
packet cost we modeled the following equation, to take in account also the
presence of the added delay by compression and decompression:
COSTcompression = flitsC × (hops× costL) + costR + C +D (4.5)
C and D represents respectively the cost, expressed in cycles, of the com-
pression and decompression phases. FlitsC accounts for the number of flits
resulting after the compression and the rest of variables represent the same
costs as equation 4.4. The policy evaluates the following condition:
COSTcompression ≤ COSTbaseline (4.6)
And will enable compression if the above condition is true, denying it
otherwise.
We then calculated the theoretical break-even point for this policy, in
order to know the boundaries of values, expressed in hops, within which we
can gain benefits from compression.
39
4.2. Energy Analysis
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Figure 4.4: Average hops/flit performed for the baseline NoC architecture.
Simplifying both parts of the equation 4.6, and substituting variables with
our simulation parameters, we have found that:
hops ≥ C +D
(flits− flitsC) ∗ linkcost, (4.7)
hops ≥ 2
Based on this result, our policy will give benefits only for those packets
that have to travel a minimum distance of 2 hops.
Figure 4.4 represents the average number of hops that each flit performs,
for each considered workload from the SpecCPU2006 benchmark suite, exe-
cuted over the baseline architecture.
Based on such results, every benchmark has a number of hops greater
than the break-even point calculated earlier, with an average of 3.6 hops/flit,
making it useless to introduce a packet compression selection policy, because
of the totality of packets always performs a number of hops greater than 2.
We therefore applied compression at the whole amount of data packets.
40
4.3. Performance Analysis
4.3 Performance Analysis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
delta zero parallel
Figure 4.5: Simulation execution time results on the three compression-enabled
NoC architectures, normalized to the baseline architecture.
Results on total execution time are shown in Figure 4.5.
Each of the three grouped columns on the X axis represents the three different
architectures used in simulations, where on the Y axis is represented the
execution time of each of them, normalized to the baseline execution time.
We adopted in this case the Normalized Execution Time (NET) metric to
compare different architecture’s result:
NETk =ETk
ETbaseline(4.8)
Where ETi represents the Execution Time experienced by the considered
compression architecture over the baseline Execution Time.
Results highlight that the introduction of compression technique, in all the
three configurations, do not gives valuable improvements on the overall speedup
of the system execution, with delta-enabled performing 0.2% worse, zero-
enabled and parallel-enabled a 0.4% and 1% better respectively, compared to
the baseline architecture.
Best improvements, observed on both GOBMK and LESLIE3D, account
for only a 5% speedup over the baseline architecture, due to the fact that
41
4.3. Performance Analysis
they are the only two benchmarks that reach a high compression ratio. Con-
versely, the compression introduces a system performance penalty within 3%
for NAMD, OMNETPP and GEMS benchmarks. Those benchmarks have
the less compression ratio on data, and albeit they compress all the packets,
compression does not give benefits in terms of less injected flits, but only af-
fects the packet’s delay. This can be observed by pointing out the increased
execution time with respect to the baseline architecture.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
VNET0 VNET1 VNET2
Figure 4.6: VNETs comparison results on parallel-enabled architecture, normalized
to the baseline architecture.
We also evaluated the impact of data compression on the VNETs latency.
Figure 4.6 gives a comparison of the three VNETs latencies collected on
the parallel-enabled architecture. On the X axis, each of the three grouped
columns identifies a VNET, from left to right VNET0, VNET1 and VNET2.
On the Y axis is detailed the VNET latency, normalized to the baseline la-
tency value. We adopted the Normalized VNET Latency (NVL) metric in
order to compare them against the baseline model:
NV Lk =V Lk
V Lbaseline
, k = 0..2 (4.9)
Where V Lk represents the Latency on the kth VNET.
As we can see from the above figure, packet latency for both VNET0
and VNET1 remain practically equal to the baseline values, indicating that
42
4.3. Performance Analysis
coherence control packets gain no benefits over the utilization of compression.
Differently, VNET2 experiences a 23% latency decrease on average. GOBMK
and LESLIE3D show the best improvements, with 45% and 46% latency
decrease, respectively. This result can be seen also on Figure 4.3, where
the two considered benchmarks are the only two that experience a little
performance improvement.
NAMD workload maintains instead practically the same latency values for
the three VNETs. This is a direct reflection of the fact that this benchmark
, albeit compressing all the data packets, its compression ratio is small and
gives no improvements in terms of less flit injection.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
delta zero parallel
Figure 4.7: Total latency comparison on the three considered compression-enabled
NoC architectures, normalized to the baseline architecture.
Figure 4.7 show the total packet’s network latency, resulting as the mean
value of the three VNETs for each of the three compression architectures. We
adopted the Normalized Network Latency (NNL) metric in order to compare
the different architectures:
NLk =
∑V NETSi=0 V Li
VNETS(4.10)
NNLk =NLk
NLbaseline
(4.11)
Where V Li represents the latency of each VNET, and V NETS represents
the number of VNETS of the system.
43
4.3. Performance Analysis
Results show that total network latency can be lowered, on average, by 6%,
7% and 9% with delta-enabled, zero-enabled and parallel-enabled respectively,
compared to baseline architecture.
Best results, with parallel-enabled architecture, are given by GOBMK and
LESLIE3D benchmarks that outperform baseline architecture by 15% and
16% respectively.
On the other hand, NAMD benchmarks show little to no improvements also
on latencies, with a 0.5% latency increase with respect to the baseline archi-
tecture.
The analysis of the impact of compression over packet’s network latency
let understand why compression does not improve system’s speedup: the key
participant that allows benefits in terms of performance speedup is in fact
the single-flit coherence control packet, and not the data packet. Because
compression acts only on the latter packet type, control packet’s network
latency remain equal to the baseline case, as reported in Figure 4.6, and so
the system cannot experience a visible performance improvement.
In confirmation of this, a recent published research [12], albeit not imple-
menting or working with compression on the NoC, exposed the fact that the
key point to work with in order to speedup a NoC are the coherence control
packets.
In their work, researchers have analyzed the impact of the different types
of packets on system’s performance, indicating that coherence control pack-
ets are the one that are responsible to the system’s performance behaviour,
defining them latency-sensitive packets.
In order to verify their assumptions, they have proposed a NoC architec-
ture that is composed by two physical networks paired together, one used to
route and dispatch only latency-sensitive packets, and the other employed as
a standard NoC to manage the rest of the traffic.
Using a router architecture with a pipeline latency of only 1 cycle that can
perform Route Computation, Virtual Channel allocation, Switch allocation,
Switch and link traversal all in the same cycle, they increase the latency-
sensitive packet’s speed, leading to faster dispatching.
Their results show that performance in terms of overall system’s speedup can
reach on average 1.08x improvements with respect to baseline architecture on
the GEM5 simulator, and a 1.66x improvements with SynFull [3] synthetic
44
4.3. Performance Analysis
traffic modeling tool.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
4 cycles adaptive 1 cycle
Figure 4.8: Execution time comparison between the three baseline architectures
with different router’s pipeline latencies.
To verify this aspect, we conducted a simulation over our baseline archi-
tecture with the latest released GEM5, that adds a flexible router’s pipeline
feature to its architecture, thus allowing to flexibly set the pipeline depth.
In particular we applied, to each control coherence packet, a router pipeline
latency equal to one cycle, and for all the other packets a latency equal to
4 cycles. The considered architecture exploit Virtual Networks NoC, thus
it splits the traffic into three separate virtual networks that share the same
physical resources, i.e. links and routers.
Results of this simulation are presented in Figure 4.8. On the X axis,
each of the three grouped column represents the considered architecture: the
first employing a fixed router latency of 4 cycles for each the packet, the
second that employs adaptive latency discussed above, ant the third with a
fixed router’s latency of 1 cycle for each packet. Simulation time for each
benchmark is shown on the ordinate and normalized to the 4-cycle pipeline
router latency result.
For each of the analyzed benchmark, adaptive-latency results fall between
the two fixed-latency architecture results. As we can see execution time
speedup for the adaptive architecture is, on average, 1.05x with respect to
the baseline architecture that uses a classical router pipeline with 4-cycle
latency.
45
4.4. Compression Ratio Analysis
4.4 Compression Ratio Analysis
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
delta zero
Figure 4.9: Packet distribution against the compression module used in parallel-
enabled architecture.
We also investigated how the parallelization of two different compres-
sion algorithms can affect the achievable compression ratio at runtime with
respect to the two single-modules architectures, being delta-enabled and zero-
enabled.
This last analysis focuses in fact only on the parallel-enabled architecture,
in which delta module and zero module are paired together and used to
compress packets in parallel with both compression modules, selecting then
only one of the two outputs, as described in Section 3.2.3.
Figure 4.9 shows the distribution of data packets that have been com-
pressed with delta or zero module at runtime, using parallel-enabled archi-
tecture. On the Y axis is shown the percentage of data packets compressed
with delta or zero algorithm over the total number of compressed packets,
for each of the benchmarks present on the X axis. As we can see from the
figure, data packets are well distributed among the two algorithm, resulting
in an average of 52% of them being compressed with delta technique and
48% of them with zero technique.
There exist some cases in which benchmark’s data packets prefers one
46
4.4. Compression Ratio Analysis
compression technique against the other: BZIP2 an HMMER represent the
case where data is better compressed using delta compression instead of
zero compression. In those two benchmarks, the 76% and 81% are in fact
compressed using the former technique.
OMNETPP represents instead the situation in which zero compression
compresses better with respect to delta, with an amount of compressed data
equal to 78%.
These results indicate that data messages are sensitive to different com-
pression techniques, because of the different data pattern that are exploited
by those techniques. Our results show that our algorithm selection is a good
compromise, due to the balanced average distribution of packets based on
compression algorithm.
0
0.1
0.2
0.3
0.4
0.5
0.6
delta zero parallel
Figure 4.10: Standard deviation of the compression ratio of the three considered
compression-enabled NoC architectures.
Figure 4.10 shows the standard deviation calculated over all the packets, that
represents the shift that occurs between each consequent compressed packet
at runtime. Each of the three bars on the X axis represent the considered
architecture, where on the Y axis is shown their associated standard devi-
ation. The more the standard deviation is high, the more the compression
ratio among packets is unstable.
Parallel-enabled architecture shows a 5% standard deviation reduction on
47
4.4. Compression Ratio Analysis
average, with respect to the other two models.
Some workloads, however, have good reaction on using such architecture,
lowering their standard deviation from 6% to 10% with MCF, GOBMK and
LESLIE3D. Differently, BZIP2 and HMMER have little to no improvements
with respect to delta-enabled architecture. This is due to the already dis-
cussed fact that those benchmarks reach good compression ratio only with
delta technique, being it alone or paired with zero compression.
GEMS, NAMD and OMNETPP, having a small compression ratio com-
pared to the rest of the workloads, have also a small standard deviation,
because the most of its traffic is not compressed at all and those packets that
are compressed obtain a really small compression ratio, remaining near to
the uncompressed situation.
This last result validates the fact that by using different algorithms to
compress data, and thus giving the possibility to select the best results be-
tween all the compression results, the fluctuation of compression ratio is more
stable compared to single-compression models.
48
Chapter 5
Conclusions and Future Works
Modern NoC-based multi-cores SoCs are increasingly stressed by the mar-
ket’s application demands, that are pushing low-latency and low-power re-
quirements to the limit.
Actual application’s traffic is governed by data packet’s flits, where they
cover the 80% of the totality of injected flits, limiting resources availability,
increasing contention on the interconnect and slowing packet delivery time.
Also, the more the number of flits traversing the network, the more energy
consumptions increase during execution due to increased link activity.
Starting from these considerations, we introduced an evaluation method-
ology workflow applied on top of GEM5 simulator, and a comprehensive
analysis on the introduction of end-to-end data compression mechanism as a
simple and effective way to reduce energy consumptions and packet latency.
We presented a compression/decompression block capable to fit N differ-
ent compressor modules in parallel that can give the best output, based on
a simple output selection policy able to select only one compressor’s output
from the N modules. This general implementation has been characterized
by using two specific compression algorithms derived from the State of Art
techniques, Delta compression and Zero compression, and by specifying a
suitable output selection policy for these two techniques in particularly.
Such implementation has been used in order to present a detailed analysis
of the impact of data compression on the Noc. Qualitatively, we analyzed
the impact of compression against energy consumptions and performance
improvement, other than evaluate the effectiveness of the usage of a com-
pression selection policy. Quantitatively, we gave a description of the impact
that the introduction of compression module have in terms of energy and
49
5.1. Future Works
performance overheads.
Results have been extracted using a 4x4 2D-Mesh architecture running a
subsuite of the SpecCPU2006 benchmark suite. We compared our methodol-
ogy with the baseline architecture without compression capabilities enabled.
Moreover, the baseline architecture has been enhanced with power-aware and
performance-aware solution to limit the design space and to have a reference
model from both a power and performance standpoint.
Compared to the power-aware baseline model, parallel architecture shows
a 26% average dynamic energy reduction, with a modest 0.2% overheads ac-
counted for the compression/decompression modules consumptions. It also
outperforms delta and zero by a 7% and 4% more energy savings, respec-
tively. Compared to the performance-aware solution, parallel architecture
gives no improvements in terms of overall system speedup, but instead aver-
age packet’s latency is reduced by a 12%. It also outperforms delta and zero
models by a 4% and 3% factor, respectively.
Finally, compression ratio is increased by 5.7% with parallel architecture
with respect to delta and zero models, showing that the compressed traffic
is well-distributed among the compression algorithm used at runtime, with
a 52% of packets compressed with delta and 48% with zero. The utilization
of parallel architecture helps to stabilize the compression ratio during time,
as shown by the 5% decrease of its standard deviation, compared to single-
algorithm architectures.
These results show that data compression on NoC is definitely a tech-
nique capable to decrease energy consumptions, but is not the same for per-
formance improvements. What compression affects is in fact only the data
packet’s network latency. Moreover, the possibility to tune the compres-
sion/decompression block by introducing other different modules, allow the
architecture to explore the behaviour of such techniques on different work-
loads to possibly reach even more better results.
5.1 Future Works
The presented work is based on two specific data compression techniques,
and their parallelization, among all the actually existent techniques. How-
ever, we are confident that, by exploring different possible combination of
algorithms, better results could be possibly reached both in terms of energy
savings and reduced resource contention. Also, a deeper analysis on traffic
50
5.1. Future Works
behaviours could give better insights, used to tune this methodology to obtain
ever increasing optimization. Finally, a combined approach with orthogonal
optimizations, that for example aim to increment system’s speedup, could
result in a good way to both reduce energy consumptions and gain better
performances.
51
Bibliography
[1] A Alameldeen and D. Wood. Frequent pattern compression: A
significance-based compression scheme for l2 caches. In Technical Report
1500, University of Wisconsin, Madison, USA, May 2004.
[2] A. R. Alameldeen and D. A. Wood. Adaptive cache compression for
high-performance processors. In Computer Architecture, 2004. Proceed-
ings. 31st Annual International Symposium on, pages 212–223, June
2004.
[3] Mario Badr and Natalie Enright Jerger. Synfull: Synthetic traffic models
capturing cache coherent behaviour. In Proceeding of the 41st Annual
International Symposium on Computer Architecuture, ISCA ’14, pages
109–120, Piscataway, NJ, USA, 2014. IEEE Press.
[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Rein-
hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower,
Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell,
Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood.
The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, Au-
gust 2011.
[5] S. Corbetta, D. Zoni, and W. Fornaciari. A temperature and re-
liability oriented simulation framework for multi-core architectures.
VLSI (ISVLSI), 2012 IEEE Computer Society Annual Symposium on,
September 2012.
[6] S. Corbetta, D. Zoni, and W. Fornaciari. A temperature and reliabil-
ity oriented simulation framework for multi-core architectures. In 2012
IEEE Computer Society Annual Symposium on VLSI, pages 51–56, Aug
2012.
52
BIBLIOGRAPHY BIBLIOGRAPHY
[7] R. Das, A. K. Mishra, C. Nicopoulos, D. Park, V. Narayanan, R. Iyer,
M. S. Yousif, and C. R. Das. Performance and power optimization
through data compression in network-on-chip architectures. In 2008
IEEE 14th International Symposium on High Performance Computer
Architecture, pages 215–225, Feb 2008.
[8] Ashley M. DeFlumere and Sadaf R. Alam. Exploring multi-core lim-
itations through comparison of contemporary systems. In The Fifth
Richard Tapia Celebration of Diversity in Computing Conference: In-
tellect, Initiatives, Insight, and Innovations, TAPIA ’09, pages 75–80,
New York, NY, USA, 2009. ACM.
[9] M. Ekman and P. Stenstrom. A robust main-memory compression
scheme. In 32nd International Symposium on Computer Architecture
(ISCA’05), pages 74–85, June 2005.
[10] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankar-
alingam, and Doug Burger. Dark silicon and the end of multicore scal-
ing. In Proceedings of the 38th Annual International Symposium on
Computer Architecture, ISCA ’11, pages 365–376, New York, NY, USA,
2011. ACM.
[11] John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH
Comput. Archit. News, 34:1–17, September 2006.
[12] Z. Li, J San Miguel, and N. Enright Jerger. The runahead network-on-
chip. High Performance Computer Architecture (HPCA), 2016 , Inter-
national Symposium on, March 2016.
[13] G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and
S. W. Keckler. A case for toggle-aware compression for gpu systems. In
2016 IEEE International Symposium on High Performance Computer
Architecture (HPCA), pages 188–200, March 2016.
[14] Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons,
Michael A. Kozuch, and Todd C. Mowry. Base-delta-immediate com-
pression: Practical data compression for on-chip caches. In Proceedings
of the 21st International Conference on Parallel Architectures and Com-
pilation Techniques, PACT ’12, pages 377–388, New York, NY, USA,
2012. ACM.
53
BIBLIOGRAPHY BIBLIOGRAPHY
[15] Charles Poynton. Digital Video and HDTV Algorithms and Interfaces.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1 edition,
2003.
[16] Iain E. Richardson. The H.264 Advanced Video Compression Standard.
Wiley Publishing, 2nd edition, 2010.
[17] C Sun, C. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. Peh,
and V. Stojanovic. Dsent - a tool connecting emerging photonics with
electronics for opto-electronic networks-on-chip modeling. Networks on
Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, May
2012.
[18] F. Terraneo, D. Zoni, and W. Fornaciari. A cycle accurate simulation
framework for asynchronous noc design. In 2013 International Sympo-
sium on System on Chip (SoC), pages 1–8, Oct 2013.
[19] Federico Terraneo, Davide Zoni, and William Fornaciari. A cycle accu-
rate simulation framework for asynchronous noc design. System on Chip
(SoC), 2013 International Symposium on, October 2013.
[20] Luis Villa, Michael Zhang, and Krste Asanovic. Dynamic zero com-
pression for cache energy reduction. In Proceedings of the 33rd Annual
ACM/IEEE International Symposium on Microarchitecture, MICRO 33,
pages 214–220, New York, NY, USA, 2000. ACM.
[21] B. Towles W.J. Dalley. Route packets, not wires: on-chip interconnec-
tion networks. In Design Automation Conference, 2001., June 2001.
[22] J. Zhan, M. Poremba, Y. Xu, and Y. Xie. Noδ: Leveraging delta
compression for end-to-end memory access in noc based multicores. In
2014 19th Asia and South Pacific Design Automation Conference (ASP-
DAC), pages 586–591, January 2014.
[23] Ping Zhou, Bo Zhao, Yu Du, Yi Xu, Youtao Zhang, J. Yang, and
Li Zhao. Frequent value compression in packet-based noc architectures.
In 2009 Asia and South Pacific Design Automation Conference, pages
13–18, Jan 2009.
54
BIBLIOGRAPHY BIBLIOGRAPHY
[24] D. Zoni, J. Flich, and W. Fornaciari. Cutbuf: Buffer management and
router design for traffic mixing in vnet-based nocs. IEEE Transactions
on Parallel and Distributed Systems, 27(6):1603–1616, June 2016.
[25] Davide Zoni, Simone Corbetta, and William Fornaciari. Hands: Hetero-
geneous architectures and networks-on-chip design and simulation. In
Proceedings of the 2012 ACM/IEEE International Symposium on Low
Power Electronics and Design, ISLPED ’12, pages 261–266, New York,
NY, USA, 2012. ACM.
[26] Davide Zoni and William Fornaciari. Modeling dvfs and power-gating
actuators for cycle-accurate noc-based simulators. J. Emerg. Technol.
Comput. Syst., 12(3):27:1–27:24, September 2015.
[27] Davide Zoni, Federico Terraneo, and William Fornaciari. A control-
based methodology for power-performance optimization in nocs exploit-
ing dvfs. Journal of Systems Architecture, 61(5-6):197–209, 2015.
55