Exploring the end-to-end compression to optimize the power ... › bitstream › 10589 › 132476...

POLITECNICO DI MILANOMaster in Computer Science and Engineering

Dipartimento di Elettronica, Informazione e Bioingegneria

Exploring the end-to-end compression

to optimize the power-performance

tradeoff in NoC-based multicores

Supervisor: Prof. William FORNACIARI

Assistant Supervisor: Dr. Davide Zoni

Master Thesis of:

Fabio PANCOT

Student n. 799521

Academic Year 2015-2016

Estratto in lingua italiana

La continua evoluzione del mercato tecnologico, sia in termini di applicazioni

che di dispositivi, ha posto nuove sfide ai produttori hardware per poter sot-

tostare alle sempre piu crescenti richieste di bassi consumi e migliori per-

formance. La Network-on-Chip (NoC) si e imposta come l’interconnessione

on-chip di riferimento, in grado ottenere migliore scalabilita e flessibilita. Il

suo impatto sulle performance e sui consumi energetici pero e un aspetto da

non tralasciare. Cio porta alla ricerca di nuove strategie di ottimizzazione.

La letteratura riporta diverse metodologie per investigare questo tradeoff,

dalla modifica all’architettura dei router, scheduling prioritizzato dei pac-

chetti, utilizzo di topologie ottimizzate. In questo scenario, la compressione

dati risulta una soluzione pratica per ridurre l’iniezione di traffico, riducendo

quindi la banda di canale richiesta, il profilo energetico e incrementando le

performance del sistema. I meccanismi di compressione sono divisi in due

categorie: compressione in cache e compressione end-to-end (E2E). La prima

ha l’obiettivo di aumentare virtualmente la capacita dei banchi di cache in-

serendo piu righe compresse in una sola riga fisica, mentre la seconda invece

comprime i blocchi di cache prima che essi vengano pacchettizzati e iniettati

nella rete. Precedenti analisi hanno inoltre evidenziato che la maggioranza dei

flit nella rete e rappresentata dal trasferimento di dati. Partendo da queste

assunzioni, noi proponiamo un’analisi dell’impatto che la compressione E2E

introduce sulla NoC. Questa tesi investiga il tradeoff potenza-prestazioni of-

ferto dalla compressione E2E sulla NoC, con particolare enfasi sugli overhead

in potenza e prestazioni, cosı come il rapporto di compressione raggiungibile

nei confronti del rapporto ottimo ottenuto da un modello oracolo. Abbiamo

selezionato questo tipo di schema per via della sua trasparenza nei confronti

della NoC, dando maggiore flessibilita nel design del sistema e per via dei suoi

overhead contenuti. Precedenti lavori sulla compressione dati si concentrano

solo sul risparmio energetico e le latenze ridotte sui pacchetti, senza consider-

i

are gli overhead introdotti, tralasciando inoltre l’impatto sulle prestazioni del

sistema. In questa tesi presentiamo un’analisi qualitativa degli aspetti critici

nella compressione E2E, per poter migliorare prestazioni e consumo ener-

getico, e un’analisi quantitativa che esplora gli aspetti energetici e di latenza

in termini di overhead introdotti dalla compressione. Discuteremo inoltre

l’impossibilita di migliorare le prestazioni del sistema sfruttando la compres-

sione E2E, e infine analizzeremo i benefici derivanti dall’uso di piu algoritmi

di compressione eseguiti in parallelo comparandoli con singoli algoritmi di

compressione. Il blocco di compressione utilizzato per le analisi e stato inte-

grato nel simulatore di architetture GEM5 e comparato con un’architettura

di riferimento che non integra il meccanismo di compressione. I risultati

sono stati collezionati usando un’architettura 16-core, eseguendo su di essa

un sottoinsieme della suite di benchmark SpecCPU2006. I risultati mostrano

che la compressione E2E analizzata puo risparmiare fino al 26% di energia,

mantenendo le performance praticamente inalterate. La versione parallela

inoltre permette di ottenere un miglioramento in termini ti rapporto di com-

pressione pari al 5.7%, aumentando la stabilita di compressione nel tempo di

un 5% in media, comparato con meccanismi di compressione singoli.

Abstract

The continous evolution of market applications and technological devices

have posed new challenges to hardware manufacturers, in order to meet ever-

increasing low-power and performance requirements. The Network-on-Chip

(NoC) emerged as a de-facto on-chip interconnect to meet scalability and

flexibility. However, its non-negligible impact on performance and power con-

sumption on the entire chip forces different optimization strategies to cope

with the market requirements. The literature reports several optimization

methodologies to cope with such a tradeoff ranging from the router architec-

ture changes to priority-based packet scheduling techniques and optimized

NoC topologies. In this scenario, data compression techniques represent a

viable solution to reduce the injected traffic thus reducing the required band-

width, the energy profile and increasing the system-wide performance. The

compression techniques falls in two broad categories: cache compression and

end-to-end compression (E2E). The former has the objective to virtually in-

crease the cache bank capacity by fitting more compressed cache lines into

the same cache bank, where the latter instead aims to compress cache blocks

before their injection into the on-chip interconnect. Previous analysis on dif-

ferent workloads highlighted the fact that the majority of the network traffic

is due to data transfers. Starting from this assumptions, we propose an anal-

ysis of the impact that E2E compression scheme has on the NoC. This thesis

investigates the power-performance tradeoff offered by the E2E compression

in NoCs with particular emphasis on the power and performance overheads

as well as the achievable compression ratio with respect to the optimal com-

pression ratio from a golden model. We selected this type of scheme because

of its transparency with respect to the rest of the NoC, giving more flexi-

bility in terms of system design and Intellectual Property (IP) embedding,

and because of its very small area and power overheads. Previous works on

data compression focus their attention only on power savings and reduced

iii

packet latency, without taking into account any introduced overhead, nor

describing how performance of the entire system is affected. We then present

a qualitative analysis of the most critical aspects in E2E compression to

cope with, in order to improve performance and power consumption, and a

quantitative analysis that explores the energy and latency overheads intro-

duced by such compression scheme. We will also discuss the impossibility

of E2E compression to improve the system performance and finally analyze

the benefits deriving from the utilization of more compression mechanisms in

parallel, compared to single-compression-algorithm schemes. The compres-

sion methodology has been integrated in the GEM5 full-system simulator

and compared to a baseline NoC architecture that does not employ data

compression. Results have been collected using a 16-core architecture run-

ning a subset of the SpecCPU2006 benchmark suite. Results shows that E2E

compression achieve 26% energy savings with almost the same performance

of the baseline NoC. Its parallel version increase the traffic compressibility

by a 5.7%, and increase its stability over time by 5% on average, compared

to single-compression-algorithm schemes.

Acknowledgments

To my parents, Alcide and Marina, that supported me during all this journey,

patiently listening to all my insecurities and grumblings, giving me motiva-

tion in the darkest hours.

Without them, none of the following pages could exist.

To Lorenzo, that albeit being far away in Berlin, supported my work and

helped me smile. Thank you for always having been.

To Luca, Mauro, Andrea and Luca, without them working on this thesis

would be really more boring.

To Claudio e Laura, for being good friends, and for the laugher we made,

and will make, working together at Bar Bianco.

To Ottilia, that helped me relieve stress with laugher and her good

company.

To Davide, that opened to me the world of research field, and gave hard,

but wise, teachings.

To professor W. Fornaciari and the rest of the HIPEAC Research Group,

in which I’ve found a friendly group of people during this period.

vi

Contents

1 Introduction 1

1.1 Goals and Contributions . . . . . . . . . . . . . . . . . . . . . 6

1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 State of the Art 8

2.1 Cache Compression . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 End-to-End Compression . . . . . . . . . . . . . . . . . . . . . 12

3 Evaluation Methodology 16

3.1 Architectural Description . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Compression Block . . . . . . . . . . . . . . . . . . . . 19

3.1.2 Output Selection Policy . . . . . . . . . . . . . . . . . 20

3.1.3 Decompression Block . . . . . . . . . . . . . . . . . . . 21

3.1.4 Simulation Model . . . . . . . . . . . . . . . . . . . . . 22

3.2 Compression Techniques . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Delta Compression . . . . . . . . . . . . . . . . . . . . 24

3.2.2 Zero Compression . . . . . . . . . . . . . . . . . . . . . 26

3.2.3 Specialized Output Selection Policy . . . . . . . . . . . 27

3.3 Methodology Evaluation Engine . . . . . . . . . . . . . . . . . 29

3.3.1 Power Consumptions Evaluation . . . . . . . . . . . . . 30

3.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . 31

3.3.3 Compression Determinism Evaluation . . . . . . . . . . 31

4 Results 32

4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.1 Overheads Analysis . . . . . . . . . . . . . . . . . . . . 38

4.2.2 Compression Policy Analysis . . . . . . . . . . . . . . . 39

viii

CONTENTS CONTENTS

4.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Compression Ratio Analysis . . . . . . . . . . . . . . . . . . . 46

5 Conclusions and Future Works 49

5.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Bibliography 52

ix

List of Figures

1.1 Dot Per Inch evolution for smartphone devices, during years. . 2

1.2 Packet type distribution on different workloads from the Spec-

CPU2006 benchmark suite on a 4x4 2D-mesh with a channel

width of 64bit and a cache line size of 64B. . . . . . . . . . . . 4

1.3 Flit type distribution on different workloads from the Spec-

CPU2006 benchmark suite considering a 4x4 2D-mesh with a

channel width of 64bit and a cache line of 64B. Data packets

are 9 flit long and control packets are single-flit-packets. . . . . 5

1.4 Energy savings due to the introduction of compression tech-

nique on different workloads from SpecCPU2006 benchmark

suite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Graphic representation of the baseline NIC, its positioning in-

side a NoC and its internal blocks. . . . . . . . . . . . . . . . 17

3.2 Graphic representation of the NIC with compression and de-

compression modules in place. . . . . . . . . . . . . . . . . . . 18

3.3 Internal composition of the compression block. . . . . . . . . . 19

3.4 Detail of the output selection and filtering performed by the

multiplexer. It filters out N bits from a line width of B bits,

assuming N≤B. . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Internal composition of the decompression block. . . . . . . . 22

3.6 First execution phase of Delta compression. . . . . . . . . . . 24

3.7 Second execution phase of Delta compression. . . . . . . . . . 25

3.8 Zero compression mechanism. . . . . . . . . . . . . . . . . . . 26

3.9 Block diagram of the Methodology Evaluation Engine work-

flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

x

LIST OF FIGURES LIST OF FIGURES

4.1 Energy consumptions results on the three considered compression-

enabled NoC architectures, normalized to the baseline archi-

tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Total injected flits for each compression-enabled architecture,

obtained from simulations on the SpecCPU2006 benchmark

subsuite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Compression ratio for each compression algorithm, obtained

from simulations on the SpecCPU2006 benchmark subsuite. . 37

4.4 Average hops/flit performed for the baseline NoC architecture. 40

4.5 Simulation execution time results on the three compression-


tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6 VNETs comparison results on parallel-enabled architecture,

normalized to the baseline architecture. . . . . . . . . . . . . . 42

4.7 Total latency comparison on the three considered compression-


tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.8 Execution time comparison between the three baseline archi-

tectures with different router’s pipeline latencies. . . . . . . . 45

4.9 Packet distribution against the compression module used in

parallel-enabled architecture. . . . . . . . . . . . . . . . . . . 46

4.10 Standard deviation of the compression ratio of the three con-

sidered compression-enabled NoC architectures. . . . . . . . . 47

xi

List of Tables

1.1 Screen characteristics for different market products, and their

frame buffer bandwidth. . . . . . . . . . . . . . . . . . . . . . 3

4.1 Experimental Setup: architecture specifications and traffic work-

loads used in simulations. . . . . . . . . . . . . . . . . . . . . 33

4.2 Experimental Setup: compression techniques specifications. . 34

xii

Acronyms

NoC = Network-on-Chip

QoS = Quality of Services

IP = Intellectual Property

SoC = System on Chip

CMP = Chip Multi-Processor

NIC = Network Interface Controller

VNET = Virtual Network

PE = Processing Element

CC = Compression Controller

DC = Decompression Controller

xiii

Chapter 1

Introduction

In the last decade the market segment imposes a tremendous increase in

the performance requirements and low power constraints to the computing

architectures ranging from High Performance Computing (HPC) to the Em-

bedded Systems. The traditional bus-based multi-core design can not match

such requirements anymore, mainly due to the limited performance scala-

bility of the on-chip interconnect [8]. In this scenario, the Network-on-Chip

(NoC) [21] emerged as the de-facto on-chip interconnect solution for scalable

multi and many-core architectures. Thanks to its efficient wiring usage and

the ability to multiplex more traffic flows on the same channels, the NoCs

ease the matching of the imposed Quality of Service (QoS) requirements.

Moreover, it offers a scalable and flexible interconnection fabric that can be

customized depending on the specific applications.

Differently from the bus-based interconnection, where all the components

share the same physical channel in order to communicate, the NoC offers a

distributed approach. It is constituted by several nodes, connected between

them through the utilization of bidirectional links. Each node contains a sim-

ple core, an L1 or L2 cache bank, and a Network Interface Controller (NIC).

In order to perform communication with the rest of the system, each node

is coupled with a router, that provides the functionality of routing packets,

both arriving from the NIC and from other nodes. The NIC has instead the

important role to give communication functionality between its node and the

1

rest of the system: it provides an interface capable of elaborate messages into

packets and to reconstruct arriving packets into messages destined to its core.

The way in which nodes are organized is defined by the topology that the

network employs. Actually, there exist different network topologies: thorus,

trees, bidirectional trees, 2D mesh, 3D mesh and so on. The most popular

topology used in NoC is the 2D-Mesh, that is a quadrilateral matrix-style

topology in which each node is connected through its router to its neighbors

in four directions, north, south, east and west.

However, NoCs provide worse end-to-end latencies compared to on-chip

bus solutions. On the other hand, the coherence protocols can not rely on

a total message order imposed by the interconnect, due to the fact the NoC

offers different paths between each source/destination pairs. All in all, the

emergence of the Dark Silicon [10] imposes the introduction of several system-

wide architectural methodologies to optimally exploit the available power

budget, thus forcing an accurate power optimization of the NoC design as

well. Last, the continuous increase in the bandwidth requirements imposed

the introduction of bandwidth-aware optimization strategies on the NoCs

even if they are traditionally considered an high bandwidth interconnects.

0

100

200

300

400

500

600

700

800

900

Dot Per In

ch

Figure 1.1: Dot Per Inch evolution for smartphone devices, during years.

For example, the mobile revolution enables high screen resolutions on

portable devices that, nowadays, are expected to deliver the same user expe-

rience of desktop and server architectures. Figure 1.1 shows how smartphone

2

screen’s characteristics have evolved during last decade, highlighting the fact

that in the last two years this trend has increased more rapidly with respect

to its observation window.

In such a scenario, in which dedicated processors and high-resolution

components are involved, a huge amount of data have to be computed and

transmitted back and forth between processing elements and I/O devices, as

fast as possible, to maintain acceptable QoS.

Product Screen resolution Refresh rate Bandwidth(Gbps)

Nexus 5X 1080x1920 60Hz 2.9

Samsung S7 1440x2560 60Hz 5.3

IPhone 7 1337x750 60Hz 1.5

OnePlus 3 1080x1920 63Hz 3.1

Table 1.1: Screen characteristics for different market products, and their frame

buffer bandwidth.

For example, even the simple task of refreshing a screen’s smartphone

has become a data-intensive task that requires the movement of a great

amount of traffic. Values from table 1.1, which represent different bandwidth

requirements for market products, give a good insight about how critical is

the role of the interconnect on managing the transport of such amount of

traffic per second.

In this scenario, the data compression techniques deliver a viable solution

to limit the bandwidth requirements, thus increasing the energy efficiency,

and performance of the device. The compression techniques are organized in

two broad categories. The Cache Compression [1, 2, 9, 14], acts at the cache

level, by storing the cache lines in a compressed fashion, with a net increase

in the cache capacity.

On the other hand, end-to-end (E2E) compression [23, 22, 1] aims to com-

press (decompress) data at the injection (ejection) point of the interconnect.

This implementation is transparent to the SoC design, and aims at reduc-

ing the number of transferred data through the SoC. It is also completely

orthogonal to any other possible optimizations.

In this work the E2E compression has been investigated as a viable mean

to reduce the traffic on the NoC without affecting the IP design since it is

transparent to the rest of the SoC. The on-chip networks allows the transmis-

sion of messages between the CPUs and the memory system by transforming

3

each message into a control or a data packet. Control packets are used to

keep the coherence within the SoC while data packets are used for data

transfers. Each packet is further split in multiple flits that are the atomic

transmission unit in the NoC. A packet has a single head flit that carries

routing information as well as the request/response command issued from

the sender. Multiple body flits are possible to transfer the data associated

with the message. Last, a single tail flit closes the packet. The tail flit is a

special body flit that signals the end of the packet to the network resources.

To this extent, data packets only can be compressed, since control packets

are short and usually single flit long.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

gems soplex bzip2 h264ref gobmk gromacs hmmer mcf namd omnetpp leslie3d

data packets control packets

Figure 1.2: Packet type distribution on different workloads from the SpecCPU2006

benchmark suite on a 4x4 2D-mesh with a channel width of 64bit and a cache line

size of 64B.

Figure 1.2 details the distribution of data packets with respect to the

overall network packets on SpecCPU2006 benchmark suite, showing that

data packets cover only 1/3 of the total.

Albeit the portion of data packets represents 1/3 of the total traffic, its

flit count hold the majority of the traffic that flows into the NoC, as shown in

Figure 1.3. This highlights the fact that the traffic load is dominated by data

packet flits. In particular, the number of flits that compose a data packet

depends on the channel width and the cache line size.

4

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

gems soplex bzip2 h264ref gobmk gromacs hmmer mcf namd omnetpp leslie3d

data flits control flits

Figure 1.3: Flit type distribution on different workloads from the SpecCPU2006

benchmark suite considering a 4x4 2D-mesh with a channel width of 64bit and a

cache line of 64B. Data packets are 9 flit long and control packets are single-flit-

packets.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

mcf bzip2 gobmk hmmer soplex namd h264ref gromacs leslie3d omnetpp gems

golden model

Figure 1.4: Energy savings due to the introduction of compression technique on

different workloads from SpecCPU2006 benchmark suite.

To further strengthen the energy saving opportunities offered by the E2E

compression techniques, Figure 1.4 reports the energy savings for 11 Spec-

CPU2006 benchmarks when each data packet is shorten up to a single-flit-

5

1.1. Goals and Contributions

packet, i.e., using a golden compression model that always perfectly compress

each data packet. The golden compression model shows an averaged 60% en-

ergy reducing with respect to the baseline, non-compressed NoC traffic.

1.1 Goals and Contributions

This thesis aims to provide an analysis on the impact of E2E compression

on the NoC in terms of performance, power consumptions and overheads

introduced due to its utilization. Architectural modifications to the system

will be also presented, other than a parallel compression scheme that exploit

parallel execution of more compression modules to obtain better results.

Our contributions with this thesis account for both a qualitative and quan-

titative description of the implications subsiding the introduction of E2E

compression technique on the NoC.

Qualitatively, we will give the following contributions:

• we analyze the impact of E2E compression on energy consumption,

describing the effectiveness of such optimization as a power-saver;

• we investigate the usefulness in using a policy-based E2E compression

approach, to select a subset of packets eligible to compression, against

an approach of whole data packet compression;

• we analyze the impact of E2E compression against performance, ex-

plaining why compression is not a suitable optimization technique ca-

pable to boost the system’s speedup;

• we present a parallel compression model capable of executing more

compression algorithms in parallel, describing how such implementation

can give better results in terms of compression ratio and compression

stability, compared to the baseline architecture.

We will also quantitatively describe the impact of compression in terms

of overheads:

• We analyze the impact of the energy overhead, due to the utilization

of compression/decompression blocks, against the benefits derived by

compressing packets, discussing the presence of a break-even point;

6

1.2. Thesis Structure

• We discuss the impact of increased latency experienced by data packets

due to the extra cycles introduced to perform compression on data.

1.2 Thesis Structure

The rest of the thesis is structured in 4 chapters. Chapter 2 describes the

state of the art in compression techniques, both on cache and end-to-end,

already researched in NoCs. Chapter 3 provides a detailed description of the

methodology used for the analysis and the architecture modifications made

to the system. Chapter 4 exposes the obtained results with real-application

benchmark suite. Finally, Chapter 5 points out conclusions on this work.

7

Chapter 2

State of the Art

Data compression is a fairly new technique introduced in Chip Multiprocessor

(CMP) environment to exploit the tradeoff between power consumption and

performance execution by lowering the amount of traffic and, consequently,

dynamic power consumption due to minor switching activity on routers. This

chapter reviews the state of the art considering data compression on NoCs

as a way to improve overall performance while preserve energy dissipation.

Particularly, some of these methodologies are focused on expanding the

available memory capacity by applying compression on data to store more

smaller-size data into the same cache line, while other applies compression

between source and destination by compressing packets before injecting and

decompressing on receive, working at NIC level.

These aforementioned techniques are usually referred to cache compres-

sion and compression at the NIC (also known as end-to-end compression),

but are also characterized by the type of compression algorithm employed.

The common key concept behind the development of such techniques is

the redundancy in data packets: Ekman et al. [9] show how traffic patterns,

such as long runs of 0’s and 1’s, occur very frequently in different workloads,

highlighting the possibility to reduce this redundancy by applying compres-

sion on data.

Pekhimenko et al. [14] give more details on how redundancies can ap-

pear in data block and that these occurrences are widely present in many

applications and workloads. They show that, in addition to zero/one runs,

redundancy can be found in two other type of patterns: Repeated values, de-

fined as a single, small value repeated multiple times in a contiguous region of

memory, and narrow values, small data types stored in memory using larger

8

2.1. Cache Compression

data type then required, leading in a potential waste of possible utilizable

memory space. They also denote that all the above patterns fall under the

notion of low dynamic range: a set of values where differences between them

are much smaller then the values themselves.

Actually, two major typologies of compression algorithm are used: the

first one, exploiting the aforementioned concept of redundancies in data,

compares a part of the data itself, known as base, and computes a set of

differences from other portion of data. This approach is dependent on the

content of data itself, and is less deterministic in the achievable compression

ratio. The second type of algorithm relies on the comparison of the data to

a well-defined set of patterns, which include repeated symbols, 0’s or 1’s or

frequent observable patterns, and substitutes data with a smaller code that

represents the type of pattern detected. This last type is less data dependent

and so more deterministic, but at the same time can achieve less compression

ratio if compared with the former.

Starting from these motivations, different methodologies have been de-

veloped to address in different ways the needs of more memory bandwidth

and avoid increased packet latency and high dynamic power consumption on

NoCs by exploiting data-intrinsic characteristics already described.

2.1 Cache Compression

Cache Compression is a type of compression scheme that aims to increase the

effective cache capacity by fitting more compressed cache lines into a single

line.

Alameldeen et al. [1] describe how they can achieve good results by

applying Frequent Pattern Compression technique on L2 caches.

FPC compresses and decompresses data on a cache line basis by compar-

ison between actual data and a fixed-pattern table, derived from previous

analysis on different workloads.

Cache lines are split into 32-bit words, and each of them can be encoded

with a 3-bit prefix plus remaining data. If a word matches one entry of the

Frequent Pattern table, then it is encoded and stored in cache; if no matches

are encountered in this phase, the word is then stored as it is.

In order to effectively increase potential cache size, a cache line must be

able to pack more compressed caches lines than uncompressed lines into the

same space. Potentially, one cache line can be compressed to any number of

9


bits: this assumption leads in the necessity to maintain a cache organization

that don’t break its accessibility to data and that each compressed and un-

compressed lines are well-padded to avoid misalignments. To address this,

authors introduced a modification of their scheme, called Segmented Frequent

Pattern Compression, that increases L2 granularity to 8-byte segment(cache

lines are seen as groups of segments), eventually padded with zeroes to match

a multiple of a segment, adding overhead in terms of area occupation. As this

methodology maintains data in L1 caches in an uncompressed form and in

a compressed form when stored into L2 cache banks, this overhead remains

negligible compared with the introduced optimizations, also if it is compared

to the added benefit of achievable compression ratio, up to 52%.

Das et al. [7] show how Frequent Pattern compression can be practi-

cally applied on L2 caches. They highlight some management aspects to be

taken in account when introducing FPC in caches design, and particularly

they describe how to avoid data access issues and misalignment due to the

variable-length compression scheme of FPC. More precisely authors define a

compression phase, called Compaction, that takes place under two circum-

stances: on a write hit and data to be written are bigger than the original

compressed line, and on replacements when line chosen to be replaced has

not the same size of the new line that will replace it. Compaction avoid the

generation of holes inside on a cache line, compacting them into contiguous

space and padding data with 0’s to maintain alignment at a segment level.

Because of this possible line fragmentation, authors introduced also the con-

cept of invariants to check the correctness of data inside every compressed

line. All these modifications comes with an overhead in terms of area and

power consumption of 0.183mm2 and 0.287W, respectively, with latency re-

ductions of 21% on average, dynamic power dissipation reduction of 7% on

average and a CPI reduction of about 7%, compared with baseline architec-

ture without compression.

Pekhimenko et al. [14] differently from previous works exploited another

compression mechanism, called Base Delta Immediate Compression, as a fast,

simple and effective alternative in implementing Cache Compression.

As usually high-level data structures(such as C struct or arrays) are used

in common application to represent large pieces of data and because of the

nature of the computation itself, a large amount of data patterns can be

stored in less then their actual size, so authors exploited this behavior to

10


apply Base Delta Compression with the goal of diminishing both network

latency and power consumptions in a lightweight fashion.

In fact, Base Delta compression relies on a modest hardware module

composed by simple adders and a mask both in compressor and decompressor

modules, because the methodology itself needs only the actual cache line to

start compression: the idea is to represent the initial cache line as a common

base value and a sequence of deltas, defined as the subtraction between the

base and the actual portion of the line being compressed.

The algorithm views a cache line as a set of fixed-size values multiple of

8-byte size. First of all, a base has to be defined: for the Base Delta case,

authors adopted a simple base selection policy in which the first arbitrary 8

bytes of a line are defined as the base; after this step, B + ∆ computes the

deltas with the previously selected base, and it compresses the cache line if,

and only if, the sum of base and deltas is strictly smaller than the cache line

itself.

In order to gain better results in terms of compressibility, authors added

a second base to the baseline compression scheme, to take in account also

for immediate values as zero runs, repeated and narrow values, calling it

Base∆Immediate Cache Compression. This last introduction does not im-

pact overhead in terms of complexity and area, but increase the possible

compressibility level. This implementation, compared with Zero Content

Augmented Cache Compression, Frequent Value Compression and Frequent

Pattern Compression, provides a simple yet efficient technique that provides

high effective cache capacity increase and system performance improvement;

results show that B∆I improves system performances for single-core work-

loads of about 8% and 9.5% and 11.2% for two-cores and four-cores workloads

respectively, compared with the three aforementioned algorithms.

Villa et al. [20] have instead proposed a cache compression mechanism called

Dynamic Zero Compression, that exploits the prevalence of zero cache byte

on SRAM cache banks with the goal of reduce energy consumptions.

In order to support DZC mechanism, each cache byte is coupled with a Zero

Indicator Bit (ZIB), that indicated whether its associated byte contains all

zero bits.

On a cache write, the mechanism checks if all the eight bit of the byte

are zero, and if so it writes only the ZIB. In the case that the byte is not

zero, it is normally written to the cache. On a read access DZC employs a

11

2.2. End-to-End Compression

local byte word gating circuitry, controlled by the ZIB, to avoid swinging the

bitlines, and it is turned on only if the ZIB is not set, otherwise remains off

to save energy. Modifications are also made on bus drivers connecting sub-

banks to CPU, in order to avoid driving the I/O busses for zero bytes. These

introductions have their cost: area overheads are approximately around 9%

more than their baseline architecture, mostly due to the ZIBs introduction

into the banks. In terms of energy breakdown the DZC cache adds a 5pJ and

6.7 pJ energy consumptions for read and write on a zero byte, respectively.

For a non-zero byte, those levels increase to 11.4pJ for a read and 26.9pJ

for a write on DZC cache. DZC introduces also a timing delay, due to the

added circuitry to support DZC itself, that corresponds to a total of 2FO4

gate delay for a read access.

Simulations results on mediabench benchmark suite shows that Dynamic

Zero Compression achieve a 26% energy reduction on data cache accesses

and a 10% on instruction cache accesses.

2.2 End-to-End Compression

End-to-End Compression is another way to employ data compression in NoC

to achieve low network latencies and power consumptions.

Even if Cache Compression introduces benefits in terms of lowering power

consumptions and increasing the effective cache capacity, it adds design com-

plexity and hardware modifications to cache banks; researchers have found

that also moving compression/decompression modules to the Network In-

terface Controller can reach significant results, this time with the objective

of compressing data only when they have to be transmitted over the net-

work and consequently decompressed when ejected from it. This approach

avoids hardware modifications, while it follows a plug-and-play behaviour

that makes this solution flexible and scalable. Has to be cited that, in order to

work well, each tile have to have at least a pair of Compressor/Decompressor

modules, which can lead to a potential drawback if complexity of modules

grows.

Zhu et al. [23] applied this end-to-end paradigm by exploiting a com-

pression scheme based on Frequent Pattern Compression, described earlier,

called Frequent Value Compression. In their work, authors rearranged FPC

in such a way that it can complete compression prior to inject traffic into the

12


network and decompression as the header flits arrives to the NIC, without

modifying any other existing module.

FVC mechanism relies on the concept that a lot of traffic patterns ap-

pear very frequently during executions, and that each workload has its own

set of frequent values. By combining these two aspects, authors developed

an adaptive table-based compression scheme that adapt itself based on the

workload, leading in a more flexible compression scheme that can achieve

better results compared to fixed table-based ones.

To ensure that compression and decompression can be accomplished with-

out loss of data, every pair of source and destination tile have to maintain a

so-called Frequent Value Table in a synchronized way on both sides: before

injecting a data message, it is matched against the FV Table and if there’s

a hit, that message is replaced by the index of the corresponding hit on the

FV table. On receiving, if the corresponding compression tag is set, the

compressed message is checked against the receiver FV table and, because of

synchronization, the original message is decoded without problems.

In order to gain runtime adaptivity and hence better results, an LRU

replacement policy is applied to the mechanism to actually adapt the FV

table content to changes in traffic patterns. Of course, every time a value is

replaced on one table, that modification have to be forwarded to each other

table to maintain FV table’s coherency.

Finally, with the introduction of such a compression scheme, savings in

terms of routers power consumption reach a 16.7% reduction, CPI reduction

of about 23.5% compared to the baseline model.

Besides their Cache Compression implementation, Das et al. [7] also in-

troduced Compression in the NIC methodology that exploits Frequent Pat-

tern Compression algorithm to gain benefits also when implemented with an

end-to-end approach; Authors focused on the analysis of compression and

decompression latencies and area overhead and how they impact the overall

performances, since the compression algorithm has been described earlier.

Compression and decompression phases takes place at a cost: both in-

troduce extra execution cycles and extra dynamic power dissipation; To face

this problem, the solution given here is that, since FPC can be done in five

cycles, this delay can be hided by utilizing a pipelined router model; on the

other hand, decompression has not to wait until the tail flit has arrived to

the NI, but once it has received the header flit, it can start decoding the mes-

13


sage, resulting in a visible extra latency of two cycles. This added latency

do not impact the overall execution latency, rather it reduces it of about

20% compared to baseline execution. Finally, thanks to the simplicity of the

hardware modules that employs compression, power consumption remains

negligible compared to the power savings introduced, about 10% improve-

ments on average, by the methodology itself.

Differently, Zhan et al. [22] applied end-to-end mechanism on NoCs by

exploiting the Delta Compression algorithm to actually reduce traffic. They

highlight the importance of plug-and-play capability and, more precisely, the

fact that introducing such a scheme is orthogonal and complementary to any

other optimizations in other blocks.

No∆ is an adaptation of Delta Compression scheme in order to be used

in NoC environment: starting from the common key concept of low dynamic

range of values, authors aim to represent a data message as a common base

value of fixed size plus an array of relative differences from that base.

They describe also a compression policy to actually determine if a data

message can be compressed or not at runtime; Assuming that data portion

of the message has D bytes size, the base value has B bytes size and the other

values represented as an array of differences, No∆ determine compressibility

if and only if max{size(∆i)} < B, ∀i ∈ {1, 2..., n}.This methodology focuses not only on how to physically implement a

lightweight module for compression and decompression inside each NI of the

network, but also on the calculation of optimal base size and describe the

introduction of a multiple-bases variant, that tries in parallel to compress

a data message using different bases with different fixed-sizes to achieve a

better compression ratio.

Evaluation of such a technique on SPEC2006CPU Benchmark suite high-

light the potential benefits given by No∆: it can achieve an average com-

pression ratio of 21.1%, with latency reduction of 10.1% and network loads

reduction of 13.1% compared to baseline model; Also, thanks to the fact

that compressing data enlights the network load, No∆ can lower the dy-

namic power consumption of about 11% on average while power consumption

overhead for each couple compressor/decompressor account for only 1mW in

terms of dynamic power loss and 9.6uW in terms of leakage power loss.

Pekhimenko et al. [13] instead focused on the different assumption that

14


compression, besides its capability to decrease dynamic power consumption

by decreasing network traffic, increases the bit toggle count of each com-

pressed packet, a fact that can increase dynamic power consumptions. A bit

toggle count is defined as the number of switch from 0 to 1 or vice-versa on a

communication channel or link between two consecutive transmitted packets;

The higher the ratio, the higher will be the switches on links and so dynamic

energy will increase consequently.

Starting from this assumption, Authors developed an hardware mecha-

nism to exploit this tradeoff between saves introduced by compression and

power consumptions overhead due to increased bit toggle count: the idea is

to avoid that the bit toggle count increases much during consecutive data

transmissions.

To correctly manage this decision policy, an Energy Control module is

introduced with the goal of correctly predict if the next message will be

transmitted on the channel in a compressed form or not.

It uses an activation function that includes several metrics to determine

compressibility: the bit toggle count for the uncompressed packet, the bit

toggle count for the same packet but compressed, the current bandwidth uti-

lization and the actual compression ratio. Also, to maintain data alignment

on data storage, Metadata Consolidation phase plays a big role on gather

all compression-related bits and moving them in contiguous regions of space.

Its selection policy gain good results in terms of decrease in bit toggle count

(6 to 16%), but on the other side it limits the achievable compression ratio.

Differently from other works, this toggle-aware compression mechanism

focuses more on the runtime decision to compress or not a packet based on

the actual transmitted data on physical links; it do not describe a specific

compression type, but uses 6 different types of algorithm including the afore-

mentioned Frequent Pattern Compression and Base∆Immediate.Because of

different compression algorithm compress data in different modes with dif-

ferent entropies, the utilization of more than one compression algorithm is

done in order to have better chances to reach an high compression ratio with

the minimum bit toggle count compared among the algorithms.

15

Chapter 3

Evaluation Methodology

In this chapter, the evaluation methodology that we used to analyze the

impact of compression mechanism on the NoC is presented.

It will be described how we setup the analysis environment, which tools

have been used and how they interact with each other, in order to retrieve

the desired data.

An architectural description of the modification made to the NoC is pre-

sented, to give a better understanding on where the considered modules are

placed and what is their impact on the system, as well as a description of two

compression techniques used in the evaluation. A simulation model of the

compression and decompression blocks is given to clarify how we simulated

those components and how we modeled them with the objective to capture

their behaviour on the simulated environment.

The rest of the chapter is organized as follows. Section 3.1 describes the

architectural modifications made to the NoC to support compression, section

3.2 gives a detailed description of the compression techniques selected for our

evaluation. Finally, section 3.3 describes the analysis flow and tools used to

retrieve the results presented in chapter 4.

16

3.1. Architectural Description

3.1 Architectural Description

In figure 3.1 are represented the architectural building blocks of a Network

Interface Controller (NIC), and its position inside a NoC.

The NIC has the role of mediating communication between its IP core to

the rest of the system through its attached router.

Figure 3.1: Graphic representation of the baseline NIC, its positioning inside a

NoC and its internal blocks.

As we can see from the figure, it is composed by two main flows. The

first one stores messages coming from the IP block into a message buffer,

waiting to be flitisized. At every NIC’s wake up, NIC checks if there is a

ready message into the message buffer, and if so, it flitisizes it and stores it

into one of the NIC’s output flit buffers. On the other side, at each wake up

the NIC checks also if there is a complete sequence of flits in one of its input

flit buffers, and if so it picks them up and reconstruct the original message.

The reconstructed message is then queued into an input message buffer that

queues all the messages destined to the IP block.

It is noticeable that every modification done at the NIC level, in terms of

insertion of hardware modules, does not affect the rest of the system, because

of it’s intrinsic goal to maintain transparency between IP blocks and the rest

of the network.

By encapsulating then compression modules into the NIC, it results in a

transparent mechanism with respect to the rest of the system: in this way

17


its introduction is orthogonal to other possible applicable optimizations and

gives more flexibility in the design phase.

Figure 3.2: Graphic representation of the NIC with compression and decompres-

sion modules in place.

We modified the presented baseline architecture by placing a couple of

compression and decompression modules in each node’s NIC, as shown in

figure 3.2.

Particularly, the compression block is placed between the output message

buffer, accepting messages from the cache controller, and the hardware block

that packetizes the message into flits. If the message is compressible, i.e. if it

is a data message, it advances to the compression stage. Once finished, the

message goes to the following stage, that flitisizes it, sets the compression

flag on the header flit, and stores flits into one of the output flit buffer on

the NIC, waiting to be injected into the NoC.

The decompression block is placed between the hardware block that re-

construct the message and the input message buffer that serves the cache

controller. Once the message has been packed it is checked against its com-

pression flag, and if it results in a compressed message, it advances to the

decompression stage, otherwise it is bypassed directly into the input message

buffer that serves the cache controller.

18


3.1.1 Compression Block

The architectural representation of our compression block is shown in figure

3.3.

It is constituted by several compressor modules placed in parallel, each

one characterized by a different employed algorithm and delay to perform

compression on data. Once a message arrives at the input of the block, it

is forwarded simultaneously to all the compressors, and depending on their

actual state, they start compressing the message, or store it into their waiting

queues.

Compressor modules are grouped in two sets, depending on their delay

characteristics: low-latency and high-latency response modules.

A low-latency response module is a compressor module that can give an

output in a short number of clock cycles, typically near to 1, where high-

latency response modules can have more cycles to give an output, typically

from 5 cycles up.

A Compression Controller (CC) is needed to perform the output selection

from all the outputs of compressors. Each module is attached to the CC and

signals its end and its obtained compression ratio on the message.

Figure 3.3: Internal composition of the compression block.

19


Because we need to ensure that only one compressor’s output is granted

between all the possible compressors, a simple selection policy is defined to

grant the output for the rightmost multiplexer shown in figure 3.3.

3.1.2 Output Selection Policy

The selection policy employed by the CC is described here in its more general

form, in order to be applicable to an arbitrary number of modules.

If a low-latency module fires its done signal to the CC, its ratio value is

checked against a 50% threshold value, and if is equal or higher to that

threshold, CC directly grants its output to the multiplexer, aborting all the

other active tasks. By doing so, the CC speculates over the achievable com-

pression ratio, aiding low-latency response, as long as a decent compression

ratio is reached.

If more than one low-latency module fires at the same time, they are

checked against the same threshold, and CC grants the output of the com-

pressor that, among those modules with their ratio over the threshold, has

the highest compression ratio.

Differently for high-latency compressors, CC does not performs a thresh-

old filtering. In fact, if the policy have to compare high-latency modules, it

means that speculation over low-latency modules failed. So the CC simply

compares the ratio values of all the compressors after their termination and

grants the module’s output with the associated higher ratio between all the

compressors used, both low and high-latency ones.

This second rule is based on the fact that is not worth to speculate on

latency when comparing high-latency modules, rather than wait to all the

compressors to finish and hope for a better result with respect to low-latency

ones. Because is not known if high-latency modules could have surely better

result than low-latency ones (albeit being them less to 50%), all the outputs

of all modules are checked, to be sure to grant the module with the highest

compression ratio.

Once the output is granted, the CC also forwards the information about

which compressor has been granted, that will be embedded into the header

flit to guarantee the correct decompression on receive.

The output lines of each compressor have equal width to the original

20


message size, because in the worst case of no compression, the message is

forwarded without modification. So, in order to give the correct output to

the next stage, the output of the multiplexer is filtered with a bitmask that

filters only the amount of data representing the message, as shown in figure

3.4.

Figure 3.4: Detail of the output selection and filtering performed by the multi-

plexer. It filters out N bits from a line width of B bits, assuming N≤B.

3.1.3 Decompression Block

Similarly to compression block, the decompression block consists in a vector

of decompressor modules in parallel.

Once a message is retrieved by reconstruction from its flits, its compres-

sion flag is checked to detect if it needs to be decompressed, before sending

it to the cache controller’s input buffers.

As the message arrives at the decompression block, the Decompression

Controller (DC) selects the correct multiplexer’s input line from the com-

pression type flag retrieved in the previous stage from the header flit. We do

this to be sure that a message compressed with a specific compressor module

is decompressed with the corresponding decompression module that employs

the same algorithm.

The DC is also employed to select the correct decompressor module’s

output, to be sure to deliver the decoded message to the following input

message buffer. The decompression block, and its constitutive sub-blocks, is

represented in figure 3.5.

21


Figure 3.5: Internal composition of the decompression block.

3.1.4 Simulation Model

We modeled our compression and decompression blocks as two delay queues

that directly adds D queueing delay cycles to messages, that are variable

depending on the compression algorithm used.

22

3.2. Compression Techniques

3.2 Compression Techniques

Actually, there are plenty of compression techniques on the market, each

one with different algorithms and hardware designs that they employ. Most

of them fall under two set, defined as table-based and content-aware algo-

rithms: the former uses tables containing patterns associated with a code,

that are matched against the content of the analyzed data [1, 23, 7]. If there

is a match, the portion of matched data is substituted with its associated

code on the table. Tables can be adapted to the traffic in order to better

follow traffic behaviour by substitute table entries at runtime [23]. The latter

instead exploits the content of data, as described in chapter 2, in order to

transmit fewer data that can be reconstructed without the usage of additional

elements.

We selected two methodologies that belong to this last set because of

the fact that implementing a E2E table-based algorithm have the following

major drawbacks:

• using tables requires that each C/D module have to have at least one

table used to perform compression and decompression phases, as de-

scribed in [1]. This results in extra area and power overheads, due to

the fact that tables are represented as physical registers once synthe-

sized on chip.

• updating entries of those tables requires a coherence mechanism to

maintain coherence among all the modules, as described in [23]; this

implies an increment in traffic injection that can be potentially coun-

terproductive with respect to the goal to enlight traffic.

Beyond these limitations, our selection has been influenced also by the

fact that we want to highlight benefits deriving from compression without

incurring in problems regarding complex design schemes that highly affect

power and area overheads. Given so, we have found that content-aware

algorithms have very little area and power overheads, as described in [22],

and hence are the most suitable options for our investigation.

Following subsections detail the characteristics of Delta compression and

Zero compression mechanisms used in this thesis to obtain the results out-

lined in chapter 4.

23


3.2.1 Delta Compression

Delta compression is a method to transmit data in the form of differences

rather than the complete data set. This is possible due to the important

observation that most of memory and cache data have a low dynamic range

[22], [14] for different reasons: programs usually arrange data in array, change

their register numbers or memory addresses across a large number of instruc-

tions, similar data values are often grouped together, and more in general

because of the memory features of spatial and temporal locality.

That said, original data can be then seen as a common base value of fixed

size and an array of relative differences, called ∆s. We selected as base the

first flit of the data that is originally composed by N flits.

Other implementations explore the possibility to use different basis than

the first flit, but results show that best achievements in terms of both com-

pression ratio, power and area overheads are given by selecting the first flit

as the base [14].

Figure 3.6: First execution phase of Delta compression.

In order to perform compression, this algorithm compares the base flit

with all the remaining flits by byte-wise subtracting the considered flit to the

base, and filling a compression mask with 0 if the byte that corresponds to the

24


position in the mask has no difference with respect to the base, or 1 otherwise,

as described in figure 3.6. This phase requires only N-1 subtractors, where

N is the total number of data flits analyzed.

Once the compression mask is filled, the algorithm analyzes the compres-

sion mask at fixed positions starting from an offset that represents the i-th

flit inside the whole mask.

Figure 3.7: Second execution phase of Delta compression.

By finding the position of the leftmost zero that has no 1’s on its right

starting from the flit offset of each group of B bit, the algorithm can obtain

the resulting savings for that flit, and passes that value to a multiplexer that

filters only the right portion of data, as shown in figure 3.7.

Given a flit size of B bytes, and a payload dimension of N flits, the compres-

sion mask occupies (N − 1) ∗B bits.

Finally the generated ∆s are packed together as a new compressed message

and passed to the next NIC stage.

With this implementation, considering a flit dimension of N bytes and data

message represented by 1 head flit and N-1 body flits, we can obtain a data

packet formed by 1 head flit, 1 body flit representing a base and 25×N other

body flits containing ∆s and the compression mask.

The entire compression phase takes D cycles to complete, that are ac-

counted for in simulations as an additional packet’s queueing delay.

At the receiving side, the decompression module is triggered every time

the head flit of a data packet has the compression flag set, and it starts its

restoring phase.

25


Similarly to the compressor, decompressor uses a N-1 vector of byte-wise

adders in parallel, taking as first operand the base flit and as second operand

each ∆, and based on the additional informations attached, it can restore

the original flit without issues. Equally to the compression phase, its cost in

terms of delay cycles is D cycles.

The entire operation of compress and decompress a packet impacts on its

queueing latency by an additive amount of 2 ∗D cycles on execution. Differ-

ently from other implementations, in which compression and decompression

delays are masked by pipeline improvements [22] and accounts for only one

extra delay cycle, we wanted to analyze the worst-case scenario with such a

technique without any other optimization present on the system.

3.2.2 Zero Compression

Zero compression is a simple compression algorithm that aims to represent

every zero-content byte present in the data with a single zero-bit.

Figure 3.8: Zero compression mechanism.

The algorithm analyzes each byte of the message by taking its bits in

input to an OR port to see if their sum is equal to zero, as described in Figure

3.8. If so, the compression mask is updated with a 1 in the corresponding

byte position, to maintain the offset of the compressed byte and the new

message is updated by adding a zero bit to it. If the sum does not results

in a zero, then the byte is not compressible, and it is passed as it is for the

26


compressed message generation, and a zero bit, indicating that the byte is

not compressed, is set into the compression mask. A compression flag is set

in the header flit, that will be used later to trigger the decompression phase

at the receiving side. The compressed message is then passed at the next

NIC stage.

The compression mask is N ∗ B bits wide, assuming that a flit is com-

posed by B bytes and the payload represented by N flits, and it represent

the overhead that has to be attached to the compressed packet in order to

correctly restore original data.

With such an implementation, considering that an uncompressed data

packet is N flits long with each flit represented by B bytes, we can shrink

it to a total of 1 header flit, 1 body flit for all the zeroes and 1 body flit

for the compression mask. The compression phase accounts for D cycles on

execution, that is directly added to the packet’s queueing latency.

On the receiving side, the decompressor is triggered every time a packet

has its compression flag set. Based on the informations contained in the

compression mask, it expands to zero those bytes that are marked with a 1

in the mask, and finally the original data is regenerated.

Like the former phase, decompression takes D additional execution cy-

cle in order to give an output, that is accounted for as additional packet’s

queueing latency.

This implementation accounts in total for 2 × D additional cycles on

packet’s queueing delays.

3.2.3 Specialized Output Selection Policy

In our evaluation methodology we wanted to capture the impact of compres-

sion by using the two aforementioned techniques, paired together as described

in section 3.1.1.

We have made the following modifications to the general selection pol-

icy presented in section 3.1.2, in order to use it with the delta and zero

compression modules .

Based on their characteristic delays, Zero compression will be a low-

latency module and Delta compression an high-latency one.

So, our specialized output selection policy will be simpler than the gen-

eral one: low-latency module’s ratio result is directly checked against 50%

threshold, without waiting other possible low-latency modules because it is

27


the only present in the compression block. If it has a ratio greater than

the threshold, its output is granted by the CC and the high-latency task is

aborted.

If not, the CC waits also high-latency module to terminate, and grants

the output to the compressor module that has the greatest compression ratio

between the two.

28

3.3. Methodology Evaluation Engine

3.3 Methodology Evaluation Engine

Figure 3.9: Block diagram of the Methodology Evaluation Engine workflow.

Figure 3.9 depicts the workflow used to retrieve the results that we present

in the next chapter.

We collected data by sampling the network’s state at regular intervals

of N million instructions for a total simulated execution time of S seconds.

A configuration file is used by the Simulator Engine to pass all the needed

parameters, both for the sampling and for the definition of the NoC structure.

Once simulation finishes, raw statistics files are collected and passed to

the Elaboration Engine, that filters them to extract only useful data for

the evaluation. A post-simulation power estimation tool calculates dynamic

power consumption, and three files are then generated, grouping power, per-

formance, latency and compression-related statistics, mediated over the num-

ber of sampled epochs. Finally, those files are passed to a Graph Plotter that

presents the aggregated data in a graphical representation.

Following subsections give a detailed description on how the Elaboration

Engine performs calculations on the raw statistics data.

29


3.3.1 Power Consumptions Evaluation

Power consumption calculation is performed based on the injection rate on

links per executed cycle on simulation.

To obtain power results, an injection rate value has to be passed to the

power estimation tool. The Elaboration Engine calculates link injection rate

using the following formula:

injection rate =

(∑vnets

i=0 hopcounti)/execycles

R

L(3.1)

where R stands for the total number of routers in the NoC, L for the

number of links for each router and execycles for the total cycles executed

during simulation.

The hopcount represents the total link traversals that each flit performs

in order to reach its destination from a source node. We collected hop count

metrics by summing up every flit’s hop count before being injected into the

network.

Once obtained the dynamic instantaneous power dissipation, Elabora-

tion Engine calculates dynamic energy dissipation by multiplying that value

with the corresponding simulated execution time, in seconds, recalling the

mathematical relation that subsist between instantaneous power and energy:

Energy =

∫ t1

t0

P (t)dt (3.2)

Power consumption overheads for the compression/decompression phases

have been calculated by taking into account the amount of execution cycles

spent on performing those phases, their power dissipation per cycle and the

total executed cycles of the simulation.

Dissipated power overheads per cycle has been obtained with the following

formula:

compression overhead =PC/D × cyclesC/D

execycles(3.3)

where PC/D is the power consumption of the couple modules compressor/decompressor,

cyclesC/D are the sum of cycles spent performing compression and decom-

pression during execution and execycles represent the total cycles executed

during simulation.

30


Once retrieved, it has been added to the link’s energy as additional con-

sumption.

3.3.2 Performance Evaluation

Performance have been analyzed by collecting statistics regarding the simu-

lated execution time and network latency for every sampled epoch. Those

two metrics have been selected because are the most representative in order

to investigate compression impact in terms of overall execution speedup and

latency variations on the network.

Simulated execution time represents the time interval between simulation

launch and its termination, and represent the passed time in seconds as if

the workload was executed on a real hardware SoC.

Network latency represents instead the total latency introduced by the

network, due to switching, traversing and dispatching packets among IPs.

3.3.3 Compression Determinism Evaluation

Determinism of compression means the variability associated to the calcu-

lated compression ratio per packet at runtime. In order to explore the pos-

sibility to reduce the standard deviation window on average compression

ratio, we collected, for each simulated epoch, the compression ratio achieved

by every data packet.

Then, we have post-calculated the average compression ratio over each

epochs and its associated standard deviation, in order to better understand

how much the compression ratio shifts from packet to packet.

31

Chapter 4

Results

In this chapter is presented a comprehensive analysis of the impact of data

compression on the NoC.

Will be discussed how compression can lower dynamic power consump-

tions by acting on the amount of data injected into the network. It will shown

how this mechanism does not affect overall speedup of the system, as it is

affected by other factors not directly correlated to the traffic reduction of the

network. Finally, an analysis on the compression ratio and its determinism

is presented.

The rest of the chapter is organized as follows. Section 4.1 describes the

simulation environment setup, as well as the used benchmarks. Section 4.2

analyzes the impact that compression/decompression introduction has on the

dynamic energy dissipation and Section 4.3 describes how performance are

influenced. Finally, Section 4.4 discusses how the introduction of a parallel-

compression approach impacts the achiavable compression ratio and its stan-

dard deviation, compared to single-algorithm techiques.

4.1 Simulation Setup

The compression and decompression modules presented in sections 3.1.1 and

3.1.3 have been integrated in the enhanced version [27, 26, 19, 18, 5, 6, 24, 25]

of GEM5 cycle accurate simulator [4].

The simulator architecture without the two blocks is considered as the base-

line architecture, and consist in a 4x4 NoC with 2D-mesh topology with 16

32

4.1. Simulation Setup

Processor Core2GHz, In-Order Alpha Core, 1 cycle per

execution phase

Cache line size 64 byte

L1I Cache 16kB 4-way set Associative

L1D Cache 16kB 4-way set Associative

L2 Cache 512kB per Bank, 8-way set Associative

Coherence Protocol MOESI (3 VNET, 2 VC per VNET protocol)

Channel width 64 bit

Flit size 8 byte

Control packet size 1 flit

Data packet size 9 flits, 1 header flit + 8 body flits

Topology 2D-Mesh 4x4 at 16 Cores

Network frequency 1GHz

Technology 45nm at 1.0V

Real TrafficSubset of 11 benchmarks from SpecCPU2006

benchmark suite.

Table 4.1: Experimental Setup: architecture specifications and traffic workloads

used in simulations.

Alpha cores and NoC routers with 2 virtual channels per VNET, as described

in table 4.1.

Channel width for links has been set to 64 bit, with a resulting flit size of 8

bytes. In order to pack a 64-byte size cache line, a data packet consist in 8

body flits that carry the payload and one header flit, when control packets

requires instead only 1 flit.

We collected our result using a compressor/decompressor module with

the two compression methodologies exposed in sections 3.2.1 and 3.2.2.

We evaluated then three different configuration of the module: delta-enabled

where the compression block is forced to compress only with the delta com-

pressor module, zero-enabled where the block is forced to compress only with

the zero compressor module, and finally parallel-enabled in which the com-

pression block works normally.

Table 4.2 details the specific parameters used to tune our modules for

simulation. Values regarding the introduced delay and power consumptions

33

4.1. Simulation Setup

are relative to one couple of compressor/decompressor. Power consumption

values for delta module are retrieved from [22]. Albeit dynamic power con-

sumptions are declared as negligible in [20], we decided to set dynamic power

consumptions for the couple compressor/decompressor as the same value of

our delta compression module, in order to model a more realistic scenario.

Compression

Methodology

Max. Comp.

Ratio

Introduced

delay (cycles)

Power

Consumption

(mW)

Delta 60% 10 2

Zero 75% 2 2

Parallel 75% 10 or 2 4

Table 4.2: Experimental Setup: compression techniques specifications.

A MOESI-based protocol is used as coherence memory protocol for the

system. This protocol virtually isolate messages in three different VNETs

to guarantee deadlock-free message dispatching. Traffic isolation guarantees

also that only one VNET is employed to carry and manage data messages,

where control messages are injected in the other two VNETs. For our simu-

lations, this protocol assumes that VNET0 and VNET1 carry the coherence

control packets, or single-flit packets, where VNET2 carries data packets.

DSENT NoC power estimation tool [17] has been used in order to extract

power data from the simulated NoC architecture, and GNU Octave plotting

tool used to plot graphs from the obtained results.

A subset of 11 benchmarks from SpecCPU2006 benchmark suite have

been used to simulate real traffic workloads on the simulator. It provides

integer and floating-point single-threaded benchmarks able to stress the NoC

with high traffic workloads [11].

34

4.2. Energy Analysis

4.2 Energy Analysis

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

delta zero paralleldelta zero parallel

Figure 4.1: Energy consumptions results on the three considered compression-

enabled NoC architectures, normalized to the baseline architecture.

Total energy consumption results are depicted in Figure 4.1.

The three different grouped columns on the X axis represent the three con-

sidered architectures per considered benchmark. Energy consumption is re-

ported on the Y axis and normalized to the baseline architecture. The lighter

bars represent the energy consumption on the link, where the darker bars on

the top represents the compression energy overheads. Those values of energy

were obtained with DSENT [17] for each simulated epoch, and mediated over

the number of epochs executed, as described in Section 3.3.1. We adopted

the Normalized Average Epochs Energy (NAEE) metric to compare different

architectures:

AEEk =

∑epochsi=1 EEi

countepochs(4.1)

NAEEk =AEEk

AEEbaseline

(4.2)

Where EEi in the equation 4.1 stands for the energy consumption of

each epoch and the number of considered epochs. Average Epochs Energy for

each compression architecture is then normalized to the baseline architecture

result, as reported in equation 4.2.

35


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

delta zero parallel

Figure 4.2: Total injected flits for each compression-enabled architecture, obtained

from simulations on the SpecCPU2006 benchmark subsuite.

Figure 4.2 shows the total injected flits on the network, normalized to

the baseline architecture. On the X axis, each of the three grouped columns

represents a compression-enabled architecture, where on the ordinate the

amount of injected flits is shown. On average, parallel-enabled architecture

achieves a 27% less flit injection compared to baseline model, and outper-

forms delta-enabled and zero-enabled architectures by a 6% and 6.4% on

average, respectively. Figure 4.3 represents the compression ratio achieved

by the three different architectures, represented by each of the three grouped

columns on the X axis, where Y axis shows the compression ratio value.

We adopted Compression Ratio Value (CRV) [15, 16] metric to perform our

analysis. It is defined as:

CRVk =baseline flit count

compressed flit count(4.3)

CRV for parallel-enabled architecture reach a 1.42x value, where delta-enabled

and zero-enabled only a 1.29x and 1.30x on average, compared to the baseline

architecture.

The results show that energy dissipation is reduced by an average of 19%,

22% and 26% with the delta-enabled, zero-enabled and parallel-enabled ar-

chitectures respectively, among all the analyzed benchmarks.

36


1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

2.1

2.2

delta zero parallel

Figure 4.3: Compression ratio for each compression algorithm, obtained from sim-

ulations on the SpecCPU2006 benchmark subsuite.

In the worst case, as shown by NAMD workload results, it is nearly equal to

the baseline configuration without compression enabled. This behaviour is

due to the fact that, because its achieved compression ratio in both architec-

tures varies from 1.01x to 1.04x, compression does not significantly reduces

the number of injected flits into the network.

On the other hand, GOBMK and LESLIE3D workloads experience the best

benefits by compression, with a 68% and 60% energy reduction respectively,

considering the parallel-enabled architecture.

Since their compression ratios are 2.04x and 2.11x respectively, they actually

inject less than half of the baseline flits and also have an high hops/flit ratio

compared to the other benchmarks, thus leading to a lower link utilization

(see Figure 4.4).

A comparison between delta-enabled and zero-enabled architectures high-

lights the fact that some benchmarks prefers one technique against the other,

where other give instead very similar results regardless the employed tech-

nique.

In the former case, BZIP2 and HMMER are more sensible to delta tech-

nique, as highlighted by their compression ratios. Delta ratio outperforms

zero by 8% on BZIP2 and by a 10% on HMMER, and energy consumptions

reflect these changes. Delta technique for the two workloads performs better

37


than zero, with a 7% and 10% more savings with respect to zero technique,

without considering compression overheads.

The latter case is represented by SOPLEX, H264REF, GROMACS and

OMNETPP benchmarks. As we can see, because their compression ratios

are near the same for the two architectures, link energy consumptions are

practically the same.

4.2.1 Overheads Analysis

Figure 4.2 shows the overheads as the small portion on the top of each re-

ported bar. The overhead represents the extra consumed energy due to the

compression/decompression activity.

As we can see, their impact on energy consumptions is really small: on

average, they account for 3.9%, 1.2% and 2.1% for delta-enabled, zero-enabled

and parallel-enabled models respectively.

An interesting characteristic that is derived from those results is that

the energy consumption of the two introduced modules is directly dependent

on the execution cycles performed by the module itself, as detailed by the

calculation method in Section 3.3.1.

Delta-enabled and zero-enabled overheads show that their energy con-

sumption always follow a 5:1 ratio, because of their aforementioned depen-

dency on execution cycles. Having 10 and 2 execution cycles respectively,

their overheads will always be near or equal to a 5:1 ratio. Given this, being

equal link consumptions, zero compression is preferable compared to delta

due to its less module consumption during execution. Parallel-enabled mod-

ule’s overheads fall under the two cases, because it mixes its execution cycles

spent, based on the technique selected for each packet at runtime: for ex-

ample, parallel module’s consumptions are nearly the same with respect to

delta on HMMER, because the 81% of the traffic is compressed by delta

compression technique, as highlighted in Figure 4.9.

Finally, we investigated the Break-Even Point (BEP) of energy consump-

tions. The BEP represents that point beyond which the usage of compression

does not introduce benefits anymore, but its energy overhead nullifies energy

savings due to less traffic injection.

Although the majority of analyzed benchmarks does not suffer from over-

head introduction, an extreme case is represented by NAMD workload: its

scarce energy savings are totally nullified by the compression overhead of

38


both delta-enabled and parallel-enabled architectures. In this case in fact the

utilization of the zero-enabled architecture could be the only one that does

not nullifies energy savings, mostly due to the fact that its module works for

a lot less execution cycles with respect delta one.

4.2.2 Compression Policy Analysis

We also investigated the possibility to compress only a subset of data packets

by using a selection policy on the distance that each packet travels from

source to destination. We want to observe if, by compressing only a portion

of data packets with a certain characteristic, it could give benefits to the

system. Our cost metric is then the distance of each packet.

The cost that an uncompressed packet pays for reach its destination is

modelled as follows:

COSTbaseline = flits× (hops× costL) + costR (4.4)

Where costL represents the cost ,expressed in cycles, of traversing a link,

and costR represents the cost, expressed in cycles, to traverse a router. Flits

accounts for the number of flits forming a data packet, and hops finally de-

scribes the number of traversals done on links for each flit. For the compressed

packet cost we modeled the following equation, to take in account also the

presence of the added delay by compression and decompression:

COSTcompression = flitsC × (hops× costL) + costR + C +D (4.5)

C and D represents respectively the cost, expressed in cycles, of the com-

pression and decompression phases. FlitsC accounts for the number of flits

resulting after the compression and the rest of variables represent the same

costs as equation 4.4. The policy evaluates the following condition:

COSTcompression ≤ COSTbaseline (4.6)

And will enable compression if the above condition is true, denying it

otherwise.

We then calculated the theoretical break-even point for this policy, in

order to know the boundaries of values, expressed in hops, within which we

can gain benefits from compression.

39


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Figure 4.4: Average hops/flit performed for the baseline NoC architecture.

Simplifying both parts of the equation 4.6, and substituting variables with

our simulation parameters, we have found that:

hops ≥ C +D

(flits− flitsC) ∗ linkcost, (4.7)

hops ≥ 2

Based on this result, our policy will give benefits only for those packets

that have to travel a minimum distance of 2 hops.

Figure 4.4 represents the average number of hops that each flit performs,

for each considered workload from the SpecCPU2006 benchmark suite, exe-

cuted over the baseline architecture.

Based on such results, every benchmark has a number of hops greater

than the break-even point calculated earlier, with an average of 3.6 hops/flit,

making it useless to introduce a packet compression selection policy, because

of the totality of packets always performs a number of hops greater than 2.

We therefore applied compression at the whole amount of data packets.

40

4.3. Performance Analysis

4.3 Performance Analysis

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

delta zero parallel

Figure 4.5: Simulation execution time results on the three compression-enabled

NoC architectures, normalized to the baseline architecture.

Results on total execution time are shown in Figure 4.5.

Each of the three grouped columns on the X axis represents the three different

architectures used in simulations, where on the Y axis is represented the

execution time of each of them, normalized to the baseline execution time.

We adopted in this case the Normalized Execution Time (NET) metric to

compare different architecture’s result:

NETk =ETk

ETbaseline(4.8)

Where ETi represents the Execution Time experienced by the considered

compression architecture over the baseline Execution Time.

Results highlight that the introduction of compression technique, in all the

three configurations, do not gives valuable improvements on the overall speedup

of the system execution, with delta-enabled performing 0.2% worse, zero-

enabled and parallel-enabled a 0.4% and 1% better respectively, compared to

the baseline architecture.

Best improvements, observed on both GOBMK and LESLIE3D, account

for only a 5% speedup over the baseline architecture, due to the fact that

41


they are the only two benchmarks that reach a high compression ratio. Con-

versely, the compression introduces a system performance penalty within 3%

for NAMD, OMNETPP and GEMS benchmarks. Those benchmarks have

the less compression ratio on data, and albeit they compress all the packets,

compression does not give benefits in terms of less injected flits, but only af-

fects the packet’s delay. This can be observed by pointing out the increased

execution time with respect to the baseline architecture.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

VNET0 VNET1 VNET2

Figure 4.6: VNETs comparison results on parallel-enabled architecture, normalized

to the baseline architecture.

We also evaluated the impact of data compression on the VNETs latency.

Figure 4.6 gives a comparison of the three VNETs latencies collected on

the parallel-enabled architecture. On the X axis, each of the three grouped

columns identifies a VNET, from left to right VNET0, VNET1 and VNET2.

On the Y axis is detailed the VNET latency, normalized to the baseline la-

tency value. We adopted the Normalized VNET Latency (NVL) metric in

order to compare them against the baseline model:

NV Lk =V Lk

V Lbaseline

, k = 0..2 (4.9)

Where V Lk represents the Latency on the kth VNET.

As we can see from the above figure, packet latency for both VNET0

and VNET1 remain practically equal to the baseline values, indicating that

42


coherence control packets gain no benefits over the utilization of compression.

Differently, VNET2 experiences a 23% latency decrease on average. GOBMK

and LESLIE3D show the best improvements, with 45% and 46% latency

decrease, respectively. This result can be seen also on Figure 4.3, where

the two considered benchmarks are the only two that experience a little

performance improvement.

NAMD workload maintains instead practically the same latency values for

the three VNETs. This is a direct reflection of the fact that this benchmark

, albeit compressing all the data packets, its compression ratio is small and

gives no improvements in terms of less flit injection.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

delta zero parallel

Figure 4.7: Total latency comparison on the three considered compression-enabled

NoC architectures, normalized to the baseline architecture.

Figure 4.7 show the total packet’s network latency, resulting as the mean

value of the three VNETs for each of the three compression architectures. We

adopted the Normalized Network Latency (NNL) metric in order to compare

the different architectures:

NLk =

∑V NETSi=0 V Li

VNETS(4.10)

NNLk =NLk

NLbaseline

(4.11)

Where V Li represents the latency of each VNET, and V NETS represents

the number of VNETS of the system.

43


Results show that total network latency can be lowered, on average, by 6%,

7% and 9% with delta-enabled, zero-enabled and parallel-enabled respectively,

compared to baseline architecture.

Best results, with parallel-enabled architecture, are given by GOBMK and

LESLIE3D benchmarks that outperform baseline architecture by 15% and

16% respectively.

On the other hand, NAMD benchmarks show little to no improvements also

on latencies, with a 0.5% latency increase with respect to the baseline archi-

tecture.

The analysis of the impact of compression over packet’s network latency

let understand why compression does not improve system’s speedup: the key

participant that allows benefits in terms of performance speedup is in fact

the single-flit coherence control packet, and not the data packet. Because

compression acts only on the latter packet type, control packet’s network

latency remain equal to the baseline case, as reported in Figure 4.6, and so

the system cannot experience a visible performance improvement.

In confirmation of this, a recent published research [12], albeit not imple-

menting or working with compression on the NoC, exposed the fact that the

key point to work with in order to speedup a NoC are the coherence control

packets.

In their work, researchers have analyzed the impact of the different types

of packets on system’s performance, indicating that coherence control pack-

ets are the one that are responsible to the system’s performance behaviour,

defining them latency-sensitive packets.

In order to verify their assumptions, they have proposed a NoC architec-

ture that is composed by two physical networks paired together, one used to

route and dispatch only latency-sensitive packets, and the other employed as

a standard NoC to manage the rest of the traffic.

Using a router architecture with a pipeline latency of only 1 cycle that can

perform Route Computation, Virtual Channel allocation, Switch allocation,

Switch and link traversal all in the same cycle, they increase the latency-

sensitive packet’s speed, leading to faster dispatching.

Their results show that performance in terms of overall system’s speedup can

reach on average 1.08x improvements with respect to baseline architecture on

the GEM5 simulator, and a 1.66x improvements with SynFull [3] synthetic

44


traffic modeling tool.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

4 cycles adaptive 1 cycle

Figure 4.8: Execution time comparison between the three baseline architectures

with different router’s pipeline latencies.

To verify this aspect, we conducted a simulation over our baseline archi-

tecture with the latest released GEM5, that adds a flexible router’s pipeline

feature to its architecture, thus allowing to flexibly set the pipeline depth.

In particular we applied, to each control coherence packet, a router pipeline

latency equal to one cycle, and for all the other packets a latency equal to

4 cycles. The considered architecture exploit Virtual Networks NoC, thus

it splits the traffic into three separate virtual networks that share the same

physical resources, i.e. links and routers.

Results of this simulation are presented in Figure 4.8. On the X axis,

each of the three grouped column represents the considered architecture: the

first employing a fixed router latency of 4 cycles for each the packet, the

second that employs adaptive latency discussed above, ant the third with a

fixed router’s latency of 1 cycle for each packet. Simulation time for each

benchmark is shown on the ordinate and normalized to the 4-cycle pipeline

router latency result.

For each of the analyzed benchmark, adaptive-latency results fall between

the two fixed-latency architecture results. As we can see execution time

speedup for the adaptive architecture is, on average, 1.05x with respect to

the baseline architecture that uses a classical router pipeline with 4-cycle

latency.

45

4.4. Compression Ratio Analysis

4.4 Compression Ratio Analysis

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

delta zero

Figure 4.9: Packet distribution against the compression module used in parallel-

enabled architecture.

We also investigated how the parallelization of two different compres-

sion algorithms can affect the achievable compression ratio at runtime with

respect to the two single-modules architectures, being delta-enabled and zero-

enabled.

This last analysis focuses in fact only on the parallel-enabled architecture,

in which delta module and zero module are paired together and used to

compress packets in parallel with both compression modules, selecting then

only one of the two outputs, as described in Section 3.2.3.

Figure 4.9 shows the distribution of data packets that have been com-

pressed with delta or zero module at runtime, using parallel-enabled archi-

tecture. On the Y axis is shown the percentage of data packets compressed

with delta or zero algorithm over the total number of compressed packets,

for each of the benchmarks present on the X axis. As we can see from the

figure, data packets are well distributed among the two algorithm, resulting

in an average of 52% of them being compressed with delta technique and

48% of them with zero technique.

There exist some cases in which benchmark’s data packets prefers one

46


compression technique against the other: BZIP2 an HMMER represent the

case where data is better compressed using delta compression instead of

zero compression. In those two benchmarks, the 76% and 81% are in fact

compressed using the former technique.

OMNETPP represents instead the situation in which zero compression

compresses better with respect to delta, with an amount of compressed data

equal to 78%.

These results indicate that data messages are sensitive to different com-

pression techniques, because of the different data pattern that are exploited

by those techniques. Our results show that our algorithm selection is a good

compromise, due to the balanced average distribution of packets based on

compression algorithm.

0

0.1

0.2

0.3

0.4

0.5

0.6

delta zero parallel

Figure 4.10: Standard deviation of the compression ratio of the three considered

compression-enabled NoC architectures.

Figure 4.10 shows the standard deviation calculated over all the packets, that

represents the shift that occurs between each consequent compressed packet

at runtime. Each of the three bars on the X axis represent the considered

architecture, where on the Y axis is shown their associated standard devi-

ation. The more the standard deviation is high, the more the compression

ratio among packets is unstable.

Parallel-enabled architecture shows a 5% standard deviation reduction on

47


average, with respect to the other two models.

Some workloads, however, have good reaction on using such architecture,

lowering their standard deviation from 6% to 10% with MCF, GOBMK and

LESLIE3D. Differently, BZIP2 and HMMER have little to no improvements

with respect to delta-enabled architecture. This is due to the already dis-

cussed fact that those benchmarks reach good compression ratio only with

delta technique, being it alone or paired with zero compression.

GEMS, NAMD and OMNETPP, having a small compression ratio com-

pared to the rest of the workloads, have also a small standard deviation,

because the most of its traffic is not compressed at all and those packets that

are compressed obtain a really small compression ratio, remaining near to

the uncompressed situation.

This last result validates the fact that by using different algorithms to

compress data, and thus giving the possibility to select the best results be-

tween all the compression results, the fluctuation of compression ratio is more

stable compared to single-compression models.

48

Chapter 5

Conclusions and Future Works

Modern NoC-based multi-cores SoCs are increasingly stressed by the mar-

ket’s application demands, that are pushing low-latency and low-power re-

quirements to the limit.

Actual application’s traffic is governed by data packet’s flits, where they

cover the 80% of the totality of injected flits, limiting resources availability,

increasing contention on the interconnect and slowing packet delivery time.

Also, the more the number of flits traversing the network, the more energy

consumptions increase during execution due to increased link activity.

Starting from these considerations, we introduced an evaluation method-

ology workflow applied on top of GEM5 simulator, and a comprehensive

analysis on the introduction of end-to-end data compression mechanism as a

simple and effective way to reduce energy consumptions and packet latency.

We presented a compression/decompression block capable to fit N differ-

ent compressor modules in parallel that can give the best output, based on

a simple output selection policy able to select only one compressor’s output

from the N modules. This general implementation has been characterized

by using two specific compression algorithms derived from the State of Art

techniques, Delta compression and Zero compression, and by specifying a

suitable output selection policy for these two techniques in particularly.

Such implementation has been used in order to present a detailed analysis

of the impact of data compression on the Noc. Qualitatively, we analyzed

the impact of compression against energy consumptions and performance

improvement, other than evaluate the effectiveness of the usage of a com-

pression selection policy. Quantitatively, we gave a description of the impact

that the introduction of compression module have in terms of energy and

49

5.1. Future Works

performance overheads.

Results have been extracted using a 4x4 2D-Mesh architecture running a

subsuite of the SpecCPU2006 benchmark suite. We compared our methodol-

ogy with the baseline architecture without compression capabilities enabled.

Moreover, the baseline architecture has been enhanced with power-aware and

performance-aware solution to limit the design space and to have a reference

model from both a power and performance standpoint.

Compared to the power-aware baseline model, parallel architecture shows

a 26% average dynamic energy reduction, with a modest 0.2% overheads ac-

counted for the compression/decompression modules consumptions. It also

outperforms delta and zero by a 7% and 4% more energy savings, respec-

tively. Compared to the performance-aware solution, parallel architecture

gives no improvements in terms of overall system speedup, but instead aver-

age packet’s latency is reduced by a 12%. It also outperforms delta and zero

models by a 4% and 3% factor, respectively.

Finally, compression ratio is increased by 5.7% with parallel architecture

with respect to delta and zero models, showing that the compressed traffic

is well-distributed among the compression algorithm used at runtime, with

a 52% of packets compressed with delta and 48% with zero. The utilization

of parallel architecture helps to stabilize the compression ratio during time,

as shown by the 5% decrease of its standard deviation, compared to single-

algorithm architectures.

These results show that data compression on NoC is definitely a tech-

nique capable to decrease energy consumptions, but is not the same for per-

formance improvements. What compression affects is in fact only the data

packet’s network latency. Moreover, the possibility to tune the compres-

sion/decompression block by introducing other different modules, allow the

architecture to explore the behaviour of such techniques on different work-

loads to possibly reach even more better results.

5.1 Future Works

The presented work is based on two specific data compression techniques,

and their parallelization, among all the actually existent techniques. How-

ever, we are confident that, by exploring different possible combination of

algorithms, better results could be possibly reached both in terms of energy

savings and reduced resource contention. Also, a deeper analysis on traffic

50

5.1. Future Works

behaviours could give better insights, used to tune this methodology to obtain

ever increasing optimization. Finally, a combined approach with orthogonal

optimizations, that for example aim to increment system’s speedup, could

result in a good way to both reduce energy consumptions and gain better

performances.

51

Bibliography

[1] A Alameldeen and D. Wood. Frequent pattern compression: A

significance-based compression scheme for l2 caches. In Technical Report

1500, University of Wisconsin, Madison, USA, May 2004.

[2] A. R. Alameldeen and D. A. Wood. Adaptive cache compression for

high-performance processors. In Computer Architecture, 2004. Proceed-

ings. 31st Annual International Symposium on, pages 212–223, June

2004.

[3] Mario Badr and Natalie Enright Jerger. Synfull: Synthetic traffic models

capturing cache coherent behaviour. In Proceeding of the 41st Annual

International Symposium on Computer Architecuture, ISCA ’14, pages

109–120, Piscataway, NJ, USA, 2014. IEEE Press.

[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Rein-

hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower,

Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell,

Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood.

The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, Au-

gust 2011.

[5] S. Corbetta, D. Zoni, and W. Fornaciari. A temperature and re-

liability oriented simulation framework for multi-core architectures.

VLSI (ISVLSI), 2012 IEEE Computer Society Annual Symposium on,

September 2012.

[6] S. Corbetta, D. Zoni, and W. Fornaciari. A temperature and reliabil-

ity oriented simulation framework for multi-core architectures. In 2012

IEEE Computer Society Annual Symposium on VLSI, pages 51–56, Aug

2012.

52

BIBLIOGRAPHY BIBLIOGRAPHY

[7] R. Das, A. K. Mishra, C. Nicopoulos, D. Park, V. Narayanan, R. Iyer,

M. S. Yousif, and C. R. Das. Performance and power optimization

through data compression in network-on-chip architectures. In 2008

IEEE 14th International Symposium on High Performance Computer

Architecture, pages 215–225, Feb 2008.

[8] Ashley M. DeFlumere and Sadaf R. Alam. Exploring multi-core lim-

itations through comparison of contemporary systems. In The Fifth

Richard Tapia Celebration of Diversity in Computing Conference: In-

tellect, Initiatives, Insight, and Innovations, TAPIA ’09, pages 75–80,

New York, NY, USA, 2009. ACM.

[9] M. Ekman and P. Stenstrom. A robust main-memory compression

scheme. In 32nd International Symposium on Computer Architecture

(ISCA’05), pages 74–85, June 2005.

[10] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankar-

alingam, and Doug Burger. Dark silicon and the end of multicore scal-

ing. In Proceedings of the 38th Annual International Symposium on

Computer Architecture, ISCA ’11, pages 365–376, New York, NY, USA,

2011. ACM.

[11] John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH

Comput. Archit. News, 34:1–17, September 2006.

[12] Z. Li, J San Miguel, and N. Enright Jerger. The runahead network-on-

chip. High Performance Computer Architecture (HPCA), 2016 , Inter-

national Symposium on, March 2016.

[13] G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and

S. W. Keckler. A case for toggle-aware compression for gpu systems. In

2016 IEEE International Symposium on High Performance Computer

Architecture (HPCA), pages 188–200, March 2016.

[14] Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons,

Michael A. Kozuch, and Todd C. Mowry. Base-delta-immediate com-

pression: Practical data compression for on-chip caches. In Proceedings

of the 21st International Conference on Parallel Architectures and Com-

pilation Techniques, PACT ’12, pages 377–388, New York, NY, USA,

2012. ACM.

53


[15] Charles Poynton. Digital Video and HDTV Algorithms and Interfaces.

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1 edition,

2003.

[16] Iain E. Richardson. The H.264 Advanced Video Compression Standard.

Wiley Publishing, 2nd edition, 2010.

[17] C Sun, C. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. Peh,

and V. Stojanovic. Dsent - a tool connecting emerging photonics with

electronics for opto-electronic networks-on-chip modeling. Networks on

Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, May

2012.

[18] F. Terraneo, D. Zoni, and W. Fornaciari. A cycle accurate simulation

framework for asynchronous noc design. In 2013 International Sympo-

sium on System on Chip (SoC), pages 1–8, Oct 2013.

[19] Federico Terraneo, Davide Zoni, and William Fornaciari. A cycle accu-

rate simulation framework for asynchronous noc design. System on Chip

(SoC), 2013 International Symposium on, October 2013.

[20] Luis Villa, Michael Zhang, and Krste Asanovic. Dynamic zero com-

pression for cache energy reduction. In Proceedings of the 33rd Annual

ACM/IEEE International Symposium on Microarchitecture, MICRO 33,

pages 214–220, New York, NY, USA, 2000. ACM.

[21] B. Towles W.J. Dalley. Route packets, not wires: on-chip interconnec-

tion networks. In Design Automation Conference, 2001., June 2001.

[22] J. Zhan, M. Poremba, Y. Xu, and Y. Xie. Noδ: Leveraging delta

compression for end-to-end memory access in noc based multicores. In

2014 19th Asia and South Pacific Design Automation Conference (ASP-

DAC), pages 586–591, January 2014.

[23] Ping Zhou, Bo Zhao, Yu Du, Yi Xu, Youtao Zhang, J. Yang, and

Li Zhao. Frequent value compression in packet-based noc architectures.

In 2009 Asia and South Pacific Design Automation Conference, pages

13–18, Jan 2009.

54


[24] D. Zoni, J. Flich, and W. Fornaciari. Cutbuf: Buffer management and

router design for traffic mixing in vnet-based nocs. IEEE Transactions

on Parallel and Distributed Systems, 27(6):1603–1616, June 2016.

[25] Davide Zoni, Simone Corbetta, and William Fornaciari. Hands: Hetero-

geneous architectures and networks-on-chip design and simulation. In

Proceedings of the 2012 ACM/IEEE International Symposium on Low

Power Electronics and Design, ISLPED ’12, pages 261–266, New York,

NY, USA, 2012. ACM.

[26] Davide Zoni and William Fornaciari. Modeling dvfs and power-gating

actuators for cycle-accurate noc-based simulators. J. Emerg. Technol.

Comput. Syst., 12(3):27:1–27:24, September 2015.

[27] Davide Zoni, Federico Terraneo, and William Fornaciari. A control-

based methodology for power-performance optimization in nocs exploit-

ing dvfs. Journal of Systems Architecture, 61(5-6):197–209, 2015.

55

Exploring the end-to-end compression to optimize the power ... › bitstream › 10589 › 132476...

Documents

Transcript of Exploring the end-to-end compression to optimize the power ... › bitstream › 10589 › 132476...