Efficiency of Logic Minimization Techniques for ...

Efficiency of Logic Minimization Techniques for CryptographicHardware Implementation

Shashank Raghuraman

Thesis submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Engineering

Leyla Nazhandali, Chair

Patrick R Schaumont

Haibo Zeng

June 14, 2019

Blacksburg, Virginia

Keywords: Logic synthesis, Cryptographic hardware, Circuit minimization, Leon-3,

System-on-Chip, Authenticated encryption hardware.

Copyright 2019, Shashank Raghuraman


Shashank Raghuraman

(ABSTRACT)

With significant research effort being directed towards designing lightweight cryptographic

primitives, logical metrics such as gate count are extensively used in estimating their hard-

ware quality. Specialized logic minimization tools have been built to make use of gate count

as the primary optimization cost function. The first part of this thesis aims to investigate

the effectiveness of such logical metrics in predicting hardware efficiency of corresponding

circuits. Mapping a logical representation onto hardware depends on the standard cell tech-

nology used, and is driven by trade-offs between area, performance, and power. This work

evaluates aforementioned parameters for circuits optimized for gate count, and compares

them with a set of benchmark designs. Extensive analysis is performed over a wide range of

frequencies at multiple levels of abstraction and system integration, to understand the dif-

ferent regions in the solution space where such logic minimization techniques are effective. A

prototype System-on-Chip (SoC) is designed to benchmark the performance of these circuits

on actual hardware. This SoC is built with an aim to include multiple other cryptographic

blocks for analysis of their hardware efficiency. The second part of this thesis analyzes

the overhead involved in integrating selected authenticated encryption ciphers onto an SoC,

and explores different design alternatives for the same. Overall, this thesis is intended to

serve as a comprehensive guideline on hardware factors that can be overlooked, but must

be considered during logical-to-physical mapping and during the integration of standalone

cryptographic blocks onto a complete system.


Shashank Raghuraman

(GENERAL AUDIENCE ABSTRACT)

The proliferation of embedded smart devices for the Internet-of-Things necessitates a con-

stant search for smaller and power-efficient hardware. The need to ensure security of such

devices has been driving extensive research on lightweight cryptography, which focuses on

minimizing the logic footprint of cryptographic hardware primitives. Different designs are

optimized, evaluated, and compared based on the number of gates required to express them

at a logical level of abstraction. The expectation is that circuits requiring fewer gates to

represent their logic will be smaller and more efficient on hardware. However, converting a

logical representation into a hardware circuit, known as “synthesis”, is not trivial. The logic

is mapped to a “library” of hardware cells, and one of many possible solutions for a function

is selected - a process driven by trade-offs between area, speed, and power consumption on

hardware. Our work studies the impact of synthesis on logical circuits with minimized gate

count. We evaluate the hardware quality of such circuits by comparing them with that of

benchmark designs over a range of speeds. We wish to answer questions such as “At what

speeds do logical metrics rightly predict area- and power-efficiency?”, and “What impact

does this have after integrating cryptographic primitives onto a complete system?”. As part

of this effort, we build a System-on-Chip in order to observe the efficiency of these circuits

on actual hardware. This chip also includes recently developed ciphers for authenticated en-

cryption. The second part of this thesis explores different ways of integrating these ciphers

onto a system, to understand their effect on the ciphers’ compactness and performance. Our

overarching aim is to provide a suitable reference on how synthesis and system integration

affect the hardware quality of cryptographic blocks, for future research in this area.

Dedication

To my family.

iv

Acknowledgments

First and foremost, I thank my parents, brother, and everyone else in my family back home

in India for their constant support and encouragement throughout my graduate studies. I

am indebted to my uncle, Aravind Srinivasan, and my friend, Lakshman Maalolan, both of

whose presence and positivity were an integral part of my education at Virginia Tech. I offer

my gratitude to my advisor Dr. Leyla Nazhandali, whose invaluable guidance and faith in

me were my biggest sources of motivation. I have drawn inspiration from her approachable

and good-natured mentorship, academic expertise, and skills in presentation, which I believe,

have moulded me into a better professional. I must thank Dr. Patrick Schaumont for his

ideas, insights, and critique through the course of our collaborative work. I will always aspire

to acquire his ability to articulate complex ideas and findings in the simplest possible manner.

I also thank Dr. Nazhandali, Dr. Schaumont, and Dr. Haibo Zeng for having agreed to be part

of my Masters advisory committee. I thank NIST for having funded our project, and Dr. René

Peralta for his prompt responses to our queries. I wish to acknowledge Pantea Kiaei for her

dedication to our shared project, and the members and alumni of Secure Embedded Systems

Lab - with special mention to Archanaa S Krishnan, Tarun Kathuria, Daniel Dinu, and

Chinmay Deshpande - for their cheerful companionship and transfer of knowledge. During

my Masters, I had the opportunity of interning at Qualcomm in Boulder. I thank everyone in

that team - in particular, Curt Musfeldt, Brian Steele, and Karthi Subbiah - who made it a

rich learning experience that enhanced my confidence. Finally, I express my gratitude to my

friends Subramaniam Mahadevan, Swati Bhardwaj, and Naresh Vemishetty for their words

of encouragement, and Vamsi Chandra, Shamit Bansal, Omkar Dhande, Abhinuv Pitale,

and Akhil Ahmed for their help and the fun times I spent in Blacksburg.

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Relevant Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Digital Logic synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Low Gate-Count (LGC) Synthesis Tool . . . . . . . . . . . . . . . . . 9

1.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Cryptographic Benchmark Selection and Experimental Methodology 13

2.1 Benchmark Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 AES SBox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Binary Polynomial Multiplication . . . . . . . . . . . . . . . . . . . . 15

2.1.3 Galois Field Multiplication . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.4 Galois Field Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.5 Reed-Solomon Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Integration of SLPs into ASIC design flow . . . . . . . . . . . . . . . 22

vi

2.2.2 Design alternatives for LGC designs . . . . . . . . . . . . . . . . . . 24

2.2.3 Standard cell library choices . . . . . . . . . . . . . . . . . . . . . . . 25

3 Experimental results of logic synthesis of benchmark designs 26

3.1 Combinatorial logic synthesis results . . . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Technology-independent evaluation . . . . . . . . . . . . . . . . . . . 26

3.1.2 Post-synthesis analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Integrated Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.1 Comparison of AES designs with different SBox circuits . . . . . . . 47

3.2.2 Reed-Solomon Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 Effect of physical design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4 Connection between abstract and technology-dependent quality metrics . . . 58

4 Design of prototype ASIC 63

4.1 Design Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 SoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.1 Memory-mapped coprocessors on NISTCHIP . . . . . . . . . . . . . 66

4.2.2 NISTCOMB coprocessor: Design and Programming model . . . . . . 69

4.3 NISTCHIP ASIC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.1 Logic synthesis of NISTCHIP . . . . . . . . . . . . . . . . . . . . . . 71

4.3.2 NISTCHIP Physical Design flow . . . . . . . . . . . . . . . . . . . . 72

vii

4.3.3 The final ASIC layout . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4 Post-layout results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 Impact of SoC integration on Authenticated Encryption Ciphers 81

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Relevant Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.1 ACORN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.2 AEGIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2.3 MORUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3 Design alternatives for SoC integration . . . . . . . . . . . . . . . . . . . . . 87

5.3.1 An intuitive and convenient wrapper design - FIFOs at the input and

output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3.2 Reducing wrapper overhead - FIFO only at the output . . . . . . . . 90

5.3.3 Lightweight integration - no FIFOs in the wrapper . . . . . . . . . . 91

5.3.4 Direct Memory Access (DMA) for increased throughput . . . . . . . 93

5.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.1 Studying Area and Power . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5 Observations and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

viii

5.5.1 ACORN-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5.2 ACORN-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.5.3 AEGIS-128L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5.4 MORUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6 Conclusion 113

Bibliography 114

Appendices 121

Appendix A Additional results for logic synthesis of LGC circuits 122

Appendix B NISTCHIP Memory Map 127

Appendix C Additional results for SoC integration of AEAD Ciphers 130

ix

Chapter 1

Introduction

1.1 Motivation

The necessity for cryptographic hardware with small logic footprint, high performance, and

low power consumption has become paramount with the growing popularity and usage of

the Internet-of-Things (IoT). High-volume applications and lightweight embedded devices

primarily require reduction in silicon area. On the other hand, shared applications in the

cloud that deal with a large number of users need to support bulk data processing, and

hence require high-performance cryptography [28]. Both of these common requirements have

eventually lead to research that predominantly focuses on minimizing the logic complexity

of cryptographic designs.

Techniques to improve the quality of hardware implementations exist at all possible levels

of abstraction - from algorithms right down to semiconductor technology. The focus of this

work is to study the effects of optimization between algorithm-level and device-level, widely

characterized as logic synthesis. We consider popularly-used implementations for standard

known cryptographic algorithms, making use of vendor-provided gate-level primitives for

their hardware realization. We therefore do not focus on minimization techniques at high-

level algorithm formulation or the optimization of VLSI design layout. This work focuses

on minimization performed at the level of logical representation of cryptographic primitives.

We analyze the impact of realizing these logic-minimized designs on actual hardware through

1

2 Chapter 1. Introduction

an implementation flow that is adopted as a common standard.

With regard to obtaining compactness, low power, and high performance on hardware, it is

natural to expect that logic reduction at the architectural level translates into desired hard-

ware efficiency. Fewer logic operations are intuitively assumed to directly produce smaller

hardware, and indirectly affect power consumption [27]. Similarly, a design with fewer lev-

els of logic operations is expected to yield a faster circuit on hardware. For cryptographic

logic designers at this relatively high level of abstraction, it makes sense to focus on the

best available metric that quantifies compactness - gate count. This has naturally gained

widespread usage through the years as an important indicator of the “efficiency” of cryp-

tographic logic designs, and more importantly, in comparing one design alternative with

another [5, 9, 12, 15, 20, 26]. Optimization tools driven by gate count and/or depth as

their cost functions have been developed for different classes of functions [8, 11, 17, 22, 47].

Some works discuss the expected circuit speed in terms of its logical depth before synthesis

[16, 24, 27, 37, 39], or as an estimate obtained from a library, depending on logical complexity

[32].

An important point that needs to be considered is that converting the logical representation

of a design into cells of a hardware library is not a trivial task. These hardware cells come

with diverse functionality and sizes. Therefore, there does not exist a unique mapping from

a design with low gate count to a larger set of library cells. Determining an optimal solution

from these is driven by constraints specified by the user. For example, a simple and direct

mapping of a logic-minimized design onto corresponding hardware cells is bound to give the

smallest solution. However, placing a tight delay requirement on such a design necessitates

searching for an alternative solution through a different logic mapping and bigger cells. This

naturally increases the area of the circuit. Figure 1.1 shows a typical plot of area against

delay of designs synthesized by a constraint-driven tool. Achieving a solution that is better

1.1. Motivation 3

with respect to either speed or size than the Pareto-optimal one cannot be achieved without

compromising the other.

Figure 1.1: A typical area-delay curve depicting trade-off points.

Due to the aforementioned dependence on standard cell technology, an assessment of logic

minimization tools needs to capture different corners of the design space. Techniques that

reduce gate count might result in greater difficulty to optimize the circuit for speed, or

consume more power. This eventually brings us to the question of whether the estimate of

hardware efficiency provided by metrics such as gate count remains accurate over a range

of constraints. Many existing optimizations for specific target applications [35, 44, 45, 47]

include synthesis results obtained for a particular frequency, mainly to validate the compact-

ness of their target circuit. While this establishes their area efficiency at that frequency, we

believe that a comprehensive analysis of the area, delay, and power of a more diverse group

of circuits minimized by similar techniques would go a long way in providing designers a

clearer picture of how they are transformed along the hardware implementation flow.

Our primary task in this work is to systematically evaluate the hardware quality of logic

reduction performed by a new class of record-setting circuit-minimization techniques opti-


mized for reducing gate-count [8, 11, 41]. This Low Gate-Count (LGC) tool primarily focuses

on reducing multiplicative complexity, minimizing the number of XOR operations, and if

desired, reducing the depth of combinatorial circuits. These techniques have produced some

of the smallest known combinatorial primitives of cryptographic importance [9, 12]. Our

aim is to perform a comprehensive hardware efficiency analysis of these circuits over a range

of constraints covering various trade-off points over the design trajectory. Considering that

these tools have been optimized for a large class of combinatorial cryptographic circuits, we

believe this analysis provides significant insight into the overall hardware efficiency of such

methodologies, and helps identify specific regions in the design space where these circuits

are efficient. Specifically, we attempt to address the following points:

• Trade-off regions: Owing to the conflicting nature of hardware quality metrics, it is

conceivable that synthesis methods that are superior in one metric are inferior in another.

There is rarely a case of “one-size-fits-all” with regard to logic synthesis. It is therefore

important to identify these regions of the solution space to get a better assessment of

when LGC tools are preferable over other alternatives.

• Suitability towards wide range of functions: Similar to the previous point, there

is a possibility of one synthesis method outperforming another for a particular class of

logic functions, and not so for a different class. The same optimization strategy can

affect different functions in different ways depending on their complexity and structural

properties. Since the LGC tool is shown to be applicable to a wide range of circuits, it is

of interest to analyze the consistency of hardware efficiency over different logic functions.

• Scaling of hardware metrics: As logic synthesis is a constraint-driven process, a

circuit that is better at one operating frequency can be worse at a higher frequency. In

other words, different circuits do not scale in the same manner with respect to their

1.2. Relevant Preliminaries 5

design quality. We wish to observe how area and power scale with design constraints and

complexity. This eventually produces plots similar to Figure 1.1 that help understand the

optimal regions of operation for different synthesis schemes.

To evaluate the LGC tool, we compare the quality of designs it creates, against those pro-

duced by commercial tools, as well as other existing optimization techniques for the same

logic functions. These comparisons are performed at different levels of abstraction in the

implementation flow of an Application Specific Integrated Circuit (ASIC). In addition to

evaluating the quality of combinatorial primitives as standalone blocks, we include analysis

of an overall system design incorporating these primitives. This is aimed at demonstrating

their suitability in a practical setting. We design a prototype ASIC that includes some of

these cryptographic primitives as part of our effort in evaluating their efficiency on actual

hardware.

1.2 Relevant Preliminaries

1.2.1 Digital Logic synthesis

An algorithm can be described as a logic function in multiple ways - a high-level look-up table

specifying its input-output relation, using expressions with Boolean operators representing

logic gates, or through behavioral description in a Hardware Description Language (HDL)

such as Verilog, VHDL, SystemVerilog, etc. Realizing such a description as a circuit on

hardware involves providing it to a logic synthesis tool along with a collection of standard

cells, known as a standard cell library or technology library. Such a library, generally provided

by a commercial vendor, consists of hardware primitives belonging to a particular technology

represented by dimensions and other properties of the transistors used to construct them.


These building blocks can be as simple as regular AND, OR, XOR gates, and as complex as

multiplexers, arithmetic circuits such as adders, compound gates such as ((A ·B)||(C ·D))′,

and so on. The synthesis tool maps a logical representation on to physical components from

the library to produce a “gate-level netlist”, which is an interconnection of standard cells.

Figure 1.2: Example of hardware realization of a logical representation.

There are technology factors to be considered during this mapping process:

• Drive strength of cells: Each standard cell has a specified strength to drive their output

to other cells. Cells with higher drive strength are capable of having a greater fanout, i.e.

drive more cells, as compared to those of smaller drive strength. Synthesis tools take this

into consideration during mapping and logic optimization, and any violations to this can

lead to chip failure. This can sometimes lead to minor modifications in logic to account

for a cell’s drive strength. Cells with higher drive strength are generally bigger.

• Area: Logic synthesis tools place heavy emphasis on reducing area using both technology-

independent and technology-dependent methods. There are algorithms used to first per-

form Boolean logic minimization, factoring out common sub-expressions, etc. This is fol-

lowed by technology-dependent optimization based on the standard cell library as some

libraries can offer sophisticated logic expressions that are optimized for area. At the gate-


level, cells with smaller drive strength can be used in places with small fanout to save

area.

Figure 1.3: Area optimization during logic synthesis

• Delay: The output of each cell incurs a non-zero delay before it goes to the correct logic

value. In a combinatorial circuit with many levels of gates, the delay of each gate and

the interconnection between gates add up to form the total delay of that combinatorial

circuit. Designers often specify the desired speed of the circuit, which places a limitation

on the maximum delay of that circuit. In addition, flip flops in the standard cell library

have a setup time requirement which again places a limit on the maximum delay of a path

between two flip flops. Synthesis tools perform logic optimization for minimizing delay.

This leads to modifications in the type(s) of cells used, since there could be cells that are

faster and hence better alternatives to a direct mapping of the circuit that is logically

described.

• Power: Cells that have low power consumption are ideally preferred, but they have

a direct impact on the cell delay. For instance, faster cells often consume more power.

Delay is generally prioritized over power since failure to meet timing can cause unintended

behavior.

• Design rules: These are rules specified by the technology vendor, failure to meet which

can lead to design failure. These rules are prioritized over area, delay, and power by

synthesis tools. For example, a cell can have a maximum fanout specified. If this is


exceeded in the circuit description, the logic has to be modified in a way that doesn’t lead

to further design rule violations, even if it entails an increase in area, power, or delay.

Similarly, there are limitations on the maximum capacitance a cell can drive, and the

maximum signal transition time allowed for it to be correctly recognized by the next cell.

Impact of standard cells: The greatest challenge to logic synthesis tools is to find a

sweet spot between what the user requires in terms of area, delay, power, and what the

technology library offers along with its design rules. Synthesis cost functions include all

these constraints, and tools constantly evaluate trade-offs between them. An important

point that needs mention is that there are variations in standard cells with respect to area,

power, and delay, that cannot be overlooked. For example, Figure 1.4 shows a simple example

of the area of commonly used standard cells from two different libraries, normalized to that

of a 2-input NAND gate of the same technology. It is easy to see that XOR and XNOR gates

are significantly bigger than other cells of an equivalent drive strength. Similar observations

can be made for delay and power consumption - they are different for different cells, and

depend on input signal transition and output load.

Gate Type

Area

(Gat

e Eq

.)

0

2

4

6

XOR2

XNOR2AND2

OR2

NAND2NOR2

AOI211

OAI211IN

VMX2

Drive Strength - X1 Drive Strength - X2 Drive Strength - X4

Area of common Standard cells - TSMC 180 nm technology

(a)Gate Type

Area

(Gat

e Eq

.)

0

0.5

1

1.5

2

XOR2 XNOR2 AND2 OR2 NAND2 NOR2 AOI21 OAI21 INV

Drive Strength - X1 Drive Strength - X2 Drive Strength - X4

Area of common Standard cells - SAED 32 nm technology

(b)

Figure 1.4: Area comparison of common 2-input standard cells from (a) TSMC 180 nm, and(b) Synopsys SAED 32/28 nm standard cell libraries.

What this highlights is that a cryptographic LGC circuit dominated by XOR gates cannot


be directly assumed to be smaller in hardware than a more abstract representation, just by

virtue of having fewer gates. While differences of many hundreds of gates are likely to be

reflected on hardware, those of the order of few tens of gates are not guaranteed of area

efficiency after synthesis. This again points to an area-delay trade-off. The differences on

hardware depend on heuristics deployed by the synthesis tool to find an optimal mapping and

sizing of cells to meet design requirements. While the starting point could be the smallest

possible representation of the circuit, it is conceivable that the tool sees the need for certain

groups of gates to be replaced with compound gates in the library that better meet timing or

have a higher driving ability. Moreover, when integrated with a bigger design (for example,

when an AES SBox is plugged in between the other steps in an AES round), there are often

cells before and after the combinatorial block we are concerned with. This imposes further

delay constraints and can provide scope for logic optimization across logical boundaries as

shown in Figure 1.5.

Figure 1.5: Scope for cross-boundary optimization of a circuit.

1.2.2 Low Gate-Count (LGC) Synthesis Tool

This sub-section briefly discusses the important properties of circuit minimization techniques

proposed by Boyar and Peralta [8, 11, 41]. For an in-depth understanding of the methods

used, the interested reader can refer the cited works.


Cryptographic logic primitives are optimized for low gate-count by partitioning the circuit

into its linear (XOR) and non-linear (AND) parts. The non-linear portion is first reduced

by techniques such as automatic theorem proving, resulting in a representation with fewer

AND gates than the original. The linear portion of the circuit is now reduced using a greedy

algorithm factoring out commonly used sub-expressions. The set of variables required to

represent the function is initially filled with all the input variables, and gradually “grows”

as it is filled in with sub-expressions that minimize the total number of XOR gates required.

This is performed repeatedly with random combinations of variables from the set, until a

target number of XOR gates or a predefined maximum time is reached. This technique was

used with the addition of greedy depth-minimization heuristics to obtain a very compact

circuit for AES SBox in [9]. These algorithms have also been used to obtain some of the

smallest known circuits for Galois Field arithmetic [12] and polynomial multiplication [10].

In addition to their ability to provide high compactness, these algorithms are applicable

to a wide range of combinatorial circuits. While logic minimization of individual functions

focus on its computational properties, these algorithms are designed to accept a general

starting circuit as a Straight Line Program (SLP) to perform their optimization. An impor-

tant distinction between the LGC tool and commercial tools is the inclusion of technology-

dependence in their cost functions.

1.3 Our Contributions

The main contributions of our effort are listed as follows:

• Selected and partially developed a set of benchmark designs for evaluating the impact

of logic synthesis on circuits minimized by a new generation of low gate-count synthesis

techniques.

1.4. Attribution 11

• Implemented a methodology for integrating Straight Line Programs (SLPs) obtained

from the low gate-count circuit minimization tool into a standard ASIC design flow.

• Developed the architecture and performed physical design of a prototype ASIC for

analysis of the area, delay, and power consumption of logic-minimized circuits on chip.

• Analyzed the impact of System-on-Chip integration on the area, power, and perfor-

mance of ciphers for authenticated encryption.

1.4 Attribution

The content presented in this thesis is the result of a collaborative project supervised by Dr.

Leyla Nazhandali and Dr. Patrick Schaumont. I developed the hardware analysis setup for

observing the impact of logic synthesis of combinatorial circuits (as standalone primitives

and as part of a complete design) under the guidance of my advisor, Dr. Nazhandali. Devel-

opment of the SoC architecture for hardware benchmarking of logic-minimized designs was

a joint effort by the two project supervisors and two graduate students at Secure Embedded

Systems Lab - Pantea Kiaei and myself. RTL design of hardware wrappers for all copro-

cessors except AES, and software programming interface for the same are the only elements

of this thesis that were not created by me. I performed logic synthesis and physical design

of our SoC, followed by analysis of system integration of authenticated encryption ciphers.

The hardware and software wrappers for various design alternatives in this analysis were

developed by me.


1.5 Thesis Organization

This chapter presented the primary motivation behind our work and relevant background

required to understand the content. The rest of this thesis is organized as follows.

• Chapter 2 presents the cryptographic benchmarks selected for comparison with LGC

circuits, followed by the evaluation methodology adopted for the same.

• Chapter 3 discusses important post-synthesis results for comparison, and the effective-

ness of LGC circuits in practical designs.

• Chapter 4 describes the architecture and design of our prototype ASIC.

• Chapter 5 analyzes the impact of integrating AEAD ciphers onto an SoC, and explores

several design alternatives for the same.

• Chapter 6 concludes the findings of this work.

Chapter 2

Cryptographic Benchmark Selection

and Experimental Methodology

This chapter presents the selection of cryptographic benchmark designs, beginning with the

primitives and design alternatives chosen for comparison.

2.1 Benchmark Designs

The first step of our analysis was to select appropriate benchmark circuits for comparing

the effectiveness of various synthesis methods. There were three main criteria that were

considered in this regard:

1. The LGC circuit optimization tool [8] has been used primarily on combinatorial designs

of cryptographic relevance [12]. We therefore narrowed our choices to circuits used in

cryptographic hardware. Further, we focus on circuits that are used as primitive blocks

in bigger cryptographic designs.

2. One of the main properties of the LGC tool is two-step minimization of non-linear and

linear portions of a circuit. Hence, we selected designs with high complexity in terms of

containing a mix of linear and non-linear components.

3. Circuit minimization can be performed both at the logic level and at the technology-

13

14 Chapter 2. Cryptographic Benchmark Selection and Experimental Methodology

mapping level. While there is little or no consideration of hardware constraints in the

former, the latter is performed by synthesis tools when all constraints have been specified.

To evaluate the effectiveness of optimization at different levels, we choose two types of

benchmark designs where possible - (i) an abstract representation of the input-output

relation with minimal external logic reduction, and (ii) a design that has been minimized

by exploiting the computational properties of the circuit.

The benchmark designs are listed as follows.

2.1.1 AES SBox

The AES SBox has been extensively studied and several implementations have been proposed

in literature [9, 14, 43, 45, 46] targeting various metrics for hardware efficiency.

1. The AES SBox at its highest level is an 8X8 look-up table. This simply specifies a

behavioral input-output relation for each of the 256 possible 8-bit values, leaving its gate-

level realization completely up to the logic synthesis tool. Our first reference design is

therefore a direct look-up table representation of the SBox, denoted as sbox_lut.

2. The computational properties of the SBox, i.e. Galois Field inversion followed by an

affine transformation, allow for a direct mapping of the steps onto combinatorial logic.

This has produced very compact SBox designs in literature. The SBox by Wolkerstorfer

et al. [46] decomposes elements in GF (28) into two-term polynomials with coefficients

in GF (24). Inversion is then performed in this sub-field owing to its simpler hardware

implementation. Canright’s design [14] further reduces the gate-count by making use

of representation over the composite field GF (((22)2)2), and the introduction of normal

bases. These computational designs are denoted as sbox_wolkerstorfer and sbox_canright.

2.1. Benchmark Designs 15

3. Another way of describing an SBox is using a Sum-of-Products or a Product-of-Sums

form derived from its truth table. This gives a single-stage Positive Polarity Reed-Muller

(PPRM) representation [42] of the SBox, denoted here by sbox_pprm1. Further, Morioka

and Satoh propose a 3-stage PPRM architecture [36], which restricts the PPRM represen-

tations to three different stages of the SBox. This is to take advantage of both the PPRM

structure and a composite field representation. We denote this design by sbox_pprm3.

4. The LGC version used here is the low gate-count SBox proposed by Peralta et al. [9],

denoted as sbox_lgc. This circuit was minimized by the LGC and depth-reduction tech-

niques discussed in [8, 9].

2.1.2 Binary Polynomial Multiplication

This can be viewed as multiplication of two polynomials of degree n over GF (2). A polyno-

mial a(x) = xn−1+an−2 ·xn−2+ · · ·+a1 ·x+a0 is represented as an n−bit vector whose bits

are the coefficients of a(x). The product c(x) of two n-degree polynomials a(x) and b(x) is

computed as

c(x) = a(x)b(x) =

(n−1∑i=0

aixi

)(n−1∑i=0

bixi

)Polynomial multiplication is generally performed as the first step of field multiplication,

and is followed by polynomial reduction. For multiplication in a field F2n , the arithmetic

complexity of reduction is O(n), while that of multiplication is O(nω), where 1 < ω ≤ 2 [15].

It is therefore worthwhile to look at circuits for polynomial multiplication alone, which has

been an old and much-studied problem.

A traditional bit-serial multiplier (Figure 2.1a) performs a shift-and-add procedure to first

form partial products before accumulating them to form the complete product. However, it

is too slow to be of practical cryptographic relevance, and is therefore not considered here.


Figure 2.1: (a) Example of a 4 × 4 bit-serial multiplication, (b) Representation of n × nbit-parallel multiplication.

The benchmarks used are listed below.

1. For the high-level representation, we make use of a bit-parallel multiplier (Figure 2.1b)

realized using matrix multiplication as described in [39]. It is well-suited to cryptographic

applications, and can be realized entirely as combinatorial logic employing GF (2) addition

and multiplication. This matrix-based design is referred to as polymult_mat.

2. Computational polynomial multiplier designs with low complexity were proposed by Bern-

stein in [7]. These involve splitting the polynomials into two, three, or four parts, followed

by recursive application of multiplication and elimination of common operations. The im-

proved upper bounds and straight-line codes for 1 ≤ n ≤ 1000 are listed in [6], and are

referred to as poymult_comp.

3. The LGC versions of polynomial multipliers are available at [10] for all the input lengths

considered here. Many of them are designs that used the aforementioned computa-

tional versions as starting points for further logic reduction. These are referred to as

polymult_lgc.

Since the complexity of binary multiplication grows quadratically with n, we perform com-


parison for a range of widths from 8 to 22 bits to evaluate how the efficiency of these designs

scales with design complexity.

2.1.3 Galois Field Multiplication

In a binary Galois Extension Field GF (2n), an element A with the bit vector representation

[an−1 · · · a1a0] is represented using a polynomial A(x) = an−1x(n−1) + · · ·+ a1x+ a0. Multi-

plication of two elements A and B in GF (2n) is defined as the multiplication of polynomials

A(x) and A(x) modulo an irreducible polynomial P (x).

We pick multipliers performing GF (28) and GF (216) multiplciation, which are widely used

in cryptographic applications [12]. In the reference designs, we use the AES polynomial as

the field polynomial - P8(x) = x8 + x4 + x3 + x + 1 for the GF (28) multiplier, while the

GF (216) multiplier uses the polynomial P16(x) = x16 + x5 + x3 + x1 + 1.

1. The first benchmark considered is the Mastrovito multiplier first proposed in [34], and

later constructed for general irreducible polynomials in [24]. This architecture has long

since been one of the most popular GF multiplier designs owing to its low gate count.

Multiplication of two elements A and B modulo field polynomial P is performed by

introducing a product matrix Z = f(A(x), P (x)) as shown [34, 39]:

C =

c0

c1...

cn−1

= ZB =

f 00 · · · fn−1

0

... . . . ...

f 0n−1 · · · fn−1

n−1

b0

b1...

bn−1

where f j

i ∈ GF (2) recursively depend on the coefficients of A and P. This architecture

is denoted by textitgfmult_mastrovito.


2. Another widely used GF multiplier representation was proposed by Paar in [39], and

adopts a more high-level formulation. The pure polynomial multiplication of elements A

and B as shown in Section 2.1.2 is first computed. Modular reduction of the result is

then performed through a linear mapping represented by a reduction matrix R as shown

below [39].

C =

c0

c1...

cn−1

= RAB =

1 0 · · · 0 r0,0 · · · r0,n−2

0 1 · · · 0 r1,0 · · · r1,n−2

... ... . . . ... ... . . . ...

0 0 · · · 1 rn−1,0 · · · rn−1,n−2

a0 0 0 · · · 0

a1 a0 0 · · · 0

... ... ... . . . ...

an−1 an−2 an−3 · · · a0... ... ... . . . ...

0 0 0 · · · an−1

b0

b1...

bn−1

(2.1)

Here, ri,j depend solely on the field polynomial P. This design is denoted by gfmult_paar.

3. There are two LGC versions that are considered for each of the two multipliers. The

smallest versions produced by the LGC tool involve optimizations performed on top of

a Tower-of-Fields (ToF) representation [12] by iteratively decomposing the problem to

GF (22) operations. This however requires a mapping to and from the standard represen-

tation before and after multiplication, which is done by multiplication with a constant

matrix [38]. We consider both a design using ToF representation alone excluding isomor-

phic mappings, denoted by gfmult_lgc_tof, and one which includes isomorphic mapping

and can fit in designs using a standard representation, denoted by gfmult_lgc.

• For GF (28) multiplication, the gfmult_lgc version is a circuit optimized for the AES

polynomial, available at [10]. The gfmult_lgc_tof version requires fewer gates than

the aforementioned design, and was obtained from the same source.

• For GF (216) multiplication, the circuit available to us was optimized for a TOF


representation. We therefore performed the mapping to P16(x) externally by multi-

plying with the appropriate constant matrices derived using the methods in [38, 40].

This allows us to analyze the impact of these mapping operations which become an

overhead in circuits where they are required.

Although both gfmult_paar and gfmult_mastrovito are matrix-based, gfmult_paar is a more

abstract and modular representation without logic minimization prior to synthesis.

2.1.4 Galois Field Inversion

Inversion in GF (2m) is a computationally intensive component of many cryptographic algo-

rithms [4, 25, 45]. Here, we consider circuits for GF (28), some of the smallest implemen-

tations of which have focused on different field representations [43] and bases used for the

same [14, 37]. The benchmarks used in our work are listed below.

1. The first benchmark chosen was proposed in [43] as part of a compact AES SBox archi-

tecture. This design is based on representation in the composite field GF (((22)2)2), and

is denoted as gfinv_comp.

2. Ueno et al. [45] recently proposed an even more compact GF Inverter employing a com-

bination of normal bases, Redundantly Represented Basis (RRB), and Polynomial Ring

Representation (PRR) using a tower field GF ((24)2). This design exploits the wider va-

riety of modular polynomials provided by the redundant representations to ultimately

reduce circuit depth. This design will be denoted by gfinv_rrb.

3. The LGC GF (28) inversion circuit is optimized using techniques presented in [12] over a

GF (((22)2)2) tower-of-field representation, and is termed gfinv_lgc.


There is significant variation in the types of field representation and bases used in the

benchmarks owing to the rich design space for the same. Moreover, applications of GF (28)

inversion involve merging the isomorphic mapping with other linear operations in the al-

gorithm (for example, in an AES SBox, the isomorphic mapping is merged with the affine

transformation [45]). We therefore do not consider these mappings in this comparison, and

implement the circuits using their respective field representation.

2.1.5 Reed-Solomon Encoder

Reed-Solomon codes are a class of error-correcting codes and an important application area

of finite field arithmetic over GF (28) [33, 38]. A t-error correcting Reed-Solomon code

(represented as RS(n,k)) with m-bit symbols creates an n-symbol code word by adding 2t

parity-check symbols to a k-symbol message block. These parity-check symbols are obtained

by dividing the message polynomial by a generator polynomial of degree 2t + 1 with coef-

ficients in GF (2m). The hardware implementation of an RS(n,k) encoder is essentially a

division circuit that performs addition and multiplication of GF (2m) elements.

Figure 2.2: The reference circuit for an RS (255,223) encoder with 8-bit symbols [33].


Here, we compare the hardware implementations of the following RS(255,223) encoder de-

signs with 8-bit symbols.

1. A reference RS (255,223) design was built with the generator polynomial G(X) as specified

in [52]. The GF multiplications with coefficients of G(X) are represented as straightfor-

ward dataflow XOR operations, with their optimization left to DC. This design is termed

reedsolomon_ref.

2. The LGC version of an RS(255,223) encoder was provided to us by its designers [8]. It

implements a completely XOR-based combinatorial circuit that takes as inputs a message

byte mi and current state registers b0, b1 · · · b31. For the first 223 cycles, the circuit

computes a linear map from (b0, b1 · · · b31,mi) to the new values of the states, outputting

b31. The 32 parity-check symbols stored in the state registers at this stage are simply

shifted out one-by-one for the next 32 cycles. This design is denoted by reedsolomon_lgc.

Figure 2.3: An RS (255,223) encoder employing logic-minimization for state update.


2.2 Experimental Setup

2.2.1 Integration of SLPs into ASIC design flow

The LGC synthesis tool provides minimized circuits in SLP format. To seamlessly insert

these designs into a standard synthesis flow, these SLPs are first converted to dataflow

Verilog that can be input to Synopsys Design Compiler (DC) for logic synthesis. These

Verilog designs are parameterized for each benchmark design, and for the multipliers, they

are additionally parameterized for each input size. We obtained some of the LGC SLPs from

[10], and the rest were provided to us by the designers. The reference circuits were obtained

as listed below:

• sbox_lut is a straightforward look-up table representation. We implemented sbox_wolkerstorfer

and sbox_canright in dataflow Verilog from the expressions used in their construction

[46], [13]. The Verilog models of sbox_pprm1 and sbox_pprm3 were obtained from [3].

• The matrix multiplication in polymult_mat was implemented by employing appropriate

bitwise AND and XOR operations, and the design was parameterized for any input

length. polymult_comp circuits for each input length are available as SLPs in [6]. These

SLPs were converted into dataflow Verilog using a procedure similar to that used for

LGC SLPs.

• A VHDL representation of gfmult_mastrovito for general input sizes and field poly-

nomials was obtained from [2]. gfmult_paar was implemented using the expressions

specified in [39].

• The Verilog representation of gfinv_rrb was implemented based on logical expressions

specified in [45], and the Verilog for gfinv_comp was obtained from [3].

2.2. Experimental Setup 23

Figure 2.4: The complete design flow adopted for hardware analysis of LGC SLPs.

Logic synthesis of each design is performed at multiple frequencies using Synopsys Design

Compiler (DC). This is carried out to the point where the design no longer meets timing.

Area analysis makes use of elaborate reports generated by DC. Power analysis is performed

by first running a gate-level simulation of the netlists obtained at different frequencies, along

with post-synthesis delays annotated through a Standard Delay Format (SDF) file obtained

from DC. We feed 216 random inputs to each of the design alternatives and record the

switching activity in a Value Change Dump (VCD) file using Modelsim. For combinatorial

blocks with 8-bit inputs such as the SBox and GF (28) inverter, the test set is created in such

a way that it covers all 216 possible 8-bit transitions. The VCD file obtained is then used to

calculate the power consumption of the circuits averaged over the simulation duration, using

Synopsys PrimeTime.


2.2.2 Design alternatives for LGC designs

1. DC modification: We synthesize the LGC circuits in two ways - (i) an unaltered version,

where modification is allowed only in the size of gates, and not the gate-count and type,

and (ii) DC-modified version, where logic modification by DC is allowed to meet timing

constraints. This helps us evaluate both the direct results of the LGC tool, and whether

there is an improvement in the synthesis result from DC by providing an LGC circuit.

2. Pipelined LGC designs: While logic-minimized designs appear preferable to their

abstract counterparts owing to their compactness, higher speed achievable by the latter

presents a trade-off. The logic-minimized circuits in general, have longer critical paths

delays which can become prohibitive in high-speed designs. The idea of leveraging the

compactness of these circuits through pipelining has been introduced in literature [46],

to increase throughput without a significant increase in area. We therefore consider the

pipelined versions of the logic-minimized designs to evaluate the improvement in their

area-delay relationship, and thereby their suitability in being used in high-throughput

designs.

We utilize the register retiming feature available in DC to obtain pipelined versions of the

designs. This provides two benefits:

• Easy automation: The recommended method for retiming a design is to place flip

flops at the inputs or outputs of RTL design before synthesis. The tools can then

push the flops into the combinatorial logic cloud to suitable positions depending on

critical path delays. This can be easily parameterized in the RTL, where registers are

placed at the inputs only if retiming is desired. In addition, this automated method

is close to the way such circuits are likely to be pipelined in an actual bigger design,

where manually placing registers in the middle of the combinatorial logic is infeasible

2.2. Experimental Setup 25

due to the large number of design choices possible for each desired frequency.

• Retiming is an indication of the level of branching in the circuit. Greater the branch-

ing, more the number of timing paths, and hence higher the chances of inserting a

register along that path. Although we start with as many flip flops as the number

of design inputs, a design with higher branching can see a greater increase in the

number of flops after synthesis. This could be due to longer critical path delays

or the heuristics of the tool, or both. Either way, this provides a good evaluation

metric to asses the benefit of pipelining LGC circuits.

2.2.3 Standard cell library choices

As described in Section 1.2.1, the quality of a post-synthesis netlist is strongly influenced by

the area, power, delay, and functionality of standard cells available in the library used. To

account for this variation, we perform our post-synthesis analysis on two different technology

nodes - (i) TSMC 180nm, and (ii) Synopsys SAED 32/28nm standard cell libraries.

Chapter 3

Experimental results of logic synthesis

of benchmark designs

In this chapter, we analyze the post-synthesis area, power, and performance of the bench-

marks selected. We then study these results to evaluate the hardware efficiency of LGC

techniques, and understand how well these results correlate with logical metrics.

3.1 Combinatorial logic synthesis results

The analysis in this section begins by comparing combinatorial benchmark designs at the

logical abstraction level before proceeding to post-synthesis comparison.

3.1.1 Technology-independent evaluation

At this level, designs are compared regarding their logical complexity, estimated through the

number of gates. A design with more gates can potentially be bigger on hardware, and a

design with more levels of logic can potentially have a higher combinatorial delay. However,

this estimate is inaccurate, and can often be misleading owing to the fact that standard cells

come with diverse functionality, varying sizes and drive strengths - a fact that is exploited

by synthesis tools. This point will become clearer in following two subsections.

26

3.1. Combinatorial logic synthesis results 27

1312

202

125

180

1968

427

Gen

eric

gat

e co

unt

0

500

1000

1500

2000

sbox_

lut

sbox_

comp

sbox_

lgc

sbox_

canrig

ht

sbox_

pprm1

sbox_

pprm3

Generic gate count of SBox designs

(a) Generic gate count comparison

Aver

age

fano

ut

Logi

cal D

epth

0.75

1

1.25

1.5

1.75

2

0

10

20

30

40

sbox_lut

sbox_comp

sbox_lgc

sbox_canright

sbox_pprm1

sbox_pprm3

avgFo logicalDepth

Average Fanout of nets and Logical depth - AES SBox

(b) Fanout per gate and logical depth

Figure 3.1: Logical complexity of AES SBox designs.

Figure 3.1(a) shows the technology-independent gate count and logical depth for the bench-

mark AES SBox designs. This analysis alone suggests that the LGC design is roughly 10×

and 15× smaller than sbox_lut and sbox_pprm1 respectively. In terms of delay, sbox_pprm1

has the least number of logic levels, while sbox_lut has the most, indicating that they could

be the fastest and slowest respectively. Compared to logic-minimized circuis, the PPRM and

LUT designs are high-fanout structures, which is a result of their logical representations.

N

Num

ber o

f gen

eric

gat

es

0

500

1000

1500

2000

8 10 12 14 16 18 20 22

POLYMULT_MAT POLYMULT_COMP POLYMULT_LGC

Generic Gate count - NXN Polynomial Multiplication

(a) Gate count vs N

N

Num

ber o

f log

ic le

vels

4

6

8

10

12

14

8 10 12 14 16 18 20 22

POLYMULT_COMP POLYMULT_LGC POLYMULT_MAT

Number of logic levels - NXN Polynomial Multiplication

(b) Logical Depth vs N

Figure 3.2: Technology-independent comparison of N ×N polynomial multipliers.

In case of polynomial multiplication, from Figure 3.2, the polymult_lgc and polymult_comp

designs are very similar in structure and their logical complexity scales better with N as

compared to polymult_mat, which requires almost 3X more gates for N = 22 bits. This is

28 Chapter 3. Experimental results of logic synthesis of benchmark designs

because the number of AND and XOR operations grows quadratically with N . The logical

depth of polymult_lgc is marginally higher since the regular structure of polymult_mat can

be leveraged to express it as a binary tree of gates with fewer logic levels.

178

712548

2372

117

1871

Gen

eric

Gat

e co

unt

0

500

1000

1500

2000

2500

GF(2^8) GF(2^16)

gfmult_mastrovito gfmult_paar gfmult_lgc gfmult_lgc_tof

Generic Gate count of GF multipliers


Aver

age

fano

ut

Logi

cal D

epth

1

1.25

1.5

1.75

2

0

10

20

30

40

50

gfmult_mastrovito gfmult_paar gfmult_lgc gfmult_lgc_tof

Average Fanout - GF(2^8) Average Fanout - GF(2^16) Logical Depth - GF(2^8)Logical Depth - GF(2^16)

Fanout and Logical depth of GF multiplier designs

(b) Fanout-per-net and logical depth

Figure 3.3: Logical complexity of GF multipliers.

Similar to the polynomial multiplier, gfmult_paar has the highest gate count for both the GF

multipliers owing to its high-level matrix structure. While gfmult_lgc and gfmult_lgc_tof

have very small gate counts for GF (28) multiplication, gfmult_lgc shows a very high gate

count and logical depth for a GF (216) multiplier primarily due to the mapping and inverse

mapping operations that are included. This is in sharp contrast to gfmult_lgc_tof which

has the least gate count and logical depth for both multipliers as a result of the reduced

complexity of TOF representation. The LGC designs also have an average fanout that is

smaller than gfmult_mastrovito but larger than gfmult_paar since for the latter, the large

number of constants in matrix multiplication reduce the fanout-per-gate in the design.

In summary, technology-independent analysis of LGC designs suggests that their low logical

complexity stems from not just fewer gates, but also the small average fanout and logical

depth of the designs. Matrix-based multipliers have small average fanout and logic levels

due to their structure, but higher gate-count. Abstract SBox representations have a very

large gate count and average fanout. The overall expectation from this analysis is that the


128

104

118

Gen

eric

gat

e co

unt

50

75

100

125

150

gfinv_comp gfinv_lgc gfinv_rrb

GF (2^8) Inverter - Generic gate count


Logi

cal D

epth

Aver

age

fano

ut p

er n

et

0

5

10

15

20

1.7

1.75

1.8

1.85


Logical Depth Average Fanout per signal

GF (2^8) Inverter - Average fanout and logical depth

(b) Logical depth and fanout per gate

Figure 3.4: Logical complexity of GF (28) inverters.

logical compactness of LGC designs looks more suited to be exploited for smaller and faster

designs. This however overlooks the fact that the abstract designs, albeit larger, have a high

technology-dependent optimizability, which will become clear in the next sub-section.

3.1.2 Post-synthesis analysis

Technology-independent comparison of designs is not guaranteed to be accurate due to the

flexibility in implementation offered by standard cells, and leveraged by logic synthesis tools.

Here, we focus on the technological complexity of designs, which involves the area, delay,

and power of their standard cell implementation, and is significantly affected by timing

constraints. We therefore perform logic synthesis of the benchmark designs over a range of

frequencies to analyze their performance limitations and the corresponding scaling of their

area and power. It must be mentioned that the frequencies (and delays) shown strictly

correspond to the particular technology libraries used, and can vary depending on the cells

available. However, we expect the relative differences between designs to remain the same

since they are largely a result of the structural properties of the designs.

Effect of DC Modification: During the integration of LGC circuits into a standard logic


synthesis flow, the default behavior of a synthesis tool involves logic modifications owing

to its heuristics. This can often undo the benefits of logic-minimization performed before

synthesis. An example of this is shown in Figure 3.5, for the AES SBox. Preventing any

logic modification to sbox_lgc keeps the design significantly smaller at large delays due to

its inherent low gate count. The real challenge is when the delay decreases - without any

Figure 3.5: Trade-off with DC modification

logic modification, all DC can do is make cells bigger, which causes sharp increase in both

area and power. While the area still remains smaller than the version modified by DC,

there is a region where the power consumed by the unaltered sbox_lgc exceeds that of the

design modified by DC, presenting an interesting trade-off. At delays between 5.5-7ns, logic

modification by DC results in a design with 20-25% more area, but 10-30% less power. It

must be mentioned that DC modification does not cause drastic changes in the structure of

the circuit. The major changes involve a reduction in the number of XOR cells and cells

of higher drive strength, which results in less power in spite of higher area in the trade-off

region. Allowing DC modification therefore becomes a design decision depending on the

specific requirements of the target application. All analyses henceforth make use of the best

LGC version at each frequency.


3.1.2.1 AES SBox

The important results of our analysis are selected and listed for detailed discussion as follows.

1. sbox_lgc is the smallest of all in the minimal-area region. Figure 3.6 shows

the plot of area versus delay for the SBox at the 180nm and 32nm technology nodes.

The compactness of sbox_lgc can be clearly seen in the minimal-area region, where it is

17-55% smaller than sbox_lut, and 5-25% smaller than sbox_canright.

(a) TSMC 180nm standard cell library (b) Synopsys SAED 32nm standard cell library

Figure 3.6: Area vs Delay for SBox designs - technology-specific comparison.

2. sbox_lgc is well-suited for pipelining at the high-speed region. The effect of

timing constraints is seen around 200-240 MHz (4-5ns delay), through a sharp increase

in the area of sbox_lgc, sbox_wolkerstorfer, and sbox_canright; in the 180nm node, they

actually exceed that of sbox_lut. This is expected when the design transitions from its

minimal-area region to the high-speed region where logic modification and an increase in

cell count and drive strengths are needed to meet timing.

The compactness of sbox_lgc can be exploited in this region to add a pipeline stage, after

which its area gain over sbox_lut stays within ±20%. Increased branching in sbox_lut

results in the retiming heuristics inserting more registers into the logic cloud, causing a


greater increase in area. On the other hand, the small fanout of logic-minimized designs

keeps the area-increase after retiming to a level that is of practical utility. Among the

logic-minimized designs, sbox_lgc reaches a smaller delay (2.6 ns) compared to sbox_canright

(2.9 ns), but sbox_canright stays between 4-20% smaller than sbox_lgc after pipelining.

3. Without pipelining, sbox_lut achieves a better area-delay trade-off. Figure 3.7

gives an insight of the nature of optimization performed on the SBox designs. sbox_lut

has the highest average post-synthesis fanout and the least depth among all non-pipelined

designs. Despite high logical depth before synthesis, optimization during synthesis greatly

reduces the number of logic levels, along with a small increase in fanout. Similar to its

area, the fanout of sbox_lut remains flat as the frequency increases, indicating greater

flexibility to technology-dependent optimization.

On the contrary, the synthesis process increases the fanout and depth of logic-minimized

designs. Inserting a pipeline stage reduces the logical depth of sbox_lgc, which makes it

feasible to increase the fanout of each individual gate while still meeting timing. Without

pipelining, there is little or no scope for reducing logic levels, which necessitates increased

drive strength and addition of logic to meet timing at high speeds. Furthermore, logic-

minimized designs are highly XOR-dominant, as opposed to sbox_lut. Increasing the

drive strength of large number of XOR gates becomes expensive due to their higher area

as compared to other cells in the library (as was initially shown in Figure 1.4). sbox_lut,

in contrast, consists of zero XOR gates both before and after synthesis. This eventually

ends up having a greater impact on the area of logic-minimized designs. The number of

XOR cells after synthesis was observed to have an increased correlation to the design’s

area as we move into the high-speed region, and the plot can be found in Appendix A.

4. sbox_lut is more power-efficient in the 180nm technology node. This counter-

intuitive result suggests that area efficiency of sbox_lgc does not necessarily imply its


Delay (ns)

Aver

age

fano

ut

2

2 4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_pprm1

sbox_pprm3

sbox_wolkerstorfer - Pipelined

sbox_lgc - Pipelined

sbox_canright -

SBox - Average fanout of signal nets vs Delay180nm tecnology

(a) Fanout per gate

Delay (ns)

Num

. log

ic le

vels

10

20

2 4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_pprm1

sbox_pprm3



sbox_canright -

SBox - Circuit depth vs Delay - 180nm technology

(b) Maximum number of logic levels

Figure 3.7: Greater branching and fewer logic levels in sbox_lut.

power efficiency. In the minimal-area region where sbox_lgc was 25-55% smaller than

sbox_lut, the former’s power benefit decreases from 21% to -8%, and continues decreasing

in the high-speed region.

Delay (ns)

Aver

age

Pow

er (W

)

0.001

0.002

0.004

0.006

0.008

4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_pprm1

sbox_pprm3



sbox_canright - Pipelined

SBox - Average power vs Delay - 180 nm technology

(a) TSMC 180 nm standard cell library.

Delay (ns)

Aver

age

Pow

er (W

)

0.00004

0.00006

0.00008

0.0001

4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_pprm3




SBox - Average Power vs Delay - 32 nm technology

(b) SAED 32/28 nm standard cell library.

Figure 3.8: Average power consumption of SBox designs plotted against delay.

There are two major reasons for this. First, as discussed in the area analysis, reducing the

delay of sbox_lgc involves increasing the size and drive strengths of the cells. This causes

higher power consumption during every switching activity. For reference, in the TSMC

180nm technology used, dynamic power consumption of 2-input XOR and AND gates

with drive strength X2 and X4 can be anywhere between 1.6-2× and 2.8-3.9× greater


than that of their X1 counterparts. At a delay of 4.5 ns, sbox_lut has 8× more X1

cells, whereas sbox_lgc consists of 2.5× more X4 and 2× more X8 cells (Figure 3.9(a)),

explaining both the higher area and power at that speed.

Drive Strength

Num

ber o

f cel

ls

1

510

50100

X1 X2 X3 X4 X8 X12 X20

sbox_lut sbox_lgc sbox_wolkerstorfer sbox_canright

SBox - Distribution of standard cell drive strengths in high-speed region

(a) Cell drive strength distribution (4.5 ns delay)

Delay (ns)

Num

. tog

gles

per

com

puta

tion

200

300

4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_comp - Pipelined



SBox - Toggles per computation

(b) Toggle count per SBox computation.

Figure 3.9: Drive strength and toggling of cells influence dynamic power.

The second reason is that greater toggling directly results in higher dynamic power,

which contributes to more than 99% of total power consumption at this technology node.

Although sbox_lut is 30-40% bigger with 2.5× more cells in the minimal-area region, it

involves just 5-10% more toggles per computation than sbox_lgc (Figure 3.9(b)). This

is reflective of the fact that by virtue of its ROM-structure, sbox_lut does consist of

more cells but very few of them are active for a particular table look-up, whereas logic-

minimized designs have a greater fraction of their cells active for each operation. The

higher toggling of logic-minimized designs is because they are dominated by dynamic

hazard-transparent XOR gates [36]. An XOR gate propagates a transition on any of its

inputs with a probability of 1, as opposed to a probability of 0.5 in other gates. The high

correlation of XOR count to the designs’ power consumption is shown in Appendix A.

Since sbox_lut is completely mapped and optimized by DC, XOR gates do not make up

a majority of the design. In summary, sbox_lgc toggles almost as much as sbox_lut, but

each toggle of the former is more expensive.


5. sbox_lgc can potentially be power efficient at lower technology nodes due

to less leakage power. The contribution of leakage power to total power increases

significantly as we go to smaller technology nodes. With the 32nm technology node,

sbox_lgc is 25-45% smaller than sbox_lut. Since leakage power increases with area, it is

reasonable to expect it to have a high influence on overall power.

Delay (ns)

Leak

age

Pow

er (W

)

0.00002

0.00003

0.00004

4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_pprm3




SBox - Leakage Power vs delay, 32 nm technology

(a) Leakage power

Delay (ns)

Dyna

mic

pow

er (W

)

0.00002

0.00004

0.00006

0.00008

0.0001

4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_pprm3




SBox - Dynamic power vs Delay, 32 nm technology

(b) Dynamic power.

Figure 3.10: SBox power consumption using 32nm technology.

In the minimal-area region (10 ns delay), the toggling properties of sbox_lgc result in 146%

more dynamic power (Figure 3.10(b)) as compared to sbox_lut. In the 180nm technology

node, this would have been directly reflected in the total power as well. However, in this

case, the leakage power of sbox_lut is about 190% more than that of sbox_lgc due to their

areas, which eventually keeps the total power of sbox_lgc down to just 15% more than

that of sbox_lut (Figure 3.8(b)). This is highly beneficial after pipelining. As was shown

in the area analysis, pipelining sbox_lgc results in its being smaller than sbox_lut in the

high-speed region. From Figure 3.10, at a delay of 5 ns, while the pipelined sbox_lgc

consumes 3% more dynamic power, a 40% reduction in leakage power brings about a

total power reduction of 23% (Figure 3.8(b)). Based on these observations, we believe

that at technology nodes smaller than 32nm, the effect of leakage power can potentially

result in sbox_lgc achieving an overall power reduction in the minimal-area region.


Summary: We conclude this analysis with Table 3.1, where of sbox_lgc is compared with

the two best benchmark designs. In the table, - indicates smaller area (or lower power),

while + indicates higher area/power of sbox_lgc over its alternatives. The compactness of

sbox_lgc is well-reflected in hardware at low speeds. Achieving higher speeds comes at the

cost of an increase in both area and power over an abstract LUT-based design.

Benchmark Design Region Area PowerComparison of sbox_lgc Min-Area - 54% - 11-20%

with sbox_lut High-Speed + 2-13% +12-40%Comparison of sbox_lgc Min-Area - 17-24% - 4-36%

with sbox_canright High-Speed + 4-22% + 3-23%

Table 3.1: Summary of analysis results for sbox_lgc with TSMC 180 nm technology library.

3.1.2.2 Binary Polynomial multiplication

The analysis of polynmial multipliers is performed for multiplier sizes ranging from 8 to 22

bits, and is divided into two sets - (i) Varying the multiplier size (N) at fixed delays, and (ii)

Varying the delay for each value of N . The reason is that the space complexity of polynomial

multiplication grows quadratically with N , and hence, an architecture that scales better with

N is desired. It can be noted from Figure 3.11(a) that the area-delay curve follows a similar

pattern for different multiplier sizes.

Salient results of this analysis are listed as follows.

1. A matrix-based polynomial multiplier scales better with N at higher speeds.

Figure 3.11(a) shows that minimal-area designs of polymult_lgc and polymult_comp are

smaller than polymult_mat. This is understandable because the former two are inherently

compact, and a minimal area design doesn’t involve significant modification by DC since

the designs meet timing comfortably. Moreover, this area reduction increases with N .


(a) Area vs Delay for fixed sizes (b) Area vs N at fixed delays.

Figure 3.11: Area comparison for polynomial multipliers using TSMC 180nm technology.

For instance, polymult_lgc is only 6% smaller than polymult_mat in the minimal-area

region for an 8× 8 multiplier, whereas it is 25% smaller in the same region for a 22× 22

multiplier. This trend is shown in Figure 3.13(a).

As the delay decreases, polymult_lgc and polymult_comp incur a sharp increase in area to

a point where their plots cross that of polymult_mat, beyond which they are constantly

greater than polymult_mat. This rate of increase and the difference in the areas of

polymult_lgc and polymult_mat increases with N . polymult_lgc is 25% bigger than poly-

mult_mat for an 8× 8 multiplier, while it gets upto 40-50% bigger for larger multipliers.

This can also be intuitively understood from Figure 3.11(b), where the gap between the

areas of polymult_lgc and polymult_mat widens with the delay for each value of N . Note:

In Figure 3.11(b), the area of polymult_mat remains unchanged for both 10ns and 3ns

delay, while polymult_lgc gets bigger.

These observations are a result of the balanced tree structure of polymult_mat (Figure

3.12(a)), which offers remarkable potential to collapse the circuit into fewer levels of gates.

The logical depth of polymult_lgc on the other hand, really depends on the sub-expressions

factored out and the logic minimization heuristics at work. This is also seen in Figure

3.12(b) where polymult_mat is synthesized with significantly fewer levels (within 6-10)


(a)

Delay (ns)Ci

rcui

t dep

th

4

6

8

10

12

14

2 4 6 8

polymult_mat, N=8

polymult_mat, N=16

polymult_mat, N=22

polymult_lgc, N=8

polymult_lgc, N=16

polymult_lgc, N=22

Post-synthesis Circuit depth comparison - polymult_lgc and polymult_mat

(b)

Figure 3.12: (a) Regular and balanced structure of polymult_mat with logical depth ofthe order of log2N (b) Post-synthesis logical depth comparison of polymult_lgc (blue) andpolymult_mat (yellow)

whereas polymult_lgc has greater variations in logical depth that is 2-3 levels higher than

the corresponding polymult_mat design.

2. Pipelining polymult_lgc does not provide an improvement in area over poly-

mult_mat. Unlike the SBox where pipelining sbox_lgc significantly brings down its area

at smaller delays, Fig. 3.11(a) shows that pipelining polymult_lgc for N = 22 does not

have significant area benefit, while it does enable the design to reach smaller delays. This

is mainly because polymult_mat has a cell count that is relatively closer to polymult_lgc

as compared to sbox_lut over sbox_lgc. Therefore, in spite of an improvement in speed

of polymult_lgc after pipelining, polymult_mat still scales better in the high-speed re-


gion. Figure 3.11(b) shows the area after pipelining polymult_lgc and polymult_comp for

different sizes at a delay of 2.2ns. The area is consistently higher than polymult_mat,

indicating that pipelining logic-minimized multipliers does not offer a better area-delay

trade-off over a non-pipelined matrix multiplier.

3. The differences in area between polymult_lgc and polymult_comp are within

±20%, with polymult_comp marginally better at high speeds. Figure 3.13(b)

shows the percentage of area reduction obtained from polymult_lgc over polymult_comp.

Delay (ns)

Perc

enta

ge o

f are

a re

duct

ion

-75%

-50%

-25%

0%

25%

50%

2 3 4 5 6

N=8 N=12 N=16 N=20 N=22

Percentage reduction in area of polymult_lgc over polymult_mat, 180nm

(a) Reduction over polymult_mat

Delay (ns)

Perc

enta

ge o

f are

a re

duct

ion

-30%

-20%

-10%

0%

10%

20%

2 3 4 5 6

N=8 N=11 N=16 N=17 N=22

Percentage reduction in area of polymult_lgc over polymult_comp - 180nm

(b) Reduction over polymult_comp

Figure 3.13: Reduction in area of polymult_lgc over alternatives.

The differences between polymult_lgc and polymult_comp are largely within 15% at the

minimal-area region. Figure 3.13(b) suggests no clear pattern either with respect to N

or with respect to delay. For instance, polymult_lgc has a clear positive area reduction

percentage for N = 11 and N = 22, whereas, for N = 16 and N = 17, there are

opposing patterns of area reduction at the minimal-area and high-speed regions. We

believe that these variations are due to the fact that the logic minimization techniques

used in polymult_lgc and polymult_comp are very similar, with the former incorporating

additional techniques over many of the latter’s designs. As a result, the differences in their

logical descriptions are minimal, and the variations in post-synthesis area largely stem

from specific standard cells used and the individual circuit structures after optimization.


4. The power consumed by a matrix-based multiplier scales better with both N

and delay. This point can be observed from Figures 3.14(a) and 3.14(b), for 180nm

technology. Similar figures for 32nm technology are presented in Appendix A. The dif-

ferences in power consumption for small values of N are negligible, but become higher as

N increases to 22. In the minimal-area region, the power benefit offered by polymult_lgc

depends on the size. There is upto a 15% power reduction for most values of N < 15,

but beyond that, there is an increase in power consumed by polymult_lgc. As speed in-

creases, the benefits in power decrease as well, and in the high-speed region, there is up

to a 70-80% increase in power for certain values of N .

Delay (ns)

Ave

rage

Pow

er (W

)

0

0.0025

0.005

0.0075

0.01

2 4 6 8 10

polymult_mat, N=8

polymult_lgc, N=8

polymult_mat, N=12

polymult_lgc, N=12

polymult_mat, N=16

polymult_lgc, N=16

polymult_mat, N=22

polymult_lgc, N=22

Power vs Delay for NXN polynomial multipliers - 180nm technology

(a)

N

Aver

age

Pow

er (W

)

0

0.0025

0.005

0.0075

0.01

8 10 12 14 16 18 20 22

polymult_mat, 10 ns

polymult_lgc, 10 ns

polymult_mat, 4 ns

polymult_lgc, 4 ns

polymult_mat, 2.9 ns

polymult_lgc, 2.9 ns

Power vs N - 180 nm technology

(b)

Figure 3.14: Average power of polynomial multipliers using TSMC 180nm technology.

5. Dynamic Power is significantly influenced by gates with unbalanced input

delays. In case of the SBox, it was argued that the inherent structure of an LUT-based

design results in fewer cells being active at any instant that makes its power scale bet-

ter at high speeds. Here, polymult_mat is a bit-parallel design where such a property

is not expected to exist. Moreover, both the designs are dominated by similar types of

gates, which means it is reasonable to expect sbox_lgc to consume lesser power in the

minimal-area region. However, polymult_mat still consumes less power in the minimal-

area region than polymult_lgc due to fewer toggles per computation. This, we found, is a


Delay (ns)

Num

ber o

f XO

R ce

lls

0

100

200

300

400

2 4 6 8 10

polymult_mat, N=8

polymult_lgc, N=8

polymult_mat, N=16

polymult_lgc, N=16

polymult_mat, N=20

polymult_lgc, N=20

polymult_mat, N=22

polymult_lgc, N=22

Number of XOR cells with unbalanced input delays

(a) Comparison for XOR cells

Delay (ns)

Num

ber o

f gat

es

0

200

400

600

2 4 6 8 10

polymult_mat, N=8

polymult_lgc, N=8

polymult_mat, N=16

polymult_lgc, N=16

polymult_mat, N=20

polymult_lgc, N=20

polymult_mat, N=22

polymult_lgc, N=22

Total number of cells with unbalanced input delays

(b) Comparison for all cells

Figure 3.15: Cells with unbalanced input levels, using 180 nm technology node.

result of gates having unbalanced delays at their inputs. This point has been previously

mentioned with regards to the construction of an SBox [36]. In spite of both designs

being XOR-dominant, polymult_lgc has more gates which have their inputs at different

logic levels (Figure 3.15). This can again be traced back to the inherent balanced struc-

ture of polymult_mat (Figure 3.12(a)). While LGC tools [8, 12] make use of gate count

and/or logical depth as their cost function, the process can potentially increase power

consumption by removing certain properties of the design such as balanced logic levels.

Summary: Figure 3.16 presents a heat map depicting the high area-efficiency of LGC

designs at low speeds for all sizes, and their power-efficiency at low speeds only for small

multipliers. A matrix-based multiplier is a better choice at high speeds with regard to both

area and power-efficiency.

3.1.2.3 Galois Field Multiplication

1. The TOF-based LGC designs are smaller. The area estimate from technology-

independent analysis in this case was fairly accurate, considering the observation that gf-

mult_lgc and gfmult_lgc_tof are around 25% and and 15% smaller than gfmult_mastrovito


(a) Area-efficiency over polymult_mat (b) Power-efficiency over polymult_mat

Figure 3.16: Heat map illustrating the area and power-efficiency of polymult_lgc.

and gfmult_paar respectively in the minimal-area region for GF (28) multiplication. For

GF (216) multiplication, gfmult_lgc_tof is the most compact at low speeds, and is 25%

and 20% smaller than gfmult_mastrovito and gfmult_paar respectively.

GF(28) GF(216)

(a) TSMC 180nm standard cell library

GF(28)GF(216)

(b) SAED 32/28nm standard cell library

Figure 3.17: Area-delay plot for GF multiplier designs.

The representations of gfmult_paar and gfmult_mastrovito are matrix-based, which makes

them more optimzeable. Still, this does not degrade the area benefit of LGC designs for

GF (28) multiplication. At the high-speed region, gfmult_lgc and gfmult_lgc_tof are 20%

smaller than gfmult_mastrovito, and 10% smaller than gfmult_paar. There is however,


an impact on the area of gfmult_lgc_tof in GF (216) multiplication, where it offers an

area benefit within ±5% of gfmult_mastrovito and gfmult_paar. The extremely compact

logical structure of gfmult_lgc_tof makes it possible for it to not significantly exceed the

areas of the other two designs in spite of logic addition during synthesis. The cell counts of

gfmult_mastrovito and gfmult_paar about 20-30% greater than gfmult_lgc_tof, but that

is offset by the higher drive strength (and hence bigger size) of those of the latter, and

the small differences in area really come down to specific technology mapping operations

performed by DC.

2. Including the mapping and inverse mapping operations over the LGC de-

signs is expensive. In case of GF (216) multiplication, in contrast to the small size of

gfmult_lgc_tof, we see that adding the mapping operations makes the area of gfmult_lgc

more than double that of the other designs. We therefore believe that using gfmult_lgc_tof

offers compactness for designs where the mapping operations are either merged with other

linear transformations, or where they are performed once at the very beginning and the

end of a series of operations including GF multiplication in TOF representation. In case a

logic-minimized GF multiplier for a particular field polynomial is desired, using an LGC

design specific to that polynomial can give a compact design, as is seen from the area

plots for GF (28) multiplication for the AES polynomial. Another alternative is that the

GF multiplier in TOF-representation can be optimized by LGC designs together with the

mapping operations. We however did not have such an SLP, and hence cannot comment

on the area properties of the same.

3. The TOF-based LGC GF multipliers are more power-efficient. For GF (28)

multiplication, gfmult_lgc_tof consumes 30-35% less power than both gfmult_paar and

gfmult_mastrovito. In addition to the extreme compactness of the LGC design, the

low complexity of a GF (28) multiplier means very few logic levels (7-8) as compared to


gfmult_paar (9-10). The area of gfmult_lgc_tof does not blow up due to bigger cells,

and the power therefore remains very low.

Delay (ns)

Aver

age

Pow

er (W

)

0.0005

0.001

0.005

0.01

2 4 6 8 10

gfmult_mastrovito, 2^8

gfmult_paar, 2^8

gfmult_lgc, 2^8

gfmult_lgc_tof, 2^8


gfmult_paar, 2^16

gfmult_lgc, 2^16

gfmult_lgc_tof, 2^16

Power vs Delay for GF Multipliers - 180 nm technology

(a) TSMC 180nm technology

Delay (ns)

Aver

age

Pow

er (W

)

0

0.0001

0.0002

0.0003

2 4 6 8 10


gfmult_paar, 2^8

gfmult_lgc, 2^8

gfmult_lgc_tof, 2^8


gfmult_paar, 2^16


Power vs Delay for GF Multipliers - 32 nm technology

(b) Synopsys 32nm technology

Figure 3.18: (a) Average power consumption of GF multipliers plotted against delay

In case of GF (216) multiplication, when mapping and inverse mapping are applied exter-

nally, gfmult_lgc consumes 2× higher power as compared to the matrix-based designs.

Without mapping, gfmult_lgc_tof has 11-12 logic levels as compared to gfmult_paar’s

12, which is due to the balanced properties of the latter, making it scale better with speed.

In spite of having 150-200 fewer cells, the power of gfmult_lgc_tof remains within ±10%

of gfmult_paar due to higher toggles and bigger cells.

Summary: LGC GF multipliers based on Tower-of-Fields representation are compact and

power-efficient at most frequencies. At high speeds, their power consumption is not substan-

tially higher than that of benchmark designs. Conversion to and from a standard representa-

tion using mapping operations around the LGC multiplier results in large and power-hungry

hardware. Table 3.2 summarizes the comparison of LGC designs with gfmult_paar.


Comparison with gfmult_paarMinimal-area region High-speed regionLGC Design

Area Power Area PowerGF (28) Multiplier

TOF-based - 12-25% - 29-40 ±15% ± 15%

GF (28) MultiplierAES Polynomial - 12-25% - 21-31% - 1-15% - 6-15%

GF (216) MultiplierTOF-based - 15-19% - 2-4% ± 8% ± 10%

GF (216) MultiplierTOF, External mapping + 36-170% + 200% + 200% + 200%

Table 3.2: Summary of analysis results for gfmult_lgc with TSMC 180 nm technology library.

3.1.2.4 Galois Field Inversion

gfinv_lgc is smaller than gfinv_comp but larger than gfinv_rrb. In most cases,

gfinv_rrb is significantly smaller than gfinv_lgc. The starting design of gfinv_lgc has a

logical depth of 19 to gfinv_rrb’s 12, while of having only 12 fewer gates. In addition,

gfinv_rrb has a greater parallelism in its structure, indicated by its high fanout (Figure

3.4(b)). All of these are reflected in the post-synthesis results. Due to significantly fewer

logic levels, the area of gfinv_rrb scales better with delay, and attains a minimum delay of

2.9 ns as compared to 3.8 ns by gfinv_lgc. Higher parallelism and lower logical depth also

enable a smoother transition into the high-speed region. This is because gfinv_lgc incurs a

sharp increase in cell count to meet timing, in contrast to the relatively flat cell count plot

of gfinv_rrb (Figure 3.20(a)).

Pipelining gfinv_lgc does have an observable reduction in area at a delay of 3.8-4.1 ns with

the 180 nm technology node, and enables the design to reach delays as low as those of

gfinv_rrb. While gfinv_lgc is up to 107% larger than gfinv_rrb in the high-speed region

without pipelining, it is brought down to sizes that are 35-55% larger than gfinv_rrb by

inserting a pipeline stage. This is due to reduction in logical depth (Figure 3.20(b)), which


Delay (ns)

Area

(KG

E)

0

0.25

0.5

0.75

1

4 6 8 10

gfinv_comp

gfinv_lgc

gfinv_rrb

gfinv_lgc - pipelined

gfinv_comp - pipelined

GF(2^8) Inverter - Area (K Gate Eq.) vs Delay, 180 nm technology

(a)

Delay (ns)

Area

(KG

E)

0

0.1

0.2

0.3

4 6 8 10

gfinv_comp

gfinv_lgc

gfinv_rrb

gfinv_comp - pipelined


GF(2^8) Inverter - Area(K Gate Eq.) vs Delay, 32 nm technology

(b)

Figure 3.19: Area-delay plot for GF (28) inverter designs using (a) TSMC 180nm, and b)SAED 32/28nm standard cell libraries.

is however offset by the large cell count to make the overall area bigger.

Delay (ns)

Cell

coun

t

50

100

150

200

250

4 6 8 10

gfinv_comp

gfinv_lgc

gfinv_rrb


GF (2^8) Inverter - Cell count vs Delay, 180 nm technology

(a)

Delay (ns)

Circ

uit D

epth

0

5

10

15

20

25

4 6 8 10

gfinv_comp

gfinv_lgc

gfinv_rrb


GF (2^8) Inverter - Circuit depth vs Delay, 180 nm technology

(b)

Figure 3.20: Post-synthesis Cell count and logical depth of for GF (28) inverter designs using180 nm technology.

Power comparison of GF inverter designs (Figure 3.21) shows similar patterns observed

during area analysis, that are a direct result of the better parallelism gfinv_rrb.

3.2. Integrated Design Examples 47

Delay (ns)

Aver

age

pow

er (W

)

0

0.0005

0.001

0.0015

0.002

0.0025

4 6 8 10


GF(2^8) Inverter - Power vs Delay, 180 nm technology

(a) TSMC 180nm technology

Delay (ns)

Aver

age

Pow

er (W

)

0

0.000025

0.00005

0.000075

0.0001

4 6 8 10


GF(2^8) Inverter - Power vs Delay, 32 nm technology

(b) Synopsys SAED 32nm technology

Figure 3.21: (a) Average power consumption of GF (28) inverters plotted against delay

3.2 Integrated Design Examples

The analyses performed in Section 3.1.2 were entirely based on the consideration of com-

binatorial blocks as standalone primitives. The primary reason for having chosen these

benchmarks in Section 2.1 was that they often have practical utility in bigger designs. We

believe that it is important to assess the effectiveness of combinatorial optimization towards

minimizing the area or power of the complete system they are a part of. For instance, it

is not very beneficial to have large area reduction of a combinatorial primitive leading to

negligible (say, less than 5%) reduction in the total system area. This is affected by both

the contribution of the primitive to the bigger design’s area, as well as the optimization

performed by combining these primitives with external logic that is part of the design. This

section analyzes two such designs - AES encryption engine and Reed-Solomon encoder.

3.2.1 Comparison of AES designs with different SBox circuits

In this sub-section, we analyze the logic synthesis results of AES designs making use of

different SBox circuits. The aim of this study is to highlight the diversity in AES designs,


and the fact that evaluating logic-minimized combinatorial primitives highly depends on the

type of design they are plugged into. It was of interest to observe if the area or power results

varied with varying number of SBox structures. Similar results for 32 nm technology can be

found in Appendix A.

Figure 3.22: Standard and high-throughput AES designs.

3.2.1.1 Standard AES

This is an AES design with one SBox per each byte of the state and the key. There are

therefore 20 SBox circuits in total, and each encryption operation is completed in 10 clock

cycles, i.e. one round per clock cycle. The AES design is synthesized with different SBox

designs, with and without pipelining for logic-minimized designs.

The choice of SBox in standard AES is critical because it makes up more than 45% of the

total design. Similar to the standalone SBox results, aes20_lgc is 7% and 32% smaller than

aes20_canright and aes20_lut respectively. Due to better scaling of aes20_lut with speed,

the area reduction of aes20_lgc comes down to about 3% as it enters the high-speed region,

beyond which pipelining keeps the area benefit within 9-16% over aes20_lut. An advantage

of pipelining the design is that although each encryption now takes 20 clock cycles instead of

10, two different encryption operations can be performed in these 20 cycles, thereby ensuring


no loss of throughput. When plugged into an AES design, aes20_lgc is marginally better-

suited to pipelining than the other logic-minimized designs, offering 6-12% area reduction.

Pipelined

Non-Pipelined

(a) Area-throughput comparisonThroughput (M Encryptions per sec)

Aver

age

Pow

er (W

)

0

0.025

0.05

0.075

0.1

0.125

8 10 12 14 16 18 20

aes20_lut

aes20_lgc

aes20_wolkerstorfer

aes20_canright

aes20_lgc - pipelined

aes20_wolkerstorfer - pipelined

aes20_canright - pipelined

Standard AES - Power vs Throughput, 180 nm technology

(b) Power comparison

Figure 3.23: Post-synthesis comparison of standard AES designs with 180nm technology.

Coming to power comparison, the higher toggling of logic-minimized SBoxes, along with cells

of high drive strength results in aes20_lgc consuming 15-25% higher power than aes20_lut

in the minimal-area region. It is important to note the effect of pipelining on power here.

From Figure 3.23(b), even the pipelined version of aes20_lgc consumes power that is within

5% of that of aes20_lut. While pipelining makes it easier for logic minimized designs to

reduce the size of cells and meet timing, their rates of toggling do not reduce by much. In

addition, the number of cells active for each SBox computation remains high since there

are two encryption operations taking place to maintain throughput. With 32 nm cells, the

higher area of aes20_lut results in higher leakage power, but it is still an order of magnitude

smaller than the dynamic power. Both the pipelined and non-pipelined versions of aes20_lgc

are more power-efficient than aes20_canright and aes20_wolkerstorfer.


3.2.1.2 High-throughput AES

The second AES design is a high-throughput version. This design employs double the number

of SBox circuits present in standard AES, along with an intermediate set of state registers,

thereby achieving twice the throughput of standard AES. The area and power results are

Throughput (M Encryptions per sec)

Area

(KG

E)

30

40

50

60

20 30 40 50

aes40_lut

aes40_lgc

aes40_wolkerstorfer

aes40_canright




High-throughput AES - Area (K Gate Eq.) vs Throughput, 180 nm technology

(a)Throughput (M Encryptions per sec)

Aver

age

Pow

er (W

)0

0.05

0.1

0.15

0.2

20 25 30 35 40

aes40_lut

aes40_lgc

aes40_wolkerstorfer

aes40_canright




High-throughput AES - Power vs Throughput, 180 nm technology

(b)

Figure 3.24: Post-synthesis comparison of high-throughput AES designs.

largely similar to that of the standard AES. The area reduction of aes40_lgc remained about

30% and 15% over aes40_lut and aes40_canright respectively in the minimal-area region.

The increase in number of SBox circuits is reflected in the 32 nm technology node, where the

area gain of aes40_lgc jumps to more than 40% over aes40_lut. The power consumption of

aes40_lgc is higher by 28-40% than that of aes40_lut in the minimal-area region with 180

nm technology. This is more than that seen for standard AES due to doubling of the number

of SBox circuits resulting in higher toggling in them. This difference falls to less than 15%

with 32 nm tehcnology due to the impact of leakage power on the area of aes40_lut.

3.2.1.3 Lightweight AES

The final AES design is a lightweight version (Figure 3.25) consisting of only 4 SBoxes in

total. Multiplexers are inserted at the SBox inputs to send either the round key or one word


Figure 3.25: Lightweight AES design comprising only four SBox circuits.

of the state. There is significant combinatorial logic external to the SBox circuits in this

design. This is an interesting design alternative because, with only 4 SBox circuits in total,


Area

(KG

E)

9

10

1 2 3 4

aes4_lut

aes4_lgc

aes4_canright



Lightweight AES - Area (K Gate Eq.) vs Throughput, 180 nm technology

(a)Throughput (M Encryptions per sec)

Aver

age

Pow

er (W

)

0

0.02

0.04

0.06

1 2 3 4

aes4_lut

aes4_lgc

aes4_canright



Lightweight AES - Power vs Throughput, 180 nm technology

(b)

Figure 3.26: Post-synthesis comparison of lightweight AES designs with 180nm technology.

the benefits in area offered by logic-minimized designs is not expected to be as pronounced

as in the previous versions. This is reflected in an area benefit of only 7-12% obtained

from aes4_lgc at the minimal-area region. There are two important points regarding the


high-speed region. First, pipelining aes4_lgc comes at a throughput loss of 20%. This is

because this design computes the substitution operation for the AES state in four clock

cycles. There is an additional clock cycle for computing the substitution operation for each

word of the key. After retiming, there is a latency of one cycle per round, and this cycle

cannot be filled by another AES encryption. Second, area of aes4_lut scales sharply in the

high-speed region. This suggests that the external multiplexing logic has a greater impact

on the area of aes4_lut when it comes to meeting timing. Optimization is now performed on

the combined cloud of combinatorial logic, which is seen to be better in the case of aes4_lgc.

This is seen in both the technology nodes, making aes4_lut 10-15% larger than pipelined

versions of logic-minimized designs for similar throughput.

The effect of combined combinatorial optimization is also reflected in the power consumption

of aes4_lut, which goes up to 30% higher than aes4_lgc - pipelined at high speeds. Difficulty

in meeting timing has now resulted in optimization breaking up the structural properties

of aes4_lut with logic addition and modification. This ultimately leads to greater toggling

in aes4_lut than the LGC designs. As for logic-minimized designs, the differences remain

largely within ± 10% after pipelining. With just 4 SBox structures, these differences really

boil down to the specific combined optimization moves performed by the tool.

In summary, the benefits of a smaller SBox are diminished when integrated with an AES

design, especially in its lightweight version. Table 3.3 presents a summary of the analysis of

integration of different SBox circuits with AES designs.

3.2.2 Reed-Solomon Encoder

As mentioned during benchmark selection, we perform logic synthesis with three different

types of logic modification to the LGC combinatorial logic - (i) an exact gate-level represen-


AES Type Benchmark Design Region Area PowerComparison of aes20_lgc Min-Area - 12-32% + 12-25%

Standard AES with aes20_lut High-Speed - 9-16% - 1-19%(20 SBoxes) Comparison of aes20_lgc Min-Area - 9-14% - 12-21%

with aes20_canright High-Speed - 6-14% (-10%) - (+5%)Comparison of aes4_lgc Min-Area - 3-12% (-8%) - (+25%)

Lightweight AES with aes4_lut High-Speed - 3-16% - 0-30%(4 SBoxes) Comparison of aes4_lgc Min-Area - 4-8% - 0-18%

with aes4_canright High-Speed ± 5% (-11%) - (+3%)

Table 3.3: Analysis summary of AES designs with LGC SBoxes, using 180 nm technology.

tation of the circuit obtained after LGC minimization, referred to as reedsolomon_lgc-exact,

(ii) a design which preserves hierarchy of the LGC-minimized combinatorial logic, allowing

logic modifications only within the combinatorial block, referred to as reedsolomon_lgc-hier,

and (iii) a flattened design that allows DC to merge the LGC-minimized logic blocks with

external logic, referred to as reedsolomon_lgc-flat. As seen from the Figure 2.3, the com-

binatorial logic external to the LGC-minimized block is comparatively minimal to cause a

significant impact on the results. However, it does create different starting solutions by the

tool’s heuristics, which eventually cause observable differences at higher speeds.

3.2.2.1 Area-Throughput comparison

Plugging reedsolomon_lgc as a black-box sans any DC modification is clearly not the best

choice as is indicated by a 17-30% higher area (Figure 3.27(a)). As for reedsolomon_lgc-hier

and redsolomon_lgc-flat, their areas differ by less than 1% in the minimal-area region. This

is due both to negligible external logic and the fact that significant DC optimization kicks

in only when it becomes challenging to meet timing.

Both these designs are about 9-12% larger than reedsolomon_ref in the minimal-area re-

gion. This is a surprising result, considering that reedsolomon_lgc has fewer logic levels and


Throughput (million blocks per sec)

Area

(KG

E)

3

4

5

6

7

0.5 0.75 1 1.25 1.5

reedsolomon_lgc - flat

reedsolomon_lgc - exact

reedsolomon_lgc - hier

reedsolomon_ref

RS (255,233) encoder - Area (K Gate Eq.) vs Throughput, 180 nm technology

(a)


Area

(KG

E)

1.5

1.75

2

2.25

0.4 0.6 0.8 1 1.2



reedsolomon_ref

RS (255,233) encoder - Area (K Gate Eq.) vs Throughput, 32 nm technology

(b)

Figure 3.27: Area vs Throughput comparison of Reed Solomon encoder designs

similar cell count, but still ends up bigger than reedsolomon_ref in the minimal-area region.

The reason for this is the higher XOR-dominance of the LGC design. It is true that reed-

solo,on_ref is built using GF multipliers and adders which are predominantly XOR-based,

but their behavioral representation and lack of logic minimization provides greater flexibility

to DC during logic mapping and optimization.

(a) XOR cell count and total standard cell count.


Logi

cal D

epth

5

10

15

20

25

0.5 0.75 1 1.25 1.5

reedsolomon_lgc - flat reedsolomon_lgc - exact reedsolomon_lgc - hierreedsolomon_ref

RS (255,223) encoder - Logical Depth, 180 nm technology

(b) Logical depth

Figure 3.28: Post-synthesis comparison of RS(255,223) designs, at 180 nm technology.

Moving on to the high-speed region, the differences between reedsolomon_lgc-flat and reedsolomon_lgc-

hier widen up to 8% due to higher scope for optimization in the latter. On the other hand,

the difference in the areas of reedsolomon_lgc and reedsolomon_ref narrows down to 4-7%.

This is due to the higher circuit depth of reedsolomon_ref that results in sharper increase


in its area, unlike reedsolomon_lgc, whose fewer logic levels helps in reaching higher speeds.

3.2.2.2 Power comparison

Similar to area, the average power consumption of reedsolomon_lgc-flat scales better than

reedsolomon_lgc-hier in the high-speed region, and is about 15-28% smaller. Flattening

reedsolomon_lgc is therefore a better choice than preserving its hierarchy considering the

negligible difference in both area and power in the minimal-area region, and better scaling

of the flattened design in the high-speed region.


Aver

age

pow

er (W

)

0.03

0.04

0.05

0.06

0.4 0.6 0.8 1 1.2


reedsolomon_lgc - exact


reedsolomon_ref

RS (255,223) encoder - Power vs Throughput, 180 nm technology

(a)


Aver

age

pow

er (W

)

0.0002

0.0004

0.0006

0.0008

0.001

0.4 0.6 0.8 1 1.2



reedsolomon_ref

RS (255,233) encoder - Power vs Throughput, 32 nm technology

(b)

Figure 3.29: Average power comparison of RS(255,223) designs.

The difference in power between reedsolomon_lgc-flat and reedsolomon_ref remains within

10%. The number of toggles and cell count differ by less than 12% and 15% respectively,

resulting in the small power difference. The power differs by a larger amount (10-20%) in

the 32 nm technology node due to reedsolomon_lgc’s higher leakage power.

In summary, Reed-Solomon Encoder with a logic-minimized state-update circuit reaches

higher speeds, but is not efficient in terms of area and power as compared to a design

realizing GF multiplication using dataflow operations.


3.3 Effect of physical design

The final phase of our analysis at multiple abstraction levels is the study of impact of physical

placement and routing on post-synthesis netlists. Area and power after physical design are

liable to change due to physical distances between cells and routing overhead. Therefore,

these results are a more accurate reflection of their behavior on hardware. To analyze these

effects, all the benchmark deigns were placed and routed at multiple frequencies on an

appropriate die size, using Synopsys IC Compiler. It must be mentioned that this physical

design flow was performed only for studying the impact of placement and routing, and not

for the purpose of actual fabrication (that will be discussed in Chapter 4).

Post-layout Area of LGC designsDesign Minimal-Area Region High-Speed Region12-24% bigger 2-14% biggerSBox than sbox_canright than sbox_canright0-5% bigger 9-16% biggerN=8 than polymult_mat than polymult_mat

Polynomial 5-14% smaller 11-13% biggerMultiplier N=16 than polymult_mat than polymult_mat

2-13% smaller 21-27% biggerN=22 than polymult_mat than polymult_mat1-3% bigger 0-2% smaller

GF (28) than gfmult_paar than gfmult_paar1-4% smaller 10-17% smallerGF Multiplier

GF (216) than gfmult_paar than gfmult_mastrovito41-52% bigger 85-154% bigger

GF (28) Inverter than gfinv_rrb than gfinv_rrbReed-Solomon 10-16% bigger 8-9% bigger

Encoder than reedsolomon_ref reedsolomon_ref1-12% smaller (-3)-7% smallerStandard than aes20_canright than aes20_lut

(-2)-3% smaller 2-6% biggerAESLightweight than aes20_lut than aes20_canright

Table 3.4: Post-layout area evaluation of LGC designs, using 180 nm technology.

3.3. Effect of physical design 57

Variations from post-synthesis results are seen for designs with minimal differences in area.

For example, sbox_lgc is smaller than sbox_canright at low speeds after synthesis, but gets

around 20% bigger after placement and routing. Similar changes are seen between poly-

mult_lgc and polymult_mat for small values of N . Overall, the addition of buffers and effect

of physical placement occur for all the circuits, and hence marginally diminish the area ben-

efits of LGC designs. These results are summarized in Tables 3.4 and 3.5. The comparisons

reported in the tables correspond to the closest competitor to LGC designs. Complete graphs

are shown in Figures A.5 - A.10 in Appendix A.

Post-layout Power of LGC designsDesign Minimal-Area Region High-Speed Region0-33% higher 6-37% higherSBox than sbox_canright than sbox_lut46-55% higher 27-38% higherN=8 than polymult_mat than polymult_mat

Polynomial 50-60% higher 0-34% higherMultiplier N=16 than polymult_mat than polymult_mat

70-80% higher 7-36% higherN=22 than polymult_mat than polymult_mat15-17% higher 16-39% lower

GF Multiplier GF (216) than gfmult_paar than gfmult_paar0-27% higher (-18)-7% lowerStandard than aes20_lut aes20_lut0-4% higher 0-7% higherAES

Lightweight than aes20_lut aes20_canright

Table 3.5: Post-layout power evaluation of LGC designs, using 180 nm technology.

It can be noted that while there are differences in the percentage of area/power improvement,

the overall relative behavior does not change significantly compared to the post-synthesis

results for circuits that have large differences in area. We round up this discussion by stating

that the major transformation of logical circuits occurs during logic synthesis, which is where

compactness properties are susceptible to be broken up. Physical design has an impact on

the efficiency of designs only when their hardware metrics are close to each other.


3.4 Connection between abstract and technology-dependent

quality metrics

In order to complete the understanding of impact of logic synthesis on logic-minimized

combinatorial primitives, we compute the correlation of abstract logical metrics to the area

and power obtained after synthesis. This computation is performed for each benchmark

design over the entire range of frequencies considered earlier, to identify regions where logical

metrics are more likely to be predictive of hardware efficiency.

(a) Gate Count correlation to Area and Power (b) Logical Depth correlation to Area and Power

Figure 3.30: Correlation of gate count and logical depth of SBox to hardware metrics.

Figure 3.30(a) shows a strong positive correlation of logical gate count to post-synthesis area

at low speeds, for an SBox. This tapers off as we move into the high-speed region before

surging up again owing to the small gate count making LGC designs suitable for pipelining.

The logical gate count, however, has only a moderate correlation to the SBox power. We

also see from Figure 3.30(b) that logical depth fails to accurately estimate both area and

power of the SBox. These observations highlight the fact that hardware aspects such as

toggling and flexibility towards optimization with a technology library are not well-captured

by logical metrics.

3.4. Connection between abstract and technology-dependent quality metrics 59

Delay (ns)

Corr

elat

ion

-1

-0.5

0

0.5

1

2 3 4 5

N=8 N=12 N=16 N=20 N=22

Polynomial Multiplier - Correlation of Logical Gate Count to Area

(a) Gate Count correlation to Area

Delay (ns)

Corr

elat

ion

-1

-0.5

0

0.5

1

2 3 4 5

N=8 N=12 N=16 N=20 N=22

Polynomial Multiplier - Correlation of Logical Gate Count to Power

(b) Gate Count correlation to power

Figure 3.31: Correlation of logical gate count of polynomial multipliers to hardware metrics.

For a polynomial multiplier, we look at the correlation of gate count to area and power for

different values of N . It is clear that there is a pattern of the gate count being an accurate

estimator of area and power in the minimal-area region. The correlation drop at high speeds

happens sooner as the multiplier size increases, indicating the higher efficiency of matrix-

based alternatives for larger sizes. For power, it is evident that abstract metrics are highly

inaccurate for large multipliers throughout the delay range.

Delay (ns)

Corr

elat

ion

-1

-0.5

0

0.5

1

2.5 3 3.5 4 4.5 5

N=8 N=12 N=16 N=20 N=22

Polynomial Multiplier - Correlation of Logical Depth to Area

(a) Logical depth correlation to Area

Delay (ns)

Corr

elat

ion

-1

-0.5

0

0.5

1

2 3 4 5

N=8 N=12 N=16 N=20 N=22

Polynomial Multiplier - Correlation of Logical Depth to Power

(b) Logical depth correlation to power

Figure 3.32: Correlation of logical depth of polynomial multipliers to hardware metrics.

An interesting result is the correlation of logical depth to the area and power of polynomial

multipliers, shown in Figure 3.32. We see that at high speeds, while gate count fails to


predict the hardware quality, logical depth of the designs has a very strong correlation to

area and power. When the differences in logical gate count and XOR count are not high,

hardware efficiency at high speeds is a result of other logical parameters, most notably the

logical depth. A regular and balanced matrix-structure maintains few logic levels over the

8-22 bit range of N . This makes it more suited to meet timing with fewer modifications.

Similar analysis for GF multipliers and inverter indicates a similar pattern of strong corre-

lation of gate count to area and power at low speeds, and moderate-to-strong correlation of

logical depth to hardware metrics at high speeds. These figures can be seen in Appendix A.

(a) Gate count correlation to Area (b) Gate count correlation to power

Figure 3.33: Correlation of gate count of SBox to hardware metrics of AES designs.

When plugging combinatorial primitives into a complete design such an AES encryption

engine, we see a pattern of abstract metrics correlating more with the area at low speeds.

Furthermore, this is shown to depend on the contribution of the primitive to the total design

area. While gate count is only moderately correlated to area for lightweight AES at low

speeds, we do see a surge in correlation after pipelining at high speeds. Meanwhile, SBox

gate count is weakly correlated to the power of standard and high-throughput AES designs,

owing to the inability of abstract metrics to accurately represent switching activity.

3.4. Connection between abstract and technology-dependent quality metrics 61

In summary, our analysis thus far has made it clear that conversion of a logical circuit

representation to hardware is not trivial. Area and power efficiency of a combinatorial circuit

depend on standard cell library, how conducive the circuit structure is for optimization, delay

requirements, and surrounding logic when it is part of a bigger design. Efficiency with respect

to one metric can be achieved at the cost of that with respect to another. In this regard,

logical quality metrics do not provide an accurate estimate of how small or fast a circuit will

be in hardware. Figure 3.34 shows the distribution of correlation of logical gate count to

hardware quality metrics of benchmark designs over the entire frequency range considered.

We see that logical gate count has a high correlation ( > 0.8) to area only 17% of the time,

while it is highly correlated to power less than 20% of the time.

Correlation of Logical gate count to area

0

10

20

30

-1.00 -0.83 -0.67 -0.50 -0.33 -0.17 0.00 0.17 0.33 0.50 0.67 0.83 1.00

Distribution of Correlation of Logical Gate Count to Area

(a) Gate count correlation to Area

Correlation of Logical gate count to power

0

10

20

30

40

-1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00

Distribution of Correlation of Logical Gate Count to Power

(b) Gate count correlation to power

Figure 3.34: Distribution of Correlation of gate count to hardware quality metrics of allbenchmarks.

This reiterates the point that while logical metrics can provide a reasonable initial estimate,

comparing logical designs solely based on minute differences in gate count or logical depth is

not an accurate comparison of their expected hardware performance. We therefore put forth

the argument that the question of whether such logical metrics can be used for hardware

estimation does not have a definitive answer. Rather, we believe that the more prudent

question is about which performance region and type of designs show a higher reflection


of logical metrics in their hardware quality. In this regard, we conclude this chapter by

showing Table 3.6. This table summarizes the correlation analysis for the benchmark designs

considered, and is intended to serve as a reference for any future work on similar circuits.

Logical Min-area Region High-speed RegionMetric Design Area Power Area Power

SBox H M M LPolynomial N≤14 H H L LMultiplier N>14 H L L L

Gate GF Multiplier M M L LCount GF Inverter M M L L

Standard H L L→M H→LAES High-throughput H L L→M M

Lightweight M L L→H HSBox M L L LPolynomial N≤14 L L H HMultiplier N>14 L H H H

Logical GF Multiplier H H H HDepth GF Inverter L L M L→M

Standard H L L→M H →LAES High-throughput H L L→M M

Lightweight M L L→H HSBox L L L LPolynomial N≤14 L L H H

Fanout Multiplier N>14 L H H Hper GF Multiplier L L M H

Gate GF Inverter L L L LStandard H L L→H M→L

AES High-throughput H L L→H H→LLightweight H L→M L→M M

Table 3.6: Correlation of abstract metrics to hardware quality metrics. H-High (Correlation> 0.8), L-Low (Correlation < 0.5), M-Moderate (0.5 ≤ Correlation ≤ 0.8). → indicatestransition of correlation from one level to another.

Chapter 4

Design of prototype ASIC

This chapter presents the design of a prototype ASIC for benchmarking combinatorial logic-

minimized circuits on hardware. We begin with the reasons behind its architecture, followed

by integration of different hardware blocks onto this chip. We finally discuss the strategies

adopted for its physical design, ending with its post-layout evaluation.

4.1 Design Rationale

Having discussed how an SLP goes through logic synthesis in Chapter 3, the next step is to

come up with a suitable platform to benchmark the hardware performance of these designs.

While realizing these designs onto an ASIC is the obvious choice, simply grouping these

circuits together on a chip is of little or no benefit due to the following reasons:

• The design becomes pin-limited, i.e. a chip composed of simply a collection of these

combinatorial circuits would require an impractically large number of pins brought out of

it. This number grows out of control as the number of circuits increases, since each circuit

has eight or more inputs and outputs. This results in the chip boundary size growing just

to accommodate the pins, while the core itself is largely empty. This manifests into

inefficient area-utilization and lack of scalability.

• There is a lack of ease of user control for analysis. Having all the pins coming out of

63

64 Chapter 4. Design of prototype ASIC

the chip makes it mandatory for the circuits inside to be controlled through an external

voltage source connected to each pin of the circuit to be analyzed. This gives little or no

scope for automation that is essential to feed a large number of test vectors. Easy and

effective user control on such an ASIC becomes infeasible, and the chip itself is not very

flexible since its utility is limited to simply giving an output for a particular input sent.

The aforementioned reasons necessitate an elegant solution in the form of a central controlling

unit to access all the circuits to be benchmarked, while presenting a simple user interface for

feeding test vectors. This motivated us to opt for a processor-based System-on-Chip design.

Integrating hardware blocks onto such a design offers the following advantages:

• C programs - convenient user interface. The most attractive feature of using a processor

is the ease of running tests through simple-to-understand C programs. The user can now

work at a higher level of abstraction to access any hardware block of interest, since all of

them are connected to the processor through a common bus.

• Scalability - Hardware blocks are attached as “coprocessors” onto the system bus. Such

a hardware-software approach makes the entire design scalable in that any number of

additional independent hardware blocks can simply be attached to the bus through a

uniform interface for accessing them. We make use of this flexibility to add additional

cipher cores as hardware coprocessors for analysis.

4.2 SoC Architecture

The SoC designed in this project is referred to as “NISTCHIP”, and follows the ideas devel-

oped in a previous chip [51] that was designed in-house at the Secure Embedded Systems

4.2. SoC Architecture 65

Figure 4.1: A general bus-based architecture facilitating communication between variousblocks of the SoC.

Lab at Virginia Tech. Our work included expanding this architecture with additional hard-

ware blocks and strategies for physical design. The SoC is centered around Leon3, which is

a 32-bit RISC processor compliant with the SPARC V8 instuction set architecture. Leon3

is distributed as a synthesizable VHDL model by Aeroflex Gaisler. It features a 7-stage

pipeline, configurable caches, on-chip debug support, and a software interface for efficient

communication with the processor. It is highly suitable for building SoC designs due to its

extensive configurability and availability of a vast library of on-chip peripherals including

memory controllers, UART modules, timers, GPIOs, and bus controllers that are readily

compatible with the processor core.


Figure 4.2: NISTCHIP Architecture

4.2.1 Memory-mapped coprocessors on NISTCHIP

Figure 4.2 shows the architecture of NISTCHIP. The processor is extensible through an

AMBA 2.0 bus system. It acts as a master on an Advanced High-performance Bus (AHB),

which also has a 128 KB on-chip RAM attached to it. The processor has access to all

peripherals through this system-wide bus. Hardware coprocessors are attached as “slaves”

onto an Advanced Peripheral Bus (APB) which is controlled by an APB controller (or APB

bridge) that is in turn attached as a slave to the AHB bus.


Each slave attached to a bus is treated by the processor as “memory”, and has an address

range associated with it. Any address in the range for a particular coprocessor is ignored by

all other blocks on the bus. Data to be written to a coprocessor is sent through the AHB

bus, transferred onto APB by an APB controller acting as a bridge, and finally written into

the coprocessor’s internal registers. Our chip splits the coprocessors over three APB bridges,

grouped according to functionality:

• APB Bridge 0 controls essential general-purpose peripherals including a hardware

timer, GPIOs, and UART controller.

• APB Bridge 1 controls electro-magnatic fault detection sensors [18], LR-Keymill

cipher, and an AES coprocessor with an LUT-SBox (denoted as AES-LUT).

• APB Bridge 2 controls of all other NISTCHIP-related coprocessors, listed below:

– The combinatorial modules obtained from SLPs, all grouped into one coprocessor

referred to as “NISTCOMB”.

– Reed-Solomon Encoder with an LGC circuit for state-update.

– AES coprocessors with LGC SBox and Wolkerstorfer SBox, referred to as AES-LGC

and AES-WOLK respectively.

– Direct Memory Access (DMA) controller for providing test vectors to NISTCOMB

at maximum speed.

– Three finalists of the “Competition for Authenticated Encryption: Security, Appli-

cability, and Robustness” (2014-2018), referred to as CAESAR [1]. These ciphers

include ACORN (32-bit and 8-bit datapath), AEGIS-128L, and MORUS-1280. De-

tailed discussion on the integration of these ciphers is provided in Chapter 5.


The address ranges of all components connected to the AHB and three APB busses are shown

in Tables B.1 and B.2 of Appendix B. In a memory-mapped interface, each coprocessor

is equipped with a set of data and control registers that are required for communication

with the processor. For example, an AES core attached as a coprocessor takes as input a

key followed by plaintext and a corresponding “start” signal, and returns ciphertext after

encryption. Each coprocessor core similarly has its own requirement for the type of data and

the way it is sent. To address this, there is a software-interface wrapper built around each

Figure 4.3: (a)Two different types of coprocessor wrappers - one with registers, and the otherwith FIFOs. (b)Example - AES coprocessor wrapper.

coprocessor core, as shown in Figure 4.3. This wrapper communicates with the processor

through registers that are visible to software, and handles cycle-accurate data and control

flow to and from the core. The wrapper also perform address decoding to determine if the

read/write request is intended for that coprocessor. This address decoding runs in every

coprocessor whenever there is a read/write request on the APB bus.


4.2.2 NISTCOMB coprocessor: Design and Programming model

The NISTCOMB coprocessor consists of combinatorial circuits obtained from SLPs, along

with a software-interface wrapper to control the input-output flow to each circuit. The

circuits included are listed in Table 4.1. Since all of them are combinatorial, their outputs

are available in a single clock cycle after sending an input. The wrapper consists of an input

and output FIFO for data storage. Once the input FIFO is filled with test vectors, they

are sent in consecutive cycles to the intended block, and the results are stored in the output

FIFO. Once the output FIFO is full, the wrapper asserts a status bit, and the results are

read out of the output FIFO. The presence of FIFOs is beneficial for data access through a

DMA, thereby facilitating data transfer at maximum speed without processor latency. The

coprocessor’s control register consists of bits to select the circuit that is desired to be active.

Since the combinatorial circuits in NISTCOMB are minute in area as compared to the rest

Component Name FunctionSBOX-LGC 8-bit LGC SBoxSBOX-LUT 8-bit LUT-based SBoxSBOX-WOLK 8-bit Wolkerstorfer SBox [46]INVSBOX-LGC 8-bit LGC Inverse SBoxINVSBOX-LUT 8-bit LUT-based Inverse SBoxINVSBOX-WOLK 8-bit Wolkerstorfer Inverse SBox [46]GF256MULT LGC GF (28) MultiplierGF65536MULT LGC GF (216) MultiplierGF256INV LGC GF (28) InverterK3LRSBOX 16-bit SBox [30]MULT64 64-bit LGC Polynomial multiplier

Table 4.1: Combinatorial circuits included in NISTCOMB coprocessor

of the design, multiple “instances” of them were created for better observation of their power

consumption (Figure 4.4). This also helps to average out differences due to design heuristics

while making area comparisons between (for example) the three SBox designs. The number

of instances can be four, two, or one, depending on the input width of the circuits.


Figure 4.4: Multiple instances of combinatorial blocks, depending on input width.

The complete software interface register space of NISTCOMB coprocessor are shown in

Tables B.3 - B.5 of Appendix B. The number of instances of each circuit that can be active

is controllable through a “redundancy” value in the control register. For example, inputs

of 8-bit blocks can be replicated from 1-4 times, whereas for a 16-bit block, inputs can be

replicated only twice.

4.3 NISTCHIP ASIC Design

Our prototype ASIC is designed using standard cells, IO pads, and bonding pads obtained

from TSMC 180nm technology library. The tools used at various stages of the design flow

are listed in Table 4.2.

4.3. NISTCHIP ASIC Design 71

Software Tool FunctionSynopsys Design Compiler (DC) Logic synthesis and DFT insertionSynopsys IC Compiler (ICC) Physical design (placement, routing, clock tree synthesis)Synopsys PrimeTime Post-synthesis and post-route power analysisMentor Graphics Calibre Chip verification and signoffMentor Graphics ModelSim RTL, post-synthesis, and post-route simulation

Table 4.2: Software tools used at various stages NISTCHIP design.

4.3.1 Logic synthesis of NISTCHIP

The complete design is synthesized at a frequency of 80 MHz, using slow-corner libraries to

consider worst-case delays. Due to the size of this SoC, and the large number of coprocessors,

synthesis is performed as an iterative process until the design meets timing. Appropriate

dont_touch and false_path constraints are set on legacy sensor modules that are a part of

this chip, in order to ensure that DC does not optimize their logic away. A useful strategy

followed to increase optimization on critical paths of the design was to add them to a separate

path group. Optimization by DC is now performed on the worst paths of each such path

group. The outcome of this strategy is an improvement in meeting timing constraints.

Synthesis is followed by insertion of scan chain. All registers in the coprocessors on APB

Bridge 2 that store their state, control, and status are added to the scan chain to increase

their observability. This is followed by incremental synthesis until the design is free of setup

violations. After the design is synthesized, gate-level simulation is performed to verify the

functionality of all coprocessors, after annotating worst-case post-synthesis delays through a

Standard Delay Format (SDF) file.


4.3.2 NISTCHIP Physical Design flow

The following subsections briefly discuss the important steps followed at different stages of

the physical design of NISTCHIP.

4.3.2.1 Floorplanning

The primary focus here is to fix the locations of all hard macros, and perform an initial coarse

placement of the design. We perform “virtual flat placement”, which considers the design to

be a flat collection of cells. Standard cells of each module are generally placed together, with

some overlap to account for interconnection between modules. Extra attention was given to

the following points during this stage:

• Macro proximity to relevant standard cells - To minimize wire-length of connections to

and from macros, it is important to ensure their physical proximity to standard cells they

are directly connected to. In this design, it was necessary to place the processor register

files close to the rest of the processor logic. After an initial run of coarse placement, the

rough location of the integer unit was understood to be close to instruction trace buffers

and cache memories that the processor requires access to. Fixing register files close to

this region aided in minimizing the number of critical paths.

• EM Sensor bounds - These sensors are built as a chain of large number of inverters and

multiplexers [18], to detect EM fault attacks that manifest as a clock glitch. As a result,

it was desired that the sensors remain compact, and the three flip flops in each such

sensor stay physically close together, with their clock signals preferably coming from the

same buffer. To accomplish this, exclusive move bounds were created for each sensor by

determining the rough area of a square they can fit in. Sensor locations were chosen


such that the total of 64 sensors are spread all over the chip, including narrow channels

between macros. This can be seen in Figure 4.5(a).

(a) EM Fault attack detection sensors (b) Physical grouping of processor pipeline stages

Figure 4.5: Bounds created for sensors and isolation of processor pipeline stages.

• Physically separated processor pipeline registers - Another feature of this chip is that

registers belonging to each pipeline stage of the processor are physically grouped together.

This is intended to aid test EM attacks on specific pipeline stages as part of possible future

research efforts using our chip. This was achieved by creating group bounds for each set of

pipeline registers, and letting ICC decide their physical locations. It must be mentioned

that complete isolation of each pipeline stage was not possible due to increased complexity

in meeting timing as a result of these restrictions. This can be observed in Figure 4.5(b).


4.3.2.2 Power Network Synthesis

Before creating the power grid on chip, power rings are first created around memory macros.

Pairs of macros were provided with a common ring to minimize unusable space under these

power rings. Route guides were created to ensure that power straps across the chip do not

pass too close to macro power rings. Power planning for the rest of the chip is then performed

using ICC’s automatic power network synthesis feature, with a power budget of 1000 mW

and a target IR drop of 100 mV . The number of width of straps was determined by the tool

based on these values. Finally, standard cell rows and macro power/ground rings are hooked

up to power/ground straps to complete the connection to each standard cell and macro.

Two-pass Synthesis Flow: At this point, floorplan and power network information of the

chip are saved, and the design is re-synthesized using DC in Topographical Mode. This is a

useful step to follow when the design is constrained by area. Knowledge of macro placement

and coarse physical locations of standard cells is used in this second-pass synthesis to make

further logic optimizations. This in turn potentially reduces area.

Figure 4.6: Power grid on chip, with power rings around pairs of macros.


4.3.2.3 Placement, Clock Tree Synthesis, and Routing

Placement of the design finalizes the location of each standard cell without any overlap. We

enable scan-chain optimization, which modifies the scan-chain based on physical proximity

as opposed to logical connections. This is followed by Clock Tree Synthesis (CTS). We make

use of clock buffer cells to initially build the clock tree, in order to avoid the tool’s pruning of

clock inverters during this process. Fixing of hold violations is enabled during CTS, making

use of fast-corner libraries to estimate best-case delays. Care is taken to ensure that the tool

performs Clock Reconvergence Pessimism Removal, in order to prevent over-constraining the

design. Optimization steps are performed repeatedly at the end of CTS to ensure that the

design stays free of major setup, hold, or logical design rule violations.

Following placement and CTS, all standard cells and macros are physically interconnected.

This chip is designed with six metal layers as provided by our technology library, with the

power/ground straps routed on the top two layers. Diode insertion is enabled to fix antenna

violations, with the rules and diode mode specified as defined by the technology vendor. We

run a single route_opt flow owing to its timing driven algorithms, and aggressive incremental

optimization to fix design rule violations. Due to high utilization of our available core area,

optimization runs after initial routing allowed partial placing of cells under power/ground

straps (with density of 25%), to provide a little extra space to move cells around.

4.3.2.4 Chip verification and signoff

Gate-level simulation is first performed to verify functionality of the netlist obtained after

placement and routing. Following this, the final step consists of verifying that vendor-

specified physical design rule constraints and antenna rules are met. Filler cells and metal

fillers are inserted to meet density requirements, followed by performing a Layout Versus


Schematic (LVS) check on the final layout using Mentor Graphics Calibre.

4.3.3 The final ASIC layout

Figure 4.7: Layout of the physically placed and routed chip.

Figure 4.7 shows the layout of the chip after physical placement and routing of the design.

Majority of the area is taken up by the on-chip RAM, with the Leon3 processor and co-

processors making up a substantial portion of the remaining area. The processor is placed

close to the register files and cache memory, while the physical location of the coprocessors

is not heavily dependent on any other block since they are all connected to a central APB

controller.

4.4. Post-layout results 77

4.4 Post-layout results

4.4.1 Area

The chip is to be fabricated on a 5mm × 5mm die, and the total core area occupied by

the design is 16.83 mm2. The 128K AHBRAM makes up 49% of this total area, while the

cache memories, processor register files, and instruction trace buffer make up an additional

8.6%. Figure 4.8 shows the area in terms of number of equivalent NAND2X1 gates of all

other blocks. While the coprocessors make up majority of the chip area, the APB and AHB

controllers do not account for a significant portion ( 0.6% of the total area).

Area

(KG

E)

0

50

100

150

200

Process

or Core

MMU ACache

MMU DCache

DMA

GP-Timer

AHB Controlle

r

APB Bridge 2

ACORN-32

ACORN-8

ReedSolomon

NISTCOMB

AES-LGC

AES-WOLK

AES-LUTAEGIS

MORUS

Keymill

Standalone

SBox senso

rs

Area (K Gate Eq.) of modules on chip

Figure 4.8: Area (K Gate Eq.) of individual blocks on chip

The areas of combinatorial circuits obtained from LGC SLPs are listed in Table 4.3.

4.4.2 Power Consumption

Power consumption of the chip is measured through a vector-based analysis for specific

coprocessor tests. These measurements are performed on the post-layout netlist, with the


Num. Area (mm2) NAND2X1 Gate Eq.Circuit Area (mm2) Instances per instance per instanceSBOX-LGC 0.0138 4 0.0034 345SBOX-LUT 0.0268 4 0.0067 672SBOX-WOLK 0.0145 4 0.0036 364INVSBOX-LGC 0.0158 4 0.0039 395INVSBOX-LUT 0.0245 4 0.0061 613INVSBOX-WOLK 0.0152 4 0.0038 381GF256MULT 0.0121 4 0.003 304GF65536MULT 0.021 2 0.0105 1050GF256INV 0.0115 4 0.0029 290GF65536INV 0.0259 2 0.013 1300K3LRSBOX 0.038 2 0.019 1910MULT64 0.0922 1 0.0922 9240

Table 4.3: Post-layout area of combinatorial circuits on NISTCHIP

interconnect parasitics annotated on the design through a SPEF file obtained from ICC. Both

the chip-level power and hierarchical block-level power consumed over the total simulation

duration are obtained from analysis using PrimeTime. Two different coprocessor tests are

discussed below.

4.4.2.1 NISTCOMB Test - SBox power comparison

A set of 256 random test vectors is sent to each of the three SBox circuits present in NIST-

COMB. These vectors are sent in groups of 32, as the FIFOs in NISTCOMB can hold upto

32 vectors at a time. Table 4.6 shows the power consumption of each SBox.

SBOX-LGC Test SBOX-LUT Test SBOX-WOLK TestSBox Power 0.282 mW 0.248 mW 0.314 mW

NISTCOMB Power 15.1 mW 15.0 mW 15.1 mW

Table 4.4: Post-layout power consumption of SBox circuits in NISTCOMB - 256 test vectorsfed through DMA.

An important point regarding power consumption is that the overall chip-level power is a

4.4. Post-layout results 79

result of not just a single coprocessor. In a processor-based SoC, there are multiple compo-

nents on the chip that are active at the same time, and hence can contribute to total power.

For this particular SBox test, it was found that the clock network on chip contributes to

56% of total chip-level power. The contribution of major active SoC components during a

NISTCOMB test is listed in Table 4.5 below.

Component on Chip Power (mW ) % of Total PowerTop-level 331 100%

Processor Core 19.2 5.8%DMA 12.5 3.8%

Cache controller 2.155 0.6%AHB Controller 0.928 0.3%APB Bridge 2 0.796 0.2%

NISTCOMB Coprocessor 15.0-15.1 4.5-4.6%

Table 4.5: Contribution of components on SoC to total power during NISTCOMB test.

It can be seen that although there is a difference of around 20% power within the SBoxes,

this hardly reflects in total power or even the coprocessor power. At the coprocessor level,

power due to the communication overhead in the form of FIFOs overshadows that of the

combinatorial blocks, while at the chip level, even the coprocessor power is a small fraction

due to clock network and other active components. This goes on to show that when integrated

with a big system, individual combinatorial blocks has a negligible impact on total power.

4.4.2.2 AES Test

This test feeds 64 blocks of plaintext into each of the three AES coprocessors running en-

cryption in CBC mode. The power consumption of different components active during this

test is listed in Table 4.7 below.

In summary, this chapter discussed the prototype ASIC designed for benchmarking of com-

binatorial logic-minimized circuits. A processor-based SoC was used as a platform for this


AES-LGC AES-LUT AES-WOLK19.8 mW 19.2 mW 19.0 mW

Table 4.6: Post-layout power consumption of AES circuits - encryption of 64 plain-textblocks in CBC mode.

Component on Chip Power (mW ) % of Total PowerTop-level 323 100%

Processor Core 25.2 7.8%Cache controller 2.773 0.8%AHB Controller 0.688 0.2%APB Bridge 2 0.611 0.2%

AES Coprocessor 19.8 6.1%

Table 4.7: Contribution of components on SoC to total power during AES Tests.

purpose, and additional cryptographic blocks were added as coprocessors. The next chapter

presents extensive analysis on the AEAD coprocessors integrated on this SoC.

Chapter 5

Impact of SoC integration on

Authenticated Encryption Ciphers

5.1 Introduction

Authenticated Encryption has gained popularity as a hardware-efficient and secure alter-

native to two-phase algorithms employing separate encryption and message authentication

[19]. The fundamental idea is to use a single cipher that provides authenticity in addition

to confidentiality and integrity. Authenticated Encryption with Associated Data (AEAD)

schemes take a message or plaintext (PT) as input along with a key, associated data (AD),

and a public message number (Npub). Following encryption of the message, a tag is gen-

erated that is used to verify authenticity during decryption. CAESAR [1] has been driving

the development of new AEAD ciphers for lightweight and high-performance applications.

As was mentioned in Chapter 4, we picked three of the CAESAR finalists to be integrated

as hardware coprocessors onto our SoC.

The motivation behind this work is that such an integration incurs non-negligible impact

on the area, power, and performance of the ciphers, examples of which were provided in

[23]. Extensive analysis of hardware implementations of the AEAD ciphers can be found in

literature [21, 29, 31]. However, existing works consider these ciphers as standalone hardware

blocks, and the results do not always hold when they are integrated on to a larger system.

81

82 Chapter 5. Impact of SoC integration on Authenticated Encryption Ciphers

For example, Figure 5.1 (similar to Figure 4.3 in Chapter 4) is shown here to reiterate the fact

that plugging a hardware block onto a system bus requires wrapper logic for communication.

Since the penalty in resources brought about by this additional logic is unavoidable, it is

up to the designers to choose a scheme that can appropriately minimize overhead in area,

power, and performance.

Figure 5.1: A generic wrapper structure for an AEAD coprocessor on Leon3-based SoC.

As with any hardware design, there is no golden method for building a coprocessor. The SoC

integration scheme used on our chip is only one of different methods that result in varying

amount of overhead. Hence, it would not be fair to generalize our results from the post-layout

analysis of our chip. Our aim here is to explore multiple alternatives for wrapper design that

take advantage of the way the cipher cores work, and underline the benefits of each design

alternative. We provide our comments on which of those is likely to be of practical utility

in an SoC context, and the trade-offs to be considered in the process.

5.2. Relevant Background 83

5.2 Relevant Background

This section provides a brief background of the AEAD ciphers chosen for analysis, covering

the aspects of their hardware functionality that are necessary to understand our work. For

further details about the algorithms and their security features, interested readers can refer

to their design documents [48, 49, 50].

5.2.1 ACORN

ACORN is a lightweight stream cipher, with a 293-bit state (S0S1 · · ·S292), arranged as

six concatenated Linear Feedback Shift Registers (LFSRs). ACORN is popular due to its

suitability to both lightweight and high-performance applications. In our work, we make use

of ACORN-128 which uses a 128-bit key, and generates a 128-bit tag after encryption. The

cipher employs simple AND, XOR, and NOT logic operations to update the state at every

step i, to generate a feedback bit fi, and a keystream bit ki.

Figure 5.2: The structure of ACORN cipher [48]

There are four main stages involved in ACORN, described as follows:


• Initialization: The initialization stage consists of loading the key and Initialization

Vector (IV) bit-by-bit to update the state. Initialization runs for 1792 steps in total.

Figure 5.3: Message stream during ACORN initialization

• Processing Associated Data: In this stage, the associated data is used to update the

state. Considering an AD of size adlen bits, this stage first runs for adlen steps. This is

followed by 256 additional steps which are mandatory even when the length of AD is 0.

Figure 5.4: ACORN message stream during AD processing

• Encryption: In addition to using plaintext to update the state, this stage generates a

ciphertext bit by XOR-ing the corresponding plaintext and keystream bits. Similar to

the previous stage, this stage also runs a mandatory 256 additional steps after processing

ptlen bits of plaintext. When ptlen is 0, there is no ciphertext generated.

Figure 5.5: ACORN message stream during PT encryption

• Finalization: The final stage involves generating the tag by running for 768 steps in

total. The last 128 keystream bits form the 128-bit tag. The message bit is set to 0

throughout this stage.

5.2. Relevant Background 85

5.2.1.1 Parallelization of ACORN

The designers of ACORN proposed a parallelized implementation by choosing a datapath

that is either 8 or 32 bits wide. We will refer to these two implementations as ACORN-8

and ACORN-32 respectively, with the former processing 8 bits of the message stream to-

gether, while the latter processes 32 bits in one cycle. Both these alternatives offer increased

throughput over the basic version. ACORN-8 is especially highly suited for very lightweight

applications, as it offers a logic footprint that is about 40-50% less than that of ACORN-

32 [31]. ACORN-32 however is shown to provide a throughput that is almost 4× that of

ACORN-8. Table 5.1 shows the number of clock cycles required for each of the four stages

after parallelization.

Stage \Datapath ACORN-32 ACORN-8Initialization 56 224Process AD ⌈adlen

32⌉+ 8 ⌈adlen

8⌉+ 32

Encryption ⌈ptlen32

⌉+ 8 ⌈ptlen8

⌉+ 32Finalization 24 96

Table 5.1: Number of steps required for each stage of ACORN-32 and ACORN-8 ([48]).

5.2.2 AEGIS

AEGIS is a family of AEAD ciphers popular for high-performance applications, and its

throughput is among the highest of the CAESAR finalists [31]. With its high security

and speed, AEGIS has been claimed to be well-suited for packet encryption in network

applications [49]. In this work, we consider AEGIS-128L, which is the fastest among AEGIS

ciphers.

AEGIS-128L takes a 256-bit message block per cycle, performs encryption using a 128-bit

key, and generates a 128-bit tag for authentication. It consists of a 1024-bit state, whose


Figure 5.6: AEGIS-128L state update - Each Si is a 128-bit word of the state [49].

update logic consists of eight AES round functions as shown in Figure 5.6. An important

distinction from ACORN is that there is no state update performed when the length of AD

or PT is 0. Padding on the data, if any, is performed externally before sending it to the core.

Moreover, the block size here is greater, which results in reduced number of computation

steps for the same data size. It is easy to observe that the high throughput and security

comes at the expense of higher area resulting from high parallelization and multiple AES

round functions.

Stage Number of cyclesInitialization 10Process AD ⌈adlen

256⌉


⌉Finalization 7

Table 5.2: Number of clock cycles required for each stage of AEGIS-128L.

There are four stages in AEGIS-128L with similar functionality as those of ACORN, but the

algorithm differs in the time taken for each stage. A wider datapath and absence of padding

by the core also contribute to its high speed. This is summarized in Table 5.2.

5.3. Design alternatives for SoC integration 87

5.2.3 MORUS

The MORUS family of AEAD ciphers is designed following the manner of stream cipher

design which involves low-complex state-update functions [50]. The design is intended to be

fast in both hardware and software, especially in the absence of AES-NI instruction. MORUS

can be viewed as a design offering parts of the benefits of both AEGIS and ACORN:

• High throughput due to 256-bit messages, absence of padding steps, and small number of

steps in each stage of the algorithm, all of which are similar to AEGIS.

• State-update with small logic footprint similar to ACORN, employing simple AND, XOR,

and rotation operations.

Stage Number of cyclesInitialization 16Process AD ⌈adlen

256⌉


⌉Finalization 8

Table 5.3: Number of clock cycles required for each stage of MORUS-128L.

Its hardware efficiency stems from its replacing of AES round functions for state-update

with simpler logic. As a result, MORUS achieves the best throughput-to-area ratio among

the CAESAR finalists [31]. The parameters used in our analysis are those of MORUS-1280,

making use of a 128-bit key, 1280-bit state, and 256-message block. Table 5.3 lists the

number of computation steps required for each stage of MORUS.

5.3 Design alternatives for SoC integration

As was shown in Figure 5.1, there are two main components in the coprocessor wrapper:


• Storage elements for core signals: For the AEAD cipher coprocessors used here,

registers are required to store the key, public data, input message, and the results in the

form of cipher-text and tag. In addition, registers are also needed to store the size of AD

and PT, but these are small in size and negligible in comparison to the data registers

required.

• Control logic to send/receive data to/from the core: This logic is generally mod-

eled as a Finite State Machine (FSM) that waits for the required inputs to be received

by the wrapper before sending them to the core, ensuring appropriate handshaking as

required by the core. Similarly, it needs to monitor the results from the core to be stored

and sent out to the system bus when requested.

The control logic is fundamental to the functioning of a coprocessor wrapper, and is therefore

unavoidable. Moreover, the FSM itself is not as important to the overhead as the storage

is. This is simply due to the fact that the FSM consists of only a few bits of state and

combinatorial state-update logic, whereas the storage of data consists of large number of flip

flops, which is likely to have significant effect in a lightweight context. We therefore narrow

down our analysis to three design alternatives depending on the size of storage resources

making up the wrapper.

5.3.1 An intuitive and convenient wrapper design - FIFOs at the

input and output

This is a simple scheme where the processor continuously sends all data to the coprocessor

which stores them in a FIFO at the input side. The control FSM in the wrapper monitors

the core and reads data out of the input FIFO as and when the core is ready to accept them.

The ciphertext and tag sent out of the core are stored in a FIFO on the output side. The


software monitors the completion of encryption and tag generation before reading the result

out.

Figure 5.7: Illustration of Coprocessor wrapper and Software API for integration with inputand output FIFOs.

Advantages: This method effectively decouples the cipher core and the processor by en-

abling continuous data transfer from the processor. There is little or no handshaking required

between the wrapper and the processor since data loss will be avoided by the presence of

the input FIFO. This method does not require the coprocessor designer to understand the

cipher core in great detail apart from the interface signals and handshaking mechanism.

Disadvantages: The FIFOs take up too much space. For instance, with the TSMC 180 nm

library used for this chip, even a small 64-word (2KBit) FIFO built with flip flops is about

2.5× bigger than the entire ACORN core. While it can be argued that replacing flip flops

with SRAM memory macros could be a better alternative, we found that a 64-word (2KBit)

SRAM macro for TSMC 180nm technology still takes up 1.7× more space than the core.

Another notable disadvantage of using FIFOs is that it limits the amount of data that can

be sent to the coprocessor at one go. Failure to maintain the FIFO read rate greater than

or equal to that of the write rate can potentially lead to data loss depending on the size of

data. A possible workaround is to send the data in installments, reading the results for one


group of data words before sending the next.

5.3.2 Reducing wrapper overhead - FIFO only at the output

What the area numbers in the previous subsection go on to show is that even small FIFOs

manifest as a huge overhead when added on top of a compact cipher core, so much so that

the coprocessor as a whole no longer retains the lightweight properties of the core. Removing

one of the FIFOs is therefore appealing since it can halve the FIFO overhead. This is further

made possible by the fundamental working of the AEAD cipher cores.

As was described in Section 5.2, these ciphers require only one cycle to process a particular

message (AD or PT), generate the ciphertext, and get ready to accept the next word. The

only wait periods when the core cannot accept inputs occur during initialization and final-

ization stages. What this means is that apart from the wait periods, reading data from the

input FIFO can happen as fast as the rate at which data is written. This can be exploited

by creating a scheme where there is no input FIFO, and data that is sent through software

is forwarded to the core immediately. The software needs to perform handshaking with the

wrapper during the wait periods, since no data is sent to the core in this duration.

Advantages: The basic functionality of the core lends itself well to this scheme. While

the extra handshaking is expected to result in a small performance penalty, the primary

advantage here is the huge reduction in area and power made possible by getting rid of one

entire FIFO.

Disadvantages: This method requires greater understanding of the cipher core’s hardware

implementation than the previous scheme. For instance, for the ciphers here, small mod-

ifications were required to ensure that the cipher core does not assume that AD and PT

inputs arrive at consecutive cycles. The extent of complexity of these changes depend on the


Figure 5.8: Illustration of Coprocessor wrapper and Software API for integration with onlyan output FIFO.

cipher used and its hardware implementation. In addition, this scheme still does not solve

the problem of a limit on the maximum size of data that can be sent at once. Since there is

still one FIFO present, data of size greater than the FIFO’s capacity needs to be sent over

installments.

5.3.3 Lightweight integration - no FIFOs in the wrapper

This scheme makes use of a no-frills wrapper with no FIFOs to hold data. This uses only reg-

isters to hold the key and public data, along with a 128-bit (or 256-bit, for AEGIS/MORUS)

message word at the input and the output. The intention here is to consider this design as

a reference that indicates the best-case scenario, i.e. an estimate on the lower bound for the

wrapper overhead.

While this scheme minimizes wrapper overhead, it requires changes on the software side. It

is now not possible to send more than 128 (or 256) bits of data before reading the result

out, since there is no FIFO at the output of the core. Therefore, this scheme requires the

software to send four (or eight) 32-bit words of data, followed by an immediate reading of


four (or eight) ciphertext words. This is repeated until the whole message is encrypted.

Figure 5.9: Illustration of Coprocessor wrapper and Software API for lightweight integrationwith no FIFOs.

Advantages: As mentioned earlier, this design presents a good example of the minimum

wrapper overhead that is unavoidable for a particular cipher. If implemented in a practical

system, this design is not expected to be significantly slower than the other schemes when all

data is sent only by the processor. This is because the number of read and write operations

are still the same, with only the order changed.

Disadvantages: When compared with designs that send data in a burst (for example,

through Direct Memory Access), this scheme is bound to have a lower performance, since

it cannot support bursty data. In that sense, the practical utility of this scheme is lim-

ited. Furthermore, for ciphers that require more than one cycle to generate the result, this

scheme will require constant polling by the software to monitor when the operation is done,

before reading the result out. This can lead to decreased throughput and increased power

consumption.

5.4. Evaluation Methodology 93

5.3.4 Direct Memory Access (DMA) for increased throughput

While presenting the performance of a coprocessor, care must be taken to show both the

best-case and worst-case scenarios. While sending inputs from the processor is a simple

method, it incurs significant loss in performance due to each data transfer going through

the processor pipeline as an individual instruction. A common method followed in practical

designs is to offload the task of transferring large data to a DMA controller that simply reads

a large chunk of data from a source and writes it to a destination.

As there is a DMA controller already included in our SoC, we consider transfers to the

coprocessors through DMA, in order to understand the best-case performance achievable

after SoC integration. This can be used with the first two FIFO schemes described earlier,

and is very beneficial in systems that already have a DMA controller as part of the SoC.

5.4 Evaluation Methodology

Separate coprocessor wrappers with an APB interface are first built for each type of wrapper

discussed, and the design alternatives that are analyzed are as follows:

1. Input and Output FIFOs, with and without DMA for ACORN-8, ACORN-32, AEGIS,

and MORUS.

2. Output FIFO only, with and without DMA for ACORN-32, AEGIS, and MORUS.

3. No FIFO, without DMA for ACORN-32, AEGIS, and MORUS.


5.4.1 Studying Area and Power

For the purpose of analysis, all the coprocessor alternatives are attached to the APB bus

on the SoC, and the design is synthesized at 80 MHz using Synopsys DC with the same

constraints as those on our primary chip design. We obtain the post-synthesis area from

DC to understand the “price to pay” for SoC integration, i.e. how much additional area is

required over the standalone cipher core.

To study power efficiency, gate-level simulation is first run on the post-synthesis netlist for

each design alternative using ModelSim. The test cases used here include those provided by

the designers, as well as a set of arbitrary test vectors of different sizes. VCD files generated

from ModelSim are used for power analysis using Synopsys PrimeTime. We focus primarily

on dynamic power consumption of the top-level design, the coprocessor, and other active

components of the SoC. Static power being three orders of magnitude smaller in the 180 nm

technology node, is not included here due to its negligible impact on total power.

5.4.2 Performance Analysis

Performance comparison is performed through RTL simulation in Modelsim with test cases

of different sizes, using the general-purpose timer present on the SoC to measure clock cycles

elapsed from the start of an encryption to its end. Following similar analysis previously pre-

sented in literature [23], we observe that total time required for an authenticated encryption

using a coprocessor on an SoC can be broken down into the following components:

• Computation Time: Time required for the hardware coprocessor to complete the entire

authenticated encryption.

• Communication Time: This refers to the total time required for sending data and

5.5. Observations and Results 95

control words to the coprocessor, and reading the results back. Communication time is

composed of two types of overhead:

– Bus Overhead: Time taken for data transfer to and from coprocessor over the system

bus.

– Processor overhead: Time spent in the processor pipeline, which includes instruction

decoding, cache operations, and memory accesses.

For the AEAD ciphers considered in this work, a major part of computation time overlaps

with communication time due to their single-cycle state updates. The only non-overlapped

portion occurs when the software waits for final tag generation to be completed after sending

all data. This is illustrated in Figure 5.10. The contributions of each type of overhead will

be presented in the following section.

Figure 5.10: An example of different sources of contribution to the total time for authenti-cated encryption using an AEAD coprocessor on SoC.

5.5 Observations and Results

In this section, we now discuss important results from our analysis of coprocessor alternatives,

and provide our comments on the benefits and trade-offs they pose at the system-level.


5.5.1 ACORN-32

5.5.1.1 Area and Performance

Figure 5.11(a) first shows that even with the most lightweight wrapper with no FIFOS, there

is still a 1.7× increase in area over the ACORN-32 core. The storage elements needed for

the key and data are comparable in size to the internal logic of ACORN-32, making the

resultant coprocessor significantly bigger. Furthermore, adding two 1 KBit FIFOs is seen to

be highly area-inefficient, resulting in a 4× increase in area.

Figure 5.11(b) shows the performance comparison of the different coprocessor alternatives,

represented as the ratio of the time consumed for each design, with the time required for

standalone hardware. First, we see that in addition to saving 30% area, having only an output

FIFO incurs only a small performance loss of less than 5% as compared to a conventional

two-FIFO design. This decrease in performance arises due to the additional wait period

required for the former as shown in Figure 5.12. However, as this duration is fixed, the loss

in performance remains small over all the test cases.

Another important observation from Figure 5.11(b) is that the DMA-based design alterna-

tives are slower for message smaller than 32 bytes, while they provide significant speedup

for longer messages. The increased speedup for large messages is because the DMA mini-

mizes processor overhead which is the dominant component of total time consumed. The

DMA controller reads data from RAM in consecutive cycles before transferring them to the

coprocessor without any processor intervention in between (Figure 5.13).

DMA-based design alternatives perform worse for small data sizes since there is a fixed time

required to program the DMA each time it transfers a block of data. This task consumes

more time than the actual data transfer. This point is highlighted in Figure 5.14, where the


Area

(K G

ate

Eq)

0

10

20

30

40

ACORN core ACORN coprocessor - NO

FIFO

Acorn Coprocessor - Output FIFO

ACORN Coprocessor -

Input and Output FIFOs

ACORN-32: Coprocessor Area Comparison

(a) Area overhead.

Performance improved by DMA

(b) Performance overhead - lower value indicates better performance.

Figure 5.11: Area and performance overhead of ACORN-32 coprocessor alternatives.

contribution of processor overhead when using DMA is seen to be close to 75% of total time

for small messages. For larger data, this contribution progressively decreases to less than

60% of total time by using DMA, whereas it remains consistently above 67% without DMA.

Considering the fact that this happens only for very small data sizes, we believe that using

DMA is beneficial in order to extract the best possible performance in a practical setting.

Finally, Figure 5.11(b) suggests that the FIFO-less no-frills design maintains appreciable


(a) Two-FIFO design

(b) Output-FIFO-only design

Figure 5.12: Illustration of communication overhead and wait periods.

Figure 5.13: Reduction of processor overhead in data transfer using DMA.

performance that is better than the designs not using DMA. The reason is that this design

sends four data words in quick succession before reading four words of ciphertext. The

processor overhead caused due to loop operations is smaller here, as opposed to FIFO-based

designs. For larger test cases, the performance of this design does not drop as sharply as

other non-DMA designs, and remains within 10% of the DMA-based alternatives. In Figure

5.11(b), it becomes faster than DMA-based designs for message sizes between 240-330 bytes.

This is because the data is sent across two DMA transfers owing to FIFO limitations.

To summarize this analysis of area and performance, we make use of throughput-per-area as


AD + PT Length (Bytes)

% o

f Tot

al T

ime

0%

25%

50%

75%

100%

8 6 10 16 24 31 32111

120140

170190

220239

271324

384430

Processor Overhead Bus Overhead Computation Overhead

ACORN-32: Computation-Communication overhead split, Without DMA

(a) Designs not employing DMAAD + PT Length (Bytes)

% o

f Tot

al T

ime

0%

25%

50%

75%

100%

8 6 10 16 24 31 32111

120140

170190

220239

271324

384430

Processor Overhead Bus Overhead Computation Overhead

ACORN-32: Computation-Communication overhead split, With DMA

(b) Designs using DMA

Figure 5.14: ACORN-32 coprocessor performance - Computation, communication overhead.

(AD+PT) Length (Bytes)

Thro

ughp

ut (M

bps)

/ Ar

ea (K

GE)

0

5

10

15

0 100 200 300 400

Two FIFOs, no DMA Output-FIFO, no DMA No FIFO Two FIFOs, DMAOutput-FIFO, DMA

ACORN-32: Coprocessor Throughput-per-Area

Figure 5.15: Throughput-per-area of ACORN-32 coprocessor designs.

a metric that captures both the performance and silicon overhead together. This is shown

for all the designs in Figure 5.15. Averaged over all tests, the lightweight no-FIFO wrapper

wins with a 1.63× higher throughput-per-area over the next best design. For systems where

the use of DMA is desired, the output-FIFO wrapper provides the best trade-off.


5.5.1.2 Power and Energy efficiency

The overall power consumed by an SoC during a particular coprocessor operation is affected

by switching in the active coprocessor, as well as other active components on the chip. These

blocks, which are necessary for the SoC’s basic functionality, mainly include the processor,

memories, system bus, and the cache controller. As an illustration, Table 5.4 lists the

contribution of these major components to the total SoC power for an ACORN-32 test using

DMA.

Component on Chip % of Total powerOn-Chip Memory 25.71%Processor 8.0%ACORN-32 Coprocessor 7.1%DMA Controller 3.7%Cache controller 0.8%AHB 0.3%APB 0.3%

Table 5.4: Contribution of active blocks to total power during ACORN-32 tests.

Block Clock Tree power (% of block power)Top-level 59.8%Processor 62.1%ACORN-32 Coprocessor

With FIFOs 82.2%Without FIFOs 65.04%

DMA Controller 27.6%Cache controller 27.55%

Table 5.5: Contribution of clock tree to block-level power during ACORN-32 tests.

From Table 5.4, we see that apart from memory, the processor and active coprocessor have a

significant contribution. The busses and cache controller have very small impact on the total

power due to their relatively lower hardware footprint. In addition, Table 5.5 shows that for

the major blocks, a large part of their power consumption comes from their clock tree. This


is due to constant switching of the clock network and related buffers, and this becomes more

pronounced with the increase in the size of the block. We now discuss how different design

alternatives affect the power of active blocks, and their impact on total power.

Pow

er (W

)

0.23

0.235

0.24

0.245

0.25

0.255

2 FIFOs, no DMA

Output FIFO, no DMA

No FIFO 2 FIFOs, DMA Output FIFO, DMA

Top-level Power - ACORN-32 Test

(a) Top-level power consumption

Pow

er (W

)0.00E+00

1.00E-02

2.00E-02

3.00E-02

Processor ACORN-32 Coprocessor

ACORN32-FIFO ACORN32-Core DMA

2 FIFOs, no DMA Output FIFO, no DMA No FIFO 2 FIFOs, DMA Output FIFO, DMA

Block Level Power - ACORN-32 Test

(b) Block-level power consumption

Figure 5.16: Power consumption resulting from different ACORN-32 coprocessor designs.

Figure 5.16(a) shows that the DMA-based designs result in a 3.6% increase of chip-level

(top-level) power, computed over the total simulation duration. The reason for this small

impact is the domination of clock power that results in reduced visibility of dynamic power

due to logic switching during coprocessor tests. Since the impact on top-level power greatly

depends on SoC size, number of simultaneously active components on chip, and the extent

of clock or power gating, we additionally focus on block-level power consumption (Figure

5.16(b)).

While the DMA-based designs result in 16% less processor power due to reduced activity, the

power consumption due to the DMA controller and its internal buffers offsets this difference.

As for integration overhead on the cipher core, we see that ACORN-32 is a wrapper-limited

design, with the FIFO power contributing to 59% and 42% of total coprocessor power for

the two-FIFO and one-FIFO designs respectively. As a result, the no-FIFO wrapper gives

the most power-efficient ACORN-32 coprocessor. However, the most power-efficient copro-

cessor does not necessarily result in the least power at the system level since the coprocessor


amounts to less than 5% of top-level power. In this regard, it is the one-FIFO wrapper

without DMA that is seen to result in smallest top-level power. The reason for this is its

significantly smaller run-time as compared to other alternatives, which brings our focus onto

energy-efficiency as an alternate quality metric.

Ener

gy p

er b

it (n

J/bi

t)

0

5

10

15

20

2 FIFOs, no DMA

Output FIFO, no

DMA

No FIFO 2 FIFOs, DMA

Output FIFO, DMA

Top-level Energy-per-bit - ACORN-32 Test, Small messages

(a) Top-level energy for small messages

Ener

gy p

er B

it (n

J/bi

t)

0.00E+00

5.00E-01

1.00E+00

1.50E+00

2.00E+00


ACORN32-FIFO ACORN32-Core DMA

2 FIFOs, no DMA Output FIFO, no DMA No FIFO 2 FIFOs, DMA Output FIFO, DMA

Block-level Energy-per-bit - ACORN-32 Test, Small messages

(b) Block-level energy for small messages

Ener

gy p

er B

it (n

J/bi

t)

1.5

1.75

2

2.25

2.5

2 FIFOs, no DMA

Output FIFO, no

DMA


Output FIFO, DMA

Top-level Energy-per-bit - ACORN-32 Test, Large messages

(c) Top-level energy for large messages

Ener

gy p

er B

it (n

J/bi

t)

0.00E+00

5.00E-02

1.00E-01

1.50E-01

2.00E-01

2.50E-01


ACORN-32 FIFO

ACORN-32 Core

DMA

2 FIFOs, no DMA Output FIFO, no DMA No FIFO 2 FIFOs, DMAOutput FIFO, DMA

Block-level Energy-per-bit - ACORN-32 Test, Large messages

(d) Block-level energy for large messages

Figure 5.17: Energy efficiency of ACORN-32 coprocessor alternatives.

The need to compare energy-efficiency is two-fold. First, embedded applications running

on battery-powered devices are required to consume lesser energy over time. Second, the

design alternatives considered differ significantly in their run-time, making power comparison

misleading due to its being averaged over time. Figures 5.17(a) and 5.17(b) show the DMA-

based designs being the least energy-efficient for small messages. Excess time spent in

DMA programming as compared to data transfer, as was discussed in Section 5.5.1.1, is


the reason for this inefficiency. Figures 5.16(a) and 5.16(b) indicate that for more practical

message lengths, DMA-based designs reduce top-level energy consumption per message bit

by 20%, owing to their faster completion. Another interesting result is that the no-FIFO

wrapper offers comparable energy-efficiency to the DMA-based designs at the top-level, while

reducing the ACORN-32 coprocessor energy-per-bit by more than 36% for all message sizes

considered. This is due to a combination of power reduction due to speed comparable to

DMA-based designs (shown in Section 5.5.1.1), and complete elimination of FIFO power.

We summarize this analysis by suggesting that unless the message sizes are extremely small,

using DMA with FIFO-based wrappers or the lightweight no-FIFO wrapper are the most

energy-efficient options for ACORN-32.

5.5.2 ACORN-8


Figure 5.18(a) shows the most important result here - there is a severe area penalty incurred

during the SoC integration of a lightweight cipher with a shortened datapath width. The

coprocessor wrappers needs to be able to write 32-bit data per cycle, while it can only

read out 8 bits at a time. Furthermore, there is a large initialization time of 224 cycles

when no data can be read out of the FIFO. With the FIFO-less lightweight wrapper not

being suitable here, the coprocessor area becomes 6.4× that of the cipher core. This clearly

negates the primary intention of making the cipher lightweight. Regarding performance,

with the ACORN-8 core being inherently much slower than the other ciphers (Table 5.1),

SoC integration makes it 1.6-2× slower. Using a DMA is better for performance, providing

a 1.2× increase over the non-DMA alternative.

104 Chapter 5. Impact of SoC integration on Authenticated Encryption CiphersAr

ea (K

Gat

e Eq

.)

0

10

20

30

ACORN-8 core ACORN-8 Coprocessor - Input and Output FIFOs

ACORN-8: Coprocessor Area Comparison

(a) Area overheadAD + PT Length (Bytes)

Num

cyc

les

/ Num

cyc

les

of H

W

1

1.5

2

2.5

0 50 100 150 200

Input-Output FIFO, No DMA Input-Output FIFO, DMA

ACORN-8: Coprocessor Performance Comparison

(b) Performance overhead

Figure 5.18: Illustration of area and performance overhead arising from different alternativesfor SoC integration of ACORN-8.

5.5.2.2 Energy efficiency

As there are no wrapper alternatives to minimize area, we only compare the energy-efficiency

with and without DMA. The latter naturally results in 18% reduced energy-per-bit for large

messages.

Ener

gy-p

er-b

it (n

J/bi

t)

1.50E+00

2.00E+00

2.50E+00

3.00E+00

3.50E+00

Two FIFOs, no DMA Two FIFOs, DMA

Top-level Energy-per-bit - ACORN-8 Test, Large messages

(a) Top-level energy

Ener

gy-p

er-b

it (n

J/bi

t)

0.00E+00

1.00E-01

2.00E-01

3.00E-01


ACORN-8 FIFO

ACORN-8 Core

DMA


Block-level Energy-per-bit - ACORN-8 Test, Large messages

(b) Block-level energy

Figure 5.19: Energy efficiency of ACORN-8 coprocessor alternatives, for large messages.

In summary, the results for ACORN-8 reiterate the point that in spite of its small logic

footprint compared to ACORN-32, integration onto an SoC with a wider datapath negates

the advantages offered by the standalone core. ACORN-32 is therefore more suited to SoC


integration than ACORN-8.

5.5.3 AEGIS-128L


The area of AEGIS-128L coprocessor is heavily influenced by its core, as opposed to lightweight

ciphers like ACORN, whose core in comparison is about 14× smaller. This can be understood

from Figure 5.20(a), which shows the AEGIS-128L coprocessor area to be affected more by

its core than the wrapper. Adding a small 512-bit input FIFO and a 2 KBit output FIFO

has only a 1.31× area overhead. A small input FIFO is sufficient since the FIFO is read

immediately after every eight writes to it due to the 256-bit message block size. A FIFO-less

wrapper adds 9.8% additional area to the AEGIS-128L core, while a two-FIFO design adds

31% overhead. Doing away with the input FIFO provides a negligible area reduction of 4%.

Area

K G

ate

Eq.)

0

50

100

150

200

AEGIS-128L Core AEGIS-128L Coprocessor - No

FIFO

AEGIS-128L Coprocessor - Output FIFO

AEGIS-128L Coprocessor -

Input and Ouput FIFOs

AEGIS-128L: Coprocessor Area Comparison


Num

cyc

les

/ Num

cyc

les

of H

W

0

20

40

60

0 100 200 300 400

Input-Output FIFO, No DMA Output FIFO, No DMA No FIFO, No DMAInput-Output FIFO, DMA Output FIFO, DMA

AEGIS-128L: Coprocessor Performance comparison


Figure 5.20: Illustration of area and performance overhead arising from different alternativesfor SoC integration of AEGIS-128L.

The coprocessor performance follows a similar pattern as that for ACORN, but the decrease

in speed over a standalone hardware implementation is more pronounced due to the high

speed of AEGIS-128L. As a result, the best-case performance on SoC obtained using a DMA


is still 30-35× slower than standalone AEGIS-128L hardware. The performances of the two

FIFO-based designs are almost identical due to very small wait periods between stages.

AD+PT Length (Bytes)

Thro

ughp

ut (M

bps)

/ Ar

ea (K

GE)

0

0.5

1

1.5

2

0 100 200 300 400


AEGIS-128L: Coprocessor Throughput-per-Area

Figure 5.21: Throughput-per-area of AEGIS-128L coprocessor designs.

In summary, from Figure 5.21, we conclude that the no-FIFO wrapper and the FIFO-based

wrappers using DMA all provide appreciable area-performance trade-off. The output-FIFO

design has a higher throughput-per-area for larger messages, but only by a small factor of

1.07×. This goes on to suggest that the choice of wrapper does not have a significant impact

on bulky ciphers such as AEGIS.


An important observation here is that the AEGIS coprocessor contributes 12.6% to the total

power, which is higher than even that of the processor. The contributions of other blocks to

total power are close to those seen in the ACORN tests, and are listed in Appendix C. The

large logic footprint of AEGIS leads to high clock network power, and the highly parallelized

implementation causes increased logic switching power. Unlike ACORN, the AEGIS core


contributes more to total coprocessor power than its wrapper, which results in a power

reduction of less than 1% by the one-FIFO design over the two-FIFO one. The DMA-based

designs consume 6% more power due to their faster completion, necessitating comparison of

energy-efficiency.

Pow

er (W

)

0.22

0.225

0.23

0.235

0.24

0.245

2 FIFOs, no DMA

Output FIFO, no DMA


Output FIFO, DMA

Top-level Power - AEGIS-128L Test

(a) Top-level Power consumption

Pow

er (W

)

0.00E+00

1.00E-02

2.00E-02

3.00E-02

4.00E-02

Processor AEGIS - Coprocessor

AEGIS-FIFO AEGIS-Core DMA


Block-level Power - AEGIS-128L Test


Figure 5.22: Power consumption resulting from different AEGIS coprocessor designs.

The DMA-based designs lead to 20% more top-level energy consumption per bit for messages

smaller than 32 bytes. Since AEGIS-128L is a high-throughput cipher intended for use in

applications with large data, we believe that the results for larger tests have greater practical

relevance. The energy-per-bit for small messages can be found in Appendix C. We see from

Figure 5.23(a) that the DMA-based designs are clearly more energy-efficient, by about 30%

and 14% as compared to non-DMA designs with and without FIFOs respectively. Unlike

ACORN, the no-FIFO wrapper does not offer a significant energy benefit due to the high

energy consumption of AEGIS core. Its small reduction of coprocessor energy is nullified by

an increase in processor energy, and it is the DMA’s faster completion that finally wins the

energy battle.

In summary, we believe that for a large cipher such as AEGIS, the use of a DMA is the best

way to achieve better system-level energy-efficiency.

108 Chapter 5. Impact of SoC integration on Authenticated Encryption CiphersEe

rgy

per B

it (n

J/bi

t)

1

1.2

1.4

1.6

1.8

2 FIFOs, no DMA

Output FIFO, no DMA


Top-level Energy-per-bit - AEGIS-128L Test, Large messages


Eerg

y pe

r Bit

(nJ/

bit)

0.00E+00

5.00E-02

1.00E-01

1.50E-01

2.00E-01

2.50E-01

Processor AEGIS Coprocessor



Block-level Energy-per-bit - AEGIS-128L Test, Large messages


Figure 5.23: Energy efficiency of AEGIS coprocessor alternatives, for large messages.

5.5.4 MORUS


With the MORUS core area being 3.2× smaller than AEGIS, integration onto an SoC comes

at a significant price. Figure 5.24(a) shows a 32% additional area required for the most

lightweight wrapper, while this value jumps to 103% for a two-FIFO wrapper with 512-bit

and 2 KBit input and output FIFOs respectively. Removal of the input FIFO is a more

effective option than in the case of AEGIS, with an area reduction of 11.2% over the two-

FIFO alternative.

For large messages, integration without using DMA makes the coprocessor 50× slower than

standalone MORUS-1280. Using a DMA brings the penalty down to 28×, which is the

best-case performance possible in this system. The performances of one-FIFO and two-

FIFO designs are virtually indistinguishable due to negligible wait periods between MORUS

stages. The lightweight no-FIFO wrapper, on the other hand, provides a reasonable 1.38×

gain in performance over the alternatives not using DMA, owing to fewer looping operations.

5.5. Observations and Results 109Ar

ea (K

GE)

0

20

40

60

80

MORUS-1280 Core

MORUS-1280 Coprocessor -

NO FIFO

MORUS-1280 Coprocessor - Output FIFO

MORUS-1280 Coprocessor -

Input and Output FIFOs

MORUS-1280: Coprocessor Area Comparison


Num

cyc

les

/ Num

cyc

les

of H

W

0

20

40

60

0 100 200 300 400

Input-Output FIFO, No DMA Output FIFO, No DMA No FIFO, No DMAInput-Output FIFO, DMA Output FIFO, DMA

MORUS-1280: Coprocessor Performance comparison


Figure 5.24: Illustration of area and performance overhead arising from different alternativesfor SoC integration of MORUS-1280.

AD+PT Length (Bytes)

Thro

ughp

ut (M

bps)

/ Ar

ea (K

GE)

0

1

2

3

4

0 100 200 300 400


MORUS-1280: Coprocessor Throughput-per-Area

Figure 5.25: Throughput-per-area of MORUS-1280 coprocessor designs.


The power and energy efficiency of MORUS coprocessor alternatives are very similar to those

of AEGIS, due to the cipher core’s contribution being comparable to that of the FIFOs.

MORUS coprocessor contributes to 8.7% of the total power - roughly the same as that of

the processor. The lightweight wrapper reduces coprocessor power by more than 35%, but


increases that of the processor, thereby leading to a 3.3% increase in top-level power. As

seen in previous ciphers, DMA-based designs increase top-level power by 4-5%.

Pow

er (W

)

2.85E-01

2.90E-01

2.95E-01

3.00E-01

3.05E-01

3.10E-01

3.15E-01

2 FIFOs, no DMA

Output FIFO, no

DMA


Output FIFO, DMA

Top-level Power - MORUS-1280 Test

(a) Top-level Power consumptionPo

wer

(W)

0.00E+00

1.00E-02

2.00E-02

3.00E-02

Processor MORUS Coprocessor

MORUS-FIFO MORUS-Core DMA


Block-level Power - MORUS-1280 Test


Figure 5.26: Power consumption resulting from different AEGIS coprocessor designs.

Energy consumed per message bit for small messages has been plotted for different MORUS

coprocessor alternatives in Appendix C. They are not discussed here since MORUS, like

AEGIS, is aimed at providing high performance which is practically more relevant for mes-

sages larger than 32 bytes. DMA-based design alternatives are energy-efficient for large mes-

sages, leading to 16.4% energy reduction per message bit over the no-FIFO design. While the

latter eliminates FIFO energy, the cipher core itself amounts to roughly half the coprocessor

power. In addition, the processor energy is increased by the no-FIFO design, due to which

it leads to an increase in overall top-level energy consumption.

5.6 Conclusion

With one-pass AEAD ciphers gaining traction due to their high speed and hardware-efficiency,

the relative lack of research on their integration onto a System-on-Chip motivated the analy-

sis in this chapter. Different coprocessor alternatives for CAESAR finalists ACORN, AEGIS,

and MORUS were studied with regard to their area, performance, and power after SoC in-

5.6. Conclusion 111En

ergy

-per

-bit

(nJ/

bit)

1.00E+00

1.50E+00

2.00E+00

2.50E+00

2 FIFOs, no DMA

Output FIFO, no

DMA


Output FIFO, DMA

Top-level Energy-per-Bit - MORUS-1280 Test, Large messages


Ener

gy-p

er-b

it (n

J/bi

t)

0.00E+00

5.00E-02

1.00E-01

1.50E-01

2.00E-01

Processor MORUS- Coprocessor



Block-level Energy-per-Bit - MORUS-1280 Test, Large messages


Figure 5.27: Energy efficiency of MORUS coprocessor alternatives, for large messages.

tegration. We make the following conclusions from our analysis.

1. Area and performance overhead after SoC integration is inevitable. Lightweight copro-

cessor wrappers with no FIFO, or a single FIFO at the output were found to achieve

the best area-performance trade-off.

2. Lightweight ciphers whose datapath width is smaller than that of the system bus incur

a greater penalty on their area and performance.

3. Designs can employ Direct Memory Access to maximize their throughput-by-area and

minimize energy consumption by reducing processor activity.

To make a final comment on the AEAD coprocessors, Figures 5.28(a) and 5.28(b) plot the

average throughput and throughput-per-area respectively against the energy consumed per

bit, all averaged over identical test cases. These results are shown for the top two coprocessor

alternatives that emerged from our analysis.

From these figures, we believe that while AEGIS-128L and MORUS-1280 achieve the highest

throughput after SoC integration, ACORN-32 is the best choice for area-constrained appli-

cations. MORUS-1280 is a better choice for energy-constrained applications owing to its


MORUS, Output-FIFO, DMA

AEGIS, No FIFO

AEGIS, Output-FIFO, DMA

ACORN-32, No FIFO

ACORN-32,Output-FIFO, DMA

ACORN-8, TwoFIFOs, DMA

MORUS, No FIFO

(a) Throughput vs Energy-per-bit

MORUS, Output-FIFO, DMA

AEGIS, No FIFO

AEGIS, Output-FIFO, DMA

ACORN-32, No FIFO

ACORN-32,Output-FIFO, DMA

ACORN-8, TwoFIFOs, DMA

MORUS, No FIFO

(b) Throughput-per-area vs Energy-per-bit

Figure 5.28: Comparison of AEAD coprocessors.

better throughput-per-area than AEGIS-128L.

Chapter 6

Conclusion

This thesis presented the impact of logic synthesis on circuits minimized by a new class

of tools optimized for low gate count. The hardware efficiency of circuits was extensively

compared with that of a set of benchmark designs over a range of frequencies, after synthesis,

as well as physical design. This analysis was performed to demonstrate the importance of

trade-offs between the circuits’ area, power, and performance after synthesis. Specific regions

in the solution space were highlighted, where logic minimization provides intended benefits

on hardware. Furthermore, it was also shown that these benefits are liable to diminish

when logic-minimized combinatorial blocks are integrated with a complete system. This

discussion ends with a study of overall effectiveness of using logical metrics to predict the

aforementioned circuits’ hardware efficiency. A prototype ASIC was designed to observe the

performance and efficiency of these circuits on hardware. The ASIC comprises a System-on-

Chip designed to accommodate multiple hardware blocks as coprocessors around a Leon3

processor core. Apart from a coprocessor consisting of logic-minimized circuits, the SoC

also includes recently developed ciphers for authenticated encryption. The second part of

this thesis explored the implementation of these ciphers as coprocessors on an SoC, and

investigated the benefits and trade-offs associated with different design alternatives for the

same. We believe that the results presented in this thesis will serve as a useful reference for

future research in this field.

113

Bibliography

[1] CAESAR: Competition for Authenticated Encryption: Security, Applicability, and Ro-

bustness. https://competitions.cr.yp.to/caesar.html.

[2] Hardware implementation of finite-field arithmetic. http://www.

arithmetic-circuits.org/finite-field/vhdl_codes.htm.

[3] Tohoku university: Cryptographic Hardware Project, May 2015. URL http://www.

aoki.ecei.tohoku.ac.jp/crypto/.

[4] Kazumaro Aoki, Tetsuya Ichikawa, Masayuki Kanda, Mitsuru Matsui, Shiho Moriai,

Junko Nakajima, and Toshio Tokita. Camellia: A 128-bit block cipher suitable for

multiple platforms - design and analysis. In Selected Areas in Cryptography, 2000.

[5] Christof Beierle, Thorsten Kranz, and Gregor Leander. Lightweight multiplication in

GF(2n) with applications to MDS Matrices. In Matthew Robshaw and Jonathan Katz,

editors, Advances in Cryptology – CRYPTO 2016, pages 625–653, Berlin, Heidelberg,

2016. Springer Berlin Heidelberg. ISBN 978-3-662-53018-4.

[6] Daniel J. Bernstein. Minimum number of bit operations for multiplication. https:

//binary.cr.yp.to/m.html.

[7] Daniel J. Bernstein. Batch Binary Edwards. In CRYPTO, 2009.

[8] Joan Boyar and René Peralta. A New Combinational Logic Minimization Technique

with Applications to Cryptology. In Festa P. (eds) Experimental Algorithms. SEA 2010.

Lecture Notes in Computer Science, volume 6049. Springer, Berlin, Heidelberg, 2010.

114

https://competitions.cr.yp.to/caesar.html

http://www.arithmetic-circuits.org/finite-field/vhdl_codes.htm

http://www.arithmetic-circuits.org/finite-field/vhdl_codes.htm

http://www.aoki.ecei.tohoku.ac.jp/crypto/.

http://www.aoki.ecei.tohoku.ac.jp/crypto/.

https://binary.cr.yp.to/m.html

https://binary.cr.yp.to/m.html

BIBLIOGRAPHY 115

[9] Joan Boyar and René Peralta. A Small Depth-16 Circuit for the AES S-Box. In Gritzalis

D., Furnell S., Theoharidou M. (eds) Information Security and Privacy Research. SEC

2012. IFIP Advances in Information and Communication Technology. Springer, Berlin,

Heidelberg, 2012.

[10] Joan Boyar, Morris Dworkin, René Peralta, Meltem Turan, Cagdas Calik, , and Luis

Brandao. Circuit minimization work. http://cs-www.cs.yale.edu/homes/peralta/

CircuitStuff/CMT.html. Past collaborators include: Michael Bartock, Ramon Col-

lazo, Magnus Find, Michael Fischer, Murat Cenk, Christopher Wood, Andrea Visconti,

Chiara Schiavo, Holman Gao, Bruce Strackbein, Larry Bassham.

[11] Joan Boyar, Philip Matthews, and René Peralta. Logic minimization techniques with

applications to cryptology. J. Cryptol., 26(2):280–312, April 2013. ISSN 0933-2790. doi:

10.1007/s00145-012-9124-7. URL http://dx.doi.org/10.1007/s00145-012-9124-7.

[12] Joan Boyar, Magnus Find, and René Peralta. Small low-depth circuits for cryptographic

applications. Cryptography and Communications, 11:109–127, 2018.

[13] David Canright. A very compact Rijndael S-box, 2004. URL https://calhoun.nps.

edu/handle/10945/791.

[14] David Canright. A very compact s-box for aes. In CHES, 2005.

[15] Murat Cenk and M. Anwar Hasan. Some new results on binary polynomial multiplica-

tion. Cryptology ePrint Archive, Report 2015/094, 2015. https://eprint.iacr.org/

2015/094.

[16] Alessandro Cilardo. Fast parallel GF(2m) polynomial multiplication for all degrees.

IEEE Transactions on Computers, 62:929–943, 2013.

http://cs-www.cs.yale.edu/homes/peralta/CircuitStuff/CMT.html

http://cs-www.cs.yale.edu/homes/peralta/CircuitStuff/CMT.html

http://dx.doi.org/10.1007/s00145-012-9124-7

https://calhoun.nps.edu/handle/10945/791

https://calhoun.nps.edu/handle/10945/791

https://eprint.iacr.org/2015/094


116 BIBLIOGRAPHY

[17] Nicolas Courtois, Daniel Hulme, and Theodosis Mourouzis. Solving circuit optimisation

problems in cryptography and cryptanalysis. IACR Cryptology ePrint Archive, 2011:

475, 01 2011.

[18] Chinmay Deshpande. Hardware fault attack detection methods for secure embedded

systems. Master’s Thesis, Computer Engineering, Virginia Tech, February 2018. URL

https://vtechworks.lib.vt.edu/handle/10919/82141.

[19] William Diehl, Farnoud Farahmand, Abubakr Abdulgadir, Jens-Peter Kaps, and Kris

Gaj. Face-off between the CAESAR lightweight finalists: ACORN vs. Ascon. Cryptol-

ogy ePrint Archive, Report 2019/184, 2019. https://eprint.iacr.org/2019/184.

[20] Michele Elia, M Leone, and C Visentin. Low complexity bit-parallel multipliers for

GF(2m) with generator polynomial xm + xk + 1. Electronics Letters, 35:551 – 552, 05

1999. doi: 10.1049/el:19990407.

[21] Farnoud Farahmand, William Diehl, Abubakr Abdulgadir, Jens-Peter Kaps, and Kris

Gaj. Improved lightweight implementations of CAESAR Authenticated Ciphers. 2018

IEEE 26th Annual International Symposium on Field-Programmable Custom Computing

Machines (FCCM), pages 29–36, 2018.

[22] Carsten Fuhs and Peter Schneider-Kamp. Synthesizing shortest Linear Straight-Line

Programs over gf(2) using SAT. In Ofer Strichman and Stefan Szeider, editors, Theory

and Applications of Satisfiability Testing – SAT 2010, pages 71–84, Berlin, Heidelberg,

2010. Springer Berlin Heidelberg. ISBN 978-3-642-14186-7.

[23] Xu Guo, Zhimin Chen, and Patrick Schaumont. Energy and performance evaluation

of an FPGA-Based SoC platform with AES and PRESENT coprocessors. In Embedded

Computer Systems: Architectures, Modeling, and Simulation, pages 106–115, Berlin,

Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-70550-5.

https://vtechworks.lib.vt.edu/handle/10919/82141


BIBLIOGRAPHY 117

[24] Alper Halbutogullari and Cetin Koc. Mastrovito multiplier for general irreducible poly-

nomials. IEEE Trans. Computers, 49:503–518, 01 2000.

[25] Jingwei Hu, Wei Guo, Jizeng Wei, and Ray C. C. Cheung. Fast and generic inversion

architectures over gf(2m) using modified Itoh–Tsujii algorithms. IEEE Transactions

on Circuits and Systems II: Express Briefs, 62:367–371, 2015.

[26] José Luis Imaña, Román Hermida, and Francisco Tirado. Low complexity bit-parallel

multipliers based on a class of irreducible pentanomials. IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, 14:1388–1393, 2006.

[27] Jérémy Jean, Thomas Peyrin, and Siang Meng Sim. Optimizing implementations of

lightweight building blocks. IACR Trans. Symmetric Cryptol., 2017:130–168, 2017.

[28] Sanu K. Mathew, Farhana Sheikh, Michael Kounavis, Shay Gueron, Amit Agarwal,

Steven K. Hsu, Himanshu Kaul, Mark A. Anders, and Ram K. Krishnamurthy. 53

gbps native GF((24)2) composite-field AES-Encrypt/Decrypt Accelerator for Content-

Protection in 45 nm high-performance microprocessors. Solid-State Circuits, IEEE

Journal of, 46:767 – 776, 05 2011. doi: 10.1109/JSSC.2011.2108131.

[29] Maria Katsaiti and Nicolas Sklavos. Implementation efficiency and alternations, on

CAESAR finalists: AEGIS Approach. 2018 IEEE 16th Intl Conf on Dependable, Auto-

nomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing,

4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology

Congress(DASC/PiCom/DataCom/CyberSciTech), pages 661–665, 2018.

[30] Matthew Kelly, Alan Kaminsky, Michael Kurdziel, Marcin Lukowiak, and Stanislaw

Radziszowski. Customizable sponge-based authenticated encryption using 16-bit s-

boxes. pages 43–48, 10 2015. doi: 10.1109/MILCOM.2015.7357416.

118 BIBLIOGRAPHY

[31] Sachin Kumar, Jawad Haj-Yihia, Mustafa Khairallah, and Anupam Chattopadhyay. A

comprehensive performance analysis of hardware implementations of CAESAR candi-

dates. IACR Cryptology ePrint Archive, 2017:1261, 2017.

[32] Sandeep S. Kumar, Thomas J. Wollinger, and Christof Paar. Optimum digit serial

gf(2m) multipliers for curve-based cryptography. IEEE Transactions on Computers,

55:1306–1311, 2006.

[33] Shu Lin and D.J. Costello. Error Control Coding: Fundamentals and Applications.

Prentice-Hall computer applications in electrical engineering series. Prentice-Hall, 1983.

ISBN 9780132837965. URL https://books.google.com/books?id=autQAAAAMAAJ.

[34] Edoardo D. Mastrovito. VLSI designs for multiplication over finite fields GF(2m). In

AAECC, 1988.

[35] Nele Mentens, Lejla Batina, Bart Preneel, and Ingrid Verbauwhede. A systematic

evaluation of compact hardware implementations for the Rijndael S-Box. volume 3376,

pages 323–333, 02 2005. doi: 10.1007/978-3-540-30574-3_22.

[36] Sumio Morioka and Akashi Satoh. An optimized S-Box circuit architecture for low

power AES design. In CHES, 2002.

[37] Yasuyuki Nogami, Kenta Nekado, Tetsumi Toyota, Naoto Hongo, and Yoshitaka

Morikawa. Mixed bases for efficient inversion in f((22)2)2 and conversion matrices of

SubBytes of AES. In CHES, 2010.

[38] Christof Paar. Efficient VLSI architectures for bit-parallel computation in Galois Fields.

Ph. D. Thesis, Inst. for Experimental Math., Univ. of Essen, 1994. URL https:

//ci.nii.ac.jp/naid/10026847019/en/.

https://books.google.com/books?id=autQAAAAMAAJ

https://ci.nii.ac.jp/naid/10026847019/en/

https://ci.nii.ac.jp/naid/10026847019/en/

BIBLIOGRAPHY 119

[39] Christof Paar. A new architecture for a parallel finite field multiplier with low complexity

based on composite fields. Computers, IEEE Transactions on, 45:856 – 861, 08 1996.

doi: 10.1109/12.508323.

[40] René Peralta. Galois Field derivations. http://cs-www.cs.yale.edu/homes/peralta/

CircuitStuff/calc.pdf.

[41] René Peralta and Joan Boyar. Method of optimizing combinational circuits, April 22

2014. US Patent 8,707,224 B2.

[42] Tsutomu Sasao. AND-EXOR expressions and their optimization. 01 1993. doi: 10.

1007/978-1-4615-3154-8_13.

[43] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A compact rijndael

hardware architecture with S-Box optimization. In ASIACRYPT, 2001.

[44] Leilei Song and Keshab K. Parhi. Low-complexity modified mastrovito multipliers over

finite fields GF(2m). 1999.

[45] Rei Ueno, Naofumi Homma, Yukihiro Sugawara, Yasuyuki Nogami, and Takafumi Aoki.

Highly efficient GF(28) inversion circuit based on redundant GF arithmetic and its

application to AES design. In IACR Cryptology ePrint Archive, 2015.

[46] Johannes Wolkerstorfer, Elisabeth Oswald, and Mario Lamberger. An ASIC implemen-

tation of the AES SBoxes. In CT-RSA, 2002.

[47] Christopher A. Wood. Large substitution boxes with efficient combinational

implementations. Master’s Thesis, B. Thomas Golisano College of Comput-

ing and Information Sciences, Rochester Institute of Technology, Rochester, New

York, 2013. URL https://scholarworks.rit.edu/cgi/viewcontent.cgi?referer=

https://www.google.com/&httpsredir=1&article=6531&context=theses.

http://cs-www.cs.yale.edu/homes/peralta/CircuitStuff/calc.pdf

http://cs-www.cs.yale.edu/homes/peralta/CircuitStuff/calc.pdf

https://scholarworks.rit.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=6531&context=theses

https://scholarworks.rit.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=6531&context=theses

120 BIBLIOGRAPHY

[48] Hongjun Wu. ACORN:A Lightweight Authenticated Cipher, September 2016. URL

https://competitions.cr.yp.to/round3/acornv3.pdf.

[49] Hongjun Wu. AEGIS:A Fast Authenticated Encryption Algorithm (v1.1), September

2016. URL https://competitions.cr.yp.to/round3/acornv3.pdf.

[50] Hongjun Wu. The Authenticated Cipher MORUS (v2), September 2016. URL https:

//competitions.cr.yp.to/round3/acornv3.pdf.

[51] Bilgiday Yuce. Fault attacks on embedded software: New directions in modeling, design,

and mitigation. Ph. D. Thesis, Computer Engineering, Virginia Tech, January 2018.

URL https://vtechworks.lib.vt.edu/handle/10919/81824.

[52] Qiang Zhang. The implementation of a Reed Solomon Code Encoder/De-

coder. Graduate Project, California State University, Northridge, 2014.

URL http://scholarworks.csun.edu/bitstream/handle/10211.3/121217/

Zhang-Qiang-thesis-2014.pdf;sequence=1.

https://competitions.cr.yp.to/round3/acornv3.pdf




https://vtechworks.lib.vt.edu/handle/10919/81824

http://scholarworks.csun.edu/bitstream/handle/10211.3/121217/Zhang-Qiang-thesis-2014.pdf;sequence=1

http://scholarworks.csun.edu/bitstream/handle/10211.3/121217/Zhang-Qiang-thesis-2014.pdf;sequence=1

Appendices

121

Appendix A

Additional results for logic synthesis

of LGC circuits

Delay (ns)

Corr

elat

ion

-1

-0.5

0

0.5

1

4 6 8 10

Correlation to Area Correlation to Power

SBox - Correlation of logical XOR count to area and power

Figure A.1: Correlation of SBox logical XOR count to area and power.

N

Aver

age

Pow

er (W

)

0

0.0001

0.0002

0.0003

0.0004

0.0005

8 10 12 14 16 18 20 22

polymult_mat, 10 ns

polymult_comp, 10 ns

polymult_lgc, 10 ns

polymult_mat, 2.2 ns

polymult_comp, 2.2 ns

polymult_lgc, 2.2 ns

Average Power vs N - 32 nm technology

(a) Power vs N

Delay (ns)

Aver

age

pow

er (W

)

0

0.0001

0.0002

0.0003

0.0004

0.0005

2 4 6 8 10

polymult_mat, N=12

polymult_comp, N=12

polymult_lgc, N=12

polymult_mat, N=16

polymult_comp, N=16

polymult_lgc, N=16

polymult_mat, N=22

polymult_comp, N=22

polymult_lgc, N=22

Average Power vs Delay for NXN polynomial multipliers - 32nm technology

(b) Power vs delay

Figure A.2: Power of polynomial multipliers using 32nm technology

122

123


Area

(KG

E)

7

8

9

10

5 10 15 20

aes20_lut

aes20_lgc

aes20_wolkerstorfer

aes20_canright




Standard AES - Area (K Gate Eq.) vs Throughput, 32 nm technology

(a) Area vs ThroughputThroughput (M Encryptions per sec)

Aver

age

pow

er (W

)

0

0.001

0.002

0.003

0.004

5 10 15 20

aes20_lut

aes20_lgc

aes20_wolkerstorfer

aes20_canright




Standard AES - Power vs Throughput, 32 nm technology

(b) Power vs Throughput

Figure A.3: Area and power of standard AES designs using 32nm technology


Area

(KG

E)

4.0

4.5

5.0

5.5

6.0

6.5

1 1.5 2 2.5 3 3.5

aes20_lut

aes20_lgc

aes20_canright



Lightweight AES - Area vs Throughput, 32 nm technology

(a) Area vs Throughput Throughput (M Encryptions per sec)

Aver

age

Pow

er (W

)

0.0005

0.00060.00070.0008

0.001

1 1.5 2 2.5 3 3.5

aes4_lut aes4_lgc aes4_canright aes4_lgc - pipelinedaes4_canright - pipelined

Lightweight AES - Power vs Throughput, 32 nm technology

(b) Power vs Throughput

Figure A.4: Area and power of lightweight AES designs using 32nm technology

Delay (ns)

Area

(KG

E)

0.2

0.4

0.6

0.8

4 6 8 10

sbox_lut

sbox_wolkerstorfer

sbox_lgc

sbox_canright

sbox_pprm3



SBox Post-layout Area (K Gate Eq.) vs Delay

(a) Post-layout Area (K Gate Eq.)

Delay (ns)

Pow

er (W

)

0.0005

0.0010

0.0015

0.0020

4 6 8 10

sbox_lut

sbox_lgc

sbox_canright

sbox_lgc - pipelined

sbox_canright-pipelined

SBox Post-layout Power vs Delay

(b) Post-layout Power

Figure A.5: Post-layout area and power of SBox, using 180nm technology.

124 Appendix A. Additional results for logic synthesis of LGC circuits

Delay (ns)

Area

(KG

E)

0

1

2

3

4

4 6 8 10

polymult_mat, N=8

polymult_nist, N=8

polymult_comp, N=16

polymult_mat, N=16

polymult_nist, N=16

polymult_comp, N=22

polymult_mat, N=22

polymult_nist, N=22

Polynomial Multiplier Post-layout Area (K Gate Eq.) vs Delay


Delay (ns)

Pow

er (W

)

0

0.002

0.004

0.006

0.008

4 6 8 10

polymult_mat, N=8

polymult_lgc, N=8

polymult_mat, N=16

polymult_lgc, N=16

polymult_mat, N=22

polymult_lgc, N=22

Poynomial Multiplier - Post-layout Power vs Delay


Figure A.6: Post-layout area and power of Polynomial Multiplier, using 180nm technology.

Delay (ns)

Area

(KG

E)

0

0.5

1

1.5

4 5 6 7 8


gfmult_paar, 2^8

gfmult_lgc_tof, 2^8


gfmult_paar, 2^16


GF Multiplier Post-layout Area (K Gate Eq.) vs Delay


Delay (ns)

Pow

er (W

)

0.001

0.002

0.003

0.004

4 5 6 7

gfmult_paar gfmult_mastrovito gfmult_lgc

GF (2^16) Multiplier Post-layout Power vs Delay


Figure A.7: Post-layout area and power of GF Multipliers, using 180nm technology.

Delay (ns)

Area

(KG

E)

0

0.2

0.4

0.6

4 6 8 10

gfinv_comp gfinv_rrb gfinv_lgc

GF(2^8) Inverter Post-layout Area (K Gate Eq.) vs Delay

(a) Post-layout Area - GF (28) Inverter

Delay (ns)

Area

(KG

E)

3

3.5

4

4.5

4 6 8 10

reedsolomon_ref reedsolomon_lgc

Reed-Solomon Encoder Post-layout Area (K Gate Eq.) vs Delay

(b) Post-layout Area - Reed-Solomon Encoder

Figure A.8: Post-layout area of GF (28) Inverter and Reed-Solomon Encoder.

125

Frequency (MHz)

Area

(KG

E)

15

20

25

30

35

100 125 150 175 200

aes20_lut aes20_lgc aes20_canrightaes20_lgc - pipelined aes20_canright - pipelined

Post-layout Area (K Gate Eq.) - Standard AES


Frequency (MHz)

Pow

er (W

)

0

0.025

0.05

0.075

0.1

100 125 150 175 200


Post-layout Power - Standard AES


Figure A.9: Post-layout area and power of standard AES, using 180nm technology.

Frequency (MHz)

Area

(KG

E)

15161718192021

100 125 150 175 200


Post-layout Area (K Gate Eq.) - Lightweight AES


Frequency (MHz)

Pow

er (W

)

0.01

0.015

0.02

0.025

0.03

100 125 150 175 200


Post-layout Power - Lightweight AES


Figure A.10: Post-layout area and power of lightweight AES, using 180nm technology.

Delay (ns)

Corr

elat

ion

-0.25

0

0.25

0.5

0.75

2.25 2.5 2.75 3 3.25 3.5 3.75

TSMC180 SAED32

GF Multipliers - Correlation between Area and Logical gate Count

(a) Logical Gate Count and Area

Delay (ns)

Corr

elat

ion

-0.25

0

0.25

0.5

0.75

2.5 3 3.5 4

TSMC180 SAED32

GF Multipliers - Correlation between Power and Logical gate Count

(b) Logical Gate Count and Power

Figure A.11: Correlation of logical gate count to hardware quality metrics of GF multipliers.

126 Appendix A. Additional results for logic synthesis of LGC circuits

Delay (ns)

Corr

elat

ion

0.5

0.6

0.7

0.8

0.9

1

2.25 2.5 2.75 3 3.25 3.5 3.75

TSMC180 SAED32

GF Multipliers - Correlation between Area and Logical Depth

(a) Logical Depth and Area

Delay (ns)

Corr

elat

ion

0.5

0.6

0.7

0.8

0.9

1

2.5 3 3.5 4

TSMC180 SAED32

GF Multipliers - Correlation between Power and Logical Depth

(b) Logical Depth and Power

Figure A.12: Correlation of logical depth to hardware quality metrics of GF multipliers.

Delay (ns)

Corr

elat

ion

-0.5

0

0.5

1

4 6 8 10

TSMC180 SAED32

GF(2^8) Inverter - Correlation between Area and Logical Gate Count

(a) Logical Gate Count and Area

Delay (ns)

Corr

elat

ion

-0.5

0

0.5

1

5 6 7 8 9 10

TSMC180 SAED32

GF(2^8) Inverter - Correlation between Power and Logical Gate Count

(b) Logical Gate Count and Power

Figure A.13: Correlation of logical gate count to hardware quality metrics of GF(28) inverter.

Delay (ns)

Corr

elat

ion

-0.25

0

0.25

0.5

0.75

1

4 6 8 10

TSMC180 SAED32

GF(2^8) Inverter - Correlation between Area and Logical Depth

(a) Logical Depth and Area

Delay (ns)

Corr

elat

ion

-0.25

0

0.25

0.5

0.75

1

5 6 7 8 9 10

TSMC180 SAED32

GF(2^8) Inverter - Correlation between Power and Logical Depth

(b) Logical Depth and Power

Figure A.14: Correlation of logical depth to hardware quality metrics of GF(28) inverter.

Appendix B

NISTCHIP Memory Map

Component AHB Address AHB Master / SlaveLeon3 Processor N/A MasterDMA Controller N/A MasterDebug UART N/A MasterBoot ROM 0x00000000 - 0x000FFFFF SlaveOn-Chip RAM 0x40000000 - 0x400FFFFF SlaveDebug Support Unit 0x90000000 - 0x9FFFFFFF Slave

0xFFF00200 - 0xFFF002FF (I/O part)SPI Memory Controller 0x10000000 - 0x10FFFFFF (Memory part) Slave

APB Controller 0 0x80000000 - 0x800FFFFF SlaveAPB Controller 1 0xA0000000 - 0xA00FFFFF SlaveAPB Controller 2 0xB0000000 - 0xB00FFFFF Slave

Table B.1: Address Map of components on AHB Bus

127

128 Appendix B. NISTCHIP Memory Map

Component APB Address BusUser UART 0x80000100 - 0x800001FF APB Bus 0Debug UART 0x80000300 - 0x800003FF APB Bus 0GPIO 0x80000500 - 0x800005FF APB Bus 0GP Timer 0x80000600 - 0x800006FF APB Bus 0Standalone EM Sensors 0xA0010000 - 0xA0010FFF APB Bus 1AES-LUT 0xA0020000 - 0xA0020FFF APB Bus 1Keymill 0xA0030000 - 0xA0030FFF APB Bus 1SBox with EM Sensors 0xA0040000 - 0xA0040FFF APB Bus 1ACORN (32-bit) 0xB0010000 - 0xB0010FFF APB Bus 2Reed-Solomon Encoder 0xB0020000 - 0xB0020FFF APB Bus 2NISTCOMB 0xB0030000 - 0xB0030FFF APB Bus 2AES-LGC 0xB0040000 - 0xB0040FFF APB Bus 2AES-WOLK 0xB0050000 - 0xB0050FFF APB Bus 2AEGIS 0xB0060000 - 0xB0060FFF APB Bus 2ACORN (8-bit) 0xB0070000 - 0xB0070FFF APB Bus 2MORUS 0xB0080000 - 0xB0080FFF APB Bus 2DMA Controller 0xB0090000 - 0xB0090FFF APB Bus 2

Table B.2: Address Map of components on APB Busses

Address Register R/W0 Control R/W

b0: Soft Resetb2,b1: Redundancyb3: GF Modeb7-b4: Function

1 Input0 R/W2 Input1 R/W3 Input2 R/W4 Input3 R/W5 Output0 R6 Output1 R7 Output2 R8 Output3 R

Table B.3: NISTCOMB software-interface

129

GF Mode Function Purpose0 0 OUTP = SBOX-LGC(INP)0 1 OUTP = INVSBOX-LGC(INP)0 2 OUTP = SBOX-LUT(INP)0 3 OUTP = INVSBOX-LUT(INP)0 4 OUTP = SBOX-WOLK(INP)0 5 OUTP = INVSBOX-WOLK(INP)0 6 OUTP = K3LRSBOX(INP)1 0 OUTP = GF256MULT(INP)1 1 OUTP = GF256INV(INP)1 2 OUTP = GF65536MULT(INP)1 3 OUTP = GF65536INV(INP)1 4 OUTP = MULT64(INP)

Table B.4: Allowed values for GF Mode and Function in NISTCOMB coprocessor

Redundancy Purpose0 (0,0,0,byte) or (0,word16)1 (0,0,byte,byte) or (word16, word16)2 (0,byte,byte,byte) or (word16, word16)3 (byte,byte,byte,byte) or (word16,word16)

Table B.5: Redundancy settings in NISTCOMB coprocessor

Appendix C

Additional results for SoC integration

of AEAD Ciphers

Component on Chip % of Total powerOn-Chip Memory 27.92%Processor 7.82%AEGIS Coprocessor 12.6%DMA Controller 5.02%Cache controller 0.8%AHB 0.3%APB 0.3%

Table C.1: AEGIS-128L Tests - Power split

Block Clock Network powerTop-level 55.86%Processor 65.83%AEGIS Coprocessor

With FIFOs 77.06%Without FIFOs 66.3%

DMA Controller 21.27%Cache controller 32.02%

Table C.2: AEGIS-128L Tests - Clock power

Ener

gy p

er B

it (n

J/bi

t)

3

3.5

4

4.5

5

5.5

2 FIFOs, no DMA

Output FIFO, no DMA


Top-level Energy-per-bit - AEGIS-128L Test, Small messages

(a) Top-level

Eerg

y pe

r Bit

(nJ/

bit)

0.00E+00

2.00E-01

4.00E-01

6.00E-01

8.00E-01

Processor AEGIS Coprocessor



Block-level Energy-per-bit - AEGIS-128L Test, Small messages

(b) Block-level

Figure C.1: Energy efficiency of AEGIS coprocessor alternatives, for small messages.

130

131En

ergy

-per

-bit

(nJ/

bit)

4.00E+00

4.50E+00

5.00E+00

5.50E+00

6.00E+00

6.50E+00

2 FIFOs, no DMA

Output FIFO, no DMA


Output FIFO, DMA

Top-level Energy-per-Bit - MORUS-1280 Test, Small messages

(a) Top-levelEn

ergy

-per

-bit

(nJ/

bit)

0.00E+00

2.00E-01

4.00E-01

6.00E-01

Processor MORUS- Coprocessor



Block-level Energy-per-Bit - MORUS-1280 Test, Small messages

(b) Block-level

Figure C.2: Energy efficiency of MORUS coprocessor alternatives, for small messages.

Ener

gy-p

er-b

it (n

J/bi

t)

1.40E+01

1.60E+01

1.80E+01

2.00E+01

2.20E+01

2.40E+01

2.60E+01


Top-level Energy-per-bit - ACORN-8 Test, Small messages

(a) Top-level

Ener

gy-p

er-b

it (n

J/bi

t)

0.00E+00

5.00E-01

1.00E+00

1.50E+00

2.00E+00

2.50E+00


ACORN-8 FIFO

ACORN-8 Core

DMA


Block-level Energy-per-bit - ACORN-8 Test, Small messages

(b) Block-level

Figure C.3: Energy efficiency of ACORN-8 coprocessor alternatives, for small messages.

Efficiency of Logic Minimization Techniques for ...

Documents

Transcript of Efficiency of Logic Minimization Techniques for ...