Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for...

113
Tools and Techniques for Evaluating Reliability Trade-os for Nano-Architectures Debayan Bhaduri Thesis submitted to the Faculty of Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering Sandeep K. Shukla, Chair Dong S. Ha, Member Binoy Ravindran, Member April 30, 2004 Blacksburg, Virginia Keywords: Nanotechnology, Gibbs distribution, TMR, CTMR, PRISM, Gaussian, reliability, entropy, probabilistic model checking, interconnect noise, modeling, defect-tolerant architecture, granularity Copyright c 2004, Debayan Bhaduri

Transcript of Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for...

Page 1: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Tools and Techniques for Evaluating Reliability Trade-offsfor Nano-Architectures

Debayan Bhaduri

Thesis submitted to the Faculty of

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Engineering

Sandeep K. Shukla, Chair

Dong S. Ha, Member

Binoy Ravindran, Member

April 30, 2004

Blacksburg, Virginia

Keywords: Nanotechnology, Gibbs distribution, TMR, CTMR, PRISM, Gaussian,

reliability, entropy, probabilistic model checking, interconnect noise, modeling,

defect-tolerant architecture, granularity

Copyright c©2004, Debayan Bhaduri

Page 2: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Tools and Techniques for Evaluating Reliability Trade-offs for

Nano-Architectures

Debayan Bhaduri

Abstract

It is expected that nano-scale devices and interconnections will introduce unprece-

dented level of defects in the substrates, and architectural designs need to accommo-

date the uncertainty inherent at such scales. This consideration motivates the search

for new architectural paradigms based on redundancy based defect-tolerant designs.

However, redundancy is not always a solution to the reliability problem, and often too

much or too little redundancy may cause degradation in reliability. The key challenge

is in determining the granularity at which defect tolerance is designed, and the level of

redundancy to achieve a specific level of reliability. Analytical probabilistic models to

evaluate such reliability-redundancy trade-offs are error prone and cumbersome, and

do not scale well for complex networks of gates. In this thesis we develop different tools

and techniques that can evaluate the reliability measures of combinational circuits, and

can be used to analyze reliability-redundancy trade-offs for different defect-tolerant

architectural configurations. In particular, we have developed two tools, one of which

is based on probabilistic model checking and is named NANOPRISM, and another

MATLAB based tool called NANOLAB. We also illustrate the effectiveness of our reli-

ability analysis tools by pointing out certain anomalies which are counter-intuitive but

can be easily discovered by these tools, thereby providing better insight into defect-

tolerant design decisions. We believe that these tools will help furthering research and

pedagogical interests in this area, expedite the reliability analysis process and enhance

the accuracy of establishing reliability-redundancy trade-off points.

Page 3: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Acknowledgements

I would like to thank Dr. Sandeep K. Shukla for his guidance, motivation and support

throughout this work. I also thank Dr. Dong S. Ha and Dr. Binoy Ravindran for serving

as committee members for this thesis.

We acknowledge the support of NSF grant CCR-0340740 which provided for the fund-

ing for the work reported in this thesis.

I extend my love and affection to Didi, without her I would not have come this far.

I would like to thank my childhood buddy Tamal Bhattacharya for being there for

me through thick and thin. Growing up together with him has been a wonderful

experience.

From the FERMAT Lab, I would like to thank the following:

Suhaib Syed: For “tolerating me at home and at work, and being my best friend at

Tech”.

Hiren D. Patel: For “motivating me with his hard work and dedication”.

Deepak Abraham MathaiKutty: For “keeping me amused by his quaint dress codes”.

David Berner: For “being supportive about weekend parties”.

Gaurav Singh: For “helping me be in touch with hindi”.

I would also like to take this opportunity to thank Habeeb Abdullah, Animesh Patcha,

Madhup Chandra and Shekhar Sharad Agrawal for being great pals at Tech. In spite

of our busy schedule, we always found time to hang out together.

iii

Page 4: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Dedication

I dedicate this thesis to my parents

Mrs. Bharati Bhaduri

and

Mr. Dwipendra Nath Bhaduri

iv

Page 5: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Analytical Approaches for Reliability Analysis . . . . . . . . . . . . . . . 3

1.3 Main Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 Brief Introduction to Our Tools . . . . . . . . . . . . . . . . . . . . 14

1.3.2 Features of these Tools . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Background 19

2.1 Defect Tolerant Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Majority Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Defect-Tolerance through Reconfiguration . . . . . . . . . . . . . . . . . . 25

v

Page 6: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

2.4 Markov Random Field based Model of Computation . . . . . . . . . . . 28

2.4.1 Modeling Noise at Inputs and Interconnects . . . . . . . . . . . . 31

2.5 Granularity and Measure of Redundancy . . . . . . . . . . . . . . . . . . 33

2.6 Probabilistic Model Checking and PRISM . . . . . . . . . . . . . . . . . . 33

2.7 Emerging Technologies and Techniques . . . . . . . . . . . . . . . . . . . 35

2.7.1 Brief Introduction to Emerging Technologies . . . . . . . . . . . . 35

2.7.2 Techniques to Build Nano-Scale Devices . . . . . . . . . . . . . . . 39

3 NANOLAB: A MATLAB Based Tool 43

3.1 Reliability, Entropy and Model of Computation . . . . . . . . . . . . . . . 43

3.2 Automation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.1 Detailed Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Shortcomings of NANOLAB . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Reliability Analysis of Boolean Networks using

NANOLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.1 Reliability and Entropy Measures of a NAND Gate . . . . . . . . 50

3.4.2 Reliability and Entropy Measures for the Logic Block . . . . . . . 52

3.4.3 Output Energy Distributions for the Logic Block . . . . . . . . . . 54

3.4.4 Reliability Analysis of Different Circuits for Smaller KT Values . 57

vi

Page 7: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

3.4.5 Reliability Analysis in the Presence of Noisy Inputs and Intercon-

nects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 NANOPRISM: A Tool Based on Probabilistic Model Checking 61

4.1 Modeling Single Gate TMR, CTMR . . . . . . . . . . . . . . . . . . . . . . 62

4.1.1 Modeling Block Level TMR, CTMR . . . . . . . . . . . . . . . . . 64

4.2 Reliability Analysis of Logic Circuits with

NANOPRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.1 Reliability Measures of Single Gate CTMR Configurations . . . . 66

4.2.2 Reliability Measures for the Logic Block CTMR Configurations . 68

4.2.3 Reliability vs. Granularity Trade-offs for CTMR Configurations . 69

5 Reliability Evaluation of Multiplexing Based Majority Circuits 71

5.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Model Construction of Majority Multiplexing System . . . . . . . . . . . 73

5.3 Reliability Evaluation of Multiplexing based Majority Systems . . . . . . 76

5.3.1 Error Distribution at the Outputs of Different Configurations . . 78

5.3.2 Small Gate Failure Probabilities . . . . . . . . . . . . . . . . . . . . 79

5.3.3 Large Gate Failure Probabilities . . . . . . . . . . . . . . . . . . . . 81

vii

Page 8: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

5.4 Comparison of Reliability Measures for NAND and Majority Multiplex-

ing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6 Conclusion and Future Work 86

Bibliography 89

Vita 96

viii

Page 9: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

List of Figures

1.1 A NAND multiplexing unit from [62] . . . . . . . . . . . . . . . . . . . . 5

1.2 The design flow of an electronic or information-processing system . . . . 12

2.1 Different TMR configurations . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Generic Cascaded Triple Modular Redundancy with triple voters: multi-

layer voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Cascaded Triple Modular Redundancy with triple voters: smaller gran-

ularity from [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 A majority multiplexing unit . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 A NAND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6 The neighborhood of the input of a NAND gate depicted as a network

node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Script for 1st order CTMR with discrete input distribution . . . . . . . . 47

3.2 A Boolean network having the logic function z = x2 ∧ (x0 ∨ x1′) . . . . . 49

ix

Page 10: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

3.3 Entropies for the output of a Single NAND gate and CTMR configura-

tions of the NAND gate at different KT values . . . . . . . . . . . . . . . 51

3.4 Entropy for different orders of CTMR configurations for the Boolean

network given in Figure 3.2 at different KT values . . . . . . . . . . . . . 53

3.5 Energy distributions at the output of the Boolean network for differ-

ent orders of CTMR configuration at different KT values. Inputs are

uniformly distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6 Entropy for different logic networks at KT values ∈ [0.01, 0.1] . . . . . . . 56

3.7 Entropy at the outputs of different NAND gate configurations in the

presence of noisy inputs and interconnects . . . . . . . . . . . . . . . . . 58

3.8 Entropy and Energy distribution at the outputs of different CTMR config-

urations of the logic block in the presence of noisy inputs and interconnects 59

4.1 PRISM description of the TMR configuration of a single NAND gate . . 63

4.2 A Boolean network having the logic function z = x2 ∧ (x0 ∨ x1′) . . . . . 64

4.3 Error Distribution at the Outputs for the different Orders of CTMR Con-

figurations of Single Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 Error distribution at the outputs of the different CTMR orders for the

Boolean network given in Figure 4.2 . . . . . . . . . . . . . . . . . . . . . 68

4.5 Granularity Analysis for the Logic Block shown earlier . . . . . . . . . . 69

5.1 A majority multiplexing unit . . . . . . . . . . . . . . . . . . . . . . . . . 73

x

Page 11: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

5.2 PRISM description of the von Neumann majority multiplexing system . 75

5.3 Error distributions at the output with 1 restorative stage under different

gate failure rates and I/O bundle sizes . . . . . . . . . . . . . . . . . . . . 77

5.4 Error distributions at the output with 3 restorative stages under different

gate failure rates and I/O bundle sizes . . . . . . . . . . . . . . . . . . . . 78

5.5 Probability that atmost 10% of the outputs being incorrect for different

I/O bundle sizes (small probability of failure) . . . . . . . . . . . . . . . . 79

5.6 Probability that atmost 10% of the outputs of the overall system are

incorrect for different I/O bundle sizes (large probability of failure) . . . 80

5.7 Expected percentage of incorrect outputs for I/O bundle size of 10 (large

probability of failure) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.8 Probability of atmost 10% of the outputs being incorrect for the multi-

plexing systems when I/O bundle size equals 20 and gate failure proba-

bilities are small . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.9 Probability of atmost 10% of the outputs being incorrect for the multi-

plexing systems when I/O bundle size equals 20 and gate failure proba-

bilities are large . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

xi

Page 12: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

List of Tables

2.1 Logic compatibility function for a NAND gate with all possibilities . . . 30

2.2 Estimated Parameters for Emerging Techniques from [18] . . . . . . . . . 36

3.1 Probability of the output z of a logic gate being at different energy levels

for values of KT ∈ {0.1, 0.25, 0.5, 1.0} . . . . . . . . . . . . . . . . . . . . . 46

xii

Page 13: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Chapter 1

Introduction

New technologies for building nanometer-scale devices are expected to provide the

means for constructing much denser logic and thinner wires. These technologies pro-

vide a mechanism for the construction of a useful Avogadro computer [34] that makes

efficient use of extremely large number (Avogadro number is in the order of 1023) of

small devices computing in parallel. But the economic fabrication of complete cir-

cuits at the nanometer level remains challenging because of the difficulty of connecting

nanodevices to one another. Also, experts predict that these devices will have high

defect density due to their minuscule dimension, quantum physical effects, reduced

noise margins, system energy levels reaching thermal limits of computation, manufac-

turing defects, aging and many other factors. In the near future we will not be able

to manufacture large, defect-free integrated circuits. Thus, designing reliable system

architectures that can work around these problems at run-time becomes important.

General computer architectures till date have been based on principles that differentiate

between memory and processing and rely on communication over buses [10]. Nano-

1

Page 14: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 2

electronics may change these basic principles. Processing will be cheap and copious,

interconnection expensive and prevalent. This will tend to move computer architec-

tures in the direction of locally connected, redundant and reconfigurable hardware

meshes that merge processing and memory. At the same time, due to fundamental

limitations at the nanoscale, micro-architects will be presented with new design chal-

lenges. For instance, the methodology of using global interconnections and assuming

error-free computing may no longer be possible. Due to the small feature size, there

will be a large number of nanodevices at a designer’s disposal. This will lead to re-

dundancy based defect-tolerant architectures, and thus some conventional techniques

such as Triple Modular Redundancy (TMR), Cascaded Triple Modular Redundancy

(CTMR) and multistage iterations of these may be implemented to obtain high reliabil-

ity. However, too much redundancy does not necessarily lead to higher reliability, since

the defects affect the redundant parts as well. As a result, in-depth analysis is required

to find redundancy levels for specific reliability measures of different architectural

configurations.

1.1 Motivation

[62, 8, 9] show that redundancy based defect-tolerant architectures (resource redun-

dancy) for different Boolean networks have unique reliability-redundancy trade-off

points. But, there are certain key challenges involved in determining such trade-off

points [61, 6]. These are as follows:

• The arbitrary augmentation of unreliable devices could result in the decrease of

the reliability of an architecture.

• For each specific architecture and a given failure distribution of devices, once an

Page 15: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 3

optimal redundancy level is reached, any increase or decrease in the number of

devices may lead to less reliable computation.

• Redundancy may be applied at different levels of granularity, such as gate level,

logic block level, functional unit level etc.

• Determining the correct granularity level for a specific Boolean network is crucial

in such trade-off analysis.

Theoretical models in [47, 48, 60] have been proposed in the past, and results from infor-

mation theory [65, 36] have been used to analyze redundancy-reliability trade-offs for

different defect-tolerant architectures. However, since such analytical methodologies

involve complex combinatorial arguments and conditional probability computations,

they often lead to erroneous results. [62] proposes a probabilistic model checking based

automation approach, that found such a flaw in the analytical approach of [47]. This

motivates the need for automating reliability analysis of systems. We have developed

automation methodologies to evaluate reliability measures of different redundant ar-

chitectural configurations for arbitrary Boolean networks. Before we explain our tools

and techniques for evaluating such trade-offs, let us discuss some interesting analytical

reliability computation models and hybrid defect-tolerant architectures.

1.2 Analytical Approaches for Reliability Analysis

In this section we survey some theoretical models that have been used to analyze

reliability-redundancy trade-off points of some classical defect-tolerant architectures.

Several techniques have been suggested in the past to mitigate the effects of both

faults and defects in designs [85, 50, 48]. One of the techniques discussed in fault

Page 16: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 4

tolerant literature is the multiplexing based defect-tolerance technique initiated by

von Neumann [85], and is based on massive duplication of imperfect devices and

randomized error-prone interconnects.

In 1952, von Neumann introduced a redundancy technique called NAND multiplexing

[85] for constructing reliable computation from unreliable devices. He applied this

technique to majority gates as well and illustrated the generality of this multiplexing

based defect-tolerant architectural configuration. Von Neumann also showed that

such an architectural configuration can perform computations with high reliability

measures, if the failure probabilities of the gates are sufficiently small and failures are

independent. Pippenger [65] showed that von Neumann’s construction works only

when the probability of failure per gate is strictly less than 1/2, and that computation

in the presence of noise (which can be seen as the presence of transient fault) requires

more layers of redundancy. Dobrushin et al. [31] also showed that a logarithmic

redundancy is necessary for some Boolean function computations, and is sufficient for

all Boolean functions. In [60], NAND multiplexing was compared to other techniques

of fault-tolerance and theoretical calculations showed that the redundancy level must

be quite high to obtain acceptable levels of reliability.

The multiplexing based defect-tolerance scheme replaces a single logic device by a

multiplexed unit with N copies of every input and output of the device. The N devices

in the multiplexing unit process the copies of the inputs in parallel to give N outputs.

Each element of the output set will be identical and equal to the original output of the

logic device, if all the copies of the inputs and devices are reliable. However, if the inputs

and devices are in error, the outputs will not be identical. To tackle such an error-prone

scenario, some critical level ∆ ∈ (0, 0.5) is defined. The output of the multiplexing

unit is considered stimulated (taking logical value true) if at least (1−∆)·N of the

outputs are stimulated, and non-stimulated (taking logical value false) if no more

Page 17: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 5

Figure 1.1: A NAND multiplexing unit from [62]

than ∆·N outputs are stimulated. If the number of stimulated outputs is in the interval

(∆·N, (1−∆)·N), then the output is undecided, and hence a malfunction will occur. The

basic design of a von Neumann multiplexing technique consists of two stages: the

executive stage which performs the basic function of the processing unit to be replaced,

and the restorative stage which reduces the degradation in the executive stage caused by

erroneous inputs and faulty devices. As shown in Figure 1.1, NAND multiplexing is

an architectural configuration wherein a single NAND gate is the processing unit of a

multiplexing based defect-tolerant system. The unit U performs random permutation

of the input signals i.e. each signal from the first input bundle is randomly paired with

a signal from the second input bundle to form the input pair of one of the duplicated

NAND gates.

Han and Jonker [47] extend this technique to a rather low degree of redundancy, and

elucidate the stochastic markov [63] nature of the system. They also investigate a

system architecture based on NAND multiplexing for SETS, where transient errors are

introduced due to random background charges. Stochastic analysis of the chains of

Page 18: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 6

stages (Figure 1.1) is performed to evaluate optimal redundancy factors for systems,

and bounds on the failure probability per gate that can be tolerated for specific reliability

and redundancy levels.

For the NAND multiplexing system shown in Figure 1.1, let X be the set of lines in the

first input bundle that are stimulated. Consequently, (N −X) lines are not stimulated

(logic low). Y and Z are also corresponding sets for the second input and the output

bundles. [47] also assumes a constant probability of gate failure ε, and that the faulty

gates invert their outputs (von Neumann fault). If the sets X, Y and Z have (x, y, z)

elements respectively, then these are the relative excitation levels of the two input

bundles and of the output bundle, respectively. The stochastic distribution of z has to

be determined in terms of the given input distributions (x and y).

In [85], von Neumann concluded that, for extremely large N, the stochastic variable z

is approximately normally distributed. He also determined an upper bound of 0.0107

for the gate failure probability that can be tolerated. In other words, according to

von Neumann, if the failure probability per gate is greater than this threshold, then

the probability of the NAND multiplexing system failing will be larger than a fixed,

positive lower bound, no matter how large a bundle size is used. Recently, it was

shown that, if each NAND gate fails independently, the tolerable threshold probability

of each gate will be 0.08856 [36].

For a single NAND gate in the multiplexing scheme (Figure 1.1), assume xN and yN are

input lines stimulated in each input bundle respectively. If the error probabilities for the

two input bundles of the multiplexing configuration are independent, the probability of

each gate’s output being logic high is z = 1− xy (assuming error-free NAND operation).

If the gate failure probability is ε, the probability of the output being stimulated is:

Page 19: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 7

z = (1− xy) + ε(2xy− 1) (1.1)

Equation 1.1 is valid only for the von Neumann fault model (erroneous gate inverts

the correct output). The NAND multiplexing unit constitutes a Bernoulli sequence

because the gates are assumed to function independently. Therefore, the probability

distribution of the stimulated outputs can be expressed as a binomial distribution. The

probability of exactly k outputs being stimulated is:

P(k) =

(

N

k

)

zk(1− z)N−k (1.2)

When N is sufficiently large and z is extremely small, the probability distribution of

exactly k outputs being stimulated from the N output lines of the NAND multiplexing

stage can be approximated to a Poisson distribution. If both inputs of the NAND gates

have high probability of being logic high, the stimulated outputs are then considered

erroneous. The reliability of the system can then be computed as the probability of the

number of stimulated outputs being below a threshold (P(k ≤ x)). Since the number

of stimulated outputs is a stochastic variable that is binomially distributed, the central

limit (De Moivre-Laplace) theorem applies when N is sufficiently large and 0 < z < 1.

In this case, Equation 1.2 can be rewritten as:

P(k ≤ x) =

∫ x

−∞

1√

2π√

Nz(1− z)e−1/2(t−Nz/

√Nz(1−z))2

(1.3)

Thus, [47] shows analytically that, for the executive stage of NAND multiplexing, the

probability of the number of stimulated outputs is theoretically a binomial distribution

for small values of N. The authors also show that the probability could be approximated

Page 20: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 8

to a Gaussian distribution with mean Nz and standard deviation√

Nz(1− z), when N

is sufficiently large and 0 < z < 1. The authors then go on to demonstrate how

additional restorative stages improve fault-tolerance. The discussion above illustrates

that the number of stimulated outputs of each NAND multiplexing stage is a stochastic

variable and its state space is A = [0, 1, 2, ....N− 1, N]. This stochastic variable is denoted

as ξn, where n is the index of the multiplexing stage. Thus, the evolution of ξn in the

multiplexing configuration is a stochastic process, and with fixed bundle size and

gate failure probability, the distribution of ξn ∀n (stages) depends on the number of

stimulated inputs of the nth multiplexing unit (outputs of the n − 1 th unit). This can

be expressed as follows:

P(ξn ∈ A | ξ1 = k1, ξ2 = k2, ....ξn−1 = kn−1)

= P(ξn ∈ A | ξn−1 = kn−1) (1.4)

Equation 1.4 is the condition for a stochastic process to be a Markov process. As

indicated in Figure 1.1, k1, k2, ....kn−1 are the number of stimulated outputs of the stages

represented by the indices respectively. The evolution of ξn in the NAND multiplexing

system is therefore a Markov process, or a discrete Markov chain. The transition

probability of a stochastic Markov chain indicates the conditional probability from one

specified state to another. As illustrated in [47], the transition probability matrix Ψ

for each ξn is identical and independent of the multiplexing stage (n). Thus, it can

be inferred that ξn evolves as a homogeneous Markov chain. Therefore, an initial

probability distribution and a transition probability matrix are sufficient to get all

output distributions. For a NAND multiplexing system with n individual stages, the

output distribution of the configuration is:

Page 21: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 9

Pn = P0Ψn (1.5)

It is also pointed out in [47] that when the number of multiplexing stages (n) is large,

Ψn approaches a constant matrix π, and each row of π is identical. This indicates

that as n becomes extremely large, not only the transition probabilities in a NAND

multiplexing system will get stable, but also the output distribution will become stable

and independent of the number of multiplexing stages. Experiments indicate that

this defect-tolerant technique requires a rather large amount of redundant components

to give acceptable reliability levels. This makes NAND multiplexing inefficient for

the protection against permanent faults, normally compensated by reconfiguration

techniques. However, this architectural configuration may be a solution for ultra large

integration of highly unreliable nanometer-scale devices affected by dominant transient

errors.

[48] reviews the NAND multiplexing and reconfiguration fault tolerant techniques,

and presents a defect and fault tolerant architecture in which multiplexing (low degree

of redundancy) is combined with a massively reconfigurable architecture. Reconfig-

urable architectures are discussed in details in Chapter 2. The system performance of

this architecture is evaluated by studying its reliability, and this shows that the sug-

gested architecture can tolerate a device error rate of up to 10−2. The architectural

configuration is seen to be efficiently robust against both permanent and transient

faults for an ultra-large integration of highly unreliable nanodevices. For the NAND

multiplexing configuration, Equation 1.2 can be used to compute the reliability of the

system if the faulty devices are independent and uniformly distributed. This scenario

may be reasonable when the dominant faults are transient in nature. However, this

binomial distribution model is not sufficient to describe the manufacturing defects and

permanent faults. The device components are not statistically independent but rather

Page 22: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 10

correlated since defects tend to cluster on a chip [51]. Thus, Equation 1.2 is not appro-

priate for computing reliability measures of the multiplexing system. [48] indicates that

such manufacturing defects can be modeled with a continuous probability distribution

function f (r) where r is the component reliability. Thus, the new reliability evaluation

formula is:

R(k) =

∫ 1

0

(

N

k

)

zk(1− z)N−k f (r)dr (1.6)

The success of the approach depends on finding appropriate parameters for Equa-

tion 1.6, and [48] follows Stappers beta distribution model [46]. Han and Jonker also

discuss about the analytical methodology to compute reliability of reconfigurable archi-

tectures. The basic logic circuit blocks in the processor arrays on a reconfigurable chip

are referred to as processing elements (PEs), and these are sometimes associated with

local memories. In very large chips, reliability can be enhanced by adding spare PEs to

the architectural configuration. Instead of trying to achieve complete fault tolerance,

most techniques aim at optimizing probability of survival, defined as the percentage

of defects that can be eliminated by the reconfiguration approach. Reconfiguration

approaches are categorized as local or global [37]. Global approaches usually involve

far more complex reconfiguration algorithms than local solutions. [48] assumes that all

PEs (also called modules) are identical, so that any spare module can be substituted for

a defective one, provided sufficient interconnection paths exist in the cluster. If in an

array there are r spares out of n identical modules, then at least n− r must be error-free

for proper operation. The reliability of the array is given by Rn =∑n

m=n−r Rmn, where

Rmn is the probability of exactly m out of n modules being fault free. Assuming the

modules have the same reliability measure R0, and are statistically independent, the

probability distribution of the number of defect free modules m can be modeled as a

binomial distribution:

Page 23: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 11

Rmn =

(

n

m

)

Rm0 (1−R0)

n−m (1.7)

Again, these defective modules in an array are not uniformly distributed but rather cor-

related, and Stappers model is used to improve the reliability calculation of correlated

modules [46]. The authors compute the reliability measures of a hierarchical approach

that uses NAND multiplexing at the lowest level and then uses redundancy (spares)

for reconfiguration at three additional implementation levels. The authors show that

for an individual gate failure probability of 10−2 and with 1012 devices on the chip,

this architectural approach achieves a chip-level reliability of 99.8% with a redundancy

factor of less than 10.

Nikolic et al. [60] also use theoretical models to compare the relative reliability mea-

sures of R-fold modular redundancy (details in Chapter 2), NAND multiplexing and

reconfiguration for different device failure rates. It is shown that for a chip with 1012

devices, and with a reliability level that specifies that the chip must work with 90%

probability, RMR is the least effective followed by NAND multiplexing and reconfigu-

ration providing the best computational reliability. It is also indicated in [60] that for

individual device failure rates of 0.01 to 0.1, the redundancy factors required to provide

high reliability of computation (90 % overall working probability) are very large (103

to 105) for such defect-tolerant architectural configurations.

1.3 Main Contributions of this Thesis

Figure 1.2 shows the design flow of a digital system. It indicates the different stages

of the system design starting from the specification (higher level of abstraction) to the

netlist generation (detailed implementation/lower abstraction level). There are other

Page 24: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 12

Specifications

RTL Design

Micro-architectural Design

Architectural Design

Netlist Determine

Redundancy factors and Granularity levels

Evaluate Reliability

Abstraction Implementation

Specific Reliability Level

Achieved ?

No Yes

Back End Front End

Probabilistic Tools and Techniques

Logic Synthesis

….

Figure 1.2: The design flow of an electronic or information-processing system

back-end specific processes performed after logic synthesis from the RTL design, such

as logic optimization, physical design, layout etc, that finally lead to the fabricated

system. These have been omitted in Figure 1.2 for simplicity. The front-end design

consists of translation of the system specifications to an architectural design, which is

refined to a micro-architectural design. This is a detailed architectural description of

the system. Such a design methodology for digital and information processing systems

has to guarantee acceptable reliability levels. Therefore, quick and easy techniques are

required to measure the reliability of such micro-architectural designs. If the desired

reliability levels cannot be achieved with the architectural configuration, the design

has to be made more robust. This may involve augmenting more redundancy and at

Page 25: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 13

different granularity levels (such as gate level, logic block level).

As discussed earlier, specific Boolean networks have different reliability-redundancy

trade-off points and incorporating arbitrary number of redundant devices in the archi-

tecture may even degrade the reliability of computation [62]. Also, for complex Boolean

networks, most of the analytical methodologies discussed earlier do not scale well, and

are error-prone and cumbersome. Such limitations necessitate the automation of such

methodologies. Figure 1.2 shows the probabilistic tools and techniques reported in this

thesis that we believe will expedite and ease the evaluation of reliability measures and

redundancy/granularity levels for specific micro-architectural descriptions. Thus, the

main contributions of this thesis are as follows:

• Development of automation methodologies for the analysis of reliability-redundancy

trade-offs of different redundancy (resource) based defect-tolerant architectural

configurations.

• Development of tools that perform quick and accurate computation of best and

worst case reliability characteristics by setting upper and lower bounds on device

and interconnect failure probabilities. The inputs to these tools are:

* Any arbitrary combinational circuit.

* Any hardware redundancy based defect-tolerant technique.

* Redundancy factor and granularity level.

* Probability distribution of signal energy levels at the prime inputs of logic

networks.

* Error distributions of the component gates.

* Signal noise at inputs and interconnects.

Page 26: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 14

Given such inputs, these tools compute reliability measures of logic networks in

terms of different metrics. The outputs of these tools may

* Probability distribution of signal energy levels at the prime outputs of the

logic networks.

* Entropy at the prime outputs and interconnects of circuits.

* Probability of atmost 10% errors at the outputs.

* Expected percentage of errors at the outputs.

1.3.1 Brief Introduction to Our Tools

In this subsection, we describe our reliability evaluation tools, and how these will

expedite the design cycle of a electronic or information processing system. These tools

are targeted for systems that are built out of nano-scale devices in particular. We

describe a MATLAB based tool code named NANOLAB since it is based on MATLAB

[5, 7, 13], and a probabilistic model checking based tool named NANOPRISM [6,

8]. One difference between our automation techniques and the standard analytical

approaches (discussed in the previous section) is as follows: we evaluate the reliability

of specific cases as opposed to considering the general framework, and hence are not

necessarily restricted by analytical bounds.

1 Conventional digital signals are bimodal, meaning that the logic low or high are

defined as discrete voltage/current levels. Due to device miniaturization at the

nanoscale, the notion of being a binary zero or one will change. A probabilistic

design methodology based on Markov Random Fields (MRF) [57] proposed in [2]

introduces a new information encoding and computation scheme where signals

Page 27: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 15

are interpreted to be logic low or high over a continuous energy distribution. The

inputs and outputs of the gates in a combinational block are realized as nodes

of a Markov network and the logic function for each gate is interpreted by a

Gibbs distribution [71] based transformation. This computational scheme also

exploits the fact that maximizing the probability of energy levels at each node

of the network is equivalent to minimizing the entropy of Gibbs energy distri-

bution that depends on neighboring nodes. The probability of a logic variable

can be calculated by marginalizing over the set of possible states of the neigh-

boring logic variables and propagated to the next node in a Boolean network by

using Belief Propagation [52]. NANOLAB automates this probabilistic design

methodology by computing energy distribution and entropy at the primary/in-

termediate outputs and interconnects of Boolean networks, and by implementing

Belief Propagation. The logic compatibility functions (similar to truth table) [2]

for the different component gates of the Boolean network and the energy distribu-

tion at the primary inputs are specified to the tool. We have also augmented the

capability to model uniform and Gaussian noise at the primary inputs and inter-

connects of combinational blocks and analyze such systems in terms of entropy

at the outputs. Such modeling features in NANOLAB will expedite and enhance

the analysis of reliability measures for different defect tolerant architectures.

2 NANOPRISM is a probabilistic model checking based tool that applies prob-

abilistic model checking techniques to calculate the likelihood of occurrence

of transient defects in the devices and interconnections of nano architectures.

NANOPRISM is based on the conventional Boolean model of computation and

can automatically evaluate reliability at different redundancy and granularity lev-

els, and most importantly show the trade-offs and saturation points. By saturation

point we mean the granularity based redundancy vs. reliability reaches a plateau

Page 28: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 16

meaning that there cannot be any more improvements in reliability by increasing

redundancy or granularity levels. It consists of libraries built on PRISM [54, 73]

(probabilistic model checker) for different redundancy based defect-tolerant ar-

chitectural configurations. These libraries also support modeling of redundancy

at different levels of granularity, such as gate level, logic block level, logic function

level, unit level etc. Arbitrary Boolean networks are given as inputs to these li-

braries and reliability measures of these circuits are evaluated. This methodology

also allows accurate measurements of variations in reliability on slight changes

in the behavior of the system’s components, for example the change in reliability

as the probability of gate failure varies.

1.3.2 Features of these Tools

Here we enlist some major advantages of our tools:

• Our tools allow fast analysis of reliability-redundancy trade-offs for alternative

defect tolerant architectures.

• The idea of a new model of computation in [2] for uncertainty based compu-

tation is extended in the direction of reliability evaluation and automated by

NANOLAB. Previously [2, 21] only show the viability of a MRF based computa-

tion scheme and how it can be mapped to nano devices such as Y CNTs [82]. These

do not extend the model of computation in the direction of reliability-redundancy

trade-off analysis.

• NANOLAB can also compute reliability of systems in the presence of signal

noise at interconnects and inputs. Noise is modeled as uniform or Gaussian

distribution or combinations of both. This models practical situations the circuits

Page 29: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 17

are subjected to during normal operation. Analysis of reliability measures for

redundant architectural configurations of these logic circuits when exposed to

such real world noise distributions makes our methodology more effective.

• NANOPRISM is a tool that automates the evaluation of redundancy vs. reliabil-

ity vs. granularity trade-offs. In [62] a probabilistic model checking based CAD

framework is used to evaluate reliability measures of a specific redundancy tech-

nique namely NAND multiplexing, and only redundancy vs. reliability trade-offs

are computed. NANOPRISM uses such a framework too, but we have developed

a library of generic redundancy based defect tolerant architectures where different

Boolean networks can be plugged in and analyzed. NANOPRISM analyzes opti-

mal redundancy and granularity levels for specific logic networks and expedites

the process of finding the correct trade-offs.

• We are able to illustrate some anomalous counter intuitive trade-off points that

would not be possible to observe without significant and extensive theoretical

analysis, and the automation makes it easier to analyze these critical parameters.

• From our literature search, we found that the results on reliability measures for

different defect tolerant architectures were mostly analytical [77, 47, 60, 38] and

none considered granularity and entropy (discussed in details in Chapter 2) as

parameters. However, for complex network of gates, such analytical results may

be error-prone. As a result, we believe that scalable automation methodologies to

quickly evaluate these figures of merits is crucial for practical use by engineers.

Page 30: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 1. Introduction 18

1.4 Organization of this Thesis

Chapter 2 introduces certain basic concepts and terminologies that are used throughout

this thesis. We discuss defect-tolerant computing, a probabilistic model of computa-

tion from [2] and how to use such a computational scheme to model noise. We also

introduce granularity and measure of redundancy, and concepts in probabilistic model

checking. Chapter 3 describes NANOLAB and our approach to analyze reliability of

systems with the model of computation proposed by Bahar et al. A detailed example

along with code snippets is used to elucidate this methodology. We also report relia-

bility measures of different logic networks computed by this tool. Chapter 4 focuses

on explaining how we use the NANOPRISM framework to model defect-tolerant ar-

chitectures such as TMR, CTMR and their iterations for single gates and logic blocks

respectively. Reliability-redundancy trade-off points for specific combinational cir-

cuits and interesting anomalies are also illustrated in this chapter. We also describe

our von Neumann multiplexing based defect-tolerance library that is a part of our

NANOPRISM tool in Chapter 5, and compare different multiplexing based architec-

tural configurations. Finally, Chapter 6 concludes the thesis and summarizes plans to

extend these automation methodologies in the future.

Page 31: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Chapter 2

Background

2.1 Defect Tolerant Architectures

Formally, a defect-tolerant architecture is one which uses techniques to mitigate the effects

of defects in the devices that make up the architecture, and guarantees a given level of

reliability. There are a number of different canonical defect-tolerant computing schemes

most of which are based on redundancy. The techniques which we look at are highly

generic and concerned with resource redundancy and can be easily extended to nano

architectures. Some of these canonical techniques are Triple modular Redundancy (TMR),

Cascaded Triple Modular Redundancy (CTMR) and multi-stage iterations of these [60]. We

have also discussed multiplexing based defect-tolerance in Chapter 1.

Triple Modular Redundancy as shown in Figure 2.1(a) entails three similar units work-

ing in parallel, and comparison of their outputs with a majority voting logic [60]. The

units could be gates, logic blocks, logic functions or functional units. The TMR provides

a functionality similar to one of the three parallel units but provides a better probability

19

Page 32: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 20

B-bits

B-bits

B-bits

B- Bit Majority

Gate

Unit 1

Unit 2

Unit 3

Z

(a) Triple Modular Redundancy with a single voter

Unit 1

Unit 3

Unit 2

MG

MG

MG

MG = Majority Gate

(b) Triple Modular Redun-

dancy with three voters

Figure 2.1: Different TMR configurations

of working, and is fairly easy to apply to digital logic. The tradeoff is that instead of

n devices, 3n devices and a majority gate are needed in this configuration, that results

in increased circuit area and power, and decreased circuit speed. This architectural

configuration can tolerate multiple failures in a single module. Faults or defects in

two or more of the three modules will cause the system to fail. Also, the majority

voting logic is a single failure point, and three voters can be used instead of just one

to improve the reliability of the system further (shown in Figure 2.1(b)). The R-fold

modular redundancy is a generalization of the TMR configuration where R units work

in parallel instead of 3 and R ∈ {3,5,7,9.....}. Note that R is always an odd number so as

to prevent tie votes.

Cascaded Triple Modular Redundancy is similar to TMR, wherein the units working

in parallel are TMR units combined with a majority gate. This configuration forms a

CTMR of the first order and therefore TMR can be considered to be 0th order CTMR.

Higher orders of CTMR are obtained by multi-stage iterations of these, but this does not

Page 33: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 21

1 bit

1 bit

1 bit

1 bit

MG

MG

MG

MG

x2

x2

y2

y2

x2 y2

x1

x1

y1 x1

y1

y1

x3

x3

y3

x3

y3

y3

MG = Majority Gate

Z

Figure 2.2: Generic Cascaded Triple Modular Redundancy with triple voters: multi-

layer voting

mean that the reliability of a system goes up due to this. Increasing the redundancy level

also introduces more unreliable devices in the architectural configuration. Figure 2.2

shows a 1st order CTMR configuration where the parallel processing units in each of

the three TMR units are NAND gates. This defect-tolerant technique is also called a

Cascaded TMR configuration with triple voters (multi-layer voting scheme).

There are many variations to the generic CTMR configuration. Figure 2.3 illustrates

a cascading scheme [81, 41] where a logic circuit is divided into smaller units, and

TMR is applied to these units. The majority voters are also replicated thrice and placed

between the TMR configurations. This architectural configuration can be considered

to be a multi-level TMR configuration with triple majority gate logic. The conventional

CTMR configuration replicates the Boolean network as a whole (Figure 2.2 illustrates

Page 34: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 22

X1

X3

X2

MG

MG

MG Y1

Y3

Y2

MG

MG

MG Z1

Z3

Z2

MG

MG

MG

MG = Majority Gate

Figure 2.3: Cascaded Triple Modular Redundancy with triple voters: smaller granu-

larity from [41]

a CTMR for a single NAND gate), and the circuit level granularity for the redundant

units increases the probability of these failing. Whereas, the multi-level CTMR with

triple voters allows a circuit to be apportioned into smaller functional units or modules,

and can effectively allow a circuit to withstand more errors across the redundant TMR

configurations. Specifically, if the logic network is divided into N functional units, the

number of permanent or transient defects that can be tolerated is more than ⌊(N/2)⌋.As discussed previously, a simple TMR configuration can provide the same system

performance if only one of the redundant units fails. However, by using the cascading

technique shown in Figure 2.3, the reliability of the system shown does not degrade

even in the presence of errors in the units X1, Y2, and Z3 (more than one erroneous

unit).

Let n be the order of the CTMR configuration. The majority logic requires four gates

i.e. if a, b and c are inputs, the logic operation is (a∧b)∨(b∧c)∨(c∧a). If Fn−1 is the total

number of devices for a (n-1)th order CTMR, then we can say:

Page 35: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 23

Fn = 3Fn−1 + 5 (2.1)

For a single gate, the total number of redundant gates in a nth order CTMR configuration

(where n ∈ {0,1,2....}) is:

Fn =21

2· 3n − 5

2(2.2)

If the total number of gates in a logic circuit is k, then the total number of devices in

any nth order configuration will be:

Fn = 3 · 2k + 5

3k− 1(3k)n +

9k2 + 6k− 20

3k− 1(2.3)

Also, [38] derives the failure rate of a chip with N devices for a RMR defect-tolerant

architectural configuration (explained previously) and this can be expressed as:

Pchip

f ail=

N

RNc + mB[C(Ncp f )

(R+1)/2 + mBp f ] (2.4)

In equation 2.4, C =

R

R−12

is the binomial factor, p f is the probability of an individual

device failing, Nc is the total number of devices in one of the R units working in parallel,

N = RNc is the total number of devices and imperfect majority gates have B outputs

and mB devices. The probability that an i-th order RMR configuration (cascaded

redundancy) works is as follows:

P(i)w = (1− p f ail)

mB[P(i−1)3

w + 3P(i−1)2

w (1− P(i−1)w )] (2.5)

Page 36: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 24

where the majority gate contains mB imperfect devices and Pw = 1− P f ail. The authors

also show that CTMR is not advantageous when the redundant units have small number

of devices and the majority logic is also made from the same devices as the units.

2.2 Majority Multiplexing

Creating computing systems built from devices other than silicon-based transistors is

a research area of major interest. A number of alternatives to silicon VLSI have been

proposed, including techniques based on molecular electronics, quantum mechanics,

and biological processes. One such nanostructure proposed by Lent et al [56, 55]

is quantum-dot cellular automata (QCA) that employs arrays of coupled quantum

dots to implement Boolean networks. Due to the miniscule size of the quantum dots,

extremely high packing densities are possible in CA based architectures. Also, the other

advantages are simplified interconnections, and extremely low power-delay product.

The fundamental QCA logic device is a three-input majority logic gate consisting of

an arrangement of five standard quantum cells: a central logic cell, three inputs and

an output cell. A majority gate can be programmed to act as an OR gate or an AND

gate, by using one of the three inputs as the control input that determines the AND/OR

functionality [83]. This indicates that majority gates are going to play an important role

in the future architectural configurations.

This motivates us to consider multiplexing based defect-tolerance (as discussed in

Chapter 1) when the logic device is a single majority gate. We, therefore, replace the

inputs and output of the gate with N copies and duplicate the majority gate N times in

the executive stage, as shown in Figure 2.4. The unit U represents a random permutation

of the input signals, that is, for each input set to the N copies of the majority gate, three

Page 37: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 25

input signals are randomly chosen from the three separate input bundles respectively.

Figure 2.4 also shows the restorative stage which is made using the same technique

as the executive stage, duplicating the outputs of this stage to use as inputs to the

restorative stage.

U

M

M

M

: : : :

: :

: :

: :

X

Z

Y

Executive Stage

U

M

M

M

: : : : :

: : : : :

Restorative Stage

: : : : :

M=Majority Gate

Figure 2.4: A majority multiplexing unit

2.3 Defect-Tolerance through Reconfiguration

A computer architecture that can be configured or programmed after fabrication to im-

plement desired computations is said to be reconfigurable. Reconfigurable fabrics such

as Field-Programmable Gate Arrays (FPGAs) are composed of programmable logic el-

ements and interconnects, and these can be programmed, or configured, to implement

any circuit. Such reconfigurable architectures may mitigate manufacturing defects

that will be rampant at the nano substrates. The key idea behind defect tolerance in

FPGAs is that faulty components are detected during testing and excluded during re-

configuration. It is expected [59] that reconfigurable fabrics made from next generation

manufacturing techniques (CAEN-based technologies where molecular junctions can

be made which hold their own state) will go through a post-fabrication testing phase

Page 38: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 26

during which these fabrics will be configured for self-diagnosis. Testing for error-prone

devices will not incur either an area or a delay penalty because the test circuits placed on

the fabric during this self-diagnosis phase will utilize resources that will be available

later for normal fabric operation (unlike BIST structures). The defect map obtained

from this test phase will contain locations of all the defects, and this map can be used

by place-and-route tools to layout circuits on the fabric that avoid the defective devices.

Thus, the built-in defect-tolerance in such a reconfigurable digital system will ease the

requirements on the manufacturing process.

Teramac [50] is an extremely defect-tolerant reconfigurable machine built in HP lab-

oratories. The basic components in Teramac are programmable switches (memory)

and redundant interconnections. The high communication bandwidth is critical for

both parallel computation and defect tolerance. With about 10% of the logic cells and

3% of the total resources defective, Teramac could still operate 100 times faster than a

high-end single processor workstation for some of its configurations. In contrast to the

static defect discovery process used in Teramac (test vectors), [59] proposes scalable

testing methods that generate defect maps for reconfigurable fabrics in two phases,

and dynamic place-and-route techniques for configuartion/programming of the fabric.

The reconfigurable architecture particularly targeted by the methodology in [59] is the

nanoFabric [39, 40]. The nanoFabric is an example of a possible implementation of

a CAEN based reconfigurable device. The post-fabrication testing suggested in [59]

comprises of the probability assignment and defect location phases. The probability

assignment phase assigns each component a probability of being defective, and dis-

cards the components which have a high probability. Thus, this phase identifies and

eliminates a large fraction of the defective components. The remaining components

are now likely to have a small enough defect rate and can be tested in the defect lo-

cation phase using simple test-circuit configurations. The authors also point out the

Page 39: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 27

non-adaptiveness of the test-circuit generation. This implies that the results of previous

tests are not used to generate new circuits (details in [58]). The Cell Matrix architec-

ture [34, 33] also supports dynamic defect identification and elimination. The Cell

Matrix is a fine-grained reconfigurable fabric composed of simple, homogeneous cells

and nearest-neighbor interconnect. The cells are programmable, gate-level processors

that can be configured by Lookup tables (LUT) [53]. This allows the cells to directly

programmed to perform meaningful and extensive logic computations and data pro-

cessing [32]. The homogeneity of this architecture makes it inherently fault-tolerant.

Like Teramac, the Cell Matrix can handle large manufacturing defect rates (permanent

faults), and provide high reliability of computation. However, the Cell Matrix archi-

tecture also provides defect-tolerance to transient errors (caused due to less amiable

operating conditions) due to the inherent self-analyzing, self-modifying, and dynamic

local processing capabilities of the architectural configuration. Also, the embryonics ar-

chitecture [30] is a potential paradigm for future nanoscale computation systems. The

objective of developing defect-tolerant and ultra-large integrated circuits capable of

self-repair and self-replication makes this architecture viable for future nanoelectronic

system design.

x1

x0 x2

Figure 2.5: A NAND gate

Page 40: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 28

2.4 Markov Random Field based Model of Computation

The advent of non-silicon manufacturing technologies and device miniaturization will

introduce uncertainty in logic computation. Newer models of computations [87] have

to be designed to incorporate the strong probabilistic notions inherent in nanodevices.

Such a probabilistic design methodology has been proposed in [2] that adapts to errors

by maximizing the probability of being in valid computational states. This model of

computation introduces a new information encoding and computation scheme, where

signals are interpreted to be logic low or high over a continuous energy distribution.

The basis for this architectural approach is the Markov random network which is based

on Markov random fields (MRF) [57]. The Markov random field is defined as a finite

set of random variables, Λ = {λ1,λ2,......, λk}. Each variable λi has a neighborhood, Ni,

which are variables in {Λ - λi}.

Figure 2.6: The neighborhood of the input of a NAND gate depicted as a network node

The energy distribution of a given variable depends only on a (typically small) neigh-

borhood of other variables that is called a clique. These variables may represent states

of nodes in a Boolean network and we might be able to consider effects of noise and

Page 41: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 29

other uncertainty factors on a node by considering the conditional probabilities of

energy values with respect to its clique. Due to the Hammersley-Clifford theorem [4],

The conditional probability in Equation 2.6 is Gibbs distribution. Z is the normalizing

constant and for a given node i, C is the set of cliques affecting the node i. Uc is the clique

energy function and depends only on the neighborhood of the node whose energy state

probability is being calculated. KT is the thermal energy (K is the Boltzmann constant

and T is the temperature in Kelvin) and is expressed in normalized units relative to the

logic energy (clique energy). For example, KT = 0.1 can be interpreted as unit logic

energy being ten times the thermal energy. The logic energy at a particular node of a

Markov network depends only on its neighborhood as discussed earlier. Also, the logic

margins of the nodes in a Boolean network decrease at higher values of KT and become

significant at lower values. The logic margin in this case is the difference between

the probabilities of occurrence of a logic low and a logic high which if high leads to a

higher probability of correct computation. This formulation also allows correct analysis

of entropy values.

P(λi|{Λ − λi}) =1

Ze−

1KT

c∈C Uc(λ) (2.6)

Let us take a specific example to walk through the methodology in [2]. For a two input

NAND gate as shown in Figure 2.5, there are three nodes in the assumed logic network:

the inputs x0 and x1, and the output x2. Figure 2.6 represents x0 as a network node,

and shows its neighborhood. The energy state of this node depends on its neighboring

nodes. The edges in Figure 2.6 depict the conditional probabilities with respect to the

other input x1 and the output x2 (nodes in the same clique). The operation of the gate

is designated by the logic compatibility function (x0,x1,x2) shown as a truth table in

Table 2.1. = 1 when x2 = (x0 ∧ x1)′ (valid logic operations). Such a function takes all

Page 42: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 30

i x0 x1 x2

0 0 0 1 1

1 0 0 0 0

2 0 1 1 1

3 0 1 0 0

4 1 0 1 1

5 1 0 0 0

6 1 1 0 1

7 1 1 1 0

Table 2.1: Logic compatibility function for a NAND gate with all possibilities

valid and invalid logic operation scenarios into account so as to represent an energy

based transformation similar to the NAND logic. The axioms of the Boolean ring [15]

are used to relate to a Gibb’s energy form. Also, the valid input/output states should

have lower clique energies than invalid states to maximize the probability of a valid

energy state at the nodes. Thus the clique energy (logic energy) is the summation over

the minterms of the valid states and is calculated as :

U(x0, x1, x2) = −∑

i

i(x0, x1, x2) (2.7)

where i = 1 (i is the index for each row of Table 2.1). This clique energy definition

reinforces that the energy of invalid logic state is greater than valid state. As shown in

[21], for a valid state, the summation of valid states (i = 1) is ’-1’ and for any invalid

state, this summation value is ’0’. The clique energy for the NAND gate is:

Page 43: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 31

U(x0, x1, x2) = −x2 + 2x0x1x2 − x0x1 (2.8)

The probability of the different energy configurations of x2 is:

p(x2) =1

Z

x0∈{0,1}p(x0)

x1∈{0,1}p(x1) e−

1KT U(x0,x1,x2) (2.9)

As according to Equation 2.9, the probability of different energy states at x2 is calculated

as a function of x2 after marginalizing over the distributions of x0 and x1. The probability

of the energy state configurations at the outputs of any logic gate can be calculated by the

methodology above. Given the input probability distributions and logic compatibility

functions, using Belief Propagation algorithm [52] it is also possible to calculate entropy

and energy distributions at different nodes of the network. Thus, [2] gives a probabilistic

model of computation which we exploit to compute reliability-redundancy trade-offs

for different nanoscale architectures.

2.4.1 Modeling Noise at Inputs and Interconnects

The probabilistic non-discrete computing scheme described above can be extended to

incorporate the impact caused by continuous noise at the inputs and the interconnects

of combinational circuits [2]. Let us take the same NAND gate example to illustrate

this. For the inputs being logic low or high, the Gaussian noise distribution below zero

will be filtered out because negative values are invalid [21]. To make the Gaussian

noise symmetrically distributed about the inputs of the gate, the coordinate system has

to be shifted such that x = 0 → x′ = −1, and x = 1 → x′ = 1. Thus, x0, x1 and x2 can

Page 44: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 32

be rewritten as :

x′0= 2(x0 - 1

2 ), x′1= 2(x1 - 1

2 ), x′2= 2(x2 - 1

2 )

The clique energy in Equation 2.8 for the NAND gate is modified as follows:

U(x′0,x′

1,x′

2) = -1

2 - 14x′

2+ 1

4x′0x′

1+ 1

4x′1x′

2+ 1

4x′0x′

1x′

2

We have modeled noise as a Gaussian process with mean µ and variance σ2. The

probability distribution of x′2

being in different energy configurations ∈ {-1.0, -0.9, -

0.8,...., 0.1, 0.2, 0.3,...., 1.0} is:

p(x′2) =

1

Z

∫ 1

−1

x′1∈{−1,1}

e−U

KT

e−(x′0−µ)2/2σ2

√2πσ

dx0 · p(x′1) (2.10)

The energy distribution at x′2, if uniform distribution is used to model signal noise is

given by:

p(x′2) =1

Z

∫ 1

−1

x′1∈{−1,1}

e−U

KT dx0 · p(x′1) (2.11)

Equation 2.10 or 2.11 is used to compute the probability of x2 at different energy states,

marginalizing over the distributions of x0 (ranges between −1 and 1) and x1 [13]. The

energy distribution at the output of any logic gate can be calculated in the presence of

Page 45: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 33

noisy inputs and interconnects by the methodology above.

2.5 Granularity and Measure of Redundancy

Redundancy based defect-tolerance can be implemented for Boolean networks at dif-

ferent levels of granularity. For a specific logic circuit, all the gates could be replicated

as a particular CTMR configuration and the overall architecture can be some other

CTMR configuration. For example, each gate in the circuit could be a kth order CTMR

configuration and the overall logic block could be a nth order configuration where k,n.

Redundancy can thus be applied at different levels of granularity, such as gate level,

logic block level, logic function level etc. In Chapter 4, we discuss a few experiments

that show us that reliability is often dependent on the granularity level at which re-

dundancy is injected in a logic network. NANOPRISM indicates the correct level of

granularity beyond which the reliability measures for a system do not improve.

2.6 Probabilistic Model Checking and PRISM

Probability is the measure of level of uncertainty or randomness. It is widely used

to design and analyze software and hardware systems characterized by incomplete

information, and unknown and unpredictable outcomes. Probabilistic model checking

is a range of techniques for calculating the likelihood of the occurrence of certain

events during the execution of unreliable or unpredictable systems. The system is

usually specified as a state transition system, with probability values attached to the

transitions. A probabilistic model checker applies algorithmic techniques to analyze

the state space and calculate performance measures.

Page 46: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 34

NANOPRISM consists of libraries built on PRISM [54, 73], a probabilistic model checker

developed at the University of Birmingham. PRISM supports analysis of three types of

probabilistic models: discrete-time Markov chains (DTMCs), continuous-time Markov

chains (CTMCs) and Markov decision processes (MDPs). We use DTMCs to develop

libraries for generic defect-tolerant architectural configurations. This model of compu-

tation specifies the probability of transitions between states, such that the probabilities

of performing a transition from any given state sums up to 1. DTMCs are suitable

for conventional digital circuits and the fault models considered. The fault models are

manufacturing defects in the gates and transient errors that can occur at any point of

time in a Boolean network.

The PRISM description language is a high level language based on guarded commands.

The basic components of the language are modules and variables. A system is constructed

as a number of modules which can interact with each other by means of the standard

process algebraic operations [74]. A module contains a number of variables which

represents its state. Its behavior is given by a set of guarded commands of the form:

[] <guard> → <command>;

The guard is a predicate over the variables of the system and the command describes

a transition which the module can make if the guard is true (using primed variables to

denote the next values of variables). If a transition is probabilistic, then the command

is specified as:

<prob> : <command> + · · · + <prob> : <command>

The properties of a DTMC model can be verified by PCTL model checking techniques.

PCTL [49] is a probabilistic temporal logic, an extension of CTL. It can be used to express

Page 47: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 35

properties such as ”termination will eventually occur with probability at least 0.98”.

Our work is based on modeling defect tolerant architectures as state machines with

probabilistic assumptions about the occurrence of defects in the devices and intercon-

nections, and then use Markov analysis and probabilistic model checking techniques

to evaluate reliability measures.

2.7 Emerging Technologies and Techniques

This section has been included to introduce the readers to some novel techniques that

are being looked at by the research community to build devices at the nanoscale. Also,

some of the emerging technologies that may use such devices are discussed. Their

impact on conventional information processing paradigms are also briefly elucidated

in this section. These discussions are not directly related to the scope of this thesis.

2.7.1 Brief Introduction to Emerging Technologies

Table 2.2 compares CMOS and a set of emerging technologies in terms of speed, size,

power consumed and cost incurred per device. In Table 2.2, T refers to a single delay,

CD refers to critical dimension, Energy is the intrinsic operational energy (Joules/oper-

ation), and cost is defined as $ per gate. A number of assumptions are made to estimate

these parameters for these non-silicon manufacturing technologies in the absence of

firm empirical data. For emerging technologies in the concept stage with no observed

data, the underlying physical principles impose constraints on the parameters. If there

is availability of some measured data, the assumptions involve a scalability estimation

of these techniques. Also, some of these technologies are particularly effective for cer-

Page 48: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 36

tain specific application areas. For instance, the applicability of theoretical quantum

computing was shown by P. Shor in 1994 [19]. He devised the first quantum algorithm

to find the prime factors of large numbers in polynomial time, that is, considerably

faster than any classical algorithm and exploits the parallelism intrinsic in quantum

computing. But quantum computing is much less efficient for other applications. In

this case, the time required by a classical device to perform an operation using a classi-

cal algorithm is defined as ”effective” time per operation. The different parameters for

quantum computing (shown in Table 2.2) are determined by calculating the ”effective”

time per operation for different algorithms. A similar approach is used for neuromor-

phic and optical computing. The importance of such a discussion is to elucidate that

few of the new technologies are directly competitive with scaled CMOS and most are

highly complementary, and to point out the advantages of heterogeneous integration

of the non-silicon manufacturing technologies with silicon CMOS.

Technology Tmin sec Tmax sec CDmin m CDmax m Energy Cost min Cost max

Si CMOS 3E-11 1E-6 8E-9 5E-6 4E-18 4E-9 3E-3

RSFQ 1E-12 5E-11 3E-7 1E-6 2E-18 1E-3 1E-2

Plastic 1E-4 1E-3 1E-4 1E-3 1E-24 1E-9 1E-6

Optical (digital) 1E-16 1E-12 2E-7 2E-6 1E-12 1E-3 1E-2

NEMS 1E-7 1E-3 1E-8 1E-7 1E-21 1E-8 1E-5

Neuromorphic 1E-13 1E-4 6E-6 6E-6 3E-25 5E-4 1E-2

Quantum Computing 1E-16 1E-15 1E-8 1E-7 1E-21 1E3 1E5

Table 2.2: Estimated Parameters for Emerging Techniques from [18]

Recently, there has been significant advancements in non-silicon manufacturing tech-

niques. We will briefly discuss a few methodologies in this section to illustrate their

Page 49: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 37

impact in future digital systems. [24, 25] discuss a new technique for silicon tran-

sistor fabrication on plastic substrates. Such transistors are called thin-film transis-

tor (TFT) devices, wherein the active layer of these devices can be amorphous or

polycrystalline-silicon as well as organic semiconductors. These devices can be used in

combination with organic light-emitting devices (OLED) [26, 27] for the development

of high-resolution, flexible flat-panel displays, that would enable highly portable infor-

mation devices that are rugged and lightweight. Another important criterion for the

maturity of an emerging manufacturing methodology is the cost incurred as compared

to existing techniques. Currently, the most common type of flat panel display is the

Liquid Crystal Display (LCD) made on glass substrates. A large percentage of the

manufacturing cost of such a display comes from the material cost of the glass panels

used for the front and back display surfaces, the driver electronics used to address the

display, and display breakage. High quality TFTs on plastic substrates could eliminate

costs incurred due to the glass and the driver IC’s, and thus provide highly flexible

and cost-effective displays. However, this novel technique is only in its nascency and

considerable research is being done to perfect this manufacturing technology.

Another interesting computational paradigm is optical computing [28]. The concept of

optical computing is almost 100 years old, and is based on a theory proposed by Albert

Einstein in the early 20th century about the mechanics of light molecules. Information

processing is dependent on light transmission and interaction with solids. An optical

logic gate is a switch that controls one light beam with another. The device is considered

to be ”on” when it transmits light, and ”off” when it blocks the light. Digital optical

computers have certain advantages, and these are due to certain characteristics of light

that is used as an information carrier. First, optical information-processing functions

can be performed in parallel, and second, optical beams do not cause any interference

with each other. Lastly, optical signals can be propagated at the speed of light in a

Page 50: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 38

media.

Nanoelectromechanical systems (NEMS) consist of integrated electromechanical actua-

tors of nanometer-scale dimensions that are driven by electrical energy. These systems

have initiated several researchers to look at the possibility of developing fast logic

gates, switches, and even computers that are entirely mechanical. For instance, this

methodology applied to logic gates entails mechanical digital signals represented by

displacements of solid rods, wherein signal propagation speed is limited to the speed

of sound [18]. Optimistic empirical estimates predict that the switching speed of NEMS

logic gates will be around 0.1ns, and these devices will dissipate less than 10−21 Joules.

The idea of electromechanical actuators and mechanical computers dates back to the

1820s [72], when Charles Babbage designed the first mechanical computer, viewed

as the forerunner to the modern computer. In the 1960s, electronic logic gates and

integrated circuits vastly outperformed moving elements due to which the idea of me-

chanical computers was dropped. But with the rapid developments in nanotechnology,

it may be possible to manufacture complex molecular-scale mechanical elements that

will move on timescales of a nanosecond or less.

On the other hand, quantum computing exploits physical phenomenon unique to

quantum mechanics to realize a fundamentally new mode of information processing.

At the quantum level, the values of certain observable quantities are restricted to a

discrete finite set (Quantization). This is to ensure that each classical bit can be stored

as a stable state of the system. A binary bit is a fundamental unit of information,

represented as a 0 or 1 in digital systems. The fundamental unit of information in a

quantum computer is called a quantum bit or qubit [68]. A qubit can be a 1 or a 0,

or it can exist in a superposition that is simultaneously both 1 and 0 or somewhere in

between. This implies that quantum computers are not limited to only two states as

compared to classical digital systems. The massive parallelism intrinsic in quantum

Page 51: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 39

computing is due to such superposition of different states [66, 67]. This allows a

quantum computer to work much faster than conventional digital systems. Quantum

computers also utilize another aspect of quantum mechanics known as entanglement.

Two spatially separated and non-interacting quantum systems are said to be entangled

if they have some information (due to previous interactions) that can only be accessed

by observing the states of the systems together, and not by performing experiments

on either of them alone. Quantum cryptography uses this characteristic of quantum

mechanics [19].

At the extreme end of the spectrum of these developing technologies are neuromorphic

systems, which are silicon implementations of sensory and neural systems. The archi-

tectural configurations and designs of such systems are inspired by neurobiology [69].

This is because the human brain is a classical neuromorphic information processing

system, and can be considered as a motivation for future technological advancements.

The different parameters (shown in Table 2.2) for the human brain are approximate

estimations. For example, the critical dimension of each neuron is computed by es-

timating the volume of the brain and the number of neurons. Similarly, the single

delay Tmax is the experimentally observed time for opening and closing of synapses.

Also, each neuron is connected to 100 to 10,000 synapses, and this high interconnect

density differentiates the human brain from any silicon-based system [18]. This devel-

oping technological area offers exciting possibilities, such as sensory systems that can

compete with human senses and pattern recognition systems that can run in real time.

2.7.2 Techniques to Build Nano-Scale Devices

The current techniques to manufacture silicon-based devices such as casting, grinding,

milling and even lithography provide only approximate control over the molecular

Page 52: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 40

structures of the devices being manufactured, and form structures that are composed

of billions of atoms. The statistical nature of these large groups of atoms guarantee

the correct behavior of the device being built. Thus, these methodologies employ the

Law of Large Numbers (LLN) [45] at the device scale to provide higher level of reliability.

LLN is said to hold when the mean of a sequence of random variables with a common

distribution converges to their common expectation as the size of the sequence goes to

infinity.

In accordance to Moore’s law [70], the device density on a chip is increasing, necessitat-

ing the scaling down of device size. The current techniques used in mass-production

of silicon devices provide phenomenal reliability in both manufacture and operation

of these devices, however, putting a limit on the ability to scale down their size. Thus,

such techniques are not viable for emerging computing systems that incorporate nano-

scale devices. Andre DeHon proposes the application of LLN at levels of abstraction

higher than device level [42], and suggests that a large number of such nanodevices

can be used at the architectural or micro-architectural level, for instance, to design

redundancy based defect-tolerant architectures. This may circumvent the reliability

issues associated with devices built with small number of atoms, and assembled as

well-defined molecular structures. Techniques to build devices at the nano-scale are

still in their infancy, but it is worth noting that recently there has been major advance-

ments in some of these areas. In this subsection, we discuss some these methodologies

in brief.

Self-assembly is a spontaneous process by which atoms and molecules form organized

aggregates or networks, typically by interacting with a solution or gas phase [43]. This

methodology involves a process known as convergent synthesis [79], that allows as-

sembling atomically precise devices. Structures formed by this process are covalently

bonded, well-defined and stable. The components of such self-assembled structures

Page 53: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 41

find their appropriate locations based on their structural properties (or chemical prop-

erties for atomic or molecular self-assembly). This methodology is not limited only

to the nanoscale, and is considered a very generic and powerful bottom-up device

manufacturing technique. Self-assembly is still in its nascent stages, but researchers

have succeeded in building some primitive nanodevices with this technique. This has

created a lot of interest in the industry as well as the academia.

Positional assembly [44] is not the same as self-assembly. The main difference between

the two processes is as follows: positional assembly provides a higher degree of control

on positional placements of atoms and molecules to form arbitrary stable structures,

however, self-assembly does not do so. On the other hand self-assembly is largely a

natural process, that only needs an external initiation. Positional assembly requires a

designer to observe and analyze the device manufacturing process (perform nanoanal-

ysis), and this requires the usage of precision observation equipments and the ability

to manipulate atoms. Researchers are trying to conglomerate these two processes to

form novel and cost-effective manufacturing techniques for nano-scale devices.

Lithography is the process of imprinting patterns on semiconductor materials that

are used as integrated circuits. In the past, electron-beam (e-beam) lithography [80]

has been used to fabricate nanoscale molecular devices. This process is impractical

for commercial applications because of the sequential nature of the methodology that

slows down writing speed. High-energy electron beams can also damage the active

molecules in a circuit. In contrast, a new technique called imprint lithography [17]

has been introduced, and is well suited to transfer patterns from micrometers to the

nanometers range. This methodology entails the placement of a thin polymer film on

top of the substrate, and a pre-patterned mold is brought into soft contact with the film.

The mold pattern is transferred to the polymer by applying pressure and increasing

the temperature. The residual polymer is removed from the feature bottom and further

Page 54: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 2. Background 42

processing may be done (e.g. metal deposition, etching). Nanoimprint Lithography has

high throughput due to parallel processing, and is cost-efficient since no sophisticated

tools are required.

Page 55: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Chapter 3

NANOLAB: A MATLAB Based Tool

In this chapter, we discuss how information theoretic entropy coupled with thermal

entropy can be used as a metric for reliability evaluation [7]. We also present the

automation methodology for the NANOLAB tool with a detailed example and code

snippets.

3.1 Reliability, Entropy and Model of Computation

[2] not only provides a different non-discrete model of computation, in fact, it relates

information theoretic entropy and thermal entropy of computation in a way so as to

connect reliability to entropy. It has been shown that the thermodynamic limit of

computation is KT ln 2 [3]. What the thermodynamic limit of computation means is

that the minimum entropy loss due to irreversible computation amounts to thermal

energy that is proportional to this value. If we consider energy levels close to these

thermal limits, the reliability of computation is likely to be affected, and if we can keep

43

Page 56: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 44

our systems far from the temperature values that might bring the systems close to this

amount of entropy loss, the reliability is likely to improve. The model of computation

in [2] considers thermal perturbations, discrete errors and continuous signal noise as

sources of errors. The idea is to use a Gibbs distribution based technique to characterize

the logic computations by Boolean gates and represent logic networks as MRFs, and

maximize probability of being in valid energy configurations at the outputs.

3.2 Automation Methodology

NANOLAB consists of a library of functions implemented in MATLAB [1]. The library

consists of functions based on the probabilistic non discrete model of computation dis-

cussed earlier (details in [2]), and can handle discrete energy distributions at the inputs

and interconnects of any specified architectural configuration. We have also developed

libraries that can compute energy distribution at the outputs given continuous distribu-

tions at the inputs and interconnects, introducing the notion of signal noise. Therefore,

this tool supports the modeling of both discrete and continuous energy distributions.

The library functions work for any generic one, two and three input logic gates and

can be extended to handle n-input logic gates. The inputs of these gates are assumed

to be independent of each other. These functions are also parameterized and take in

as inputs the logic compatibility function (Table 2.1) and the initial energy distribution

for the inputs of a gate. If the input distribution is discrete, the energy distribution at

the output of a gate is computed according to Equation 2.9, by marginalizing over the

set of possible values of the nodes that belong to the same clique. In the case of generic

gates these nodes are their inputs. These probabilities are returned as vectors by these

functions and indicate the probability of the output node being at different energy

Page 57: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 45

levels between 0 and 1. These probabilities are also calculated over different values

of KT so as to analyze thermal effects on the node. The Belief Propagation algorithm

[52] is used to propagate these probability values to the next node of the network to

perform the next marginalization process. The tool can also calculate entropy values

at different nodes of the logic network. It also verifies that for each logical component

of a Boolean network, the valid states have an energy level less than the invalid states

as shown theoretically in [2].

Our tool also consists of a library of functions that can model noise either as an uniform

or gaussian distributions or combinations of these, depending on the user specifica-

tions. The probability of the energy levels at the output of a gate is calculated by the

similar marginalizing technique but using Equation 2.10 or 2.11 or both depending on

the characterization of signal noise. Similar to the library functions discussed previ-

ously, the signal library returns output energy distributions as vectors but the energy

levels are between -1 and 1 due to rescaling of the logic level for reasons discussed

in Chapter 2. Entropy values at the primary outputs of Boolean networks are also

computed for different thermal energy levels. Arbitrary Boolean networks in any re-

dundancy based defect-tolerant architectural configuration can be analyzed by writing

simple MATLAB scripts that use the NANOLAB library functions. Also, generic fault

tolerant architectures like TMR, CTMR are being converted into library functions such

that these can be utilized in larger Boolean networks where more than one kind of

defect-tolerant scheme may be used for higher reliability of computation.

3.2.1 Detailed Example

We now discuss a detailed example to indicate the power of our methodology. Fig-

ure 2.2 shows a CTMR configuration with three TMR blocks working in parallel with a

Page 58: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 46

majority gate logic. The code listing shown in Figure 3.1 is a MATLAB script that uses

NANOLAB functions and Belief Propagation algorithm to evaluate the probability of

the energy configurations at the output of the CTMR.

The probability distributions for x1, y1, x2, y2, x3 and y3 for the NAND gates in Figure 2.2

are specified as vectors. These vectors specify the probability of the inputs being a logic

low or high (discrete). The input probability distributions for all the TMR blocks are

the same in this case but these can be varied by having separate input vectors for each

TMR block.

z=0.0 z=0.2 z=0.5 z=0.8 z=1.0

0.809 0.109 0.006 0.025 0.190

0.798 0.365 0.132 0.116 0.201

0.736 0.512 0.324 0.256 0.263

0.643 0.547 0.443 0.379 0.356

Table 3.1: Probability of the output z of a logic gate being at different energy levels for

values of KT ∈ {0.1, 0.25, 0.5, 1.0}

The NANOLAB functions return vectors similar to the one shown in Table 3.1. These

indicate the probability of the output of a logic network being at specified energy levels

for different KT values. In the CTMR configuration, for each TMR block, the energy

configurations at the outputs of each of the three NAND gates are obtained from

the function for a two input gate. Then these probabilities are used as discrete input

probability distributions to the function for a three input gate. This computes the energy

distribution at the output of the majority logic gate. Similarly, the probabilities of the

Page 59: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 47

1

2 no of blocks = 3; % number of TMR blocks

3 prob input1 = [0.1 0.9]; % prob distbn of input1 of NAND gate

4 prob input2 = [0.1 0.9]; % prob distbn of input2 of NAND gate

5 BT Values = [0.10.25 0.5 1.0];% different kbT values

6

7 for TMR block = 1:no of blocks

8 counter = 1;

9 % energy 2 input gates function is a NANOLAB function and takes in as parameters

10 % the logic compatibility function , input prob distbns and the kbT values .

11 % The output gives the state configurations of the primary output of the logic

12 % gate at different kbT values .

13

14 prob1 = energy 2 input gates function(input1,prob input1,prob input2,BT Values) ;

15 prob2 = energy 2 input gates function(input1,prob input1,prob input2,BT Values) ;

16 prob3 = energy 2 input gates function(input1,prob input1,prob input2,BT Values) ;

17 [a,b] = size(prob1) ;

18

19 % req pb1 , req pb2 , req pb3 are vectors which contain probabilities

20 % of the output being a 0 or a 1 for a particular kbT value for Belief

21 % Propagation.

22 for i = 1:a

23 req pb1 = [prob1( i ,1) prob1( i ,b) ] ;

24 req pb2 = [prob2( i ,1) prob2( i ,b) ] ;

25 req pb3 = [prob3( i ,1) prob3( i ,b) ] ;

26 BT = BT Values( i ) ;

27

28 % energy 3 input gates function is a part of NANOLAB and takes in input

29 % and output parameters similar to the previous 2 input gates function .

30 t p = energy 3 input gates function(input2,req pb1,req pb2,req pb3,BT Values( i ) ) ;

31 prob(TMR block,counter) = t p(1,1) ;

32 counter = counter + 1;

33 prob(TMR block,counter) = t p(1 ,b) ;

34 counter = counter + 1;

35 end

36 end

Figure 3.1: Script for 1st order CTMR with discrete input distribution

Page 60: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 48

final output of the CTMR is calculated. It can be seen that our methodology provides a

very convenient way of evaluating this CTMR configuration and only requires minor

modifications to be extended to any i-th order CTMR. Additionally, it is desired that

valid input/output states should have lower clique energies than invalid states [2].

We have checked for the conformance to this property for the different clique energy

functions which are generated by NANOLAB.

Similarly, MATLAB scripts can be written to model signal noise, and evaluate en-

ergy distribution and entropy at the output of the CTMR shown in Figure 2.2. The

NANOLAB library functions can be used to interject unform or gaussian noise or com-

binations of these. Our tool provides a wide range of modeling options to system

architects, and expedites reliability analysis of defect tolerant architectural configura-

tions for combinational logic circuits.

3.3 Shortcomings of NANOLAB

NANOLAB libraries can be used to inject random noise at the inputs and interconnects

of logic circuits. Signal noise can be modeled as discrete and continuous distributions.

We have experimented with different such combinations at the primary inputs of a

Boolean network as well as interjected such noise distribution at the interconnects of a

logic circuit. While trying to specify two noise spikes at the inputs of a NAND gate as

independent gaussian distributions, the expression for the energy distribution at the

output of the NAND gate turns out to be as follows [13]:

Page 61: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 49

p(output) =

1

Z

∫ 1

−1

∫ 1

−1e−

UKT

e−(x0−µ0)2/2σ2

0

√2πσ0

e−(x1−µ1)2/2σ2

1

√2πσ1

dx0dx1

(3.1)

The two gaussian processes have mean µ0 and µ1 which may or may not be equal,

and variance σ0 and σ1. Equation 3.1 does not evaluate to an explicit integral and

probability values for the output being at different energy levels cannot be computed.

To tackle this scenario, we are developing a procedure in MATLAB to approximate the

multiplication of two gaussian distributions. Note that specifying noise as a bivariate

gaussian leads to an explicit integral as the probability density function of a bivariate

gaussian leads to a less complex integrand. Also, the accuracy of the probabilities

computed by our libraries and the Belief Propagation algorithm [52] are limited by the

floating point accuracy of MATLAB.

x0

x1

x2

x5 x3

x4 z

Figure 3.2: A Boolean network having the logic function z = x2 ∧ (x0 ∨ x1′)

Page 62: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 50

3.4 Reliability Analysis of Boolean Networks using

NANOLAB

We analyze different defect-tolerant architectural configurations of arbitrary Boolean

networks starting from single gates to logic blocks with NANOLAB. The entropy values

and logic margins are observed and these determine interesting facts [5]. The next few

subsections discuss in details the different experimental results. The combinational

circuit we refer to is shown in Figure 3.2.

3.4.1 Reliability and Entropy Measures of a NAND Gate

Figure 3.3 shows the entropy curves at the outputs of a single NAND gate and different

orders of a NAND CTMR configuration at different KT values. The output probability

distribution of a NAND gate is asymmetrical. This should be expected since only one

input combination produces a logic low at the output. Figure 3.3 (a) indicates that the

entropy is lower when the inputs of the single NAND gate are uniformly distributed.

At higher KT values, in both the uniform and non-uniform probability distribution

cases, the entropy increases, resulting in the logic margins (probability of being in

valid energy configurations) being reduced. This indicates that the logic margins in

both scenarios of distribution almost approach zero as thermal energy levels increase.

When thermal energy becomes equal to logic energy (KT = 1), the entropy values in

the case of the uniform distribution is lower implying that the logic margins are better

than when the inputs have a higher probability of being logic high.

Figure 3.3 (b) shows the entropy curves for different orders of a NAND CTMR config-

uration at different KT values. The inputs in this case are uniformly distributed. It can

Page 63: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 51

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

En

tro

py

Different Values of KT (Thermal Energy Levels)

uniformprob(1) = 0.8

(a) Entropy of a NAND gate for

different input distributions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1.2

1.4

1.6

1.8

2

2.2

En

tro

py

Different Values of KT (Thermal Energy Levels)

0th Order CTMR1st Order CTMR2nd Order CTMR

(b) Entropy of a CTMR NAND

configuration when inputs are

uniformly distributed

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.4

1.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

En

tro

py

Different Values of KT (Thermal Energy Levels)

0th Order CTMR1st Order CTMR2nd Order CTMR

(c) Entropy of a CTMR NAND

configuration when inputs are

logic high with 0.8 probability

Figure 3.3: Entropies for the output of a Single NAND gate and CTMR configurations

of the NAND gate at different KT values

Page 64: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 52

be observed that the 0th order CTMR (TMR) has higher entropy value i.e. less logic

margin than the 1st order CTMR at lower thermal energy levels and in both cases the

entropy values almost converge at KT = 0.5. At this point (at close proximities of the

thermodynamic limit of computation) logic margins at the output of the CTMR con-

figurations become insignificant and there is total disorder, and computation becomes

unpredictable. The interesting plot is for the 2nd order CTMR. The entropy shoots up

at a KT value of 0.25 indicating that the 2nd order CTMR degrades the reliability of

computation. This aptly shows that every architectural configuration has reliability/re-

dundancy trade-off points and even if redundancy levels are increased beyond these,

the reliability for the architecture either remains the same or worsens.

Figure 3.3 (c) indicates the entropy values when the inputs are non-uniformly dis-

tributed with a probability of 0.8 to be logic high. In this case, it is observed that the

2nd order CTMR degrades reliability of the system, but does better than the 0th order

CTMR configuration. This result demonstrates that varying energy distributions at the

inputs of a Boolean network may result in different reliability-redundancy trade-off

points. Also, the entropy values are higher than Figure 3.3(b) at lower KT values. This

is because the inputs being at a higher probability of being a logic high implies that the

output of the NAND gate has a higher probability of being non-stimulated, and only

one input combination can cause this to happen i.e. when both inputs are high. Thus,

the logic margins are a bit reduced due to convergence towards a valid output energy

state, and the entropy has higher values than the previous experiment.

3.4.2 Reliability and Entropy Measures for the Logic Block

Figure 3.4 shows the entropy curves at the outputs of the CTMR configurations for the

logic block. Figure 3.4 (a) shows the entropy when the input distribution is uniform.

Page 65: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 53

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11

1.5

2

2.5

Different Values of KT (Thermal Energy Levels) →

En

tro

py

0th Order CTMR1st Order CTMR2nd Order CTMR3rd Order CTMR4th Order CTMR

(a) Inputs are uniformly dis-

tributed

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11

1.5

2

2.5

Different Values of KT (Thermal Energy Levels) →

En

tro

py

0th Order CTMR1st Order CTMR2nd Order CTMR3rd Order CTMR4th Order CTMR

(b) Inputs are logic high with 0.8

probability

Figure 3.4: Entropy for different orders of CTMR configurations for the Boolean network

given in Figure 3.2 at different KT values

It can be observed that as the CTMR orders are increased, the entropy decreases (logic

margin increases) at lower KT values, and the entropy converges at KT = 0.5 for all

the CTMR orders. This indicates that as the system approaches the thermal limit of

computation (KT ln 2) [3], increase in redundancy level does not improve the logic

margins. The reliability measures remain the same. But at lower thermal energy

levels, increasing the redundancy level (orders of the CTMR) results in improvement

of reliability. This is because the 4th order CTMR has entropy values less than the

other lower orders at thermal energy levels below 0.5. Whereas, in the NAND CTMR

configuration, the 2nd order CTMR has higher entropy at lower KT values indicating

degradation in reliability of computation. This experiment illustrates that each specific

Boolean network has an unique reliability-redundancy trade-off point. It may also

be inferred that the redundancy level is proportional to the device density of a logic

network.

Figure 3.4 (b) indicates the entropy values when the inputs have a non-uniform prob-

Page 66: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 54

ability distribution i.e the probability of being logic high is 0.8. The same facts are

observed in this case as in Figure 3.4 (a). But, the entropy is higher than the previous

experiment at lower thermal energy levels. This is because the inputs being at a higher

probability of being logic high means that the output of the logic block will be stim-

ulated and only a few input combinations can cause this to happen. Thus, the logic

margins are slightly reduced as compared to when the inputs are uniformly distributed.

Also, the entropy curves for the 3rd and 4th order CTMR configurations are in very

close proximity to each other. This indicates that the degree of reliability improvement

depletes as more redundancy is augmented to the architecture.

We conduct this experiment to explore the flexibility and robustness of our tool in

evaluating any arbitrary Boolean network. Note that the Boolean network shown

in Figure 3.2 has been used only for illustration purposes and reliability/redunancy

analysis of larger and more complex combinational circuits have been performed.

3.4.3 Output Energy Distributions for the Logic Block

NANOLAB has another perspective to it: the probability of being in different energy

configurations at the primary outputs of a Boolean network can also be computed.

Reliability measures of logic circuits can also be analyzed from these probability dis-

tributions. Figure 3.5 shows the energy distributions at the outputs of the different

CTMR orders applied to the logic circuit when the inputs have uniform energy distri-

bution. Note that the probability values are based on bin sizes of 0.1. As the order of

the CTMR is increased, it can be seen that the logic margins for the output (z) at KT

values of 0.1, 0.25 and 0.5 keep on increasing. Due to the asymmetrical nature of the

logic network, the probability of z (p(z)) being at energy level zero is almost always

higher than being at one and hence such plots are obtained. It is also observed that at

Page 67: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 55

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

z (output of TMR) →

p(z

) →

KT=0.10KT=0.25KT=0.50KT=1.00

(a) The output distributions of a

TMR configuration

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

z (output of 2nd order CTMR) →

p(z

) →

KT=0.10KT=0.25KT=0.50KT=1.00

(b) The output distributions of a

2nd Order CTMR configuration

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

z (output of 4th order CTMR) →

p(z

) →

KT=0.10KT=0.25KT=0.50KT=1.00

(c) The output distributions of a

4th Order CTMR configuration

Figure 3.5: Energy distributions at the output of the Boolean network for different

orders of CTMR configuration at different KT values. Inputs are uniformly distributed

Page 68: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 56

a KT value of one, the logic margin for any order of the CTMR configuration becomes

considerably small (output energy distribution becomes almost uniform), and remains

the same even with an increase of redundancy resulting in unreliable computation.

Comparing the different orders of CTMR in Figure 3.5, we infer that for lower thermal

energy levels, the probability of being in a valid energy configuration increases as more

redundancy is added to the architecture. But this increase in probability slows down

as higher orders of CTMR are reached. This can be understood as follows: the logic

margin of the system reaches a saturation point after which reliability can no longer

be improved. The experimental results for single NAND gate CTMR configurations

show similar plots with the difference that the output of the 2nd order CTMR config-

uration is non-stimulated (valid state) with a higher probability at KT = 0.1. This can

be attributed to the intuitive fact that if a unit being duplicated has a lesser number

of devices, a stable logic margin is reached with lower redundancy levels than a unit

which has higher number of components. Such observations give a clear notion of the

optimal redundancy points for specific logic networks.

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Different values of KT (Thermal Energy Levels) →

En

tro

py

0th Order CTMR1st Order CTMR2nd Order CTMR

(a) CTMR orders of single

NAND gate

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Different Values of KT (Thermal energy levels) →

En

tro

py

0th Order CTMR1st Order CTMR2nd Order CTMR3rd Order CTMR4th Order CTMR

(b) CTMR orders of logic circuit

shown in Figure 3.2

Figure 3.6: Entropy for different logic networks at KT values ∈ [0.01, 0.1]

Page 69: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 57

3.4.4 Reliability Analysis of Different Circuits for Smaller KT Values

Reliability analysis of real systems at thermal energy levels below 0.1 are of practical

importance. Figure 3.6 presents computational entropy values for a single NAND gate

and the Boolean network. These results are similar to the ones indicated by Figures 3.3

and 3.4. It is interesting to note that in this case, there is a linear increase in the entropy

plots for both the circuits with increase in KT values. The previous experimental results

had a convergence of the entropy values at around KT = 0.5 and remained steady for

all thermal energy levels beyond that point. Similar reliability/redundancy trade-off

points are observed in both the logic networks. The 2nd order CTMR configuration for

the NAND gate does worse than the lower orders, whereas increasing the redundancy

level for the logic block yields better reliability.

3.4.5 Reliability Analysis in the Presence of Noisy Inputs and Inter-

connects

We have also analyzed reliability of different logic networks in the presence of random

noise at the inputs and interconnects. Entropy and energy distributions at the outputs

are computed automatically and reliability/redundancy trade-off points are inferred.

Logic margins in the presence of signal noise is the difference between the probabilities

of occurrence of -1 (logic low) and 1 (logic high) due to rescaling the energy levels as

discussed in Chapter 2. The Boolean network we consider here is the one given in

Figure 3.2.

Reliability and Entropy Measures of a NAND Gate: Figure 3.7 shows the entropy

values at the outputs of different NAND gate configurations in the presence of noisy

Page 70: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 58

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

1

1.5

2

2.5

3

Different Values of KT (Thermal energy levels) →

En

tro

py

Uniform DistributionGaussian Distribution

(a) Entropy at the output of a

NAND gate

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.6

1.8

2

2.2

2.4

2.6

2.8

3

Different Values of KT (Thermal energy levels) →

En

tro

py

0th Order CTMR1st Order CTMR2nd Order CTMR3rd Order CTMR4th Order CTMR

(b) Entropy for different CTMR

orders of NAND gate

Figure 3.7: Entropy at the outputs of different NAND gate configurations in the pres-

ence of noisy inputs and interconnects

inputs and interconnects. The plots in Figure 3.7 (a) are obtained when (i) one of the

inputs of the gate has equal probability of being at logic low or high and the other is

subjected to a gaussian noise spike centered around 1 with a variance of 2 and (ii) when

the gaussian noise signal is changed to an uniformly distributed one ranging from

{−1, 1}. The plots show that the entropy has lower values in the presence of gaussian

noise as compared to uniformly distributed noise at lower KT values. This is because a

gaussian noise spike centered around logic high alleviates the probability of the output

being in a valid energy state and decreases the degree of randomness the system is

in (entropy). Normally gaussian noise spikes have mean at invalid energy levels, and

thus prevents the system from converging to a valid logic state. Such noise signals

have also been modeled for complex logic circuits.

The entropy curves for different CTMR configurations of a NAND gate are indicated

in Figure 3.7 (b). The entropy values for the NAND gate are plotted till the 4th order

CTMR for different thermal energy levels. The inputs to the logic block and the in-

Page 71: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 59

terconnects are subjected to both uniformly distributed and gaussian noise. We have

assumed two gaussian noise spikes, one centered around 1 with variance of 2 and the

other centered around 0.5 with variance of 0.2. It can be observed that as redundancy

is increased by adding more CTMR orders, the entropy decreases (logic margin and

hence reliability increases) at lower KT values. However, the rate of improvement in

reliability decreases as 3rd order CTMR is reached. This result emphasizes that beyond

a certain redundancy level, the system’s reliability for a given defect-tolerant configu-

ration reaches a steady state and cannot be improved substantially.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2.65

2.7

2.75

2.8

2.85

2.9

2.95

Different Values of KT (Thermal energy levels) →

En

tro

py

→ Oth Order CTMR

1st Order CTMR

(a) Entropy for different CTMR

orders of the logic block

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

z (output of TMR) →

p(z

) →

KT = 0.10KT = 0.25KT = 0.50KT = 1.00

(b) Output distribution for the

TMR of the Boolean network

Figure 3.8: Entropy and Energy distribution at the outputs of different CTMR configu-

rations of the logic block in the presence of noisy inputs and interconnects

Entropy and Energy Distribution for the Boolean Network: The entropy and energy

distribution at the output of different CTMR configurations of the logic block are

presented in Figure 3.8. The entropy values for the different CTMR orders of the

Boolean network are given in Figure 3.8 (a). The experimental setup is as follows:

the inputs and interconnects are subjected to both uniformly distributed and gaussian

noise. Two gaussian noise signals are considered, with equal means at 0.5 energy

Page 72: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 3. NANOLAB: A MATLAB Based Tool 60

level and variances 2 and 0.2 respectively. It can be inferred from the results that

TMR (0th order CTMR) has low entropy values at KT values less than 0.5. The higher

CTMR orders have very high entropy irrespective of the thermal energy. We have

only indicated entropy for the 1st order CTMR because higher orders do not improve

the reliability. Considering this result, we can state that due to a very noisy and

error-prone environment, the augmentation of any level of redundancy beyond the

TMR configuration causes a steep degradation of reliability. Comparing these results

with Figure 3.7(b), we surmise that reliability/redundancy trade-off points are not only

dependent on specific logic networks, these vary with the normal operation scenarios

(such as random noise) of the circuits.

The energy distribution at the output of the TMR configuration for the circuit at dif-

ferent thermal energy levels is presented in Figure 3.8 (b). As discussed earlier, these

probability values can also indicate reliability measures. Noise characterizations at the

inputs and interconnects are similar to the previous experiment. Due to the asym-

metrical nature of the logic network, the probability of z being at energy level zero

is almost always higher than being at one when the inputs have equal probability of

being logic low or high. But due to the uncertainty interjected in the form of noise,

the logic margins at the output of the logic block becomes very sensitive to thermal

fluctuations. At KT = 0.25, z has high probabilities of being in any of the valid logic

states, and as the thermal energy increases, the probability of occurrence of the invalid

states is almost equal to that of valid states.

Page 73: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Chapter 4

NANOPRISM: A Tool Based on

Probabilistic Model Checking

The advent of nanoscale devices necessitates architectures to be defect-tolerant through

redundancy. Redundancy may be applied at different levels of granularity, such as

gate level, logic block level, logic function level, unit level etc. The trade-offs between

reliability-redundancy and the level of granularity and reliability must be evaluated

rapidly for computer architects to find suitable design pareto points. This chapter ex-

plains how we use a probabilistic model checking framework, in particular PRISM [73]

to model defect-tolerant architectures such as TMR, CTMR and multi-stage iterations of

these. These probabilistic models that represent different defect-tolerant architectural

configurations are integrated to form a library that composes NANOPRISM [8, 12]. We

also report interesting results that indicate the effectiveness of our tool.

61

Page 74: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 62

4.1 Modeling Single Gate TMR, CTMR

In this section, we explain the PRISM model construction of a single gate TMR, CTMR

and multistage iterations of these. The first approach is directly modeling the systems as

given in Figures 2.1 and 2.2. For each redundant unit and the majority voting logic, we

construct separate PRISM modules and combine these modules through synchronous

parallel composition. However, such an approach leads to the well know state space

explosion problem. At the same time, we observed that the actual values of the inputs

and outputs of each logic block is not important, instead one needs to keep track of only

the total number of stimulated (and non-stimulated) inputs and outputs. Furthermore,

to compute these values, without having to store all the outputs of the units, we replace

the set of replicated units working in parallel with ones working in sequence. This folds

space into time, or in other words reuse the same logic unit over time rather than

making redundancy over space. This approach does not influence the performance of

the system since each unit works independently and the probability of each gate failing

is also independent.

The different orders of CTMR configurations are built incrementally from the models

of the previous iterations. In this case too, two approaches seem to emerge. One of

the approaches is incorporating PRISM modules of the previous CTMR iteration as the

redundant functional units for the current order and adding a majority voting logic.

This causes the model to grow exponentially as the higher orders of configuration are

reached. The other approach is to use already calculated probability values (probability

of being a logic low or high) at the output of the last CTMR configuration as the output

probability distributions of the three redundant functional units of the current order of

a CTMR configuration.

The DTMC model of the TMR configuration of a NAND gate is shown in Figure 4.1

Page 75: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 63

1

2 prob p err = 0.1; / / probability that gate has error

3 prob p in=0.9; / / probability an input is logic high

4

5 const R=3; / /number of redundant processing units const

6 const R limit=1;

7

8 moduleTMR

9

10 x : bool;

11 y : bool;

12 s : [0 . . 3 ] init 0 ; / / local state

13 z : [ 0 . .R] init 0 ; / / number of outputs that are stimulated

14 z output : [0 . . 1 ] init 0 ; / / output of majority logic

15 c : [0 . . 4 ] init 0 ; / / count of the redundant unit being processed

16

17 [ ] s=0 & c>R−> (s’=0) ;

18

19 / / processed all redundant units

20 [ ] s=0 & c=R−> (s’=3)&(c’=c+1);

21

22 / / initial choice of x and y

23 [ ] s=0 & c<R−> p in : (x’=1)&(s’=1)&(c=c+1) + (1−p in) : (x’=0)&(s’=1)&(c=c+1);

24 [ ] s=1−>p in : (y’=1)&(s’=2) + (1−p in) : (y’=0)&(s’=2) ;

25

26 / / NAND operation

27 [ ] s=2−> p err : ( z’=z+(x&y))&(s’=0) + (1−p err) : ( z’=z+(!(x&y)) )&(s’=0) ;

28

29 / / majority logic

30 [ ] s=3 & z>=0 &z<=R limit −> (s’=0) & (z output’=0) ;

31 [ ] s=3 & z>R limit & z<=R−> (s’=0) & (z output’=1) ;

32

33 endmodule

Figure 4.1: PRISM description of the TMR configuration of a single NAND gate

Page 76: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 64

for illustration purposes. We assume in this case that the inputs X and Y have iden-

tical probability distribution (probability of being logic high is 0.9), and the failure

(inverted output) probability of NAND gates is 0.1. However, the input probability

distributions and failure distribution of the NAND gates can be changed easily by

modification of the constants given at the start of the description. The probabilistic

state machine for this DTMC model built by PRISM has 115 states and 182 transitions.

Also, model checking is performed to compute the probability distribution of the TMR

configuration’s output being in an invalid state for different gate failure probabilities.

Furthermore, since PRISM can also represent non-deterministic behavior, one can set

upper and lower bounds on the probability of gate failure and then obtain (best and

worst case) reliability characteristics for the system under these bounds. As discussed

before, the CTMR configuration uses three TMR logic units and majority voter. The

probability distribution obtained for the TMR block of a single NAND gate can be used

directly in the CTMR configuration, thus reducing the state space.

x0

x1

x2

x5 x3

x4 z

Figure 4.2: A Boolean network having the logic function z = x2 ∧ (x0 ∨ x1′)

4.1.1 Modeling Block Level TMR, CTMR

We explain the PRISM model construction for different orders of block level CTMR

configurations. Each unit as shown in Figure 2.1 is no longer composed of a single

Page 77: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 65

gate, but contains a logic circuit such as the one shown in Figure 4.2. As discussed

earlier, we replace the set of replicated units working in parallel with ones working in

sequence, thus folding space into time. The DTMC model for the TMR configuration

of the logic block in Figure 4.2 has 1407 states with this approach of modeling. The

models for different orders of CTMR configurations are built using similar techniques

that are used for single gate CTMR.

We have also incorporated different levels of granularity in these redundancy based

architectural configurations. Let us walk through an example to explain this concept

clearly. For the logic circuit under discussion, we model the different levels of CTMR

wherein each redundant unit contains the logic block itself. We consider this redun-

dancy at the logic block granular level. Next , for each redundant logic block, we

further replicate each gate so as to form different orders of CTMR configurations at the

gate level of granularity. Our experimental results indicate the intuitive fact that CTMR

configurations at different levels of granularity and lower gate failure probabilities give

better reliability than the conventional flat architectural configuration.

4.2 Reliability Analysis of Logic Circuits with

NANOPRISM

NANOPRISM has been applied for reliability analysis of different combinational cir-

cuits starting from single gates to logic blocks. The output error distributions for

different granularity levels and failure distributions of the gates are observed, and

these demonstrate certain anomalous and counter-intuitive facts. The next subsections

discuss in details the different experimental results [8]. Figure 4.2 shows the Boolean

network used for the illustration of the effectiveness of NANOPRISM in evaluating

Page 78: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 66

arbitrary logic circuits.

−2 0.1 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Failure Probability of Inverter

Pro

ba

bili

ty o

f O

utp

ut

be

ing

In

ve

rte

d0th Order CTMR1st Order CTMR2nd Order CTMR3rd Order CTMR

(a) Inverter CTMR Configura-

tions

−2 0.1 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Failure Probability of NAND

Pro

ba

bili

ty o

f O

utp

ut b

ein

g In

ve

rte

d

0th Order CTMR1st Order CTMR2nd Order CTMR3rd Order CTMR

(b) NAND CTMR Configura-

tions

Figure 4.3: Error Distribution at the Outputs for the different Orders of CTMR Config-

urations of Single Gates

4.2.1 Reliability Measures of Single Gate CTMR Configurations

Figure 4.3 shows the error distribution at the output of CTMR configurations for

the Inverter and NAND gates. The input distribution in both the cases is assumed

to be logic high with 0.9 probability. The probability of the gates being in error

∈ {10−2, 0.1, 0.2, ......, 0.5}. Figure 4.3 (a) shows the probability of the output being in

error (inverted with respect to the expected output) for the inverter CTMR configu-

rations upto order three. It is observed that as the order increases, the probability

of having the output in error goes down. But at lower gate error probabilities, it is

clearly seen that the 2nd order and 3rd order configurations provide the same relia-

bility measures. Thus increasing redundancy beyond this point does not make sense.

At higher gate failure rates, there is a difference between the reliability obtained by

the 2nd and 3rd order CTMR architectures. We extended the number of orders further

Page 79: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 67

and observed that at the 5th CTMR order when gate failure distributions are high,

the difference in reliability measures obtained from this iteration and the previous one

reduces but never becomes equivalent. This is intuitive because at higher device failure

rates, introducing more redundancy causes more error prone computation and at same

point the reliability can not be improved anymore and may sometimes degrade as we

will observe shortly. Also, for each CTMR configuration, at lower gate failure distri-

butions a plateau is reached i.e. lowering the device failure rate does not lower the

probability of an erroneous output. This means that a redundancy based architecture

cannot provide a reliability level lower than a certain ”point”, and NANOPRISM can

analyze such reliability measures accurately.

Figure 4.3 (b) shows the error distribution at the output of the different CTMR config-

urations of a NAND gate. This graph indicates certain observations that are already

seen in the previous plot. In this case, the 1st order configuration provides a major

improvement in reliability as compared to the 0th order CTMR. This observation is

of interest and can be interpreted as follows: due to the asymmetrical nature of the

NAND gate, there are ceratin probabilistic combinations of the inputs that will cause

the output to be in a logic state that is an invalid state in the context of the model. For

example, if one of the inputs of the NAND gate are logic low and the gate is not in error,

the output is logic high. This scenario occurs when the inputs have higher probability

of being 1 and a logic high is expected at the output if there is an error thus mildly

elevating the probability of errors at the output. Also, the 2nd and 3rd order CTMR

configurations provide almost the same reliability measures for lower gate failure rates.

Page 80: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 68

−4 −3 −2 0.1 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Failure Probability of Gates in the logic Block

Pro

babi

lity

of O

utpu

t bei

ng In

vert

ed

0th Order CTMR1st Order CTMR2nd Order CTMR3rd Order CTMR

Figure 4.4: Error distribution at the outputs of the different CTMR orders for the

Boolean network given in Figure 4.2

4.2.2 Reliability Measures for the Logic Block CTMR Configurations

We also analyze the combinational circuit given in Figure 4.2. This is to illustrate

that our tool can be used to compute reliability measures of any arbitrary logic cir-

cuit. Figure 4.4 shows the output error distributions for different orders of CTMR

configurations for the aforesaid logic network. The probability of the inputs being

stimulated (logic high) is assumed to be 0.9. Also, the component inverter and NAND

gates are assumed to have same failure distribution. These failure probability values

∈ {10−4, 10−3, 10−2, 0.1, 0.2, ......, 0.5}. The plots observed in Figure 4.4 are similar to the

previous experiment. It can be seen that as the CTMR orders increase, the probability

of having the output in error goes down. Also, at lower gate error probabilities, it

is clearly seen that the 2nd and 3rd order configurations provide the same reliability

measures. Interestingly, in this case, at higher gate failure rates, the rate of improve-

ment in the reliability of the system slows down steadily as the redundancy (in terms

of CTMR orders) is increased. This is due to the augmentation of unreliable devices to

Page 81: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 69

the architecture. Note that this degree of slow down is higher than what is observed in

the case of single gate logic networks. This important observation can be interpreted

as follows: for Boolean networks with higher number of unreliable component gates,

increasing redundancy factor increases the probability of erroneous outputs and these

probability values are higher than for single gate logic networks.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Failure Probability of Gates in the Logic Block

Pro

ba

bili

ty o

f O

utp

ut b

ein

g In

ve

rte

d

1st Order CTMR w/o Granularity1st Order CTMR with Granularity

(a) 1st Order CTMR with and

w/o Granularity

−2 0.1 0.50

0.1

0.2

0.3

0.4

0.5

Failure Probability of Gates in the Logic BlockP

rob

ab

ility

of O

utp

ut b

ein

g In

ve

rte

d

0th Order Gate Level Granularity1st Order Gate Level Granularity2nd Order Gate Level Granularity

(b) Comparison of different Gate

Level Granularities

Figure 4.5: Granularity Analysis for the Logic Block shown earlier

4.2.3 Reliability vs. Granularity Trade-offs for CTMR Configurations

We discuss the impact of applying reliability at different granularity levels for specific

Boolean networks. We determine interesting facts and anomalies while experimenting

with this form of redundancy based architectures. Figure 4.5 shows error distributions

at the outputs of 1st order CTMR configurations with and without granularity. For il-

lustration purposes, the granularity considered here is at the gate level but our generic

libraries can handle granularity based redundancy at logic block level, logic function

level, unit level etc. In Figure 4.5 (a), we compare the reliability measures of two differ-

ent defect-tolerant architectural configurations of the logic circuit shown in Figure 4.2.

Page 82: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 4. NANOPRISM 70

One of these is a 1st order CTMR configuration with redundancy at the logic block level

(only the logic circuit is duplicated), whereas the other has gate level and logic block

level redundancy and we call this a granularity based architectural configuration. In

this experiment, the logic circuit is configured as a 1st order CTMR and the component

gates are individually configured as TMR configurations. It is observed that at lower

device failure rates, the reliability of the system for the granularity based architecture is

better. This is intuitive because introduction of more redundant devices causes better

reliability. But a counter-intuitive phenomena is observed at higher gate failure distri-

butions. At the 0.4 failure probability mark, the probability of an erroneous output for

the granular architecture becomes more than the one without granularity. This shows

that for the CTMR configuration with gate level granularity, there is a degradation

of the reliability of the system. The reason for this anomaly is that introduction of

redundancy at any level of granularity entails addition of more unreliable devices to a

specific Boolean architecture. Thus, this interesting finding ascertains the fact that there

are indeed trade-off points between reliability and granularity of redundant systems

that may impact the design of a defect-tolerant architecture.

We also model different orders of CTMR for the component gates (gate level granularity)

while having an overall 1st order CTMR configuration for the logic block used in the

previous experiment. The plots in Figure 4.5 (b) show that as the CTMR levels at the

gate level of granularity increase, the reliability of the system improves steadily. Our

experiments show that the reliability measures of the system remain almost the same

beyond the 2nd order CTMR configuration. At lower gate failure distributions, the

1st order CTMR configuration at the gate level can be considered to be the optimal

redundancy level of the overall architecture as increasing the level of redundancy

further at the gate level does not yield much improvement in the overall reliability.

Page 83: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Chapter 5

Reliability Evaluation of Multiplexing

Based Majority Circuits

In this emerging era of nanotechnology, non-silicon manufacturing methodologies such

as quantum dots [84], quantum cellular automata (QCA) [11, 55] use majority logic as

the fundamental building blocks for implementing Boolean computation. In this chap-

ter, we analyze reliability measures of multiplexing based majority circuits by building

a generic multiplexing library. To achieve these trade-off figures, and comparison of

these figures against previously reported figures for NAND-multiplexing [62], we have

enhanced our probabilistic model checking based tool NANOPRISM (discussed in de-

tails in Chapter 4). We outline the enhancements needed in our tool for evaluating

majority multiplexing configurations. Other than the interesting reliability trade-off

measures, the library based enhancement of the NANOPRISM tool is a major contri-

bution by itself, especially to designers building QCA based architectures that work

around probable defects, and in need of a design automation tool to quickly evaluate

the trade-offs in their designs. Our NANOPRISM libraries can be applied to model any

71

Page 84: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 72

arbitrary boolean circuit or a portion of a large Boolean network and also at different

levels of granularity, such as gate level, logic block level, logic function level, unit level

etc. Our attention to the importance of majority gates in the nanotechnology context

was drawn by [76, 75], where analytical results for similar trade-off evaluations were

presented.

5.1 Main Results

We have determined different reliability measures of multiplexing architectures based

on majority circuits. We have also compared the results of different configurations for

majority and NAND multiplexing. Our observations show that:

• the probability of errors at the output is higher for large gate failure probabilities.

• the probability of errors at the output decreases if more redundancy is incorpo-

rated in the majority multiplexing architecture, provided the probability of the

gates failing is small.

• for large gate failure probabilities, due to augmentation of more redundant de-

vices, either the reliability of the system does not improve or may even degrade.

• for the same redundancy factors, the majority multiplexing system is less reliable

than NAND multiplexing when probability of the gates failing is small.

• the majority multiplexing provides better reliability as compared to multiplexing

systems based on NAND gates when the gate failure probabilities are large.

Such observations could be of significant consequence in determining redundancy

levels for systems with majority gates as the basic building blocks.

Page 85: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 73

U

M

M

M

: : : :

: :

: :

: :

X

Z

Y

Executive Stage

U

M

M

M

: : : : :

: : : : :

Restorative Stage

: : : : :

M=Majority Gate

Figure 5.1: A majority multiplexing unit

5.2 Model Construction of Majority Multiplexing System

In this section, we explain the PRISM model of a majority gate multiplexing configu-

ration. The first approach is directly modeling the system as shown in Figure 5.1. A

PRISM module is constructed for each multiplexing stage comprising N majority gates

and these modules are combined through synchronous parallel composition. How-

ever, such a model construction scheme leads to the well know state space explosion

problem. Due to the observations stated earlier in Chapter 4, a model construction

technique similar to the TMR and CTMR architectural configurations is adopted. The

set of N majority gates working in parallel is replaced by N majority gates working in

sequence. The same methodology is applied to the multiplexing stages of the system

so as to reuse the same module for each of the stages while keeping a record of the

outputs from the previous stage. This approach does not influence the performance of

the system since each majority gate works independently and the probability of each

gate failing is also independent.

The unit U in Figure 5.1 performs random permutation. Consider the case when k

outputs from the previous stage are stimulated for some 0 < k < N. Since there are k

Page 86: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 74

stimulated outputs, the next stage will have k of the inputs stimulated if U performs

random permutation. Therefore, the probability of either all or none of inputs being

stimulated is 0. This implies that each of the majority gates in a stage are dependent on

one other, for example, if one majority gate has a stimulated input, then the probability

of another having the same input stimulated decreases. It is difficult to calculate the

reliability of a system by means of analytical techniques for such a scenario. To change

the number of restorative stages, bundle size, input probabilities or probability of the

majority gates failing requires only modification of parameters in the PRISM model

description. Since PRISM can also represent non-deterministic behavior, one can set

upper and lower bounds on the probability of gate failure and then obtain best and

worst case reliability characteristics for the system under these bounds.

Figure 5.2 shows the DTMC model of the von Neumann majority multiplexing system.

We assume in this case that the inputs X and Y have identical probability distribution

(probability of being logic high is 0.9), and the input Z can be non-stimulated with

the same probability value. The failure (inverted output) probability of the majority

gates is 0.01. However, these parameters can be modified by changing the value of the

constants (const) given at the start of the model . The probabilistic state machine built

by PRISM for the multiplexing configuration shown in Figure 5.2 has 1227109 states

and 1846887 transitions. This indicates the complexity of analyzing such architectural

configurations. The models for the different multiplexing configurations are verified

for PCTL [49] properties such as ”probability of having less than 10% errors at the

primary output”.

Page 87: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 75

1 const N = 20; / / number of inputs in each bundle

2 const M = 3 ; / / number of restorat ive stages equals (M−1)

3 const low lim = 1 ; / / the higher limit for logic low at majority gate output

4 const high lim = 3 ; / / the higher limit for logic high at majority gate

5

6 prob p err = 0.01 ; / / probabil ity that majority gate has von Neumann fault

7 prob p1 in=0.9; / / probabil ity one of the inputs i s logic high

8 prob p2 in=0.1; / / probabil ity one of the inputs i s logic high

9

10 module majority multiplex

11 u : [ 1 . .M] ; / / current stage

12 c : [ 0 . .N] ; / / counter (number of copies of the majority gate done)

13 s : [ 0 . . 5 ] ; / / loca l state

14 / / number of stimulated inputs ( outputs of previous stage )

15 nx : [ 0 . .N] ; ny : [ 0 . .N] ; nz : [ 0 . .N] ;

16 x : [ 0 . . 1 ] ; y : [ 0 . . 1 ] ; z : [ 0 . . 1 ] ; / / value of inputs

17 k : [ 0 . . 3 ] ; / / value of addendum of a l l inputs

18 out : [ 0 . .N] ; / / number of stimulated outputs

19

20 [ ] s=0 & c<N −> (s’=1) ; / / next majority gate of current stage

21 / / move on to next stage

22 [ ] s=0 & c=N & u<M−> (s’=1)&(nx’=out)&(ny’=out)&(nz’=out)&(u’=u+1)&(c’=0) ;

23 / / f inished ( reset variables )

24 [ ] s=0 & c=N & u=M−> (s’=0)&(nx’=0)&(ny’=0)&(nz’=0)&(x’=0)&(y’=0)&(z’=0)&(k’=0) ;

25 / / choose x ( i n i t i a l l y random)

26 [ ] s=1 & u=1 −> p1 in : ( x’=1)&(s’=2) + (1−p1 in) : ( x’=0)&(s’=2) ;

27 / / permute

28 [ ] s=1 & u>1 −> nx / (N−c) : ( x’=1)&(s’=2)&(nx’=nx−1) + (1−(nx / (N−c) ) ) : ( x’=0)&(s’=2) ;

29 / / choose y ( i n i t i a l l y random)

30 [ ] s=2 & u=1 −> p1 in : ( y’=1)&(s’=3) + (1−p1 in) : ( y’=0)&(s’=3) ;

31 / / permute

32 [ ] s=2 & u>1 −> ny/ (N−c) : ( y’=1)&(s’=3)&(ny’=ny−1) + (1−(ny/ (N−c) ) ) : ( y’=0)&(s’=3) ;

33 / / choose z ( i n i t i a l l y random)

34 [ ] s=3 & u=1 −> p2 in : ( z’=1)&(s’=4) + (1−p2 in) : ( z’=0)&(s’=4) ;

35 / / permute

36 [ ] s=3 & u>1 −> nz / (N−c) : ( z’=1)&(s’=4)&(nz’=nz−1) + (1−(nz / (N−c) ) ) : ( z’=0)&(s’=4) ;

37 / / add values for the inputs to check for majority

38 [ ] s=4 −> (k’=x+y+z) & (s’=5) ;

39 / / decide majority logic low or high

40 [ ] s=5 & k>=0 & k<=low lim −> (1−p err ) : ( out’=out+0)&(s’=0)&(c’=c+1)&(k’=0) +

41 p err : ( out’=out+1)&(s’=0)&(c’=c+1)&(k’=0) ;

42 [ ] s=5 & k>low lim & k<=high lim −> (1−p err ) : ( out’=out+1)&(s’=0)&(c’=c+1)&(k’=0) +

43 p err : ( out’=out+0)&(s’=0)&(c’=c+1)&(k’=0) ;

44 endmodule

Figure 5.2: PRISM description of the von Neumann majority multiplexing system

Page 88: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 76

5.3 Reliability Evaluation of Multiplexing based Major-

ity Systems

In this section, we compute the reliability measures of multiplexing based majority

systems both when the I/O bundles are of size 10 and 20. These bundle sizes are only

for illustration purposes and we have investigated the performance of these systems

for larger bundle sizes. In all the experiments reported here, we assume that the inputs

X, Y and Z are identical (this is often true in circuits containing similar devices) and

that two of the inputs have high probability of being logic high (0.9) while the third

input has a 0.9 probability of being a logic low. Thus the circuit’s correct output should

be stimulated. Also, it is assumed that the gate failure is a von Neumann fault, i.e.

when a gate fails, the value of its output is inverted.

The PRISM model in Figure 5.2 is verified for properties such as the probability of

having k non-stimulated outputs which in terms of Figure 5.2 corresponds to the

probability of reaching the state where out = N − k, u = M and c = N, for k = 0, . . . , N

where N is the I/O bundle size. This is the computation of the error distribution at

the output of the majority gate architectural configuration. Hence any measure of

reliability can be calculated from these results. PRISM can be also be used to directly

compute other reliability measures such as, the probability of errors being less than 10%

and the expected number of incorrect outputs of the system. Our analysis of reliability

for the majority multiplexing system using the NANOPRISM generic multiplexing

library concentrates on the effects of the failure probabilities of the majority gates and

the number of restorative stages. The results we present show:

• the error distribution at the output for different gate failure probabilities (Fig-

ure 5.3).

Page 89: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 77

• the error distribution at the output for different gate failure probabilities when

additional restorative stages are augmented (Figure 5.4).

• reliability measures in terms of the probability that at the most 10% of the outputs

are erroneous, as the probability of gates failure varies (Figure 5.5).

• reliability almost reaches steady state for small gate failure probabilities and can

be improved marginally once a certain number of restorative stages have been

augmented to the architecture. (Figure 5.5).

• the maximum probability of gate failure allowed for the system to function reliably

by comparing the probability that at most 10% of the outputs are incorrect and

the expected percentage of incorrect outputs for different numbers of restorative

stages (Figures 5.6 and 5.7).

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

number of non−stimulated outputs

pro

ba

bili

ty

probability of gate failure 0.2probability of gate failure 0.1probability of gate failure 0.02probability of gate failure 0.0001

(a) I/O bundle size equals 10

0 2 4 6 8 10 12 14 16 18 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

number of non−stimluated outputs

pro

ba

bili

ty probability gate failure 0.2probability gate failure 0.1probability gate failure 0.02probability gate failure 0.0001

(b) I/O bundle size equals 20

Figure 5.3: Error distributions at the output with 1 restorative stage under different

gate failure rates and I/O bundle sizes

Page 90: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 78

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

number of non−stimulated outputs

pro

ba

bili

ty

probability of gate failure 0.2probability of gate failure 0.1probability of gate failure 0.02probability of gate failure 0.0001

(a) I/O bundle size equals 10

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

number of non−stimulated ouputs

pro

ba

bili

ty

probability gate failure 0.2probability gate failure 0.1probability gate failure 0.02probability gate failure 0.0001

(b) I/O bundle size equals 20

Figure 5.4: Error distributions at the output with 3 restorative stages under different

gate failure rates and I/O bundle sizes

5.3.1 Error Distribution at the Outputs of Different Configurations

We consider a majority multiplexing system as shown in Figure 5.1, where the I/O

bundle size equals 10 and 20. We first investigate the effect of varying the failure

probabilities of the majority gates on the reliability of the system. Figure 5.3 shows

the error distribution at the output of the system in the cases when the probability of

gate failure equals 0.2, 0.1, 0.02 and 0.0001. The experimental setup is such that two of

the inputs of the majority gate are stimulated when working correctly, and the correct

output of the majority gate should be logic high. Hence, the more outputs that are

non-stimulated, the lesser the reliability of the system.

As expected, the distributions given in Figure 5.3 show that as the probability of

a gate failure decreases, the reliability of the multiplexing system increases, i.e. the

probability of the system returning incorrect results diminishes. If Figure 5.3 (a) and

(b) are compared, it can be determined that the increase in I/O bundle size increases

the reliability of the system as the probability of having erroneous outputs decreases.

Page 91: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 79

Also, additional restorative stages are augmented to the architecture and the change

in reliability is observed. Figure 5.4 presents the error distributions at the output of

the system with 3 restorative stages. The gate failure probabilities are similar to Fig-

ure 5.3. Comparing these distributions with those presented in Figure 5.3, we observe

that, when the gate failure probability is sufficiently small (e.g. 0.0001), augmenting

additional restorative stages results in increase of reliability i.e. the probability of non-

stimulated outputs is small. On the other hand, in the cases when the gate failure

probability is sufficiently large, adding additional stages does not increase reliability

and, in fact, can decrease the reliability of the system (compare the distributions when

the failure probability equals 0.2 for each bundle size).

−8 −7 −6 −5 −4 −30.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

error of individual gate (10 x)

pro

ba

bili

ty o

f e

rro

r le

ss th

an

10

%

U : 3 StagesU : 4 StagesU : 5 StagesU : 7 Stages

(a) I/O bundle size equals 10

−8 −7 −6 −5 −4 −30.75

0.8

0.85

0.9

0.95

1

error of individual gate (10 x)

pro

ba

bili

ty o

f e

rro

r le

ss th

an

10

%

U : 3 Stages U : 4 StagesU : 5 StagesU : 7 Stages

(b) I/O bundle size equals 20

Figure 5.5: Probability that atmost 10% of the outputs being incorrect for different I/O

bundle sizes (small probability of failure)

5.3.2 Small Gate Failure Probabilities

In Figure 5.5, we have plotted the probability that less than 10% of the outputs are

incorrect against small gate failure probabilities for multiplexing stages of 3, 4, 5 and

7. These plots show that for small gate failure probabilities, the reliability of the

Page 92: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 80

1 2 3 4 5 6 7 8 9 10 110.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of restorative stages

pro

ba

bili

ty o

f e

rro

r le

ss th

an

10

%

probability of gate failure = 0.01probability of gate failure = 0.02probability of gate failure = 0.03probability of gate failure = 0.04

(a) I/O bundle size equals 10

1 2 3 4 5 6 7 8 9 10 110.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of restorative stages

pro

ba

bili

ty o

f e

rro

r le

ss t

ha

n 1

0%

probability of gate failure = 0.01probability of gate failure = 0.02probability of gate failure = 0.03probability of gate failure = 0.04

(b) I/O bundle size equals 20

Figure 5.6: Probability that atmost 10% of the outputs of the overall system are incorrect

for different I/O bundle sizes (large probability of failure)

multiplexing system almost reaches a steady state. Comparing Figure 5.5(a) and (b), it

can be inferred that increasing the bundle size results in improvement of the reliability

of the system. We have also experimented with higher I/O bundle sizes such as 40

and 45, and the results from these multiplexing configurations show that the rate of

increase in reliability of the system decreases as the bundle size increases.

Also, these results demonstrate that increasing the number of stages can greatly enhance

the reliability of the system. However, the degree of improvement slows down as more

restorative stages are added. Moreover, there exists an optimal redundancy level (in

terms of number of restorative stages) beyond which the reliability improvement is

marginal. For example, let us consider the plots presented in Figure 5.5 that indicate

probability of erroneous outputs being less than 10% when the number of restorative

stages equals 5 and 7. These show that the probability values increase marginally

for different gate failure probabilities as compared to when the architecture has lower

number of restorative stages. We should also mention that this result corresponds to the

observation made in [47] that, as the number of stages increases, the output distribution

Page 93: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 81

of the system will eventually become stable and independent of the number of stages.

5.3.3 Large Gate Failure Probabilities

In this subsection, we consider the case when the probability of gate failure of the

majority gates becomes too large for the multiplexing system to function reliably. Fig-

ure 5.6 plots the probability that less than 10% of the outputs are incorrect against large

gate failure probabilities (between 0.01 and 0.04) for different number of multiplexing

stages ∈ {1, 2, · · ·, 10, 11}. As can be seen from the results, when the probability of gate

failure is equal to 0.04 increasing the I/O bundle size does not improve the reliability

of the system. Comparing the same plots in Figure 5.6 (a) and (b), it can be observed

that augmenting additional restorative stages and increasing the I/O bundle size of the

architecture tends to cause a degradation in reliability of computation. This anomalous

behavior can be understood as follows: when the failure rate is 0.04 (or higher), the

restorative stages are sufficiently affected by the probability of gate failures to be unable

to reduce the degradation, and hence increasing the number of stages in this case makes

the system more amenable to error.

We can infer from these observations that for gate probabilities of 0.04 and higher,

increasing bundle size or addition of more restorative stages cannot make the system

more reliable and may even cause degradation. On the other hand, in the case when

the gate failure probability is less than 0.04, the results demonstrate that the system can

be made reliable once a sufficient number of restorative stages have been added.

In Figure 5.7, we have plotted the expected percentage of incorrect outputs when

I/O bundle size equals 10. The gate failure probabilities and restorative stages are

configured similar to Figure 5.6. The plots in Figures 5.7 and 5.6 determine similar

Page 94: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 82

1 2 3 4 5 6 7

4

6

8

10

12

14

16

number of restorative stages

expe

cted

per

cent

age

of in

corr

ect o

utpu

ts

probability of gate failure = 0.04probability of gate failure = 0.03probability of gate failure = 0.02probability of gate failure = 0.01

Figure 5.7: Expected percentage of incorrect outputs for I/O bundle size of 10 (large

probability of failure)

intuitive results but have different perspectives to reliability evaluation of the system.

By similar facts, we mean that for gate failure probabilities of 0.04 and higher, increasing

the number of restorative stages cannot improve reliability of the system to a very high

level. In fact, when the multiplexing system is configured to have certain specific

number of restorative stages, the reliability may decrease as compared to a system

with less redundancy (less number of multiplexing stages). It can also be observed

from Figure 5.7 that for all gate failure probabilities, the reliability of the architecture

reaches a steady state once a sufficient number of restorative stages have been added.

Such reliability-redundancy trade-off points may mitigate challenges in design and

evaluation of defect-tolerant nano-architectures.

It is important to note that there is a difference between the bounds computed by

our NANOPRISM tool on the probability of gate failure required for reliable compu-

tation and the theoretical bounds presented in the literature. This difference is to be

Page 95: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 83

expected since our methodology evaluates the performance of systems under fixed

configurations, whereas the bounds presented in the literature correspond to scenarios

where different redundancy parameters can be increased arbitrarily in order to achieve

a reliable system.

5.4 Comparison of Reliability Measures for NAND and

Majority Multiplexing Systems

−8 −7 −6 −5 −4 −3 −2 −10

0.2

0.4

0.6

0.8

1

error of individual nand (10x)

pro

ba

bili

ty o

f e

rro

r le

ss t

ha

n 1

0%

7 rest. stagesUR: 7 rest. stages5 rest. stagesUR: 5 rest. stages2 rest. stagesUR: 2 rest. stages1 rest. stageUR: 1 rest stage

(a) NAND multiplexing [62]

−8 −7 −6 −5 −4 −30.75

0.8

0.85

0.9

0.95

1

error of individual gate (10 x)

pro

ba

bili

ty o

f e

rro

r le

ss th

an

10

%

U : 3 Stages U : 4 StagesU : 5 StagesU : 7 Stages

(b) Majority multiplexing

Figure 5.8: Probability of atmost 10% of the outputs being incorrect for the multiplexing

systems when I/O bundle size equals 20 and gate failure probabilities are small

In this section we compare reliability measures of the multiplexing architectural con-

figuration for the NAND and majority gates. The input distributions for the majority

multiplexing system (shown in Figure 5.1) are identical to the ones assumed in Sec-

tion 5.3 i.e. two of the inputs have high probability of being logic high (0.9) while the

third input has a 0.9 probability of being a logic low. Thus the normal operation of

the system should return a stimulated output. The reliability results for the NAND

multiplexing configuration (Figure 1.1) reported in [62] assume that the inputs X and Y

Page 96: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 84

1 3 5 7 9 11 130

0.2

0.4

0.6

0.8

1

number of restorative stages

pro

ba

bili

ty o

f e

rro

r le

ss t

ha

n 1

0% UR: failure 0.01

failure 0.01UR: failure 0.02failure 0.02UR: failure 0.03failure 0.03UR: failure 0.04failure 0.04

(a) NAND multiplexing [62]

1 2 3 4 5 6 7 8 9 10 110.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of restorative stages

pro

ba

bili

ty o

f e

rro

r le

ss t

ha

n 1

0%

probability of gate failure = 0.01probability of gate failure = 0.02probability of gate failure = 0.03probability of gate failure = 0.04

(b) Majority multiplexing

Figure 5.9: Probability of atmost 10% of the outputs being incorrect for the multiplexing

systems when I/O bundle size equals 20 and gate failure probabilities are large

are identical and that initially 90% of the inputs are stimulated. It is also assumed that

for these multiplexing configurations, the gate failure is a von Neumann fault (inverted

output error).

Figure 5.8 compares the reliability of the majority and NAND multiplexing systems

in terms of probability of atmost 10% of the outputs being incorrect, for I/O bundle

size of 20 and small gate failure probabilities. The number of restorative stages varies

between 3 and 7. We only consider the plots for the unit U (random permutation of

the input signals) in Figure 5.8 (a). Comparing these results, we can infer that for

both the multiplexing systems, increasing the number of stages enhances the reliability

measures and that the rate of increase in reliability decreases as more restorative stages

are added to the configurations. It can also be observed that NAND multiplexing

provides better reliability than its majority gate counterpart, and such an observation

is independent of the number of restorative stages.

Also, Figure 5.9 gives a comparison of the majority and NAND multiplexing systems in

terms of reliability, for I/O bundle size of 20 and when probability of gate failure is large.

Page 97: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 5. Majority Multiplexing 85

The gate failure probabilities vary between 0.01 and 0.04. In Figure 5.9 (a), we only

observe the curves for U since our majority multiplexing model only performs random

permutation of the signals. Figure 5.9 (a) and (b) show that for both the multiplexing

systems as the probability of gate failure increases, high reliability of computation

cannot be achieved even by augmenting additional restorative stages. This is because

the reliability of the system reaches a saturation point at an optimal redundancy factor

(number of restorative stages). The results reported also indicate that the majority

multiplexing scheme provides better reliability than NAND multiplexing, and such an

observation is independent of the number of restorative stages.

Page 98: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Chapter 6

Conclusion and Future Work

This thesis proposes two automation methodologies that can expedite reliability eval-

uation of defect-tolerant architectures for nanodevices. We have developed tools that

can determine optimal redundancy and granularity levels of any specific logic network,

given the failure distribution of the devices and the desired reliability of the system.

These tools can also be used to inject signal noise at inputs and interconnects and create

practical situations the circuits are subjected to during normal operation.

We have developed NANOLAB, a tool that is based on a non-discrete probabilistic

design methodology proposed by Bahar et al. This computational scheme has two

aspects. First, the gates are assumed to be defect free, and the model of computation

based on Markov Random Fields correlates information theoretic entropy and the ther-

mal entropy of computation. Since at the reduced voltage level in nano-architecture

this issue can become significant, reliability may suffer when the computation is carried

out close to the thermal limit of computation. However, we show that by considering

various defect-tolerant architectural techniques such as TMR, CTMR and multi-stage

86

Page 99: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 6. Conclusion and Future Work 87

iterations of these, the effects of carrying out computation within close thermal limits

can be reduced substantially. Second, signal noise can be injected at the inputs and

interconnects of a logic network. We have introduced the effects of signal noise in our

automation methodology. Noise is modeled using Gaussian and uniform probability

distributions, and this goes beyond the thermal aspects as described above. NANOLAB

automatically computes reliability measures of systems in terms of energy distributions

and entropy in the presence of discrete input distributions and random noise. This tool

consists of MATLAB based libraries for generic logic gates that can calculate the proba-

bility of the outputs being in different energy levels, and Belief Propagation algorithm

that can compute such distributions at the primary outputs of arbitrary Boolean net-

works (given as inputs to the tool). It also supports various interconnect noise models,

and can be very useful in modeling transient faults and provide the designers a way

to figure out the configuration that is best in terms of reduced entropy (which in turn

means higher reliability). This is indeed an effective tool and technique for evaluating

reliability and redundancy trade-offs in the presence of thermal perturbations and in-

terconnect noise models. Our tool can also inject error probabilities at the gates and do

similar trade-off calculations.

We have reported on the results obtained for single gate and logic block level architec-

tural configurations and investigated the performance of the system in terms of entropy

and energy distribution, as the number of CTMR orders vary. Using NANOLAB, we

were able to compute the energy distributions at the outputs of systems, and hence

construct a complete picture of the reliability of the systems under study. We chose

to analyze TMR and CTMR for illustration purposes as these are canonical fault toler-

ant architectures standard in the literature, and since these enabled us to compare the

results with others. However, as explained, this approach can equally be applied to

alternative fault-tolerant architectures for nanotechnology.

Page 100: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Chapter 6. Conclusion and Future Work 88

This thesis also presents NANOPRISM, a reliability evaluation tool based on proba-

bilistic model checking. This tool consists of libraries for different redundancy based

defect-tolerant architectural configurations. We have analyzed reliability measures of

specific logic circuits with this tool, and shown that for a given probability of gate fail-

ure, we can find the minimum level of redundancy and granularity that enables reliable

computation. The NANOPRISM libraries support implementation of redundancy at

different levels of granularity, such as gate level, logic block level, logic function level,

unit level etc. We illustrate the proficiency of our methodology by modeling von Neu-

mann multiplexing systems, and different orders of CTMR for arbitrary logic circuits.

We also describe a methodology that reduces the effect of the well known state space

explosion problem, and hence allows for the analysis of larger configurations. The

experimental results from our tool show anomalous and counter-intuitive phenomena

that may not be detected quickly by analytical methods.

NANOLAB applies an approach to reliability analysis that is based on Markov Ran-

dom Fields as the probabilistic model of computation. Whereas, NANOPRISM applies

probabilistic model checking techniques to calculate the likelihood of occurrence of

probabilistic transient defects in devices and interconnections. Thus, there is an inher-

ent difference in the model of computation between these two approaches. Although

these two methodologies are different, they offer complementary alternatives to ana-

lytical methodologies and allow system architects to obtain sharp bounds and study

anomalies for specific architectures. Future work includes extending these tools to

support reliability analysis of sequential circuits. We are also in the process of building

libraries for other frequently used fault tolerance schemes.

Page 101: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Bibliography

[1] Web Page: www.mathworks.com.

[2] R. I. Bahar, J. Mundy, and J. Chen, A probability-based design methodology for nanoscale computation,

in International Conference on Computer-Aided Design (IEEE Press, San Jose, CA, November 2003).

[3] C. H. Bennett, ‘The thermodynamics of computation–a review’, International Journal of Theoretical

Physics 21 (1982.), no. 905-940.

[4] J. Besag, ‘Spatial interaction and the statistical analysis of lattice systems’, Journal of the Royal

Statistical Society Series B (1994), no. 36(3), 192–236.

[5] D. Bhaduri and S. K. Shukla, ‘Nanolab: A tool for evaluating reliability of defect-tolerant nano

architectures’, Tech. Report (Fermat Lab, Virginia Tech, 2003). Available at http://fermat.ece.

vt.edu/Publications/pubs/techrep/techrep0309.pdf.

[6] D. Bhaduri and S. K. Shukla, ‘Nanoprism: A tool for evaluating granularity vs. reliability

trade-offs in nano architectures’, Tech. Report (Fermat Lab, Virginia Tech, 2003). Available at

http://fermat.ece.vt.edu/Publications/pubs/techrep/techrep0318.pdf.

[7] D. Bhaduri and S. K. Shukla, Nanolab: A tool for evaluating reliability of defect-tolerant nano archi-

tectures, in IEEE Computer Society Annual Symposium on VLSI (IEEE Press, Lafayette, louisiana,

February 2004). http://fermat.ece.vt.edu.

[8] D. Bhaduri and S. K. Shukla, Nanoprism: A tool for evaluating granularity vs. reliability trade-offs in

nano architectures, in GLSVLSI (ACM, Boston, MA, April 2004).

[9] D. Bhaduri and S. K. Shukla, ‘Reliability evaluation of multiplexing based defect-tolerant majority

circuits’, to appear in the proceedings of IEEE Conference on Nanotechnology (Munich, Germany

August 2004).

89

Page 102: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Bibliography 90

[10] P. Beckett and A. Jennings. Towards nanocomputer architecture. In ACS Conferences in Research and

Practice in Information Technology (CRPIT), volume 6 of Computer Systems Architecture. Australian

Computer Society Inc, 2002. Available at http://www.cellmatrix.com/entryway/products/

pub/publications.html

[11] I. Amlani, A. O. Orlov, G. Toth, G. H. Bernstein, C. S. Lent, and G. L. Snider, “Digital logic gate

using quantum-dot cellular automata,” Science, vol. 284, no. 289-291, April 1999, available at

http://www.nd.edu/∼qcahome/reprints/Amlani2.pdf.

[12] D. Bhaduri and S. K. Shukla, Tools and techniques for evaluating reliability of defect-tolerant nano

architectures, in International Joint Conference on Neural Networks (IEEE Press, Budapest, Hungary,

July, 2004).

[13] D. Bhaduri and S. Shukla, ‘Reliability analysis in the presence of signal noise for nano archi-

tectures’, to appear in the proceedings of IEEE Conference on Nanotechnology (Munich, Germany

August 2004).

[14] F. Buot, ‘Mesoscopic physics and nanoelectronics: nanosicence and nanotechnology’, Physics

Reports (1993), 173–174.

[15] S. Burris, Boolean algebra (March 2001). Available at http://www.thoralf.uwaterloo.ca/htdocs/

WWW/PDF/boolean.pdf.

[16] J. Chen, M. Reed, and A. Rawlett, ‘Observation of a large on-off ratio and negative differential

resistance in an electronic molecular switch’, Science 286 (1999), 1550–2.

[17] Yong Chen, Douglas A. A. Ohlberg, Xuema Li, Duncan R. Stewart, R. Stanley Williams, Jan O. Jeppe-

sen, Kent A. Nielsen, J. Fraser, Deirdre L. Olynick and Erik Anderson, ‘Nanoscale molecular-switch

devices fabricated by imprint lithography’, in Applied Physics Letters 82 (2003), no. 10, 1610–1612.

[18] James A. Hutchby, George I. Bourianoff, Victor V. Zhirnov and Joe E. Brewer, ‘Extending The Road

beyond CMOS’, in IEEE Circuits and Devices Magazine (March 2002), 28–39.

[19] P. W. Shor, ‘Polynomial-time algorithms for prime factorization and discrete logarithms on a

quantum computer’, in SIAM Journal for Computing 26 (1997), no. 5, 1484–1509

[20] J. Chen, W. Wang, M. Reed, M. Rawlett, D. W. Price, and J. Tour, ‘Room-temperature negative

differential resistance in nanoscale molecular junctions’, Appl. Phys. Lett. 1224 (2000), 77.

Page 103: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Bibliography 91

[21] J. Chen, J. Mundy, Y. Bai, S.-M. Chan, P. Petrica, and I. Bahar, A probabilistic approach to nano-

computing, in IEEE non-silicon computer workshop (IEEE Press, San Diego, June 2003).

[22] R. Chen, A. Korotov, and K. Likharev, ‘Single-electron transistor logic’, Apll. Phys. Lett. (1996),

68:1954.

[23] C. Collier, E. Wong, M. Belohradsky, F. Raymo, J. Stoddart, P.J.Kuekes, R. Williams, and J. Heath,

‘Electronically configurable molecular-based logic gates’, Science 285 (1999), 391–3.

[24] Y. J. Tung, P. G. Carey, P. M. Smith, S. D. Theiss, X. Meng, R. Weiss, G. A. Davis, V. Aebi, and

T. J. King, ‘An Ultra-Low-Temperature-Fabricated Poly-Si TFT with Stacked Composite ECR-

PECVD Gate Oxide’, SID Int. Symp. Digest of Technical Papers, Anaheim (CA, May 1998).

[25] S. D. Theiss, P. G. Carey, P. M. Smith, P. Wickboldt, T. W. Sigmon, Y. J. Tung, and T. J. King,

‘Polysilicon Thin Film Transistors Fabricated at 100degC on a Flexible Plastic Substrate’, Int.

Electron Devices Mtg (San Francisco, CA, December 1998).

[26] Webster E. Howard, ‘Better Displays with Organic Films’, Scientific American (Feb. 2004), p. 76.

[27] Joseph Shinar, Organic Light-Emitting Devices: A Survey (Springer-Verlag, NY 2004). ISBN 0-387-

95343-4.

[28] Broome, ‘How to Design Optical Systems Using Cost-Effective Manufacturing Technologies’, in

Optical Society of America (1990), Vol. 15, p. 204.

[29] Y. Cui and C. Lieber, ‘Functional nanoscale electronic devices assembled using silicon nanowire

building blocks’, Science 291 (2001), 851–853.

[30] D. Mange, M. Sipper, A. Stauffer, and G. Tempesti, ‘Toward robust integrated circuits: the embry-

onics approach’, in IEEE, 88, 516–41.

[31] R. Dobrushin and E. Ortyukov, ‘Upper bound on the redundancy of self-correcting arrangements

of unreliable functional elements’, Problems of Information Transmission 13 (1977), no. 3, 203–218.

[32] L. Durbeck, Underlying future technology talk on computing: Scaling up, scaling down, and scaling

back (American Nuclear Society’s International Meeting on Mathematical Methods for Nuclear

Applications, September 2001). Available at http://www.cellmatrix.com/entryway/products/

pub/ANS Abstract.html.

[33] L. Durbeck and N. Macias, ‘The cell matrix: An architecture for nanocomputing’, in Nanotechnol-

ogy, 12 (Institute of Physics Publishing, 2001), 217230.

Page 104: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Bibliography 92

[34] L. J. K. Durbek, An approach to designing extremely large, extremely parallel systems, in Conference on

High Speed Computing (Oregon, USA, April 2001).

[35] J. C. Ellenbogen and J. C. Love, ‘Architectures for molecular electronic computers: Logic structures

and an adder designed from molecular electronic diodes’, in IEEE, 3 88 (2000), 386–426.

[36] W. Evans and N. Pippenger, ‘On the maximum tolerable noise for reliable computation by formu-

las’, IEEE Transactions on Information Theory 44 (1998), no. 3, 1299–1305.

[37] D. F, S. M. G, and S. R, ‘Fault-tolerance and reconfigurability issues in massively parallel architec-

tures’, in Computer Architecture for Machine Perception (IEEE Computer Society Press, Los Alamitos,

CA, 1995), 3409.

[38] M. Forshaw, K. Nikolic, and A. Sadek, ‘EC answers project (Melari 28667)’, Third Year Report.

Available at http://ipga.phys.ucl.ac.uk/research/answers.

[39] S. C. Goldstein and M. Budiu, ‘Nanofabrics: Spatial computing using molecular electronics’, in

Annual International Symposium on Computer Architecture (ISCA) (July 2001), 178–191.

[40] S. C. Goldstein and D. Rosewater, ‘Digital logic using molecular electronics’, in IEEE International

Solid-State Circuits Conference (ISSCC) (San Francisco, CA, Feb 2002), 204–205.

[41] P. Graham and M. Gokhale, ‘Nanocomputing in the Presence of Defects and Faults: A Survey’,

in Nano, quantum and molecular computing: Implications to high level design and validation (Kluwer

Academic Publishers, June 2004), 39–72. To be Published.

[42] Andre DeHon, ‘Law of large numbers system design’, in Nano, quantum and molecular computing:

Implications to high level design and validation (Kluwer Academic Publishers, June 2004), 213–244.

To be Published.

[43] J. C. Huie, ‘Guided molecular self-assembly: a review of recent efforts’, Smart Materials and

Structures 12 (2003), no. 2, 264–271.

[44] R. C. Merkle, ‘Molecular manufacturing: adding positional control to chemical synthesis’, Chemical

Design Automation News 8 (1993), no. 9 and 10, p 1.

[45] Alvin W. Drake, Fundamentals of Applied Probability Theory McGraw-Hill (1988).

[46] S. C. H, ‘A new statistical approach for fault-tolerant vlsi systems’, in Int. Symp. on Fault-Tolerant

Computing (IEEE Computer Society Press, Los Alamitos, CA, 1992), 35665.

Page 105: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Bibliography 93

[47] J. Han and P. Jonker, ‘A system architecture solution for unreliable nanoelectronic devices’, IEEE

Transactions on Nanotechnology 1 (2002), 201–208.

[48] J. Han and P. Jonker, ‘A defect- and fault-tolerant architecture for nanocomputers’, Nanotechnology

(2003), 224230. IOP Publishing Ltd.

[49] H. Hansson and B. Jonsson, ‘A logic for reasoning about time and probability’, Formal Aspects of

Computing 6 (1994), no. 5, 512–535.

[50] J. Heath, P. Kuekes, G. Snider, and R. Williams, ‘A defect tolerant computer architecture: Oppor-

tunities for nanotechnology’, Science 80 (1998), 1716–1721.

[51] K. I and K. Z, ‘Defect tolerance in vlsi circuits: techniques and yield analysis’, in IEEE, 86 (1998),

181936.

[52] J.Yedidia, W.Freeman, and Y.Weiss, Understanding belief propagation and its generalizations, in Proc.

International Joint Conference on AI (2001). Distinguised Lecture.

[53] S. Knapp, Fpga lookup tables build flexible pattern matchers (EDA Expert Column, May 1998). Avail-

able at http://www.optimagic.com/acrobat/pein 0598.pdf.

[54] M. Kwiatkowska, G. Norman, and D. Parker, ‘Prism: Probabilistic symbolic model checker’, in

TOOLS 2002, LNCS 2324 (Springer-Verlag, April 2002), 200–204.

[55] C. Lent, ‘A device architecture for computing with quantum dots’, in Porceedings of the IEEE, 541

(April 1997), 85.

[56] C. S. Lent, P. D. Tougaw, W. Porod, and G. H. Bernstein, ‘Quantum cellular automata’, Nanotech-

nology 4 (1993), 49–57.

[57] S. Li, Markov random field modeling in computer verification (Springer-Verlag, 1995).

[58] M. Mishra and S. C. Goldstein, ‘Defect Tolerance at the End of the Roadmap’, in Nano, quantum and

molecular computing: Implications to high level design and validation (Kluwer Academic Publishers,

June 2004), 73–108 To be Published.

[59] M. Mishra and S. C. Goldstein, Defect tolerance at the end of the roadmap, in International Test

Conference (ITC) (Charlotte, NC, Sep 30-Oct 2 2003).

[60] K. Nikolic, A. Sadek, and M. Forshaw, ‘Architectures for reliable computing with unreliable

nanodevices’, in Proc. IEEE-NANO’01 (IEEE, 2001), 254–259.

Page 106: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Bibliography 94

[61] G. Norman, D. Parker, M. Kwiatkowska, and S. Shukla, ‘Evaluating reliability of defect tolerant ar-

chitecture for nanotechnology using probabilistic model checking’, Tech. Report (Fermat Lab, Virig-

nia Tech, 2003). Available at http://fermat.ece.vt.edu/Publications/pubs/techrep/techrep0314.pdf.

[62] G. Norman, D. Parker, M. Kwiatkowska, and S. Shukla, Evaluating reliability of defect tolerant

architecture for nanotechnology using probabilistic model checking, in IEEE VLSI Design Conference

(IEEE Press, Mumbai, India, January 2004).

[63] J. R. Norris, Markov chains, in Statistical and Probabilistic Mathematics (Cambridge University Press,

October 1998).

[64] C. P. Collier, G. Mattersteig, E. W. Wong, Y. Luo, K. Beverly, J. Sampaio, F. Raymo, J. Stoddart

and J. R. Heath, ‘A [2] catenane-based solid state electronically reconfigurable switch’, Science 280

(2000), 11725.

[65] N. Pippenger, ‘Reliable computation by formulas in the presence of noise’, IEEE Transactions on

Information Theory 34 (1988), no. 2, 194–197.

[66] R. Feynman, ‘Simulating physics with computers’, International Journal of Theoretical Physics 21

(1982), no. 6/7, 467–488.

[67] R. Feynman, ‘Quantum mechanical computers’, Optics News 11 (1985), 11–20.

[68] C. P. Williams and S. H. Clearwater, Explorations in quantum computing, TELOS (Springer-Verlag

New York, 1998).

[69] L. S. Smith and A. Hamilton, Neuromorphic systems engineering silicon from neurobiology, in Progress

in Neural Processing 10 (World Scientific, May 1998).

[70] Gordon Moore, Cramming more Components onto Integrated Circuits (1965), 38 , no. 8.

[71] D. B. Pollard, Hammersley-clifford theorem for markov random fields (2004). Handouts, available at

http://www.stat.yale.edu/∼pollard/251.spring04/Handouts/Hammersley-Clifford.pdf.

[72] M. Roukes, ‘Nanoelectromechanical systems face the future’, Physics World (2001), 25.

[73] PRISM Web Page: www.cs.bham.ac.uk/∼dxp/prism/.

[74] W. Roscoe, Theory and practice of concurrency (Prentice Hall, 1998).

[75] S. Roy, V. Beiu, and M. Sulieman, Majority multiplexing: Economical redundant fault-tolerant designs

for nano architectures.

Page 107: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Bibliography 95

[76] S. Roy, V. Beiu, and M. Sulieman, Reliability analysis of some nano architectures.

[77] N. Saxena and E. McCluskey, ‘Dependable adaptive computing systems’, in IEEE Systems, Man,

and Cybernetics (San Diego, CA, Oct 11-14 1998), 2172–2177. The Roar Project.

[78] Semiconductor Industry Association (SIA), International roadmap for semiconductors 2001 edition ,

Ausitn, TX: International SEMATECH, 2001. Available at http://public.itrs.net.

[79] K. Mikami, K. Takahashi, T. Nakai, and T. Uchimaru, ‘Asymmetric Tandem Claisen-Ene Strategy

for Convergent Synthesis of (+)-9(11)-Dehydroestrone Methyl Ether : Stereochemical Studies on

the Ene Cyclization and Cyclic Enol Ether Claisen Rearrangement for steroid Total Synthesis’ in

Journal of the American Chemical Society (1994), pp. 116.

[80] Henderson Research Group, School of Chemical Engineering, Georgia Tech, Web Site: http:

//dot.che.gatech.edu/henderson/

[81] D. P. Siewiorek and R. S. Swarz, Reliable computer systems: Design and evaluation (Digital Press,

Burlington, MA, 2nd ed., 1992).

[82] E. J. Siochi, P. T. Lillehei, K. E. Wise, C. Park, and J. H. Rouse, ‘Design and characterization of

carbon nanotube nanocomposites’, Tech. Report (NASA Langley Research Center, Hampton, VA,

2003). Available at http://techreports.larc.nasa.gov/ltrs/PDF/2003/mtg/NASA-2003-4ismn-ejs.pdf.

[83] G. L. Snider, A. O. Orlov, I. Amlani, G. H. B. adn Craig S. Lent, J. L. Merz, and W. Porod, ‘Quatum-

dot cellular automata: Line and majority logic gate’, Japanese Journal of Applied Physics 38 (1999),

no. 12B, 7227–7229.

[84] R. Turton, The quantum dot: A journey into the future of microelectronics (Oxford University Press,

U.K, 1995).

[85] J. von Neumann, ‘Probabilistic logics and synthesis of reliable organisms from unreliable compo-

nents’, Automata Studies (1956), 43–98.

[86] T. G. Y and E. J. C, ‘Toward nanocomputers’, Science 294 (2001), 12934.

[87] L. Lavagno, A. Sangiovanni-Vincentelli, and E. Sentovich, Models of Computation for Embedded

System Design, Website: citeseer.nj.nec.com/lavagno98model.html, 1998.

Page 108: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Vita

Debayan Bhaduri

Personal Data

Date of Birth : July 30, 1978

Marital Status : Single

Visa Status : F-1

Office Address:

FERMAT Research Lab.

340 Whittemore Hall

Blacksburg, VA 24061

Email: [email protected]

URL: http://www.fermat.ece.vt.edu/dbhaduri

Tel: (617) 519-5346

Research Interests

(a) Embedded Systems Design

96

Page 109: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Vita 97

(b) Formal Verification of Probabilistic Systems

(c) Defect Tolerant Computing for Nano Architectures

(d) Modeling and Validation of Network Protocols

(e) Software Design

(f) High Level Specification of Complex Control Systems

Education

(a) Ph.D., Computer Engineering, (Expected May 2007), Virginia Polytechnic Institute and State Uni-

versity.

(b) M.S., Computer Engineering, (Expected May 2004), Virginia Polytechnic Institute and State Univer-

sity.

M.S. Thesis Title: “Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures”

(c) B.S., Computer Engineering, June 2001, Manipal Institute of Technology, Manipal, India.

Current Work

(a) Applying Probabilistic Approaches for the Analysis of Reliability-Redundancy Trade-offs for Nano-

scale Architectures

Applying probabilistic approaches for evaluation of reliability measures for different redundancy

based Defect-Tolerant architectures for nano-scale devices. Developing a MATLAB based tool

(code named NANOLAB) that implements a novel non-discrete probabilistic model of computa-

tion suited for unreliable nanoscale computation. Also developing a probabilistic model checking

based tool called NANOPRISM that implements the conventional Boolean model of computation

and can automatically evaluate reliability vs. redundancy vs. granularity trade-offs.

(b) Using Quantum Model of Computation for Reliability Evaluation of Defect Tolerant Nano-architectures

Page 110: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Vita 98

Applying the succinctness and parallelism of the Feynman’s quantum model of computation to

encode multiple possible input patterns into one superposed input, encode circuit computation

as unitary evolution operations, and read multiple outputs by un-entangling the superposed

output. Computing the probability values of various possible states at the output for various

possible input probability distributions, and evaluating reliability of the circuit being modeled in

terms of the entropy at the entangled output.

(c) High Level Specifications of Systems using XUML

Modeling high level specifications of complex control systems and network protocols in Exe-

cutable Unified Modeling language (XUML). Simulation of these systems to check for confor-

mance of certain properties like deadlock.

Skills

– C, C++, SystemC, Matlab, Mathematica, Java, JDBC, Winsock API, OPNET, NS2, Ptolemy II, X86

Assembly, JSP, ASP, JDBC, UML, XUML, PRISM.

– Linux and Microsoft Windows NT administration.

– Languages: English, Hindi and Bengali.

Experience

(a) January 2003 to present: Graduate Research Assistant, Fermat Lab, Virginia Tech., Blacksburg, VA.

– Investigating different probabilistic approaches to evaluate reliability/redundancy trade-offs of

defect-tolerant nano architectures. Working on the development of two tools: a MATLAB

based tool called NANOLAB, and a library package for generic fault tolerant architectures

in a probabilistic symbolic model checking based framework called NANOPRISM.

– Developing a tool for evaluating reliability measures of Boolean networks using succinct and

novel quantum models of computation.

Page 111: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Vita 99

– Developed a methodology to automate conversion of arbitrary MTG (Multi threaded Graph)

models to SystemC models.

– Developed systematic abstraction mechanisms for RTL models of microprocessors written in

SystemC to provide simulation efficiency.

– Applied UML and XUML to design and model complex systems and network protocols, and

verified for conformance to requirement specifications.

(b) May 2003 to August 2003: Summer Intern, BBN Technologies, Boston, MA.

Worked for the DARPA Next Generation MAC (XG) Project. Participated in the development of

an EM spectrum meta-language framework, modeling and design of XG network protocols.

– Aided in expanding the knowledge base for UML.

– Developed a console front-end for the DARPA query language (DQL) tool.

– Explored different XUML tools for project specific network protocol modeling and simulation

requirements.

– Modeling and simulation of network protocols using Ptolemy II.

– Developed mapping Specifications from Ptolemy II to UML so as to expedite conversion of

Ptolemy II models to UML diagrams and vice versa.

– Worked on the different design alternatives of the XG Opportunity Information Dissemination

protocol.

(c) July 2001 to August 2002: Software Engineer, Axes Technologies (Subsidiary of Alcatel USA Inc),

Bangalore, India

Worked on the Motorola ENHANCED TRUNK AUDITING Project for the Alcatel Electronic Mo-

bile Exchange (EMX) 2500/5000. Participated in the implementation and testing of an enhancement

to the pre-existing audit feature in the EMX.

– Implementation and testing of the call processing manager and line-trunk manager interfaces.

– Development of new man-machine interfaces (MMI).

Page 112: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Vita 100

Publications

[1] S. Sharad, D. Bhaduri, M. Chandra, H. D. Patel, and S. Syed, Systematic abstraction of microprocessor

rtl models to enhance simulation efficiency, in the proceedings of Microprocessor Test and Verification 2003

(2003).

[2] D. Bhaduri and S. K. Shukla, Nanolab: A tool for evaluating reliability of defect-tolerant nano archi-

tectures, in IEEE Computer Society Annual Symposium on VLSI (IEEE Press, Lafayette, louisiana,

February 2004). Available at http://fermat.ece.vt.edu/Publications/pubs/techrep/techrep0309.pdf.

[3] D. Bhaduri and S. K. Shukla, Nanolab: A tool for evaluating reliability of defect-tolerant nano architec-

tures, in IEEE Transactions on Nanotechnology (2004). Being revised.

[4] D. Bhaduri and S. K. Shukla, Nanoprism: A tool for evaluating granularity vs. reliabilty trade-offs in

nano architectures, in Great Lakes Symposium on VLSI (ACM, Boston, MA, April 2004). Available at

http://fermat.ece.vt.edu/Publications/pubs/techrep/techrep0318.pdf.

[5] D. Bhaduri and S. K. Shukla, ‘Reliability analysis in the presence of signal noise for nano archi-

tectures’, to appear in the proceedings of IEEE Conference on Nanotechnology (Munich, Germany

August 2004).

[6] D. Bhaduri and S. K. Shukla, ‘Reliability evaluation of multiplexing based defect-tolerant majority

circuits’, to appear in the proceedings of IEEE Conference on Nanotechnology (Munich, Germany

August 2004).

[7] D. Bhaduri and S. K. Shukla, Tools and techniques for evaluating reliability of defect-tolerant nano

architectures, in International Joint Conference on Neural Networks (IEEE Press, Budapest, Hungary,

July 2004).

[8] D. Bhaduri and S. K. Shukla, ‘Tools and techniques for evaluating reliability trade-offs for nano-

architectures’, in Nano, Quantum and Molecular Computing: Implications to High Level Design and

Validation (Kluwer Academic Publishers, June 2004), 157–212. To be Published.

[9] D. Bhaduri and S. K. Shukla, ‘Using quantum model of computation for reliability evaluation of

defect tolerant nano-architectures’, to appear in the proceedings of IEEE Conference on Nanotech-

nology (Munich, Germany August 2004).

Page 113: Tools and Techniques for Evaluating Reliability Trade-offs ... · Tools and Techniques for Evaluating Reliability Trade-offs for Nano-Architectures Debayan Bhaduri Abstract It is

Debayan Bhaduri Vita 101

Selected Projects

(a) Implementation of a Markov Random Field based model of computation for evaluating reliability

measures of defect-tolerant architectures.

(b) Development of generic libraries for redundancy based architectural configurations in a probabilistic

model checking framework.

(c) Performance Analysis of a Wide Area Network Model using Opnet.

(d) Implementation of a HTTP/1.1 server using Winsock libraries.

(e) Implementation of a peer-peer resource discovery application based on IP Multicast using WINSOCK

libraries.

(f) Automated conversion of arbitrary MTG (Multi threaded Graph) models to SystemC models.

(g) Implementation of real time scheduling algorithm like RMA, EDF and DASA in the ucLinux kernel.

(h) Implementation of a real time Air traffic control system using POSIX system calls.

(i) Implementation of an Online Computer Adaptive testing agent using JDBC, JSP and ASP.