The idea is to present an approach to VLSI fault...

235
UNIVERSITY OF CALIFORNIA SANTA CRUZ COMPREHENSIVE FAULT DIAGNOSIS OF COMBINATIONAL CIRCUITS A dissertation submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER ENGINEERING by David B. Lavo September 2002 The Dissertation of David B. Lavo is approved: Professor Tracy Larrabee, Chair Professor F. Joel Ferguson Professor David P. Helmbold

Transcript of The idea is to present an approach to VLSI fault...

The idea is to present an approach to VLSI fault diagnosis

21

University of California

Santa Cruz

Comprehensive Fault Diagnosis of Combinational Circuits

A dissertation submitted in partial satisfactionof the requirements for the degree of

Doctor of Philosophy

in

Computer Engineering

by

David B. Lavo

September 2002

The Dissertation of David B. Lavois approved:

Professor Tracy Larrabee, Chair

Professor F. Joel Ferguson

Professor David P. Helmbold

Robert C. Aitken, Ph.D.

Frank Talamantes

Vice Provost & Dean of Graduate Studies

Copyright © by

David B. Lavo

2002

Contents

List of Figuresv

List of Tablesvi

Abstractvii

Acknowledgementsviii

1Chapter 1.Introduction

Chapter 2.Background4

2.1Types of Circuits4

2.2Diagnostic Data5

2.3Fault Models8

2.4Fault Models vs. Algorithms: A Short Tangent into a Long Debate9

2.5Diagnostic Algorithms11

2.5.1Early Approaches and Stuck-at Diagnosis13

2.5.2Waicukauski & Lindbloom14

2.5.3Stuck-At Path-Tracing Algorithms16

2.5.4Bridging fault diagnosis16

2.5.5Delay fault diagnosis18

2.5.6IDDQ diagnosis19

2.5.7Recent Approaches20

2.5.8Inductive Fault Analysis20

2.5.9System-Level Diagnosis22

Chapter 3.A Deeper Understanding of the Problem: Developing a Fault Diagnosis Philosophy23

3.1The Nature of the Defect is Unknown23

3.2Fault Models are Hopelessly Unreliable24

3.3Fault Models are Practically Indispensable25

3.4With Fault Models, More is Better27

3.5Every Piece of Data is Valuable28

3.6Every Piece of Data is Possibly Bad29

3.7Accuracy Should be Assumed, but Precision Should be Accumulated29

3.8Be Practical30

Chapter 4.First Stage Fault Diagnosis: Model-Independent Diagnosis31

4.1SLAT, STAT, and All That32

4.2Multiplet Scoring35

4.3Collecting and Diluting Evidence36

4.4“A Mathematical Theory of Evidence”37

4.5Turning Evidence into Scored Multiplets40

4.6Matching Simple Failing Tests: An Example43

4.7Matching Passing Tests46

4.8Matching Complex Failures48

4.9Size is an Issue49

4.10Experimental Results – Simulated Faults51

4.11Experimental Results – FIB Defects54

Chapter 5.Second Stage Fault Diagnosis: Implication of Likely Fault Models56

5.1An Old, but Still Valid, Debate56

5.2Answers and Compromises57

5.3Finding Meaning (and Models) in Multiplets58

5.4Plausibility Metrics59

5.5Proximity Metrics62

5.6Experimental Results – Multiplet Classification64

5.7Analysis of Multiple Faults65

5.8The Advantages of (Multiplet) Analysis66

Chapter 6.Third Stage Fault Diagnosis: Mixed-Model Probabilistic Fault Diagnosis68

6.1Drive a Little, Save a Lot: A Short Detour into Inexpensive Bridging Fault Diagnosis69

6.1.1Stuck with Stuck-at Faults69

6.1.2Composite Bridging Fault Signatures70

6.1.3Matching and (Old Style) Scoring with Composite Signature72

6.1.4Experimental Results with Composite Bridging Fault Signatures72

6.2Mixed-model Diagnosis73

6.3Scoring: Bayes decision theory74

6.4The Probability of Model Error ...77

6.5... Vs. Acceptance Criteria78

6.6Stuck-at scoring80

6.70th-Order Bridging Fault Scoring80

6.81st-Order Bridging Fault Scoring81

6.92nd-Order Bridging Fault Scoring81

6.10Expressing Uncertainty with Dempster-Shaffer83

6.11Experimental results – Hewlett-Packard ASIC84

6.12Experimental results – Texas Instruments ASIC88

6.13Conclusion90

Chapter 7.IDDQ Fault Diagnosis91

7.1Probabilistic Diagnosis, Revisited92

7.2Back to Bayes (One Last Time)93

7.3Probabilistic IDDQ Diagnosis94

7.4IDDQ Diagnosis: Pre-Set Thresholds98

7.5IDDQ Diagnosis: Good-Circuit Statistical Knowledge101

7.6IDDQ Diagnosis: Zero Knowledge103

7.7A Clustering Example106

7.8Experimental Results107

Chapter 8.Small Fault Dictionaries110

8.1The Unbearable Heaviness of Unabridged Dictionaries111

8.2Output-Compacted Signatures114

8.3Diagnosis with Output Signatures115

8.4Objects in Dictionary are Smaller Than They Appear117

8.5What about Unmodeled Faults?118

8.6An Alternative to Path Tracing?119

8.7Clustering Output Signatures121

8.8Clustering Vector Signatures & Low-Resolution Diagnosis124

Chapter 9.Conclusions and Future Work126

Bibliography128

List of Figures

6Figure 2.1. Example of pass-fail fault signatures.

Figure 2.2. Example of indexed and bitmapped full-response fault signatures.7

Figure 4.1: Simple per-test diagnosis example.34

Figure 4.2. An example belief function.38

Figure 4.3. Another belief function.38

Figure 4.4. The combination of two belief functions.39

Figure 4.5. Example showing the combination of faults.41

Figure 4.6. A third test result is combined with the results from the previous example.42

Figure 4.7. Example test results with matching faults.43

Figure 4.8. Combination of evidence from the first two tests.44

Figure 4.9. A-sa-1 will likely fail on many more vectors than will B-sa-146

Figure 4.10. Example of constructing a set of possibly-failing outputs for a multiplet49

Figure 4.11. Multiplets (A,B), (A,B,C) and (A,B,D) explain all test results, but (A,B) is smaller and so preferred50

Figure 4.12 The choice of best multiplet is difficult if (A) predicts additional failures but (B, C) does not.50

Figure 6.1. The composite signature of X bridged to Y with match restrictions (in black) and match requirements (labeled R)71

Figure 7.1. IDDQ results for 100 vectors on 1 die (Sematech experiment).98

Figure 7.2. Assignment of a binary

O)

|

(A

p

ˆ

for the ideal case of a fixed IDDQ threshold.98

Figure 7.3. Assignment of a linear

O)

|

(A

p

ˆ

with a fixed IDDQ threshold.99

Figure 7.4. Assignment of normally-distributed

A)

|

(O

p

ˆ

and

A)

|

(O

p

Ø

ˆ

.101

Figure 7.5. Determining a pass threshold based on an assumed distribution and the minimum-vector measured IDDQ.102

Figure 7.6. The same data given in Figure 7.1, with the test vectors ordered by IDDQ magnitude.103

Figure 7.7. Estimating

A)

|

(O

p

Ø

ˆ

and

A)

|

(O

p

ˆ

as normal distributions of clustered values.104

Figure 7.8. Full data set of 196 ordered IDDQ measurements.106

Figure 7.9. Division of the ordered measurements into clusters.107

Figure 8.3. A simple example of clustering by subsets of outputs.123

List of Tables

53Table 4.1. Results from scoring and ranking multiplets on some simulated defects.

Table 4.2. Fastscan and iSTAT results on TI FIB experiments: 2 stuck-at faults, 14 bridges.55

Table 5.1. Results from correlating top-ranked multiplets to different fault models.64

Table 6.1. Set of likely effects that can invalidate composite bridging fault predictions.82

Table 6.2. Diagnosis results for round 1 of the experiments: twelve stuck-at faults.87

Table 6.3. Diagnosis results for round 2 of the experiments: nine bridging faults.88

Table 6.4. Diagnosis results for round 3 of the experiments: four open faults.88

Table 6.5. Diagnosis results for TI FIB experiments: 2 stuck-at faults, 14 bridges.90

Table 7.1. Results on Sematech defects.109

Table 8.1. Size of top-ranked candidate set (in faults) and total number of signature bits.113

Table 8.2. Size of top-ranked candidate set (in faults) and total number of signature bits.117

Table 8.3. Output-compacted signature sizes adjusted for repeated output signatures.118

Table 8.4. Success rate for bridging fault diagnosis using stuck-at fault candidates.119

Table 8.5. Top-ranked candidate set size and signature bits for pass-fail and output-compacted (alone) signatures.120

Table 8.6. Diagnostic results when output-compacted signatures are clustered down to 1000 bits each.123

Table 8.7. Diagnostic results for clustering (PF+OC) signatures down to 100 bits total.125

Abstract

Comprehensive Fault Diagnosis of Combinational Circuits

by

David B. Lavo

Determining the source of failure in a defective circuit is an important but difficult task. Important, since finding and fixing the root cause of defects can lead to increased product quality and greater product profitability; difficult, because the number of locations and variety of mechanisms whereby a modern circuit can fail are increasing dramatically with each new generation of circuits.

This thesis presents a method for diagnosing faults in combinational VLSI circuits. While it consists of several distinct stages and specializations, this method is designed to be consistent with three main principles: practicality, probability and precision. The proposed approach is practical, as it uses relatively simple modeling and algorithms, and limited computation, to enable diagnosis in even very large circuits. It is also probabilistic, imposing a probability-based framework to resist the inherent noise and uncertainty of fault diagnosis, and to allow the combined use of multiple fault models, algorithms, and data sets towards a single diagnostic result. Finally, it is precise, using an iterative approach to move from simple and abstract fault models to complex and specific fault behaviors.

The diagnosis system is designed to address both the initial stage of diagnosis, when nothing is known about the number or types of faults present, as well as end-stage diagnosis, in which multiple arbitrarily-specific fault models are applied to reach a desired level of diagnostic precision. It deals with both logic fails and quiescent current (IDDQ) test failures. Finally, this thesis addresses the problem of data size in dictionary-based diagnosis, and in doing so introduces the new concept of low-resolution fault diagnosis.

Acknowledgements

Among the people who have contributed to this work, I would first like to thank my co-authors on various publications: Ismed Hartanto, Brian Chess, Tracy Larrabee, Joel Ferguson, Jon Colburn, Jayashree Saxena, and Ken Butler. Their contributions to this work, both in its exposition and execution, have been invaluable.

I would also like to thank those people who have taken the time to provide advice, guidance, and insight into the issues involved in this research. These people include Rob Aitken, David Helmbold, Haluk Konuk, Phil Nigh, Eric Thorne, Doug Williams, Paul Imthurn and John Bruschi.

And while they have already been mentioned, two people deserve special acknowledgement for their remarkable dedication to seeing this work completed. The first is Tracy Larrabee, my advisor, who managed to provide both the constant encouragement and the extraordinary patience that this research required. The other is Rob Aitken, who believed enough in the work to encourage and sponsor it, in a variety of ways, throughout the many years it took to complete.

While many people have believed in this work, and given their time and support to help me complete it, no one has believed as strongly, helped so much, or is owed as much as my wife, Elizabeth. I am very happy to have completed this work, and even happier to be able to dedicate this dissertation to her.

Chapter 1.

Introduction

Ensuring the high quality of integrated circuits is important for many reasons, including high production yield, confidence in fault-free circuit operation, and the reliability of delivered parts. Rigorous testing of circuits can prevent the shipment of defective parts, but improving the production quality of a circuit depends upon effective failure analysis, the process of determining the cause of detected failures. Discovering the cause of failures in a circuit can often lead to improvements in circuit design or manufacturing process, with the subsequent production of higher-quality integrated circuits.

Motivating the quest for improving quality, as with many research efforts, is bottom-line economics. A better quality production process means higher yield and more usable (or sellable) die per the same wafer cost. Fewer defective chips means lower assembly costs (more assembled boards and products actually work) and lower costs associated with repair or scrap. And, a better quality chip or product means a more satisfied customer and a greater assurance of future business. Failure analysis is therefore an essential tool to improving both quality and profitability.

A useful if somewhat strained analogy to the process of failure analysis is its similarity to criminal detective work: given the evidence of circuit failure, determine the cause of the failure, identifying a node or region that is the source of error. In addition to location, it is useful to identify the mechanism of failure, such as an unintentional short or open, so that remediating changes can be considered in the design or manufacturing process.

Historically, failure analysis has been a physical process; a surprising number of present-day failure analysis teams still use only physical methods to investigate chip failures. The stereotypical failure analysis lab is a team of hard-boiled engineers physically and aggressively interrogating the failing part, using scanning electron microscopes, particle beams, infrared sensors, liquid crystal films, and a variety of other high-tech and high-cost techniques to eventually force a confession out of the silicon scofflaw. The final result, if successful, is the identification of the actual cause of failure for the circuit, along with the requisite gory “crime scene" photograph of the defective region itself: an errant particle, missing or extra conductor, a disconnected via, and so on.

The sweaty, smoke-filled scene of the failure analysis lab is only part of the story, however, and is usually referred to as root-cause identification. Given the enormous number of circuit devices in modern ICs, and the number of layers in most complex circuits, physical interrogation cannot hope to succeed without first having a reasonable list of suspect locations. Conducting a physical root-cause examination on an entire defective chip is akin to having to conduct a house-to-house search of an entire metropolis, in which every member of the populace is a possible suspect.

It is the job of the other part of failure analysis, usually called fault diagnosis, to do the logical detective work. Based on the data available about the failing part, the purpose of fault diagnosis is to produce an evaluation of the failing chip and a list of likely defect sites or regions. A lot is riding on this initial footwork: if the diagnosis is either inaccurate or imprecise (identifying either incorrect or excessively many fault candidates, respectively), the process of physical fault location will be hampered, resulting in the waste of considerable amounts of time and effort.

Previously-proposed strategies for VLSI fault diagnosis have suffered from a variety of self-imposed limitations. Some techniques are limited to a specific fault model, and many will fail in the face of any unmodeled behavior or unexpected data. Others apply ad hoc or arbitrary scoring mechanisms to rate fault candidates, making the results difficult to interpret or to compare with the results from other algorithms. This thesis presents an approach to fault diagnosis that is robust, comprehensive, extendable, and practical. By introducing a probabilistic framework for diagnostic prediction, it is designed to incorporate disparate diagnostic algorithms, different sets of data, and a mixture of fault models into a single diagnostic result.

The fundamental aspects of fault diagnosis will be discussed in Chapter 2, including fault models, fault signatures, and diagnostic algorithms. Chapter 3 indulges in an examination of the issues inherent in fault diagnosis, and presents a philosophy of diagnosis that will guide the balance of the work. Chapter 4 presents the first stage of the proposed diagnostic approach, which handles the initial condition of indeterminate fault behaviors. Chapter 5 discusses the second stage of diagnosis, in which likely fault models are inferred from the first-stage results. Chapter 6 digresses to a discussion of inexpensive bridging fault models, and introduces the third stage of diagnosis, in which multiple fault models are applied to refine the diagnostic result. Chapter 7 presents extends the diagnosis system to the topic of IDDQ failures, and Chapter 8 addresses the issue of small fault dictionaries. Chapter 9 presents the conclusions from this research and discusses areas of further work.

Background

Here is the problem of fault diagnosis in a nutshell: a circuit has failed one or more tests applied to it; from this failing information, determine what has gone wrong. The evidence usually consists of a description of the tests applied, and the pass-fail results of those tests. In addition, more detailed per-test failing information may be provided. The purpose of fault diagnosis is to logically analyze whatever information exists about the failures and produce a list of likely fault candidates. These candidates may be logical nodes of the circuit, physical locations, defect scenarios (such as shorted or open signal lines), or some combination thereof.

This chapter will give the background of the problem of fault diagnosis. It starts with a description of the types of circuits that will and will not be addressed by the diagnosis methods described in this thesis. It will explain the types of data that make up the raw materials of the diagnosis process, and then introduce the abstractions of defective behavior known as fault models. Finally, it will present the various algorithms and approaches that previous researchers have proposed for various instances of the fault diagnosis problem.

1.1 Types of Circuits

This thesis will only address the problem of fault diagnosis in combinational logic. While nearly all large-scale modern circuits are sequential, meaning they contain state-holding elements, most are tested in a way that transforms their operation under test from sequential to combinational. This is usually accomplished by implementing scan-based test [AbrBre90], in which all state-holding flip-flops in the circuit are modified so that they can be controlled and observed by shifting data through one or more scan chains. During scan tests, input data is scanned into the flip-flops via the scan chains and other input data is applied to the input pins (or primary inputs) of the circuit. Once these inputs are applied and the circuit has stabilized its response (now fully combinational), the circuit is clocked to capture the results back into the flip-flops, and the data values at the output pins (or primary outputs) of the circuit are recorded. The combination of values at the output pins and the values scanned out of the flip-flops make up the response of the circuit to the test, and these values are compared to the expected response of a good circuit. If there is a mismatch for any test, the circuit is considered defective, and the process of fault diagnosis can begin.

This thesis will not address the diagnosis of failures during tests that consist of multiple clock cycles and therefore sequential circuit behavior. So-called functional tests fall under this domain, and are extremely difficult to diagnose due to the mounting complexity of defective behavior under multiple sequential time frames. Another sequential circuit type that is not addressed here is that of memories such as RAMs and ROMs. Unlike the “random” logic of logic gates and flip-flops, however, the “structured” nature of memories makes them especially amenable to simple fault diagnosis. It is usually a simple process to control and observe any word or bit in most memories to determine the location of test failure.

1.2 Diagnostic Data

Part of the data that is involved in fault diagnosis, at least for scan tests, has already been introduced: namely, the input values applied at the circuit input pins and scanned into the flip-flops. The input data for each scan operation, including values driven at input pins, is referred to as the input pattern or test vector. The operation of scanning and applying an input to the circuit and recording its output response is formally called a test, and a collection of tests designed to exercise whole or part of the circuit is called a test set. This information, along with the expected output values (determined by prior simulation of the circuit and test set), makes up the test program actually applied to the circuit.

The test program runs on a tester, which can handle either wafers or packaged die, and can apply tests and observe circuit responses. The tester records the actual responses measured at circuit outputs, and any differences between the observed responses and the expected responses are recorded in the tester data log. While it is not the usual default setting during production test, this thesis will assume that the data log information identifies all mismatched responses and not just the first failing response. It is usually a simple matter to re-program a tester from a default “stop-on-first-fail” mode to a diagnostic “record-all-fails” mode once a die or chip has been selected for failure analysis.

The response of a defective circuit to a test set is referred to as the observed faulty behavior, and its data representation is commonly known as a fault signature. For scan tests, the fault signature is usually represented in one of two common forms. The first, the pass-fail fault signature, reports the result for each test in the test set, whether a pass or a fail. Typically the fault signature consists either of the indices of the failing tests, or a bit vector for the entire test set in which the failing tests (by convention) are represented as 1s and the passing tests by 0s. Figure 2.1, below, gives an example of a fault signature for a simple example of 10 tests, out which 4 failing tests are recorded.

Results for 10 total tests:

Pass-fail signatures:

1: Pass

6: Pass

By index:

5, 7, 8, 10

2: Pass

7: Fail

By bit vector:

0000101101

3: Pass

8: Fail

4: Pass

9: Pass

5: Fail

10: Fail

Figure 2.1. Example of pass-fail fault signatures.

The second type of fault signature is the full-response fault signature, which reports not only what tests failed but also at which outputs (flip-flops and primary outputs) the discrepancies were observed. As with test vectors, circuit outputs are usually indexed to facilitate identification. Figure 2.2 gives another simple example of indexed and bitmapped full-response fault signatures. Each failing vector number in the indexed signature is augmented with a list of failing outputs. In the bitmapped signature, a second dimension has been added for failing outputs.

Indexed full-response signature:

Bitmapped full-response signature:

Vectors

1

2

3

4

5

6

7

8

9

10

5: 2, 4

1

0

0

0

0

0

0

0

0

0

0

7: 3, 4

O

2

0

0

0

0

1

0

0

0

0

1

8: 7

u

3

0

0

0

0

0

0

1

0

0

0

10: 2, 7

t

4

0

0

0

0

1

0

1

0

0

0

p

5

0

0

0

0

0

0

0

0

0

0

u

6

0

0

0

0

0

0

0

0

0

0

t

7

0

0

0

0

0

0

0

1

0

1

s

8

0

0

0

0

0

0

0

0

0

0

9

0

0

0

0

0

0

0

0

0

0

10

0

0

0

0

0

0

0

0

0

0

Figure 2.2. Example of indexed and bitmapped full-response fault signatures.

Scan tests are only a single part of the suite of tests usually applied to a production chip. Another common type of test, called an IDDQ test, is to put the circuit in a non-switching or static state and measure the quiescent current draw. If an abnormally high current is measured, a defect is assumed to be the cause and the part is marked for scrap or failure analysis.

The fault signature generated by an IDDQ test set can take one of two forms. The first is the same as the pass-fail signature introduced earlier for scan tests, in which either index numbers or bits are used to represent passing (normal or low IDDQ current) and failing (high current) tests. The second type of signature records an absolute current measurement for each IDDQ test in the form of a real number.

This thesis will address fault diagnosis for both scan and IDDQ tests, as these are the two major types of comprehensive tests performed on commercial circuits. Other tests, such as those for memories, pads, or analog blocks, cover a much more limited area and require more specialized (often manual) diagnostics. Functional test failures, as mentioned, are especially difficult to diagnose, but fortunately (at least for fault diagnosis) functional tests are gradually being eclipsed by scan-based tests. Diagnosis for Built-In-Self-Test (BIST) [AbrBre90], in which on-chip circuitry is used to apply and capture test patterns, will not be directly addressed here. However, many of the diagnosis techniques presented in this thesis can be applied to BIST results if the data can be made available for off-chip processing. Finally, the issue of timing or speed test diagnostics will be addressed only briefly and remains a subject for further research.

1.3 Fault Models

The ultimate targets of both testing and diagnosis are physical defects. In the logical domain of testing and diagnostic algorithms, a defect is represented by an abstraction known as a logical fault, or simply fault. A description of the behavior and assumptions constituting a logical fault is referred to as a fault model. Test and diagnosis algorithms use a fault model to work with the entire set of fault instances in a target circuit.

The most popular fault model for both testing and diagnosis is the single stuck-at fault model, in which a node in the circuit is assumed to be unable to change its logic value. The stuck-at model is popular due to its simplicity, and because it has proved to be effective both in providing test coverage and diagnosing a limited range of faulty behaviors [JacBis86]. As an abstract representation of a class of defects, the stuck-at fault is commonly used to represent the defect of a circuit node shorted to either power or ground. It is commonly used, however, to both detect and diagnose a wide range of other defect types, as will be seen in the rest of this thesis.

Perhaps the second most popular fault model is the bridging fault model. Used to represent an electrical short between signal lines, in its most common form the model describes a short between two gate outputs. Most bridging fault models ignore bridge resistance, and instead focus on the logical behavior of the fault. These models include the wired-OR bridging fault, in which a logic 1 on either bridged node results in the propagation of a logic 1 downstream from both nodes; the wired-AND bridging fault, which propagates a 0 if either node is 0; and the dominance bridging fault, in which one gate is much stronger than the other and is assumed to always drive its logic value onto the other bridged node. Other bridging fault models have been developed of much greater sophistication [AckMil91, GrePat92, MaxAit93, Rot94, MonBru92], taking into account gate drive strengths, various bridge resistances, and even more than two bridged nodes, but they are not used as much due to their computational complexity during large-scale test generation or fault diagnosis.

Bridging fault models have become popular due to an increasing attention to defects in the interconnect of modern chips. Similarly, there has been a commensurate rise in interest in open fault models, which attempt to model electrical opens, breaks, and disconnected vias. Since opens can result in state-holding, intermittent, and pattern-dependent fault effects, these models have generally been more complex and less widely used for both testing and diagnosis.

Instead of interconnect faults, several fault models have concentrated on defects in logic gates and transistors. Among these are the transistor-stuck-on and transistor-stuck-off models, which are similar to conventional stuck-at faults. Various intra-gate short models have been proposed to model shorts between transistors in standard-cell logic gates. Many of these models have not enjoyed widespread success simply because the stuck-at model tends to work nearly as well for generating effective tests at much lower complexity.

Other fault models have been developed to represent timing-related defects, including the transition fault model and the path-delay fault model. The first assumes that a defect-induced delay is introduced at a single gate input or output, while the second spreads the total delay along a circuit path from input to output.

1.4 Fault Models vs. Algorithms: A Short Tangent into a Long Debate

The previous section briefly introduced a wide variety of fault models, from the simple and abstract stuck-at model to more complicated, specific, and realistic fault models. The stuck-at fault model has been generally dominant for several decades, and continues to be dominant today, both for its simplicity and its demonstrated utility. But the general trend, in the field of testing at least, has been a tentative shift away from sole reliance on the stuck-at model towards more realistic fault models that will facilitate the generation of better tests for more complicated defects. The question is, then, what models are best for fault diagnosis?

A paper by Aitken and Maxwell [AitMax95] identifies two main components to any fault diagnosis approach. The first is the choice of fault model, and the second is the algorithm used to apply the fault model to the diagnostic problem. As the authors explain, the effectiveness of a diagnostic technique will be compromised by the limitations of the fault model it employs. So, for example, a diagnosis tool that relies purely on the stuck-at fault model can never completely or correctly diagnose a signal-line short or open, simply because it is looking for one thing while another has occurred.

The authors go on to explain that the role of the diagnosis algorithm, then, has evolved to try to overcome the limitations of the chosen fault model. This will be illustrated in the next section of this chapter in an overview of previous diagnosis research; a common technique is to use the stuck-at model but adjust the algorithm to anticipate bridging-fault behaviors. But, the authors also opened a debate, which remains active to this day: is it better for a diagnosis technique to use more realistic fault models with a simple algorithm, or to use simple and abstract models with a more clever and robust algorithm?

As with any interesting debate, there are good arguments on both sides. The argument for simple fault models is that they are more practical to apply to large circuits and more flexible for a wide variety of defect behaviors. The argument for better models, taken by the authors in their original paper, is that good models are necessary for both diagnostic accuracy and precision. Simple models do not provide sufficient accuracy because defect behavior is often complex, more complex than even clever algorithms anticipate. They also do not result in sufficient precision because they do not provide enough specificity (e.g. “look for a short at this location”) to guide effective physical failure analysis.

This thesis will attempt to resolve this debate as it presents a new diagnostic approach. The next section outlines how previous researchers have addressed the diagnostic problem, and notes how each participant has taken their place in the model vs. algorithm debate.

1.5 Diagnostic Algorithms

This section will cover the diagnosis algorithms proposed by previous researchers, in a roughly chronological order. The general trend, as will become clear, has been from simple approaches that target simple defects, to more complex algorithms that try to address more complicated defect scenarios.

Diagnosis algorithms have traditionally been classified into two types, according to how they approach the problem. The first and by far the most popular approach is called cause-effect fault diagnosis [AbrBre90]. A cause-effect algorithm starts with a particular fault model (the “cause”), and compares the observed faulty behavior (the “effect”) to simulations of that fault in the circuit. A simulation of any fault instance produces a fault signature, or a list of all the test vectors and circuit outputs by which a fault is detected, and which can be in one of the signature formats described earlier. The process of cause-effect diagnosis is therefore one of comparing the signature of the observed faulty behavior with a set of simulated fault signatures, each representing a fault candidate. The resulting set of matches constitutes a diagnosis, with each algorithm specifying what is acceptable as a “match”.

The main job of a cause-effect algorithm is to perform this matching between simulated candidate and observed behavior. The general historical trend has been from very simple or exact matching, where the defect is assumed to correspond very closely to the fault model, to more complicated matching and scoring schemes that attempt to deal with a range of defect types and unmodeled behavior.

A cause-effect algorithm is characterized by the choice of a particular fault model before any analysis of the actual faulty behavior is performed. A cause-effect algorithm can further be classified as static, in which all fault simulation is done ahead of time and all fault signatures stored in a database called a fault dictionary; or, it can be dynamic, where simulations are performed only as needed.

The opposite approach, and the second classification of diagnosis algorithms, is called (not surprisingly) effect-cause fault diagnosis [AbrBre80, RajCox87]. These algorithms attempt the common-sense approach of starting from what has gone wrong on the circuit (the fault “effect”) and reasoning back through the logic to infer possible sources of failure (the “cause”). Most commonly the cause suggested by these algorithms is a logical location or area of the circuit under test, not necessarily a failure mechanism.

Most effect-cause methods have taken the form of path-tracing algorithms. They use assumptions about the propagation and sensitization of candidate faults to traverse a circuit netlist, usually identifying a set of fault-free lines and thereby implicating other logic that is possibly faulty.

Effect-cause diagnosis methods have several advantages. First, they don't incur the often-significant overhead of simulating and storing the responses of a large set of faults. Second, they can be constructed to be general enough to handle, at least implicitly, the presence of multiple faults and diffuse fault behavior. This is an advantage over most other diagnosis strategies that rely heavily on a single-fault assumption. The most common disadvantage of effect-cause diagnosis algorithms is significant inherent imprecision. Most are conservative in their inferences to avoid eliminating any candidate logic, but this usually leads to a large implicated area. Also, since a pure effect-cause algorithm doesn't use fault models, it necessarily cannot provide a candidate defect mechanism (such as a bridge or open) for consideration.

In fact, while most effect-cause algorithms claim to be “fault-model-independent”, this is a difficult claim to justify. Existing effect-cause algorithms implicitly make assumptions about fault sensitization, propagation, or behavior that are impossible to distinguish from classic fault modeling. (Usually, the implicit model is the stuck-at fault model.) This is understandable: it is the job of a diagnosis algorithm to make inferences about the underlying defect, but it is difficult to do so without some assumptions about faulty behavior, which is in turn difficult to do so without some fault modeling.

The following sections present algorithms for VLSI diagnosis proposed by previous researchers, from the early 1980s to the present day. In general, the earliest algorithms have targeted solely stuck-at faults and associated simple defects, while the later and more sophisticated algorithms have used more detailed fault models and targeted more complicated defects.

1.5.1 Early Approaches and Stuck-at Diagnosis

Many early systems of VLSI diagnosis, such as Western Electric Company's DORA [AllErv92] and an early approach of Teradyne, Inc. [RatKea86], attempted to incorporate the concept of cause-effect diagnosis with a previous-generation physical method called guided-probe analysis. Guided-probe analysis employed a physical voltage probe and feedback from an analysis algorithm to intelligently select accessible circuit nodes for evaluation. The Teradyne and DORA techniques attempted to supplement the guided-probe analysis algorithm with information from stuck-at fault signatures.

Both systems used relatively advanced (for their time) matching algorithms. The DORA system used a nearness calculation that the authors describe as fuzzy match. The Teradyne system employed the concept of prediction penalties: the signature of a candidate fault is considered a prediction of some faulty behavior, made up of pairs. When matching with the actual observed behavior, the Teradyne algorithm scored a candidate fault by penalizing for each pair found in the stuck-at signature but not found in the observed behavior, and penalizing for each pair found in the observed behavior but not the stuck-at signature. These have commonly become known as misprediction and non-prediction penalties, respectively. A related Teradyne system [RicBow85] introduced the processing of possible-detects, or outputs in stuck-at signatures that have unknown logic values, into the matching process.

While other early and less-sophisticated algorithms applied stuck-at fault signatures directly, expecting exact matches to simulated behaviors, it became obvious to the testing community that most failures in CMOS circuits do not behave exactly like stuck-at faults. Stuck-at diagnosis algorithms responded by increasing the complexity and sophistication of their matching to account for these unmodeled effects. An algorithm proposed by Kunda [Kun93] ranked matches by the size of intersection between signature bits. This stress on minimum non-prediction (misprediction was not penalized) reflects an implicit assumption that unmodeled behavior generally leads to over-prediction: the algorithm does not expect the stuck-at model to be perfect, but any unmodeled behavior will cause fewer actual failures than predicted by simulation. This assumption likely arose from the intuitive expectation that most defects involve a single fault site with intermittent faulty behavior — a not uncommon scenario for many chips that have passed initial tests but failed scan tests, especially after burn-in or packaging. Most authors, however, do not make this assumption explicit or explore its consequences, and an unexamined preference for the fault candidate that “explains the most failures” (regardless of over-prediction) is common to many diagnosis algorithms.

A more balanced approach was proposed by De and Gunda [DeGun95], in which the user can supply relative weightings for misprediction and non-prediction. By modifying traditional scoring with these weightings, the algorithm assigns a quantitative ranking to each stuck-at fault. The authors claim that the method can be used to explicitly target defects that behave similar to but not exactly like the stuck-at model, such as some opens and multiple independent stuck-at faults, but it can diagnose bridging defects only implicitly (by user interpretation). This is perhaps the most general of the simple stuck-at algorithms and is unique for its ability to allow the user to adjust the assumptions about unmodeled behavior that other algorithms make implicitly.

1.5.2 Waicukauski & Lindbloom

The algorithm developed by Waicukauski and Lindbloom (W&L) [WaiLin89] deserves its own subsection because it has been so pervasive and successful — the most popular commercial tool is based on this algorithm — and also because it introduced several techniques that other algorithms have since adopted.

The W&L algorithm relies solely on stuck-at fault assumptions and simulations, and as such can be best classified as a (dynamic) cause-effect algorithm. It does, however, use limited path-tracing to implicate portions of the circuit and reduce the number of simulations it performs, so it does borrow elements from effect-cause approaches.

The W&L algorithm uses a very simple scoring mechanism, relying mainly on exact matching. But, it performs this matching in an innovative way, by matching fault signatures on a per-test basis. Most fault diagnosis algorithms count the number of mismatched bits between the observed behavior and a candidate fault signature across the entire test set. Each bit is a pair, as in the Teradyne algorithm described earlier, and an intersection is performed between the set of bits in the observed behavior and the set in each candidate fault signature.

In the W&L algorithm, by contrast, each test vector that actually fails on the tester is considered independently. For each failing test, the set of failing outputs is compared with each candidate fault; if a candidate predicts a fail for that test, and the outputs match exactly, then a “match” is declared. Each matching fault candidate is then simulated against the rest of the failing tests, and the candidate that matches the most failing tests (exactly) is retained. All of the matched test results for this candidate are removed from the observed faulty signature, and the process repeats until all failing tests are considered.

Note that this matching algorithm is really a greedy coverage algorithm over the set of failing tests. Since the tests are considered in order, the sequence in which the tests are examined could affect the contents of the final diagnosis when multiple candidates are required to match all of the tests. It should also be noted that the practice of removing test results as they are matched reflects a desire to address multiple simultaneous defects, as well as an assumption that the fault effects from such defects are non-interfering.

The algorithm also conducts a simple post-processing step, in which it classifies the diagnosis by examining the final candidate set. If the diagnosis consists of a single stuck-at fault (with any equivalent faults) that matches all failing tests, it then checks the tests that pass in the observed behavior. If all of these passing test results are also predicted by the stuck-at candidate, the diagnosis is classified as a “Class I” diagnosis, or an exact match with a single stuck-at fault. If the diagnosis consists of a single candidate that matches all failing tests but not all passing tests (e.g. there is some misprediction), then the diagnosis is classified as “Class II”. The authors explain that Class II diagnoses could indicate the presence of an open, an intermittent stuck-at defect, or a dominance bridging fault. Finally, a “Class III” diagnosis consists of multiple stuck-at candidates with possible mispredicted and non-predicted behaviors.

The two most interesting features of the W&L algorithm, the per-test approach and the post-processing analysis, will be discussed further in later sections of this thesis. Overall, the W&L algorithm is interesting not only because it is so commonly used, but also because it raises some interesting theoretical issues.

1.5.3 Stuck-At Path-Tracing Algorithms

The classic effect-cause algorithms are those that rely on path-tracing to implicate portions of the circuit. Examples of these are the approaches suggested by Abramovici and Breuer [AbrBre80] and Rajski and Cox [RajCox87]. While they claim fault-model-independence, these algorithms attempt to identify nodes in the circuit that can be demonstrated to change their logic values (or toggle) during the test set, which amounts to an implicit targeting of stuck-at faults. In fact, these algorithms maintain a stricter adherence to the stuck-at model than the cause-effect algorithms just described, as any intermittent stuck-at defect is not anticipated and would not be diagnosed correctly.

1.5.4 Bridging fault diagnosis

The first evolution of diagnosis algorithms away from the stuck-at model was when they started to address bridging faults explicitly. Some of the stuck-at diagnosis algorithms already presented claim to be able to occasionally diagnose bridging faults, but only fortuitously by addressing limited unmodeled behavior. Perhaps the simplest explicit bridging fault diagnosis algorithm is that proposed Millman McCluskey and Acken (MMA) [MilMcC90], which was a direct transition from stuck-at faults to bridges. The authors introduced the idea of composite bridging-fault signatures, which are created by concatenating the four stuck-at fault signatures for the two bridged nodes. This was a novel way of creating fault signatures without relying on bridging fault simulation, which can be computationally expensive especially if electrical effects are considered. The underlying idea is that the actual behavior of a bridge, for any failing test vector, will be a subset of the behaviors predicted by the four related stuck-at faults. The matching algorithm used is simple subset matching: any candidate whose composite signature contains all the observed failing pairs is considered a match and appears in the final diagnosis.

A similar approach to the MMA algorithm was taken by Chakravarty and Gong [ChaGon93], whose algorithm did not explicitly create composite signatures but used a matching technique on combinations of stuck-at signatures to create the same result. Both of these bridging-fault diagnosis methods suffer from imprecision, however: the average diagnosis sizes for both are very large, consisting of hundreds or thousands of candidates. The performance of the MMA algorithm was improved significantly by Chess, Lavo, et al. [CheLav95], by classifying vectors in the composite signatures as stronger or weaker predictions of bridging fault behavior, and refining the match scoring appropriately. Other researchers have continued to use and extend the idea of (stuck-at based) composite signatures for various fault models [VenDru00].

A more direct approach to bridging fault diagnosis was suggested by Aitken and Maxwell [AitMax95]. As opposed to the algorithms just described, in which the simple stuck-at fault model is augmented with more-complex algorithms to deal with unmodeled behavior, the authors instead chose to build dictionaries comprised of realistic bridging faults. (A realistic bridging fault is a short that is considered likely to occur in the fabricated circuit based on a signal-line proximity analysis of the circuit artwork.) This is pure cause-effect diagnosis for bridging-faults: the fault candidates are the same faults targeted for diagnosis. The authors report excellent results, both in accuracy and precision.

While there are obvious advantages to this approach, there are also significant disadvantages. The number of realistic two-line bridging faults is significantly larger than the number of single stuck-at faults for a circuit. Since the cost of simulating each of these faults can be expensive, especially if the simulation considers electrical effects, the overall time spent in fault simulation can be prohibitive. In addition, even the best bridging fault simulations may not reflect the behavior of actual shorts, requiring continual validation and refinement of the fault models [Ait95] and possibly the use of a more complex matching algorithm.

Bridging fault diagnosis in general is plagued by the so-called candidate selection problem: there are many more faults in a circuit than can be reasonably considered by any diagnosis algorithm. Even for two-line bridging faults, there are

÷

÷

ø

ö

ç

ç

è

æ

2

n

possible candidates. The Aitken and Maxwell approach got around this problem by considering only realistic bridging faults, but the analysis required for determining the set of realistic faults can itself be impractical. Other methods have been suggested, including one by Lavo et al. [LavChe97] that used a two-stage diagnosis approach, the first stage to identify likely bridges and the second stage to directly diagnose the bridging fault candidates. This thesis will explore the candidate selection problem in more detail in a subsequent chapter.

1.5.5 Delay fault diagnosis

Due to the increasing importance of timing-related defects in high-performance designs, researchers have proposed methods to diagnose timing defects with delay fault models. Due to its simplicity, the transition fault model, in which the excessive delay is lumped at one circuit node, has been preferred. Diagnosis with the path-delay fault model, which considers distributed delay along a path from circuit input to output, has been hampered by the candidate selection problem: there are an enormous number of paths through a modern circuit.

An example of fault diagnosis using the path-delay fault model is the approach suggested by Girard et al. [GirLan92]. The authors use a method called critical path tracing [AbrMen84] to traverse backwards through the circuit from the failing outputs, implicating nodes that transition for each test. In this way it is similar to the effect-cause algorithms described in section 2.5.3, but its decisions at each node are determined by the transition fault model rather than the stuck-at fault model.

1.5.6 IDDQ diagnosis

Aside from logic levels and assertion timing data, people have applied information from other types of tests to diagnose defects. One source of such information is the amount of quiescent current drawn for certain test vectors, or IDDQ diagnosis. The vectors used for IDDQ diagnosis are designed to put the circuit in a static state, in which no logic transitions are occurring, so that a high amount of measured current draw will indicate the likely presence of a defect (such as a short to a power line). An advantage to IDDQ diagnosis is that the defects should have high observability: the measurable fault effects do not have to propagate through many levels of logic to be observed, but are rather measured at the supply pin. The issue of IDDQ observability is a complicated one, however, and will be discussed later in Chapter 7.

Aitken presented a method of diagnosing faults when logic fails and IDDQ fails are measured simultaneously [Ait91], and he later generalized this approach to include fault models for intra-gate and inter-gate shorts [Ait92]. The approach presented by Chakravarty and Liu examines the logic values applied to circuit nodes during failing tests, and attempts to identify pairs of nodes with opposite logic values as possible bridging fault sites [ChaLiu93]. All of the approaches, however, rely on IDDQ measurements that can be definitively classified as either a pass or a fail, which limits their application in some situations.

This limitation is addressed by the application of current signatures [Bur89, GatMal96], in which relative measurements of current across the test set are used to infer the presence of a defect, rather than the absolute values of IDDQ. A diagnosis approach suggested by Gattiker and Maly [GatMal97, GatMal98] attempts to use the presence of certain large differences between current measurements as a sign that certain types of defects are present. This concept was further extended by Thibeault [Thi97], who applied a maximum likelihood estimator to changes in IDDQ measurements to infer defective fault types. These approaches, while more robust, stress the implication of defect type rather than location; the algorithm I propose later in this thesis targets explicit fault instances or locations. It is possible that these two strategies could be combined to further improve resolution, a topic I discuss in Chapter 7.

1.5.7 Recent Approaches

A couple of recently-published papers have suggested diagnosis algorithms that attempt to target multiple defects or fault models. The first, called the POIROT algorithm [VenDru00], diagnosis test patterns one at a time, much like the Waicukauski and Lindbloom algorithm. In addition, it employs stuck-at signatures, composite bridging fault signatures, and composite signatures for open faults on nets with fanout. Its scoring method is rather rudimentary, especially when it compares the scores of different fault models, relying on an interpretation of Occam’s Razor [Tor38] to prefer stuck-at candidates over bridging candidates, and bridging candidates over open faults.

Another algorithm, called SLAT [BarHea01], also uses a per-test diagnosis strategy, and attempts a coverage algorithm over the observed behavior using stuck-at signatures and only exact matching of failing outputs. In both of these ways it is very similar to the W&L algorithm. However, it modifies that algorithm by attempting to build multiple coverings, which it calls multiplets; each multiplet is a set of stuck-at faults that together explain all the perfectly-matched test patterns. Test results that don’t match exactly, and passing patterns, are ignored.

Because they explicitly target multiple faults and complex fault behaviors, the SLAT and the POIROT algorithms are interesting for application to an initial pass of fault diagnosis, when little is known about the underlying defects. These algorithms, in addition to W&L, will be discussed further in Chapter 4 of this thesis, which addresses initial-stage fault diagnosis.

1.5.8 Inductive Fault Analysis

The diagnosis techniques presented so far do not use physical layout information to diagnose faults. Intuitively, however, identifying a fault as the cause of a defect has much to do with the relative likelihood of certain defects occurring in the actual circuit. Inductive Fault Analysis (IFA) [SheMal85] uses the circuit layout to determine the relative probabilities of individual physical faults in the fabricated circuit.

Inductive fault analysis uses the concept of a spot defect (or point defect), which is an area of extra or missing conducting material that creates an unintentional electrical short or break in a circuit. As these spot defects often result in bridge or open behaviors, inductive fault analysis can provide a fault diagnosis of sorts: an ordered list of physical faults (bridges or opens) that are likely to occur, in which the order is defined by the relative probability of each associated fault. The relative probability of a fault is expressed as its weighted critical area (WCA), defined as the physical area of the layout that is sensitive to the introduction of a spot defect, multiplied by the defect density for that defect type. For example, two circuit nodes that run close to one another for a relatively long distance provide a large area for the introduction of a shorting point defect; the resulting large WCA value indicates that a bridging fault between these nodes is considered relatively likely.

One way that inductive fault analysis can be applied to fault diagnosis is through the creation of fault lists. Inductive fault analysis tools such as Carafe [JeeISTFA93, JeeVTS93] can provide a realistic fault list, useful for fault models such as the bridging fault model, in which the number of possible faults is intractable for most circuits. By limiting the candidates to only faults that can realistically occur in the fabricated circuit, a diagnosis can be obtained that is much more precise than one that results from consideration of all theoretical faults.

Another possible way to use inductive fault analysis for diagnosis is presented in Chapter 6, in which IFA can provide the a-priori probabilities for a set of candidate faults. This is a generalization of the idea of creating faultlists, in which faults are not characterized as realistic or unrealistic, but instead are rated as more or less probable.

IFA has also been applied to the related field of yield analysis; a technique proposed by Ferguson and Yu [FerYu96] uses a combination of IFA and maximum likelihood estimation to perform a sort of statistical diagnosis on process monitor circuits. A similar combination of layout examination, statistical inference, and fault modeling will be applied to more traditional cause-effect fault diagnosis in Chapter 6 of this thesis.

1.5.9 System-Level Diagnosis

The area of system-level diagnosis, which deals with finding defective components in large-scale electronic systems, is outside the area of research of this dissertation. However, some interesting work has been done in this area, which predates and often deals with issues very different than those of CMOS and VLSI diagnosis. The most comprehensive diagnosis approach has been developed by Simpson and Sheppard [SimShe94], who have presented a probabilistic approach for everything from determining the identity of failing subsystems to determining the optimal order of diagnostic tests. They have also suggested an approach for CMOS diagnosis using fault dictionaries [SheSim96]. Their methods apply the Dempster-Shaffer method of analysis, which I will use extensively and discuss further in Chapter 4.

A Deeper Understanding of the Problem: Developing a Fault Diagnosis Philosophy

The previous chapter presented some of the various ways that researchers have approached the problem of VLSI fault diagnosis. These attempts have spanned a period of over 25 years, and a good deal of academic and industrial effort has gone into making fault diagnosis work in the real world. And yet, few if any academic diagnosis algorithms have made a successful transition into industrial use.

The reasons for this lack of success are probably many, but chief among them is probably the disparity between academic assumptions about the problem and the real-world conditions of industrial failure analysis. This chapter will examine these assumptions in some detail, and by trying to rectify them will present a philosophic framework for approaching the problem of fault diagnosis that will guide the rest of the research presented in this thesis.

1.6 The Nature of the Defect is Unknown

Several theoretical fault diagnosis systems have claimed great success in some variation of the following experiment: physically create or simulate a defect of a certain fault type, create some candidates of that fault type, and run the diagnosis algorithm to choose the correct candidate out of the list. While the accuracy of these success stories is indeed laudable, the result is a little like pulling a guilty culprit out of a police lineup: the job is made much easier if the choices are limited ahead of time.

It is an unfortunate fact of failure analysis, however, that what form a defect has taken, or what fault model could best represent the actual electrical phenomenon, is not known in advance. In the real world, a circuit simply fails some tests; it does not generally give any indication of what type of defect is present. While some algorithms have been proposed that attempt to infer a defect type from some behavior, most notably IDDQ information [GatMal97, GatMal98], these will not work on the most common failure type: there is generally little or no information about defect type that can be gleaned from standard scan failures.

Acknowledging this lack of initial information leads to a basic principle of fault diagnosis, often ignored by academic researchers but obvious to industrial failure analysis engineers:

A fault diagnosis algorithm should be designed with the assumption that the underlying defect mechanism is unknown.

(i)

Given this fact, it makes little sense to design a fault diagnosis algorithm that only works when the underlying defect is a certain type or class. Or, if an algorithm is targeted to one fault type, it should be designed so that an unmodeled fault will result in either explicit or obvious failure. This leads to the next principle:

A fault diagnosis algorithm should indicate the quality of its result.

(ii)

This way, if a diagnosis algorithm does encounter some behavior that violates its basic assumptions, it can let the user know that these assumptions may have been wrong.

1.7 Fault Models are Hopelessly Unreliable

Many clever diagnosis algorithms have been proposed, using a variety of fault models, and all promise great success as long as one condition holds: nothing unexpected ever happens. These expectations come from the fault model used, the diagnostic algorithm, or both. So, if the modeled defect doesn't cause a circuit failure when expected, or if a failure occurs along an unanticipated path, the algorithm will either quit or get hopelessly off the track of the correct suspect.

If the problem is defective fault models, then maybe the solution is to work very hard to perfect the models. If the models were perfect, then diagnosis would reduce to a simple process of finding exactly the matching candidate for the observed behavior. But, once again, the cold hard world intrudes with the cold hard facts: fault model perfection is extremely difficult, and may very likely be impossible.

Perhaps best documented are the problems inherent in bridging fault modeling: many simplified bridging fault models have been proposed, and each in turn has been demonstrated to be inadequate or inaccurate in one or more important respects [AckMil91, AckMil92, GrePat92, MaxAit93, MonBru92, Rot94]. Even the most complex and computationally intensive models can fail to account for the subtleties of defect characteristics and the vagaries of defective circuit behavior. And it is not only the complex models that are prone to error: even apparently simple predictions may be hard to make when real defects are involved [Ait95].

The unfortunate fact is that faulty circuits have the tendency to misbehave—they are faulty, after all—and often fail in ways not predicted by the best of fault simulators or the most carefully crafted fault models. The only answer is that any diagnostic technique that hopes to be effective on real-world defective circuits has to be robust enough to tolerate at least some level of noise and uncertainty. If not, the only certain thing about the process will be the resulting frustration of a sadly misguided engineer.

A fault diagnosis algorithm should make no inviolable assumptions regarding the defect behavior or its chosen fault model(s): fault models are only approximations

(iii)

1.8 Fault Models are Practically Indispensable

Given the well-documented limitations of fault models, several diagnosis algorithms have tried to minimize them or do away with them completely. Some, such as some effect-cause algorithms, claim to be “fault-model-independent”. Others attempt to use the abstract nature of the stuck-at fault model to avoid the messy and unreliable aspects of realistic fault models.

While the idea behind these approaches has merit, abstract diagnosis is not enough for real-world failure analysis. The majority of fault diagnosis algorithms that address complex defects use the stuck-at fault model to get “close enough” to the actual defect behavior to enable physical mapping. But, using the stuck-at model alone results in some well-characterized problems in both accuracy and precision. For example, even a robust stuck-at diagnosis may identify one of two shorted nodes only 60% to 90% of the time [AitMax95, LavChe97]. For situations in which a 10% to 40% failure rate is unacceptable, or such partial answers (single-node explanations) are inadequate, stuck-at diagnosis alone is not the answer.

The use of the stuck-at model is typical of a common answer to the problem of unreliable fault models: use an abstract model that makes as few assumptions as possible. But, while this approach has historically worked for testing, it is not likely to work for fault diagnosis.

Generally speaking, fault models have proved their utility for test generation. If, for example, a test is generated to detect the (abstract) situation of a circuit node stuck-at 0, there is considerable evidence to suggest that the test will, in the process, detect a wide range of related defects: the node shorted to ground, perhaps, or missing conductor to a pull-up network, or even a floating node held low from a capacitive effect. When testing a circuit for defects, the actual relation of fault model to defect is less important than whether the defect is caught or not.

But what does it mean, in the world of fault diagnosis, to explain the actual failures of a circuit with an abstract fault model? Try as one might, no failure analysis engineer is ever going to find a single stuck-at fault under the microscope; a stuck-at fault, strictly defined, is not a specific explanation, but is instead a useful fiction.

For fault diagnosis, the issue is one of resolution: the more abstract the model used, the less well the fault candidates in the final diagnosis will map to actual defects in the silicon. A stuck-at candidate, for example, may implicate a range of mechanisms or defect scenarios involving the specified stuck-at node, and the failure analysis engineer must account for this poor resolution by performing some amount of mapping to actual circuit elements. The more specific the fault model, the better the correspondence to actual defects, and the less mapping work is required: a sophisticated bridging fault candidate, with specific electrical characteristics, will usually resolve to either a single or a few defect scenarios.

A more specific fault model is always preferable for diagnosis.

(iv)

This is exactly the point made by Aitken and Maxwell [AitMax95], where they pointed out the perils of using abstract fault models for complex defect behaviors. While accuracy may be the most important quality of a diagnosis algorithm, the precision of a diagnosis tool is what makes it truly useful for failure analysis.

1.9 With Fault Models, More is Better

The conflicting principles of unknown fault origins and the desirability of specific fault models lead to a dilemma. If a diagnosis algorithm can make no assumptions about the nature of the underlying defect, how can it apply a specific or detailed fault model to the problem?

The answer, as with many things in life, is that more is better. Since no one fault model will ever provide both the accuracy and precision required from useful fault diagnosis, the best approach is to apply as many different fault models to the problem as possible. In this way, a wide range of possible defects can be handled with the highest possible precision for the failure analysis engineer.

The more fault models used or considered during fault diagnosis, the greater the potential for precision, accuracy, and robustness.

(v)

So, perhaps a stuck-at diagnosis, a bridging diagnosis, and a delay fault diagnosis or two could be performed, and the results from this mix of algorithms examined. But apart from the time and work required, a problem remains in reconciling the different results: how can one compare the top candidates from, for example, a stuck-at fault diagnosis algorithm to the top bridging candidates from a completely different algorithm? Many diagnosis techniques employ unique scoring mechanisms to rate their candidates, and even when common techniques are used, such as Hamming distance, they are often applied in different ways or to different data: a “1-bit difference" may mean something very different for a stuck-at candidate than for an IDDQ candidate.

It is essential, then, that a diagnosis algorithm present its results in a way that enables comparison to the results of other diagnosis algorithms. A diagnosis engineer will get the best result possible by leveraging the efforts of many algorithms and different modeling, but only if these efforts can be effectively combined.

A fault diagnosis algorithm should produce diagnoses that allow comparison or combination with the results from other diagnosis algorithms.

(vi)

1.10 Every Piece of Data is Valuable

The concept of “more is better” regarding fault models applies equally well to information: the more data that is applied to the problem of fault diagnosis, generally the higher the quality of the eventual result. This is especially true of sets of data from different sources or types of tests, such as using results from both scan and IDDQ tests. It can often be the case that IDDQ information, for example, can differentiate fault candidates that are essentially equivalent under voltage tests [GatMal97, GatMal98].

Therefore, process of diagnosis should be inclusive, using every available source of information to improve the final diagnosis.

A diagnosis algorithm or set of algorithms should use every available bit of data about the defect in producing or refining a diagnosis.

(vii)

1.11 Every Piece of Data is Possibly Bad

There is one problem with the “use all data” rule: any or all of the data might be unreliable, misleading, or downright corrupt. Data in the failure analysis problem is inherently noisy. As mentioned, simulations and fault models are only imperfect approximations. The failure data from the tester may not be completely reliable, and often results are not repeatable, especially for IDDQ measurements. The data files may be compressed with some data loss, and with the size and complexity of netlists and test programs, it’s always possible that some part of the test results or a simulation is missing or incorrect. In general, then, any diagnosis algorithm that hopes to be successful in the real (messy) world needs to be robust enough to handle some data error.

A diagnosis algorithm should not make any irreversible decisions based on any single piece of data.

(viii)

1.12 Accuracy Should be Assumed, but Precision Should be Accumulated

The prime directive of a diagnosis algorithm is to be as accurate as possible, even at the cost of precision. It is far better to give a large answer, or even no answer, than to give a wrong or misleading one. A large or imprecise diagnosis can always be refined, but an inaccurate one will lead to physical de-processing of the wrong part of a chip, with the possible destruction of the actual defect site.

Accuracy is the most important feature of a diagnosis algorithm; a large or even empty answer is preferable to the wrong answer.

(ix)

But, a diagnosis methodology should be designed so that iterative applications of new data or different algorithms should successively increase the precision and improve the diagnosis. Each step, however, needs to insure that the accuracy of previous stages is not compromised or lost.

Diagnosis algorithms should be designed so that successive stages or applications increase the precision of the answer, with a minimal sacrifice of accuracy.

(x)

1.13 Be Practical

Over the years there have been many diagnosis algorithms proposed, but the computational or data requirements of many of them immediately disqualify them for application to modern circuits. For instance, simulating a sophisticated fault model across an entire netlist of millions of logic gates is usually not feasible. Neither is considering all

÷

÷

ø

ö

ç

ç

è

æ

2

n

possible two-line bridging faults.

If an algorithm does require sophisticated fault modeling, however, it may still have application on a much-reduced faultlist resulting from a previously-obtained diagnosis. The trade-off in such a case is that the precision promised by such an algorithm may be worth the initial work to reduce the candidate space.

A diagnosis algorithm should have realistic and reasonable resource requirements, with high-resource algorithms reserved for high-precision diagnoses on a limited fault space.

(xi)

.

First Stage Fault Diagnosis: Model-Independent Diagnosis

Fault diagnosis, especially in its initial stage, can be a daunting task. Not only does the failure analysis engineer not know what kind of defect he is dealing with, but there may in fact be multiple separate defects, any number of which may interfere with each other to modify expected fault behaviors. The defect behavior may be intermittent or difficult to reproduce. Also, the size of the circuit may make application of all but the simplest diagnosis algorithms impractical.

Given these facts, a long-lived staple of fault diagnosis research has apparently outlived its usefulness. The single fault assumption – that there is one defect in the circuit under diagnosis that can be modeled by a single instance of a particular fault model – may not apply for modern fault diagnosis. While it has simplified many diagnostic approaches, some of which have worked quite well despite real-world violations of the premise, the single fault assumption has led to problems with two common defect types: multiple faults, and complex faults. As defined here, complex faults are faults in which the fault behavior involves several circuit nodes, involves multiple erroneous logic values, is pattern-dependent, or is otherwise intermittent or unpredictable.

Traditionally, the single fault assumption has led to the expectation of a certain internal consistency, or some dependence between the test results, with regard to defective circuit behavior. In cause-effect diagnosis, a fault model is selected beforehand, and the observed faulty behavior is compared, as a single collection of failing patterns and outputs, to fault signatures obtained by simulation. In effect-cause diagnosis, many algorithms look for test results that prove that certain nodes in the circuit are able to toggle, and are therefore fault-free throughout the rest of the test set. In either case, the assumption has been that individual test results are not independent, but are rather wholly determined by the presence of the single unknown defect.

From the beginning, however, a few diagnosis techniques eschewed the single fault assumption, especially those that directly addressed multiple faults. These approaches, either implicitly or explicitly, forsake inter-test dependence and instead consider each test independently. The advantage to such approaches is that pattern-dependent and intermittent faults can still be identified, as can the component faults of complex defects. The drawback is that a conclusion drawn about the defect from one test cannot be applied to any other test, and the net result is (in effect) a diagnosis for each test pattern. This can lead to large candidate sets that are difficult to understand and use, especially as guidance for physical failure analysis. Also, since these algorithms no longer implicate a single instance of a fault model, there is now the problem of constructing a plausible defect scenario to explain the observed behavior.

This chapter will attempt to address these drawbacks by improving both the process and the product of per-test fault diagnosis. First, the process will be improved by including more information to score candidates, and paring down the candidate list to a manageable number. Second, the product will be improved by suggesting a way of interpreting the candidates to infer the most likely defect type. The result is a general-purpose approach to identifying likely sources of defective behavior in a circuit despite the complexity or unpredictability of the actual defects.

1.14 SLAT, STAT, and All That

While increasing in recent popularity, the idea of conducting fault diagnosis one test pattern at a time is a venerable one. Waicukauski and Lindbloom [WaiLin89], Eichelberger et al. [EicLin91], and, more recently, the POIROT [VenDru00] and SLAT [BarHea01] diagnostic systems all suggest or rely on per-test fault diagnosis to address multiple or complex faults. We can, without too much license, state the primary axiom of the one-test-at-a-time approach as follows:

For any single test, an exact match between the observed failures (at circuit outputs or flip-flops) with those predicted by a simulated fault is strong evidence that the fault is present in the circuit, if only during that test.

The underlying concept is uncontroversial, as it underpins both traditional fault diagnosis as well as scientific modeling and prediction: A match between model and observation supports the assumptions of the model or implicates the modeled cause. The difference here is that the traditional comparison of model to observed behavior is decomposed into comparisons on individual test vectors, with a stricter threshold of exact matching to produce stronger implications.

The statement that “the fault is present” should not be taken too broadly. It does not mean that the fault (or modeled defect) is physically present, or that any conclusions can be drawn about the defect in any other circumstance other than the specific failing test. Applied most commonly to stuck-at faults, all that can be inferred from a match is that a particular node has the wrong value for a particular test. However, that node is not implicated as the source of any other failures, nor is it actually “stuck-at” any value at all, since there is no evidence that it doesn’t toggle during other tests.

Note also that the axiom cannot claim that a match constitutes proof that a particular fault is present. A per-test diagnosis approach can be fooled by aliasing, when the fault effects from multiple or complex faults mimic the response from a simple stuck-at fault. This can happen, for instance, if the propagation from a fault site is altered by the presence of other simultaneous faults, or due to defect-induced behaviors such as the Byzantine General’s effect downstream from bridged circuit nodes [AckMil91, LamSho80]. The probability of such aliasing is impossible to determine, given the variety of ways in which it could occur. Per-test diagnosis approaches rely on the assumption that this probability is small, and on the hope that, should aliasing implicate the wrong fault, that this fault is not wholly unrelated to the actual defect and is therefore not completely misleading.

A secondary axiom, implicit in the W&L paper but stated in somewhat different terms in the SLAT paper, is the following:

There will be some tests during which the defect(s) to be diagnosed will behave as a single, simple fault, which will, by application of the primary axiom, implicate something about the defect(s).

What this axiom states is that, for any defective chip, there will be some tests for which the failing outputs will exactly match the predicted failing outputs of one or more simple (generally stuck-at) faults. This assertion relies on the observation that many complex defects will, for some applied tests, behave like stuck-at faults that are in some way related to the actual defect. For example, a bridging fault will occasionally behave, on some tests, just like a stuck-at fault on one of the bridged nodes.

The way that a per-test fault diagnosis algorithm proceeds is to find these simple failing tests (referred to in the SLAT paper as SLAT patterns), and identify and collect the faults that match them. The candidate faults are arranged into sets of faults that cover all the matched tests. The SLAT authors call these collections of faults multiplets, a term adopted in this thesis. As a simple example, consider the following three tests, with the associated matching fault candidates:

Test Number

Exactly-Matching Faults

1

A

2

B

3

C, D, E

Figure 4.1: Simple per-test diagnosis example.

In this example, fault A is a match for test #1, which means that the predicted failing outputs for fault A on test #1 match exactly with the observed failing outputs for that test. Similarly, fault B matches on test #2, while for test #3 three faults match exactly: C, D, and E. The SLAT algorithm will build the following multiplets as a diagnosis: (A, B, C), (A, B, D), and (A, B, E). Each multiplet “explains”, or covers, all of the simple failing test patterns. SLAT uses a simple recursive covering algorithm to traverse all covering sets smaller than a pre-set maximum size, and then only reports minimal-sized coverings (multiplets) in its final diagnosis.

For comparison, the W&L algorithm will report one set of faults – (A, B, C, D, E) – in its diagnosis on the above example, with a note that fault C, D, and E are equivalent explanations for test #3. The POIROT algorithm will produce the same results, with a score based on how many tests are explained by each fault (in this case, all faults would get the same score).

The are several advantages to the per-test fault diagnosis approach. First, it explicitly handles the pattern-dependence often seen with complex fault behaviors. It also explicitly targets multiple fault behaviors. And, by breaking up single stuck-at fault behaviors into their per-test components, it attempts to perform a model-independent or abstract fault diagnosis. (Since it still relies on stuck-at fault sensitization and propagation conditions, however, it cannot be considered truly fault-model-independent.) This sort of abstract fault diagnosis is just the thing for an initial, first-pass fault diagnosis when nothing is known about the actual defect(s) present.

This chapter will propose a new per-test algorithm. This algorithm is similar in style to the SLAT diagnosis technique, but is able to use more information and so produce a better, more quantified, diagnostic result. The SLAT technique is focused on determining fault locations, hence the name: “Single Location At a Time”. The new approach will instead focus on the faults themselves, but will, like SLAT, diagnose test patterns one at a time. Borrowing the nomenclature, however, we will refer to the process of per-test diagnosis as “STAT” – “Single Test At a Time”. For shorthand, the new algorithm will be called “iSTAT”, for “improved STAT”. Like SLAT, the iSTAT algorithm uses stuck-at faults to build multiplets, but differs from SLAT in two important ways. First, it uses a scoring mechanism to order multiplets to narrow the resulting candidate set. Second, it can use the results from both passing and complex failing tests to improve the scoring of candidate fault sets.

1.15 Multiplet Scoring

The biggest problem with a STAT-based diagnosis is that, since each test is essentially an individual diagnosis, the number of candidates can become quite large. Specifically, the number of multiplets used to explain the entire set of failing patterns can be large, and each multiplet will itself be

composed of multiple individual component faults. What is needed is a way to reduce the number of multiplets, or to score and rank the multiplets to indicate a preference between them. This section will introduce a method for scoring and ranking multiplets. It will also talk about how to recover information from tests that don’t fail exactly like a stuck-at fault, and from passing tests that don’t fail at all.

1.16 Collecting and Diluting Evidence

The basic motivation of STAT-based approaches, as expressed in the first axiom above, is that an exact match between failing and predicted outputs on a single test is strong evidence for the fault. While this much seems reasonable, it seems just as obvious that the evidence provided by a failing test is diluted if there are many fault candidates that match. For instance, in the simple example given above, the evidence for fault A is much stronger than that for any of faults C, D, or E, simply because fault A is the only candidate (according to the axiom) that can explain the failures of test #1. The evidence provided by test #3 is just as significant as the evidence from test #1, it is just shared among three possible explanations.

This division of evidence can also be illustrated by imagining failures on outputs with a lot of fan-in, or a defect in an area with many equivalent faults. While there will be a number of faults that match the failure exactly, test results will not provide much compelling evidence to point to any particular fault instance.

The first way that iSTAT improves per-test diagnosis is to consider the weight of evidence pointing to individual faults, and to quantify and collect that evidence into multiplet scores. The mechanism that iSTAT uses to quantify diagnostic evidence is the Dempster-Shafer method of evidentiary reasoning.

1.17 “A Mathematical Theory of Evidence”

A means of quantitatively manipulating evidence was developed by Arthur Dempster in the 1960’s, and refined by his student Glen Shafer in 1976 [Sha76]. At its center is a generalization of the familiar Bayes rule of conditioning, also known simply as Bayes Rule:

å

=

=

=

n

i

C

B

p

C

p

C

B

p

C

p

B

p

C

B

p

C

p

B

C

p

i

i

i

i

i

i

i

1

)

|

(

)

(

)

|

(

)

(

)

(

)

|

(

)

(

)

|

(

(1)

In this formulation of Bayes Rule, B represents some phenomenon or observed behavior, and each Ci is a possible candidate explanation or cause for that behavior. The set of candidates is assumed to be mutually exclusive. Bayes Rule is commonly used for the purposes of statistical inference or prediction, which attempt to determine the most likely probability distribution or cause underlying a particular observed phenomenon.

Bayes Rule uses the prior probability (or a-priori probability) p(Ci) of candidate Ci and the conditional probability of B given the candidate Ci to determine the posterior probability p(Ci | B) of candidate Ci given B. This posterior probability is central to Bayes decision theory, which states that the most likely candidate given a certain behavior is that for which

j

i

B

j

C

p

B

i

C

p

¹

>

all

for

)

|

(

)

|

(

When applied to the problem of fault diagnosis, Bayes decision theory can be used to determine the best fault candidate (Ci) given a particular observed behavior (B).

The Dempster-Shafer method was developed to address certain difficulties with Bayes Rule when it is applied to the conditions of epistemic probability, in which probability assignments are based on belief or personal judgement, rather than its usual application to aleatory probability, where probability values express the likelihood or frequency of outcomes determined by chance.

The conditions of epistemic probability are familiar