Metastability and Fatal System Errors Blendics Inc....

7
1 Metastability and Fatal System Errors J. Cox 1 , G. Engel 1 , D. Zar 1 , T. Chaney 1 , S. Beer 2 1 Blendics Inc. and 2 Technion 17 Mar 2013 Metastability is an inescapable phenomenon in digital electronic systems, particularly those with multiple independent clock domains such as System-on-Chip (SoC) products. This phenomenon has been known to cause fatal system errors for half a century. Over the years, techniques have been developed for obtaining an arbitrarily long mean-time-between-failure (MTBF). These techniques have been translated into convenient rules of thumb for designers. However, as digital circuits have become more complex, smaller and faster with reduced power consumption, the old rules of thumb are beginning to fail. A new tool, MetaACE, has been developed that accurately evaluates metastability failures in contemporary SoCs. This report surveys briefly what is known and what can be done. Background. As society’s dependence on digital systems increases, and as these systems move to lower supply voltages, operate faster and are subject to increased process variability, the chance of failure due to a metastability fault has increased dramatically. If such a failure is infrequent, and merely requires a system reboot or a reset, the phenomenon is a nuisance rather than a serious problem. However, if the failure is in a mission-critical system or one for which human lives are at risk, the designers have an obligation to insure that the chance of a catastrophic system failure within the system’s lifetime is less than that from other sources. The system element whose specific purpose is to mitigate metastability hazards is called a synchronizer. After experiencing metastability, a synchronizer settles to a valid output voltage in period of time that is without an upper bound. This settling-time regime is largely exponential with a time constant τ. Throughout a number of the past semiconductor process generations, τ has been proportional to the propagation delay of a fan-out of four (FO4) inverter circuit. FO4 is a process-dependent delay metric that is characteristic of a CMOS technology’s speed. However, a change in the relationship between τ and FO4 has emerged at process geometries of 90 nm and below. This change is particularly significant when the metastable voltage (typically about half the supply voltage) is in the vicinity of the transistor threshold voltage, an increasingly common occurrence for low-power circuits. Under these circumstances, the current flowing in a metastable complementary pair of transistors can be exceedingly small, resulting in a large value of τ. Operating conditions and process variations further aggravate the situation and can cause many orders of magnitude variation in the MTBF of a synchronizer. No longer can the designer depend upon the rule of thumb that τ is proportional to the FO4 delay. As a result, traditional guidelines for synchronizer design are no longer adequate. To illustrate how these traditional rules of thumb fail, Figure 1 shows the effect of supply voltage on τ and, in turn, on MTBF. The value of FO4 versus supply voltage is also included in Figure 1. One observes that FO4 delay, under these operating conditions, displays much less sensitivity to supply voltage than τ.

Transcript of Metastability and Fatal System Errors Blendics Inc....

Page 1: Metastability and Fatal System Errors Blendics Inc. …blendics.com/.../Metastability-and-Fatal-System-Errors-rev-16-Sept...Metastability and Fatal System Errors J. Cox1, ... Metastability

1

Metastability and Fatal System Errors J. Cox1, G. Engel1, D. Zar1, T. Chaney1, S. Beer2

1Blendics Inc. and 2Technion 17 Mar 2013

Metastability is an inescapable phenomenon in digital electronic systems, particularly those with multiple independent clock domains such as System-on-Chip (SoC) products. This phenomenon has been known to cause fatal system errors for half a century. Over the years, techniques have been developed for obtaining an arbitrarily long mean-time-between-failure (MTBF). These techniques have been translated into convenient rules of thumb for designers. However, as digital circuits have become more complex, smaller and faster with reduced power consumption, the old rules of thumb are beginning to fail. A new tool, MetaACE, has been developed that accurately evaluates metastability failures in contemporary SoCs. This report surveys briefly what is known and what can be done. Background. As society’s dependence on digital systems increases, and as these systems move to lower supply voltages, operate faster and are subject to increased process variability, the chance of failure due to a metastability fault has increased dramatically. If such a failure is infrequent, and merely requires a system reboot or a reset, the phenomenon is a nuisance rather than a serious problem. However, if the failure is in a mission-critical system or one for which human lives are at risk, the designers have an obligation to insure that the chance of a catastrophic system failure within the system’s lifetime is less than that from other sources. The system element whose specific purpose is to mitigate metastability hazards is called a synchronizer. After experiencing metastability, a synchronizer settles to a valid output voltage in period of time that is without an upper bound. This settling-time regime is largely exponential with a time constant τ. Throughout a number of the past semiconductor process generations, τ has been proportional to the propagation delay of a fan-out of four (FO4) inverter circuit. FO4 is a process-dependent delay metric that is characteristic of a CMOS technology’s speed. However, a change in the relationship between τ and FO4 has emerged at process geometries of 90 nm and below. This change is particularly significant when the metastable voltage (typically about half the supply voltage) is in the vicinity of the transistor threshold voltage, an increasingly common occurrence for low-power circuits. Under these circumstances, the current flowing in a metastable complementary pair of transistors can be exceedingly small, resulting in a large value of τ. Operating conditions and process variations further aggravate the situation and can cause many orders of magnitude variation in the MTBF of a synchronizer. No longer can the designer depend upon the rule of thumb that τ is proportional to the FO4 delay. As a result, traditional guidelines for synchronizer design are no longer adequate. To illustrate how these traditional rules of thumb fail, Figure 1 shows the effect of supply voltage on τ and, in turn, on MTBF. The value of FO4 versus supply voltage is also included in Figure 1. One observes that FO4 delay, under these operating conditions, displays much less sensitivity to supply voltage than τ.

Page 2: Metastability and Fatal System Errors Blendics Inc. …blendics.com/.../Metastability-and-Fatal-System-Errors-rev-16-Sept...Metastability and Fatal System Errors J. Cox1, ... Metastability

2

Figure 1. Settling time-constant τ, FO4 delay and MTBF as a function of the supply voltage (V) for a 65 nm CMOS synchronizer operated with a 200 MHz clock.

Note that τ grows by almost an order of magnitude more than the delay through a FO4 inverter as the supply voltage changes from 1.3 volts down to 0.825 volts. Over that 0.475 supply voltage range, the metastability voltage (about half of V) decreases by almost a quarter of a volt. An equivalent increase in transistor threshold voltage Vth produces the same difference between the FO4 delay and τ. Such an increase in Vth can occur under low temperature operation of the synchronizer. The combination of low supply voltage and low temperature can lead to sub-second values of MTBF and an almost certain system failure. Anticipating Synchronizer Failures. The classic method of electrical testing to determine possible failure modes of a complex system employs extensive measurement at extremes of the operating region, the “corner conditions”. This approach is inadequate for synchronizer failures because:

• failures are usually rare and extrapolating their probability throughout a service lifetime is unreliable if measurements are made over a fraction of that lifetime,

• failures may occur more frequently at a point in the region of operation that is not a corner condition,

• failures detected after fabrication are highly undesirable because of the significant cost and time lost in a re-spin, and

• unless test chips are properly instrumented when fabricated, failures that do occur during tests may leave no indication as to what went wrong or where it happened.

Synchronizer simulation can overcome these problems, but circuit issues related to asymmetry, nonlinearity and common-mode behavior and simulator issues related to precision and false traces must be addressed in a practical simulation tool. Blendics has

Page 3: Metastability and Fatal System Errors Blendics Inc. …blendics.com/.../Metastability-and-Fatal-System-Errors-rev-16-Sept...Metastability and Fatal System Errors J. Cox1, ... Metastability

3

developed a software system, MetaACE, that deals with all these problems through four improvements:

• a mathematical model of a synchronizer that includes circuit asymmetry, nonlinearity and common-mode effects,

• a technique that overcomes simulator precision problems often found in multistage synchronizers,

• accurate co-estimation of the of the intrinsic parameters that model the behavior of a synchronizer, and

• estimation of MTBF through a formula that combines extrinsic parameters obtained from a particular synchronizer application with intrinsic parameters obtained through synchronizer circuit simulation.

Accuracy of Reliability Estimates. Physical measurement of the reliability of a single-latch synchronizer is possible, but such measurements cannot be completed in a reasonable time for multistage synchronizers. However, single-latch measurements can test the accuracy of single-stage simulations and by extension validate the more general multistage analysis. Figure 2 compares such measurements (in blue) with simulations (in red) over a range of supply voltages (V) and temperatures (T).

Figure 2. Comparison of physical measurement and circuit simulations of the settling

time-constant τ of a CMOS latch fabricated in a 65 nm process.

Most of the simulated data points fall on top of the measured ones with all points in comparison falling within ± 5%. Note that both low voltages and low temperatures cause the settling time-constant τ to increase. At the extreme of V = 0.95 and T = - 20 C, τ increases by over an order of magnitude and the risk of failure increases dramatically.

Page 4: Metastability and Fatal System Errors Blendics Inc. …blendics.com/.../Metastability-and-Fatal-System-Errors-rev-16-Sept...Metastability and Fatal System Errors J. Cox1, ... Metastability

4

Predicting MTBF. Simulating a synchronizer can provide the essential parameters intrinsic to a particular semiconductor process, but more information is needed to estimate the MTBF of the circuit in a particular application. Parameters such as clock period, clock duty cycle, rate of data transitions and number of stages in the synchronizer are extrinsic and depend on the application and not on the semiconductor process. However, as can be shown by simulation, the MTBF for these various applications of a synchronizer design can be calculated given the intrinsic parameters. Figure 3 compares the calculated and simulated results for 2, 3 and 4 stage master-slave flip-flops for various clock periods and a data transition rate of 200 MHz.

0.6 0.8 1 1.2 1.4 1.6 1.8 2

10-5

100

105

1010

1015

1020

1025

1030

T (nsec)

MTB

F (y

ears

)

Simulated Calculated (17)

2"flip'flop

2"flip'flop

3"flip'flop

4"flip'fl

op

Figure 3. Comparison of calculated and simulated estimates of MTBF.

It is clear from Figure 3 that there are extrinsic conditions under which even a 2 flip-flop synchronizer at a nominal supply voltage and temperature is unreliable. At a 1-nsec clock period (1 GHz) of a typical double-ranked 90 nm synchronizer is only a year and probably inadequate. Increasing the number of stages to four increases the MTBF for a clock frequency of 1 MHz to about 1010 years, a more than adequate reliability in most cases. These results show that a multistage synchronizer’s MTBF can be calculated from intrinsic parameters associated with semiconductor process and extrinsic parameters associated with the application of the synchronizer. Thus, a standard-cell vendor can provide intrinsic synchronizer parameters to an SoC designer who, in turn, calculates the MTBF for the application at hand. Conclusions. Synchronizers in low-power SoCs fabricated in contemporary semiconductor processes can, when operated at high clock rates, low supply voltages or low operating temperatures, lead to a seriously inadequate MTBF. Using tools, such as MetaACE, it is possible to estimate synchronizer MTBF. Manufacturers of mission critical products should carefully consider the risk of synchronizer failure and take the necessary steps so that their engineers and their semiconductor vendors will insure a satisfactory MTBF over a system lifetime, particularly when human lives are at risk.

Page 5: Metastability and Fatal System Errors Blendics Inc. …blendics.com/.../Metastability-and-Fatal-System-Errors-rev-16-Sept...Metastability and Fatal System Errors J. Cox1, ... Metastability

5

Appendix: Real-world Examples of Metastability

From the beginning, now some 50 years ago, we often ran into requests for help such as: “we need help, but this work needs to confidential. If this became known, it would damage the reputation of our product.” It was, and is, rare where we were allowed to publicly discuss cases of metastability and its affects. Below are some of the cases that we can discuss. We have included a couple of cases we know of that happened over the last few years where we can only discuss the work in very general terms. Metastability failures continue to happen. Metastability appears to be well enough known now that we are now beginning to see synchronizer designs that are over designed. In some of these cases, a designer has increased the circuit area and complexity because the synchronizer was over designed.

Some Cases: APRA-NET: (1971 - 1972)

The first ARPA-NET network used a Honeywell DDP 516 computer as the node message switching computer. This computer was selected in part because it had an excellent track record of solid performance. But in this high speed switching application, the machine failed after many hours or a few days of operation, never the same way twice. Per Severo Ornstein, the failure was “ an extraordinarily rare, seemingly random, intermittent failure. Despite numerous attempts, we were unable to catch it in the act in order to capture some symptoms.” After some time, Severo remembered the glitch work done at Washington University while he was there. In fact, Severo was one of the authors on one of the papers on Metastability. After recognizing the problem, the DDP 516 was modified to reduce the failure rate enough that this problem was never observed again. Severo goes on to describe the trouble they had convincing Honeywell there was a problem involving Metastability.

REF: Severo M. Orstein, “Computing on the Middle Ages” pp173, 174 .

--------------------------------------------------------------

DEC: (1972)

DEC found out that WU was planning a retreat in the summer of 1973 on the synchronizer and metastability. DEC called and pleaded to be allowed to send two engineers to the retreat. The two engineers were allowed to attend but the “price” of admission was that the engineers had to present their problem. Even then, we would hear of synchronizer problems and fixes, but the details would typically be held confidential. This was an opportunity to get something “on the record”.

The early (about 1972) PDP-11/45 computers had occasional word errors in the main

Page 6: Metastability and Fatal System Errors Blendics Inc. …blendics.com/.../Metastability-and-Fatal-System-Errors-rev-16-Sept...Metastability and Fatal System Errors J. Cox1, ... Metastability

6

memory. Furthermore, DEC would produce a batch of machines, and some would work perfectly for weeks on end. But there would be at least one in each batch that would fail several times every day! After the retreat, DEC was able to diagnose and fix the machine design. The main memory had to be refreshed once every 10,000 clock cycles or so. The machine design used an asynchronous timer to tell the machine when it was time for a memory refresh. At the end of each refresh cycle, (which was a clocked event and thus was synchronized to a system clock edge) an R/C timer would start. When the R/C timer ran out, a system interrupt would be created and another memory refresh cycle would start. The length of time the R/C timer ran was about 10,000 clock periods. After 10.000 clock periods, the timing jitter of the R/C timer was a significant part of a clock period. The system interrupt circuit would misbehave if the R/C timer output occurred at just the wrong time. The “fix” was to add a small trimming resistor so that the period of the R/C timer could be adjusted so that the peak of the R/C timeout distribution happened as far as possible from the system clock edge as possible.

-------------------------------------------------------------------

TI:

In the late 1970’s, TI introduced a TTL circuit that was advertised as a switch de-bouncer. I analyzed this circuit and found that in addition to the circuit still exhibiting metastable behavior, it also had a logic timing race that allowed the output of the circuit to misbehave over an input timing region at least ten times larger than it needed to be. It seemed the designer was aware of metastability and appeared to have worked so diligently on eliminating the metastability that he left behind a logic hazard that made things worse. This part was in the TI parts catalog for a few years and then vanished. This is a case of a circuit design making it all the way to the company catalog even though the circuit is virtually useless.

-------------------------------------------------------------------

(Confidential)

(This project is confidential, thus this discussion is only in broad terms.) In the early 2000’s, I received a call from a high official of a custom chip manufacturer. His company was designing chips with lots of synchronizers in their designs. After tape out and fabrication of the circuits, many of the synchronizer circuits were failing and had to be re-designed. He asked me to give a seminar to each of his design teams around the world. I did. Confidentially keeps me from saying more. But this is a case where a chip manufacturer was willing to pay me tens of thousands of dollars AND take the whole companies' design team away from their work for a half a day to train his designers to improve the design yield of his company.

----------------------------------------------------------------------

Page 7: Metastability and Fatal System Errors Blendics Inc. …blendics.com/.../Metastability-and-Fatal-System-Errors-rev-16-Sept...Metastability and Fatal System Errors J. Cox1, ... Metastability

7

(Confidential)

In 2010 Blendics was asked to analyze a synchronizer circuit destined for a communication product to be installed outside a customer’s house. Some customers were in Canada and the product was expected to work to -40 C. Through careful simulation, we found that the synchronizer failed regularly at that temperature. We do not know how the problem was resolved, but assume the synchronizer was redesigned.

Nancy Battersby
Typewritten Text
Nancy Battersby
Typewritten Text
------------------------------------------------- TECHNION: In May 2013 scientists at the Technion in Israel reported on a case of metastability in a commercial 40nm SoC that failed randomly after fabrication. Normally, there would have been no forensic evidence that metastability was the cause of these failures. However, by use of infra-red emission microscopy they identified a spot on the chip that correlated with the failure events in both time and location. The spot contained a synchronizer confirming its transient hot state and its role in the failures. Because the system employed coherent clock domains, the synchronizer MTBF was sensitive to the ratio of frequencies used in the two clock domains to be synchronized. The original, unfortunate choice of this ratio led to the failures and a more favorable choice improved the MTBF by two orders of magnitude, an acceptable solution for the application at hand.
Nancy Battersby
Typewritten Text
Nancy Battersby
Typewritten Text