[International Journal of Quality and Reliability Management] Transients effects on the reliability...

8/7/2019 [International Journal of Quality and Reliability Management] Transients effects on the reliability of programmable e

1/9

IJQRM13,2

66

Transients effects on thereliability of programmable

electronicsH.K. Tang and Brian Lee

School of Electrical and Electronic Engineering, Nanyang TechnologicalUniversity, Singapore

IntroductionIt has been said that we are living in the age of microelectronics and computers.They are present in almost every electronic product and system and are alsoused heavily in products which are not normally classified as electronics. Theseitems range from washing machines to automobiles. All these products andsystems have one thing in common. Their electronics are mostly based onmicroelectronics hardware and their operations are programmed by software.In other words, they are programmable electronics. Programmable electronicsreliability depends not only on the reliability of the constituent hardware andsoftware but also on the ambient physical environment.

Electronic hardware is inherently more reliable than most mechanicalequipment due to the lack of wear and tear. From the advent of the transistor in

the late 1940s to the latest million-transistor microprocessor chip, the reliabilityof microelectronics has improved steadily. The device failure rate model followsthe Weibull distribution in its early life and is followed by a very long, useful lifeof constant failure rate. Typically the constant failure rate ranges from a fewppb to a few hundred ppb[1]. Thus, electronic hardware is rarely responsible forfailures, even for very complex computer systems[2].

As the processing power of microprocessors (measured in million instructionsper second or MIPS) increases, software complexity also increases to harnesstheir power for better performance and to produce more functions. While thecontrol program for a washing machine may be just a few thousand lines ofinstruction, it is not unusual nowadays to find software with a million lines ofinstruction, even in personal computers. Software of such complexity alsocontrols the modern telephone exchanges, aeroplanes and non-stop computers forbanking and finance. The proliferation of programmable electronics gives rise toconcern over the risks of software. To contain the risks of software, structuredprogramming, software quality assurance and fault-tolerance techniques areincreasingly being used[1,3]. In a survey of well-debugged programs, MTTFranging from 1.6 years to 5,000 years was reported[4].

It is well known that temperature and humidity affect the reliability ofelectronics. The methods to reduce their detrimental effects are also well known.One aspect of the physical environment, however, is not widely known,although it is gaining recognition as one of the most serious elements that

International Journal of Quality

& Reliability Management,

Vol. 13 No. 2, 1996, pp. 66-74,

MCB University Press,

0265-671X


2/9

Transientseffects

67

affects electronics in general and programmable electronics in particular. Thisis the susceptibility of electronics to electromagnetic interference (EMI), whichis also known as radio frequency interference (RFI). In short, EMI affectsprogrammable electronics reliability through interaction with the hardwareand software.

Transient is one particular form of EMI that is a major cause of failures forprogrammable electronics. It is represented as a short burst of electromagneticenergy that enters into a victim equipment via conduction on cables and otherforms of conductor or via electromagnetic radiation. Strong transients cancause permanent physical damage while weak transients cause only transientfaults that involve no physical damage. Nevertheless, transient faults can still

cause havoc to programmable electronics operations. Since there is no evidenceof physical damage, failures due to transient faults are often confused withsoftware faults and mislead failure analysis in the wrong direction. It is forthese reasons that transients and their effects need to be understood better bythose responsible for product quality and reliability.

Susceptibility of programmable electronicsEMI is part of the physical environment. It is either natural or man-made. Thereare many sources of EMI. They include lightning, radio transmitters, motors,electrical circuit breakers, electrostatic discharge, personal computers[5]. EMIcould be transient in durat ion, as produced by lightning or continuous, asproduced by broadcast radio. EMI is either conducted by cables such as power

and data interface cables or radiated through the atmosphere. If suitablecountermeasures against EMI are not taken, sensitive electronic equipmentwould be interfered with. The result of the interference could be temporary orpermanent loss of performance. Due to the proliferat ion of electr ical andelectronic apparatus, particularly computing devices, man-made EMI has beenon the increase. The s ituat ion became so serious that in the early 1980sregulations were imposed internationally to limit the amount of EMI that can beemitted from computing devices or information technology equipment.

Limiting EMI emission of computing devices, however, does not eliminateEMI completely. There are still the natural and other man-made EMI sources,like lightning and electrical circuit-breakers. They produce interferences thatlast only a short time, ranging from nanoseconds to milliseconds. They enterelectronic equipment via power and interface cables or couple into theequipment as transient electromagnetic radiation. The results are transientvoltages and currents, transients in short, in the electronic hardware.

While programmable electronics are not the only potential victims oftransients, they are especially susceptible to this form of EMI and the possiblefailures could be more serious. Take, for example, a radio receiver with noprogrammable electronics; the effect of a lightning strike nearby may be just aclicking noise added to the received signal. On the other hand, transients causedby lightning may lead a traffic light controller, controlled by programmableelectronics, into an unsafe state, such as tur ning all the green lights on.


3/9

IJQRM13,2

68

Similarly, a financial transaction computer may enter a piece of wrong data withserious financial consequences.

In view of the ever-increasing use of electronics, especially programmable elec-tronics, and the concern about their safety and reliability, international standardson immunity against EMI have been adopted and will be imposed, from 1996onward, on products sold in the European Community. The European Norm50082-1 will cover residential, commercial and light industrial environments, while50082-2 will cover industrial environments. In other words, mass-produced appar-atus as well as industrial, scientific and medical equipment will be affected[6].

Transients and failures

Transients can cause faults, which in turn can cause errors and errors can causefailures. Following the computer community, the definitions of these terms aregiven below[2,7]:

A faultis an incorrect state of hardware or software resulting fromfailures of components, physical interference from the environment,operator error, or incorrect design.

An er roris the manifestation of a fault within a program or datastructure. It is a deviation from accuracy or correctness.

A failureis the non-performance of some action that is due or expected.

Faults can be classified into three types: permanent, intermittent and transient(often intermittent and transient are not differentiated in usage). A permanent

fault exists indefinitely unt il it is corrected by repair to the hardware. Anintermittent fault appears, disappears and reappears repeatedly. It is due toimpaired physical conditions of the hardware and can be repaired by p artreplacement or correction. A transient fault appears and disappears within avery short period of time and involves no damage to the hardware.

Transient caused by EMI is a major, but not the only, cause of transient faults.Electrostat ic discharge or (ESD) can also cause transient faults through theconcomitan t electromagnetic radiat ion. Defective software has also beenidentified as another major source of transient faults in software intensivesystems. According to some case studies of mature and well-debugged systems,transient faults account for more than 80 per cent of all failures observed[2,8,9].

The mechanisms of how transients produce failures are very complicated.Simply stat ed, they depend first on the physical interaction between thehardware and the sources of transients and, second, on the st ates of thesoftware at the times tha t the trans ient faults occur. Three more factorscomplicate the situation further and make transient faults and their associatedfailures so hard to deal with. These factors are now discussed.

Probability of failureA transient does not always cause a failure. Transients are inherently randomin nature. Their frequency of occurrence, waveforms and strength are randomvariables. Thus, a t ransient may or may not produce a fault. When it does


4/9

Transientseffects

69

produce a fault it could be a transient, an intermittent or a permanent faultdepending on its strength and waveform (see Figure 1).

Even when a tr ansient fau lt occur s an error does not always result. Forinstance, if a fault causes data to be 1 when it should be 0 an error will occur. Onthe other hand, if the data are already a 0 then the transient fault does not resultin an error. Similarly, an error does not always end up in a failure. For instance,a transient may cause data error but if the data are not read and used, or areoverwritten by correct data, then it cannot cause a failure. So failure due totransient is a highly complicated and random process.

In the case of a very s trong transient with very fast rise and fall times theinduced transient voltages and currents in the hardware will also be high anddistributed extensively throughout the hardware. This leads to a very highprobability for transient-induced failure. This could also happen with less severetransients if the hardware design is very poor and, therefore, highly susceptible totransients. Otherwise, the probability of failure due to transients will be low andcould be modelled by a Poisson random process of rare events as below.

The software which is being executed by the hardware is characterized by thepresence of time intervals during which the software is susceptible to transientfaults. Such intervals could be called the susceptible windows, for example, theintervals during which crucial data are being transferred between the processing

Figure 1.Transients and their

possible effects onprogrammableelectronics

Software

Hardware

Transients

Damage:permanent faults

Partial damage:intermittent faults

Transientfaults

Nofault

Errors No error

Failures No failure


5/9

IJQRM13,2

70

unit and some memory or input/output device. Typically, these susceptiblewindows represent a small fraction of the total observation time, hence randomtransients hitting at susceptible windows can be considered as rare events[10].The probability of developing transient failure could be calculated easily.

Let the observation time be T, within which there are mnumber of identicalsusceptible windows, each of duration t. The probability, f, of developing atleast one failure, with the equipment in question subject to nnumber of randomtra nsient faults, occurr ing one at a time and with uniform probabilitydistribution function throughout T, is given by:

f 1 e(nmt/T)

provided that the following condition is met,mt T.

The implications of the above expression are obvious. The more frequent are thetransient faults or the occurrences of susceptible windows within a fixed period oftime, the higher the probability of developing a failure. The fewer in number are thesusceptible windows, the longer it will take to develop a transient-induced failure.

Error latencyAn error does not always cause a failure immediately. In some cases it may take along time to do so. The time period between the occurrence of an error and itsassociated failure is called error latency. Take for example, a piece of data which iscorrupted while it is written into a memory device a failure will not occur until itis retrieved and used by the processor. Thus, the error is dormant and undetected.Such an error is called a latent error and can be likened to a computer virus.

ElusivenessAs pointed out earlier, transients are random in nature. When a failure occursand is detected, the source of the transients could have disappeared or becomequiescent for a long time. This makes troubleshooting and tracing the origin ofthe failure extremely difficult. Transient faults could produce many differentfailures and some of them seldom repeat . During the product-developmentstage, transient faults could often be masked by the more dominant hardwareand software faults. All these factors could lead t he engineers to wrongconclusions when diagnosing failures.

Design against transientsDue to the above reasons, a defensive design strategy is preferred and needed tocombat transient faults so as to achieve reliability. Such a strategy could beimplemented at several levels. The first and most fundamental level is the hard-ware. Shielding, proper grounding of cables, filtering, good circuit board layout,installing transient absorbers are essential techniques for fault avoidance[5].

The next level is at the software and data structure. At this level, the designobjective is fault tolerance. The purpose of fault tolerance is to prevent faultsleading into errors and errors leading into failures. There are many techniquesused t o achieve fault tolerance, e.g. error-correction coding, redunda ncy,


6/9

Transientseffects

71

performance monitoring[2,7]. Some techniques employ only software whileothers use software and additional hardware.

However, it is important to realize that no fault tolerance technique gives 100per cent fault coverage. Some errors may not even be detectable, so failure couldstill occur despite fault tolerance techniques. The additional hardware andsoftware to implement fault tolerance could also fail due to transient faults.Moreover, fault tolerance techniques often mean an additional workload whichslows down system performance.

To test the adequacy of design, a tr ansient simulator should be used inprototype testing. The objective is to force the equipment under test into failureso that its weak points can be discovered and ameliorated. The IEC standard


7/9

IJQRM13,2

72

Figure 2.A typical design-manufacture-operateprocess with possibleundesirable loops due todiscovery of transientproblems

Initialspecifications

Initialspecifications

Design

Phototyping/testing

Final specifications/manufacture documents/

test plans

Manufacture

Field test

Acceptance

(a) (b) (c) (d)


8/9

Transientseffects

73

cost would have escalated significantly. Besides extra engineering timethere would be material scrap and extensive revisions of documents. Atthis stage, redesign will require some or all of the following measures: there-layout of printed circuit boards, re-routeing or changing the types ofcables and wires used, addition of components for EMI suppression andsometimes modifications of the software for fault tolerance.

The transient problems remain undiscovered until field testing. At thisstage the customers would be involved. Facing failures that areextremely hard to diagnose, for reasons given earlier, the supplier-customer relationship would be strained. The elusiveness of the sourcesof transients means many trips to the field by the engineers. Material

costs and man-hour overrun would escalate further when compared toearlier discovery (above). The flexibility for a redesign is considerablyreduced because of the time and finance involved are not budgeted.

It is possible that even field testing does not expose the inherentsusceptibility to transients. One possible reason is insufficient t estduration. As explained above, a transient failure depends on the rareconcurrence of the transients and the susceptibility windows, so over ashort period of time failures associated with transients may not develop(another possible reason is long error latency). Thus, the error remainsdormant during the entire field test. Subsequent to the field test andacceptance, maybe after a long time, the error becomes active and failuredevelops. For a safety critical system or a system involved with high

finance, the consequence could be serious and result in societal loss.

Managements responsibilityGiven the possible serious consequences that transients have on programmableelectronics, management must be alert to the potential problem. It must take thenecessary steps to ensure the confinement of transients effects on reliability.Following the ISO 9001 standard on qua lity syst ems[12], managementsresponsibility should include at least the following:

Define all the personnel at various levels and functions who will beresponsible for ensur ing that the specifications, design, testing andinstallation do take transients into account.

Review contracts or product specifications to ensure that the intended

operational electromagnetic environment is well defined. If the latter isnot defined by the customers then relevant standards should be followed.

Help to set as a design objective, the immunity levels of the equipment inquestion towards defined transients.

Ensure that all test plans include transient susceptibility tests withdefined procedures. Susceptibility t ests mus t be p erformed in thedevelopment stage as well as final and field testing.

Review the design and test records and check if the immunity designobjective is achieved.


9/9

IJQRM13,2

74

Ensure that service records reflect any incidence of failures due totransient faults.

Establish a document that records the objective, plans and resultspertaining to the above points.

Although implementing the above points and the entailing work representsadditional cost to the supplier, it should be compared to the potential loss due tonegligence. Indeed, as explained earlier, the loss to the supplier and possibly tosociety could be extremely high.

Conclusion

The nature of transients and the associated failure mechanism in program-mable electronics have been discussed. The importance of design and manage-ments role with regard to transients have been stressed. In view of the pendingEuropean regulation on immunity against EMI and the possible, seriousconsequence of ignoring the issue, management must not neglect transientseffects on product reliability. It must take the lead and ensure the reliability ofproducts in their intended operational environment.

References

1. Irland, E.A., Assuring quality and reliability of complex electronic systems: hardware andsoftware, Proceedings of the IEEE, Vol. 76 No. 1, January 1988, pp. 5-18.

2. Siewiorek, D.P. and Swarz, R.S.,Reliable Computer Systems Design and Evaluation, 2nd ed.,Digital Press, Geneva, 1992.

3. Avizienis, A. and Laprie, J., Dependable computing: from concepts to design diversity,Proceedings of the IEEE, Vol. 74 No. 5, May 1986, pp. 629-38.

4. Littlewood, B. and Strigini, L., The risks of software, Scientif ic American, November1992, pp. 38-43.

5. Ott, H.W., Noise Reduction Techniques in Electronic Systems, Wiley, New York, NY, 1988.

6. Davies, J., The European (CENELEC) generic immunity st andards, EMC Test andDesign, November-December 1992, pp. 49-50.

7. Johnson, B., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley,Reading, MA, 1988.

8. Iyer, R.K. and Rossetti, D.J., A measurement-based model for workload dependence of CPUerrors, IEEE Transactions on Computers, Vol. C-35 No. 6, June 1986, pp. 511-19.

9. Duba, P. and Iyer, R.K., Transient fault behavior in a microprocessor, a case study,Proceedings of IEEE International Conference on Computer Design, 1988, pp. 272-6.

10. Papoulis, A., Probability and Statistics, Prentice-Hall, Englewood Cliffs, NJ, 1990, Chapter 3.11. International Electrotechnical Commission (IEC), IEC 801-4 Electromagnetic Compatibilityfor Industr ial-process Measurement and Control Equipment, Part 4: Electr ical FastTransient/Burst Requirements, IEC, 1988.

12. International Organization for Standards (ISO), ISO 9001 Quality Systems Model forQuality Assurance in Design/Development, Production, Installation and Servicing, ISO,Geneva, 1987.

Further reading

Tang, H.K. and Er, M.H., EMI-induced failure in microprocessor-based counting,Microprocessors and Microsystems, Vol. 17 No. 4, 1993, pp. 248-52.

[International Journal of Quality and Reliability Management] Transients effects on the reliability...

Documents

Transcript of [International Journal of Quality and Reliability Management] Transients effects on the reliability...