Probabilistic design for reliability (pdfr) in electronics part1of2

Probabilistic Design for Reliability (PDfR) in

El iElectronicsPart IPart I

Dr. E. Suhir©2011 ASQ & Presentation SuhirPresented live on Jan 03~06th, 2011

http://reliabilitycalendar.org/The_Reliability Calendar/Short Courses/Shliability_Calendar/Short_Courses/Short_Courses.html

ASQ Reliability DivisionASQ Reliability Division Short Course SeriesShort Course Series

One of the monthly webinarsOne of the monthly webinars on topics of interest to reliability engineers.

To view recorded webinar (available to ASQ Reliability ) /Division members only) visit asq.org/reliability

To sign up for the free and available to anyone live webinars visit reliabilitycalendar.org and select English Webinars to find links to register for upcoming events

http://reliabilitycalendar.org/The_Reliability Calendar/Short Courses/Shliability_Calendar/Short_Courses/Short_Courses.html

Dr. E. Suhir Page 1

PROBABILISTIC DESIGN for RELIABILITY (PDfR) CONCEPT,

the Roles of Failure Oriented Accelerated Testing (FOAT)

and Predictive Modeling (PM), and

a Novel Approach to Qualification Testing (QT)

“You can see a lot by observing”

Yogi Berra, American Baseball Player

“It is easy to see, it is hard to foresee”

Benjamin Franklin, American Scientist and Statesman

E. Suhir Bell Laboratories, Physical Sciences and Engineering Research Division, Murray Hill, NJ (ret),

University of California, Dept. of Electrical Engineering, Santa Cruz, CA,

University of Maryland, Dept. of Mechanical Engineering, College Park, MD, and

ERS Co. LLC, 727 Alvina Ct. Los Altos, CA, 94024, USA

Tel. 650-969-1530, cell. 408-410-0886, e-mail: [email protected]

Four hour ASQ-IEEE RS Webinar short course

January 3-6, 2011

Dr. E. Suhir Page 2

Contents

Session I

1. Introduction: background, motivation, incentive

2. Reliability engineering as part of applied probability and Probabilistic Risk

Management (PRM) bodies of knowledge

3. Failure Oriented Accelerated Testing (FOAT): its role, attributes, challenges, pitfalls

and interaction with other accelerated test categories

Session II

4. Predictive Modeling (PM): FOAT cannot do without it

5. Example of a FOAT: physics, modeling, experimentation, prediction

Session III

6. Probabilistic Design for Reliability (PDfR), its role and significance

Session IV

7. General PDfR approach using probability density functions (pdf)

8. Twelve steps to be conducted to add value to the existing practice

9. Do electronic industries need new approaches to qualify their devices into products?

10. Concluding remarks

Dr. E. Suhir Page 3

1. Introduction: background, motivation, incentive

“Vision without action is a daydream.

Action without vision is a nightmare”

Japanese saying

“The problem is not that old age comes.

The problem is that young age passes”

Common Wisdom

Session I

Dr. E. Suhir Page 4

Background

� The short-term down-to-earth and practical goal of a particular electronic or a

photonic device manufacturer is to conduct and pass the established qualification

tests, without questioning whether they are perfect or not

� The ultimate long-term and broad goal of electronic, opto-electronic and photonic

industries, regardless of a particular manufacturer or even a particular product, is to

make the industries deliverables sufficiently reliable in the field, be consistently good

in performance, and so to elicit trust of the customer

� Qualification testing (QT), such as, e.g., those prescribed by the JEDEC, Telcordia,

AEC or the MIL specs, is the major means that the electronic, opto-electronic and

photonic industries use to make their viable-and-promising devices into reliable-and-

marketable products.

Dr. E. Suhir Page 5

Motivation

�It is well known, however, that devices and systems that passed the existing qualification tests

often fail in the field. Should it be this way? Is this a problem indeed? Are the existing qualification

specifications adequate? Do electronic and photonic industries need new approaches to qualify

their devices into products?

� If they do, could the today’s qualification specifications and testing procedures be improved to

an extent that if the device passed these tests, its performance in the field would be satisfactory?

�On the other hand, there is a perception, perhaps, a rather substantiated one, that some electronic

components “never fail”. Although one should never say “never”, such a perception exists because

some products might be too robust and, as the consequence of that, are more costly than

necessary. Could the situation be changed and could the cost be brought down considerably, if one

would be able to assess the actual, most likely superfluous, probability of non-failure in the field

and come up, for a particular product, with the best compromise between reliability, cost and time-

to-market?

�Would it be possible to “prescribe” (specify), predict and, if necessary, even control the low

enough probability of failure for a product that operates under the given stress (not necessarily

mechanical, of course) conditions for the given time?

Dr. E. Suhir Page 6

Incentive

�We argue that the improvements in the QT, as well as in the existing best practices, are

indeed possible, provided that the Probabilistic Design for Reliability (PDfR) concept is

thoroughly developed and the corresponding methodologies are employed

�One effective way to improve the existing QT and specs is to

�conduct, on a wide scale, the appropriate Failure Oriented Accelerated Testing (FOAT) at both the

design stage (DFOAT) and the manufacturing stage (MFOAT), and, since DFOAT cannot do without

predictive modeling (PM),

�carry out, whenever and wherever possible, PM to understand the physics of failure, and to

predict, based on the DFOAT, the probability of failure in the field,

�revisit, review and revise, considering the DFOAT and, to a lesser extent, MFOAT data obtained

for the most vulnerable elements of the device of interest, the existing QT practices, procedures,

and specifications,

�develop and widely implement the PDfR methodologies and algorithms having in mind that

“nobody and nothing is perfect” and that probability of failure in the field is never zero, but could be

predicted and, if necessary, minimized, controlled and maintained at an acceptable low level during

product operation.

Dr. E. Suhir Page 7

2. Reliability engineering as part of applied probability

and Probabilistic Risk Management (PRM)

bodies of knowledge

“A pinch of probability is worth a pound of perhaps”

James G. Thurber, American writer and cartoonist

“In a long run we are all dead”

John Maynard Keynes, British economist

Dr. E. Suhir Page 8

Reliability engineering

� deals with failure modes and mechanisms, “root” causes of occurrence of various

failures, role of various defects, methods to estimate and prevent failures, and

probability-based designs for reliability;

� provides guidance on how to make a viable device into a reliable and marketable

product;

� in products, for which a certain level of failures is considered acceptable (such as, e.g.,

consumer products), examines ways of bringing down the failure rate to an allowable

level;

� for products, for which a failure is a catastrophe, examines and considers ways of

making the probability of failure as low as necessary or possible.

Dr. E. Suhir Page 9

Reliability engineering as part of applied probability and

probabilistic risk management (PRM) bodies of knowledge

�Reliability is part of applied probability and probabilistic risk management (PRM) bodies

of knowledge, and includes the item's (system's) dependability, durability, maintainability,

reparability, availability, testability, etc., i.e., probabilities of the corresponding events or

characteristics

� Each of these characteristics is measured as a certain probability and could be of a

greater or lesser importance depending on the particular function and operation

conditions of the item or the system, and consequences of failure

�Applied probability and Probabilistic Risk Management (PRM) approaches and

techniques put the art of Reliability Engineering on a solid “reliable” ground.

E. Suhir

““If a man will begin with certainties, he will end with doubts; bIf a man will begin with certainties, he will end with doubts; but if he will be ut if he will be content to begin with doubts, he shall end in certainties.content to begin with doubts, he shall end in certainties.””

Sir Francis Bacon, English Philosopher and StatesmanSir Francis Bacon, English Philosopher and Statesman

“We see that the theory of probability is at heart only common sense reduced to calculations; it makes us appreciate with exactitude what reasonable minds

feel by a sort of instincts, often without being able to account for it… The most important questions of life are, for the most part, really only problems of

probability.”Pierre Simon, Marquise de Laplace

““Mathematical formulas have their own life, they are smarter thanMathematical formulas have their own life, they are smarter than we, even we, even

smarter than their authors, and provide more than what has been smarter than their authors, and provide more than what has been put into put into

themthem””

Heinrich Hertz, German PhysicistHeinrich Hertz, German Physicist

Dr. E. Suhir Page 11

Reliability should be taken care of on the permanent basis

The reliability evaluation and assurance cannot be delayed until the device is made (although it is often the case in many actual industries). Reliability should be

� “conceived” at the early stages of its design (a reliability and an electronic engineers should start working together from the very beginning of the device/system development),

� implemented during manufacturing (through a high quality manufacturing process)

� qualified and evaluated by electrical, optical, environmental and mechanical testing both at the design and the manufacturing stages (the customer requirements and the general qualification requirements are to be considered),

� checked (screened) during production (by implementing an adequate burn-in process) and, if necessary and appropriate,

� maintained in the field during the product’s operation, especially at the early stages of the product’s use (by employing, e.g., technical diagnostics, prognostication and health monitoring methods and instrumentation).


Three classes of engineering products

from the reliability point of view

� Class I. The product has to be made as reliable as possible. Failure should not be

permitted. Examples are some military or space objects

� Class II. The product has to be made as reliable as possible, but only for a certain level

of demand (stress, loading). Failure is a catastrophe. Examples are civil engineering

structures, bridges, ships, aircraft, cars

� Class III. The reliability does not have to be very high. Failures are permitted, but

should be restricted. Examples are consumer products, commercial electronics,

agricultural equipment.

See E.Suhir, Applied probability for engineers and scientists, McGraw-Hill, 1997


Class I (military or similar) products

� The product (object) has to be made as reliable as possible. Failure is viewed as a

catastrophe. Examples are some warfare, military aircraft, battle-ships, spacecraft

� Cost is not a dominating factor

� The products usually have a single customer, such as the government or a big firm

� The reliability requirements are defined in the form of government standards

� The standards not only formulate the reliability requirements for the product, but also

specify the methods that are to be used to prove (demonstrate) the reliability, and

often even prescribe how the system must be manufactured, tested and screened

� It is typically the customer, not the manufacturer, who sets the reliability standards.


Class II (industrial or similar) products

�The product (system, structure) has to be made as reliable as possible, but only for a certain

specified level of loading (demand). If the actual load (waves, winds, earthquakes, etc.) happens to

be larger than the design demand, then the product might fail, although the probability of such a

failure should be determined beforehand and should/could be (made) very small

�Examples are: long-haul communication systems, civil engineering structures (bridges, tunnels,

towers), passenger elevators, ocean-going vessels, offshore structures, commercial aircraft,

railroad carriages, cars, some medical equipment

� These are highly expensive products, which are produced in large quantities, and therefore

application of Class I requirements will lead to unjustifiable, unfeasible and unacceptable

expenses. Failure is a catastrophe and might be associated with loss of human lives and with

significant economic losses

�The products are typically intended for industrial, rather than government, markets. These

markets are characterized by rather high volume of production (buildings, bridges, ships, aircraft,

automobiles, telecommunication networks, etc.), but also by fewer and more sophisticated

customers than in the commercial (Class III) market.


Class III (consumer, commercial) products

�The typical market is the consumer market. An individual consumer is a very small part of the

total consumer base. The product is inexpensive and manufactured in mass quantities

�The demand for the product is usually driven by the cost of the product and time-to-market, rather

than by its reliability. As long as the product is “sellable”, its reliability does not have to be very

high: it should only be adequate for customer acceptance and reasonable satisfaction. Simple and

innovative products, which have a high degree of customer appeal and are in significant demand,

may be able to prosper, at least for some time, even if they are not very reliable

�Failure is not a catastrophe: a certain reasonable level of failures during normal operation of the

product is acceptable, as long as the failure rate is within the anticipated/expected range

�Reliability testing is limited, and the improvements are often implemented based on the field

feedback

�It is typically the manufacturer, not the consumer, who sets the reliability standards, if any, for the

product . No special reliability standards are often followed, and it is the customer satisfaction (on

the statistical basis), which is the major criterion of the viability and quality of the product.


Reliability, cost-effectiveness, and time-to-market

� Reliability, cost effectiveness and time-to-market considerations play an important role in the design, materials selection and manufacturing decisions, and are the key issues in competing in the global market-place. A company cannot be successful, if its products are not cost effective, or do not have a worthwhile lifetime and service reliability to match the expectations of the customer. Too low a reliability can lead to a total loss of business

� Product failures have an immediate, and often dramatic, effect on the profitability and even the very existence of a company. Profits decrease as the failure rate increases. This is due not only to the increase in the cost of replacing or repairing parts, but, more importantly, to the losses due to the interruption in service, not to mention the “moral losses”. These make obvious dents in the company’s reputation and, as the consequence of that, affect its sails

� The time to develop and to produce products is rapidly decreasing. This circumstance places a significant pressure on both business people and reliability engineers, who are supposed to come up with a reliable product and to confirm its long-term reliability in a short period of time to make their device a product and to make this product successful in the marketplace

� Each business, whether small or large, should try to optimize its overall approach to reliability.

“Reliability costs money”, and therefore a business must understand the cost of reliability, both

“direct” cost (the cost of its own operations), and the “indirect” cost (the cost to its customers

and their willingness to make future purchases and to pay more for more reliable products).


3. Failure Oriented Accelerated Testing (FOAT): its role,

attributes, challenges, pitfalls and interaction with

other accelerated test categories

“Nothing is impossible. It is often merely for an excuse that we

say that things are impossible”

Francois de La Rochefoucauld, French philosofer

“Truth is really pure and never simple”

Oscar Wilde, British writer, “The Importance of Being Earnest”


Why accelerated tests?

� It is impractical and uneconomical to wait for failures, when the mean-time-to-failure for a typical today’s electronic device (equipment) is on the order of hundreds of thousands of hours

� Accelerated testing (AT) enables one to gain greater control over the reliability of a product

� AT has become a powerful means in improving reliability. This is true regardless of whether (irreversible or reversible) failures will or will not actually occur during the FOAT (“testing to fail”) or QT (“testing to pass”)

� In order to accelerate the material’s (device’s) degradation and/or failure, one has to deliberately “distort” (“skew”) one or more parameters (temperature, humidity, load, current, voltage, etc.) affecting the device functional and/or mechanical performance and/or its environmental durability.


Accelerated test categories: traditional definitions

Accelerated

test type

(category)

Product development

(verification) tests

(PDTs)

Qualification (“screening”)

tests (QTs)

Accelerated life tests

(ALTs), highly accelerated

life tests (HALTs), and

failure oriented accelerated

tests (FOATs)

Objective

Technical feedback to

ensure that the taken

design approach is

viable (acceptable)

Proof of reliability;

demonstration that the

product is qualified to

serve in the given capacity

Understand modes and

mechanisms of failure and ,

time permitting, accumulate

failure statistics

End point

Time, type, level, and/or

number of failures

Predetermined time and/or

the # of cycles, and/or the

excessive (unexpected)

number of failures

Predetermined number or

percent of failures

Follow-up

activity

Failure analysis, design

decision

Pass/fail decision Failure analysis and , time

permitting, statistical

analysis of the test data

Perfect (ideal)

test

Specific definition(s) No failure in a long time Numerous failures in a

short time

Accelerated test categories: updated definitions

Accelerated test type (category)

Product development (verification) testing (PDT)

Qualification (“screening”) testing

(QT)

Accelerated Life Testing (ALT)= =Failure Oriented Accelerated Testing (FOAT)

at the design

stage (DQT)

at the manufacturing stage (MQT)

at the design stage (DFOAT) At the manufacturing stage (MFOAT)= Hobbs’ Highly ALT (HHALT)= Accelerated burn-in

Objective Technical feedback to

ensure that the taken design

approach is viable (acceptable)

Proof of reliability; demonstration that the item is qualified into a product, i.e., is able to serve in the

given capacity

Understand physics (modes and mechanisms) of failure, failure limits, and, time permitting, accumulate

failure statistics

Assess failure limits, Weed out infant mortalities

End point Time, type, level, and/or number of

failures

Predermined time and/or number of cycles, and/or the excessive (unexpected)

number of failures

Predetermined number or percent of failures

Follow-up activity

Failure analysis, Design decision

Pass/fail decision

Failure analysis and, time permitting, also statistical analysis

of the test data

Pass/fail decision

Perfect (ideal) test

Specific definitions

No failures in a long time

Numerous failures in a short time No failures in a long time

Some most common accelerated test conditions (stimuli)� High Temperature (Steady-State) Soaking/Storage/ Baking/Aging/ Dwell,

� Low Temperature Storage,

� Temperature (Thermal) Cycling,

� Power Cycling,

� Power Input and Output,

� Thermal Shock,

� Thermal Gradients,

� Fatigue (Crack Propagation) Tests,

� Mechanical Shock,

� Drop Shock (Tests),

� Random Vibration Tests,

� Sinusoidal Vibration Tests (with the given or variable frequency),

� Creep/Stress-Relaxation Tests,

� Electrical Current Extremes,

� Voltage Extremes,

� High Humidity,

� Radiation (UV, cosmic, X-rays),

� Altitude,

� Space Vacuum

Elevated Stress

� AT uses elevated stress level and/or higher stress-cycle frequency as effective stimuli

to precipitate failures over a much shorter time frame

� The “stress” in reliability engineering does not necessarily have to be a mechanical or

a thermo-mechanical: it could be electrical current or voltage, high (or low)

temperature, high humidity, high frequency, high pressure or vacuum, cycling rate, or

any other factor (stimulus) responsible for the reliability of the device or the equipment

� AT must be specifically designed for the product under test

� The experimental design of AT should consider the anticipated failure modes and

mechanisms, typical use conditions, and the required or available test resources,

approaches and techniques.


Qualification Testing (QT) is a must

� The objective of the qualification testing (QT) is to prove that the reliability of the

product-under-test is above a specified level. This level is usually measured by the

percentage of failures per lot and/or by the number of failures per unit time (failure

rate)

�The typical requirement is no more than a few percent failing parts out of the total

lot (population)

�QT enables one to “reduce to a common denominator” different products, as well

as similar products, but produced by different manufacturers

�QT reflects the state-of-the-art in a particular field of engineering, and typical

requirements for the performance of the product. Industry cannot do without QT

�Testing is time limited and is generally non-destructive (not failure oriented).


Today’s Qualification Testing (QT): shortcomings

�The today’s qualification standards and requirements are only good for what they

are intended - to confirm that the given device is qualified into a product to serve in a

particular capacity

�If a product passed the standardized qualification tests, it is not always clear why it

was good, and if the product failed the tests, it is equally unclear what could be done

to improve its reliability

� If a product passed the qualification tests, it does not mean that there will be no

failures in the field, nor it is clear how likely or unlikely these failures might be

� Since QT is not failure oriented, it is unable to provide the most important ultimate

information about the reliability of the product - the probability of its failure after the

given time in service and under the given service (operation, stress) conditions.


Failure Oriented Accelerated Testing (FOAT)-1

�FOAT is aimed at the revealing and understanding the physics of the expected or

occurred failures. Unlike QTs, FOAT is able to detect the possible failure modes and

mechanisms

�Another objective of the FOAT is to accumulate failure statistics. Thus, FOAT deals with

the two major aspects of the Reliability Engineering – physics and statistics of failure

�Adequately planned, carefully conducted, and properly interpreted FOAT provides a

consistent basis for the prediction of the probability of failure after the given time in

service. Well-designed and thoroughly implemented FOAT can dramatically facilitate the

solutions to many engineering and business-related problems, associated with the cost

effectiveness and time-to-market

�This information can be helpful in understanding what should be changed to design a

viable and reliable product. Indeed, any structural, materials and/or technological

improvement can be “translated”, using the FOAT data, into the probability of failure for

the given duration of operation under the given service (environmental) conditions.


Failure Oriented Accelerated Testing (FOAT)-2

�FOAT should be conducted in addition to, and, preferably, long before the

qualification tests. There might be also situations, when FOAT can be used as an

effective substitution for the QT, especially for new products, when acceptable

qualification standards do not yet exist

� While it is the QT that makes a device into a product, it is the FOAT that enables one

to understand the reliability physics behind the product and, based on the appropriate

PM, to create a reliable product with the predicted probability of failure

�There is always a temptation to broaden (enhance) the stress as far as possible to

achieve the maximum “destructive effect” (FOAT effect) in a shortest period of time.

Unfortunately, sometimes, accelerated test conditions may hasten failure mechanisms

that are different from those that could be actually observed in service conditions

(“shift” in the modes and/or mechanisms of failure)


FOAT pitfalls

�Because of the existence of such “shifts”, it is always necessary to correctly identify the

expected failure modes and mechanisms, and to establish the appropriate stress limits, in

order to prevent “shifts” in the original (actual) dominant failure mechanism

�Examples are: change in materials properties at high or low temperatures, time-

dependent strain due to diffusion, creep at elevated temperatures, occurrence and

movement of dislocations caused by an elevated stress, or a situation when a bimodal

distribution of failures (a dual mechanism of failure) occurs

� Since, particularly, infant mortality (“early”) failures might occur concurrently with the

anticipated (“operational”) failures, it is imperative to make sure that the “early” and

“operational” failures are well separated in the tests

� Different failure mechanisms are characterized by different physical phenomena and

different activation energies, and therefore a simple superposition of the effects of two

mechanisms is unacceptable: it can result in erroneous reliability projections.


Burn-in testing (BIT) is a special type of FOAT-1

� Burn-in (“screening”) testing (BIT) is widely implemented to detect and eliminate infant

mortality failures. BIT could be viewed as a special type of manufacturing FOAT (MFOAT).

BIT is needed to stabilize the performance of the device in use

�BIT is supposed to stimulate failures in defective devices by accelerating the stresses

that will cause these devices to fail without damaging good items. The bathtub curve of a

device that undergone BIT is supposed to consist of a steady state and wear-out portions

only.

�The rationale behind the BIT is based on a concept that mass production of electronic

devices generates two categories of products that passed QT:

�1) robust (“strong”) components that are not expected to fail in the field and

�2) relatively unreliable (“week”) components (“freaks”) that, if shipped to the customer,

will most likely fail in the field

�BIT can be based on high temperatures, thermal cycling, voltage, current density, high

humidity, etc., and is performed by either manufacturer or by an independent test house.


Burn-ins – special type of FOAT-2

�For products that will be shipped out to the customer, BIT is nondestructive

�BIT is a costly process, and therefore its application must be thoroughly monitored. BIT

is mandatory on most high-reliability procurement contracts, such as defense, space, and

telecommunication systems. In the today’s practice BIT is often used for consumer

products as well. For military applications the BIT can last as long as a week (168 hours).

For commercial applications burn-ins typically do not last longer than two days (48 hours)

�Optimum BIT conditions can be established by assessment of the main expected failure

modes and their activation energies, and from the analysis of the failure statistics during

BIT

�Special investigations are usually required, if one wishes to ensure that cost-effective

BIT of smaller quantities is acceptable. A cost-effective simplification can be achieved, if

BIT is applied to the complete equipment (assembly or subassembly), rather than to an

individual component, unless it is a large system fabricated of several separately testable

assemblies.



�Although there is always a possibility that some defects might escape the BIT, it is more

likely that BIT will introduce some damage to the “healthy” structure and/or might

“consume” a certain portion of the useful service life of the product: BIT not only “fights”

the infant mortality, but accelerates the very degradation process that takes place in the

actual operation conditions, unless the defectives have a much shorter lifetime than the

healthy products and have a more narrow (more “deterministic”, more “delta-like”)

probability-of-failure distribution density

�Some BIT (e.g., high electric fields for dielectric breakdown screening, mechanical

stresses below the fatigue limit) are harmless to the materials and structures under test,

and do not lead to an appreciable “consumption” of the useful lifetime (field life loss).

Others, although do not trigger any new failure mechanisms, might consume some small

portions of the device lifetime.



�When planning, conducting and evaluating the BIT results, one should make sure that

the stress applied by the BIT is high enough to weed out infant mortalities, but is low

enough not to consume a significant portion of the product’s lifetime, nor to introduce a

permanent damage

� A natural concern, associated with the BIT, is that there is always a jeopardy that BIT

might trigger some failure mechanisms that would not be possible in the actual use

conditions and/or might affect the components that should not be viewed as defective

ones.

� In lasers, the “steady-state” portion is, in effect, not a horizontal, but a slowly rising curve. In addition, wear-out failures, which are characterized by the time-dependent failure rate, occupy a significant portion of the failure-rate (bath-tub) diagram. Standard production BIT should be combined for laser devices with the long-term life testing.


Wear-out failures

�For a well-designed and adequately manufactured product, the were-out failures

should occur at the late stages of operation and testing.

�If one observes that it is not the case (the steady-state portion of the “bathtub”

curve is not long enough or does not exist at all), one should revisit the design and to

choose different materials and/or different design solutions, and/or a different (more

consistent) manufacturing process, etc.

� In some electronics materials (such as BGA and PGA systems) and in some

photonics products (e.g., lasers) the wear-out part of the bathtub curve can occupy a

significant portion of the product’s lifetime, and should be carefully analyzed.


What one should/could possibly do

to prevent failures-1

� Develop an in-depth understanding of the physics of possible failures. No failure

statistics, nor the most effective ways to accommodate failures (such as redundancy,

trouble-shooting, diagnostics, prognostication, health monitoring, maintenance), can

replace good understanding of the physics of failure and good (robust) physical

design

�Assess the likelihood (the probability) that the anticipated modes and mechanisms

might occur in service conditions and minimize the likelihood of a failure by selecting

the best materials and the best physical design of your design/product

�Understand and distinguish between different aspects of reliability: operational

(functional) performance, structural/mechanical reliability (caused by mechanical

loading) and environmental durability (caused by harsh environmental conditions).


What one should/could possibly do

to prevent failures-2

�Distinguish between the materials and structural reliability and assess the effect of

the mechanical and environmental behavior of the materials and structures in his/her

design on the functional performance of the product

�Understand the difference between the requirements of the qualification

specifications and standards, and the actual operation conditions. In other words,

understand well the QT conditions and design the product not only that it would be

able to withstand the operation conditions on the short- and long-term basis, but also

to pass the QT

�Understand the role and importance of FOAT and conduct PM whenever and

wherever possible.


Session II

4. Predictive Modeling (PM): FOAT cannot do without it

“The probability of anything happening is in inverse ratio to its desirability”John W. Hazard, American attorney-at-law

“Any equation longer than three inches is most likely wrong”

Unknown Experimental Physicist


FOAT cannot do without predictive modeling (PM)

� FOAT cannot do without simple and meaningful predictive models. It is on the

basis of such models that one decides which parameter should be accelerated, how

to process the experimental data and, most importantly, how to bridge the gap

between what one “sees” as a result of the accelerated testing and what he/she will

possibly “get” in the actual operation conditions

�By considering the fundamental physics that might constrain the final design, PM

can result in significant savings of time and expense and shed additional light on the

physics of failure

�PM can be very helpful to predict reliability at conditions other than the FOAT and

can provide important information about the device performance

�Modeling can be helpful in optimizing the performance and lifetime of the device, as

well as to come up with the best compromise between reliability, cost effectiveness

and time-to-market .


Requirements for a good predictive model

�A good FOAT PM does not need to reflect all the possible situations, but should be

simple, should clearly indicate what affects what in the given phenomenon or structure,

be suitable/flexible for new applications, with new environmental conditions and

technology developments, as well as for the accumulation, on its basis, the reliability

statistics.

�The scope of the model depends on the type and the amount of information available.

� A FOAT PM does not have to be comprehensive, but has to be sufficiently generic, and

should include all the major variables affecting the phenomenon (failure mode) of interest.

It should contain all the most important parameters that are needed to describe and to

characterize the phenomenon of interest, while parameters of the second order of

importance should not be included into the model.

�FOAT PM take inputs from various theoretical analyses, test data, field data, customer

requirements, qualification spec requirements, state-of-the-art in the given field,

consequences of failure for the given failure mode, etc.


What the existing FOAT PMs predict

�Before one decides on a particular FOAT PM he/she should anticipates the predominant

failure mechanism in advance, and then applied the appropriate model

�The most widespread PMs identify the mean time-to-failure (MTTF) in steady-state-

conditions

�If one assumes a certain probability density function for the particular failure

mechanism, then, for a two-parametric distribution (like, e.g., the normal one) he/she

could construct this function based on the determined mean-time-to-failure and the

measured standard deviation (STD)

�For a single-parametric probability density distribution function, like an exponential one,

the knowledge of the MTTF is sufficient to determine the failure rate and to determine the

probability of failure for the given time in operation.


Most widespread predictive models (PMs)

�Power law (used when the PoF is unclear),

� Boltzmann-Arrhenius equation (used when there is a belief that the elevated temperature is

the major cause of failure),

�Coffin-Manson equation (inverse power law; used particularly when there is a need to

evaluate the low cycle fatigue life-time),

�Crack growth equations (used to assess the fracture toughness of brittle materials),

�Bueche-Zhurkov and Eyring equations (used to assess the MTTF when both the high

temperature and stress are viewed as the major causes of failure),

�Peck equation (used to consider the role of the combined action of the elevated temperature

and relative humidity)

�Black equation (used to consider the roles of the elevated temperature and current density),

�Miner-Palmgren rule (used to consider the role of fatigue when the yield stress is not

exceeded),

�Creep rate equations,

�Weakest link model (used to evaluate the MTTF in extremely brittle materials with defects),

�Stress-strength interference model, which is, perhaps, the most flexible and well

substantiated model.

Example: Boltzmann-Arrhenius equation

� Boltzmann-Arrhenius equation underlies many FOAT related concepts . The MTTF, τ=tau, is proportional to an exponential function, in which the argument is a fraction, where the activation energy, Ua, eV, is in the numerator, and the product of the Boltzmann’s constant, k=8.6174××××10-5eV/ºK, and the absolute temperature, T, is in the denominator:

The equation was first obtained by L. Boltzmann in the statistical theory of gases, and then applied by the S. Arrhenius to describe the inversion of sucrose. Arrhenius paid attention to the fact that the physical processes and the chemical reactions in solid bodies are also enhanced by the absolute temperature

� Boltzmann-Arrhenius equation is applicable, when the failure mechanisms are attributed to a combination of physical and chemical processes. Since the rates of many physical processes (such as, say, solid state diffusion, many semiconductor degradation mechanisms) and chemical reactions (such as, say, battery life) are temperature dependent, it is the temperature that is the acceleration parameter.

� .

( )

−=

*0 expTTk

U aττ


Boltzmann-Arrhenius Equation and the PDfR concept

�Boltzmann-Arrhenius equation addresses degradation processes and attributes

degradation and possible failures cased by degradation to elevated temperatures and

possibly to the elevated humidity as well, i.e., to the environmental factors.

�The failure rate for a system whose MTTF is given by the Boltzmann-Arrhenius

equation can be found as

�The probability of failure at the moment t of time can be found as

This formula is known as exponential formula of reliability. If the probability of failure

P is established for the given time t in operation, then the exponential formula of

reliability can be used to determine the acceptable failure rate.

( )

−−=

*

0

exp1

TTk

U a

τλ

teP λ−−= 1


Coffin-Manson Equation (Inverse Power Law)-1

� Many electronic materials and especially solder joints fail primarily because of the

elevated mechanical stresses and deformations (strains). The numerous existing

empirical and semi-empirical methods and approached that address the low-cycle-fatigue

life-time of solders are, in one way or another, based on the pioneering work of Coffin and

Manson

�It has been established that materials that experience elevated stresses and strains

within the elastic range fail because of elevated stresses, whether steady-state or variable,

while the materials that experience high stresses exceeding yield stress fail primarily

because of the inelastic deformations. Such a behavior, known as low-cycle-fatigue

conditions, is typical for solder materials, including even lead-free solders whose yield

point might be substantially higher than that for tin-lead solders

�The original Coffin-Manson equation is just an inversed power law that is applicable to

highly compliant materials exhibiting significant plastic deformations prior to failure. The

inverse power law is used also in some other, physically quite different, applications, such

as MTTF in random vibration tests (Steinberg’s formula); aging in high-power lasers, etc.



�The studies carried out in the 1990-s addressed primarily flip-chip tin-lead solder joint

interconnections. The today’s studies address primarily the thermal fatigue life of ball-

grid-array (BGA) and pad-grid-array (PGA) systems and especially lead-free solder joints

�The thermally-induced stresses and strains in the flip-chip solder joints are caused by

the CTE mismatch of the chip and the package substrate materials, as well as by the

temperature gradients because of the difference in temperature between the “hot” chip

and the “cold” substrate. In BGA and PGA systems the stresses and strains are caused

by the mismatch of the package structure and the PCB (“system’s substrate”)

�The numerous suggested phenomenological semi-empirical models are based on the

prediction and improving the solder material fatigue caused by the accumulated cyclic

inelastic strain in the solder material. This strain is due to the temperature fluctuations

resulting from the changes in the ambient temperature (temperature cycling) and/or from

heat dissipation in the package (power cycling).



�The modified Coffin-Manson model

can be used to model crack growth in solder and other metals due to temperature

cycling. In the above formula, is the number of cycles to failure, f is the cycling

frequency, ∆T is the temperature range during a cycle, Tmax is the maximum

temperature reached in each cycle, and k is Boltzmann’s constant. Typical values for

the cycling frequency exponent α and the temperature range exponent β are around -

1/3 and 2, respectively. Reduction in the cycling frequency reduces the number of

cycles to failure. The activation energy U is around 1.25.

�In recent years a visco-plastic rate dependent constitutive model, known as Anand

model, is often used in combination with the FEA simulation to predict the solder

joint reliability. In Anand’s model (that includes one flow equation and three evolution

equations) plasticity and creep phenomena are unified and described by the same set

of flow and evolution relations.

−∆= −−

max

expkT

UTAf f

βα

f


Stress-strength (“interference”) model


Fig.20. Stress-strength (“Interference”) models-1

Stress (Demand) and Strength (Capacity) Distributions


EXAMPLE OF A FOAT:

Physics, Modeling, Experimentation, Prediction

5.

“A theory without an experiment is dead. An experiment without a theory is blind”

Unknown Reliability Engineer


Finite_Element Analysis

(FEA) Data


Predicted Stresses and Strains

in a Short Cylinder


Experimental bathtub curve for the solder joint

interconnections in a flip-chip multichip module


Probability of failure of the solder joint interconnections

vs. failure rate

Probabilistic design for reliability (pdfr) in electronics part1of2

Technology

Transcript of Probabilistic design for reliability (pdfr) in electronics part1of2