Accelerated Stress Testing and Reliability Workshop October 9-11, 2013 San Diego, CA Accelerating...

Accelerated Stress Testing and Reliability WorkshopOctober 9-11, 2013 San Diego, CA

Accelerating Reliability into the 21st CenturyKeynote Presenter Day 1: Vice Admiral Walter MassenburgKeynote Presenter Day 2: Alain Bensoussan, Thales Avionics

&

CALL FOR PRESENTATIONS: We are now Accepting Abstracts. Email to: [email protected].

Guidelines on website www.ieee-astr.org

For more details, click here to join our LinkedIn Group:IEEE/CPMT Workshop on Accelerated Stress Testing and Reliability

mailto:[email protected]

http://www.ieee-astr.org/

http://www.linkedin.com/groups/IEEE-CPMT-Workshop-on-Accelerated-4364303?trk=myg_ugrp_ovr

This is the 3rd of a series of four webinars being put on by Ops A La Carte, ASTR, and ASQ Reliability Division

Each webinar will also be presented as a full 2 hour tutorial at our ASTR Workshop Oct 9-11th, San Diego.

Abstracts for presentations are due Apr 30.www.ieee-astr.org

Introduction

5 min

Accelerated Reliability Growth Testing

45 min

Questions

10 min

Agenda

Upcoming Reliability WebinarsTitle: 40 Years of HALT: What Have We Learned

Author: Mike Silverman

Date: Sept 12, 2013, 12pm EST

http://reliabilitycalendar.org/webinars/english/40-years-of-halt-what-have-we-learned/

Location: Webinar

HALT began 40 years ago with a simple idea of testing beyond specifications in order to better understand design margins. Over the past 40 years, thousands of engineers around the world have been exposed to the concepts of HALT and have tried the techniques.

This tutorial will explore what we have learned in the past 40 Years and what the future of HALT could be.




Registration Demographics

For this webinar we have signed up–250 Registrants–17 Countries–28 US States

Registration Question #1

Have you ever performed a Reliability Growth Test?

–Never 45%–All the time 25%–Tried Once 20%

Registration Question #1

For your last RGT, did you have a chance to plan the duration and stresses?

–Neither 50%–Both 25%–Duration Only 10%–Stresses only 10%

Copyright © 2012 Raytheon Company. All rights reserved. Customer Success Is Our Mission is a registered trademark of Raytheon Company.

Traditional and Accelerated Reliability Growth

The Case of Lost (and Found) Failure Rates

Milena Krasich, P. E. Raytheon, IDS

¨ Identify shortcomings of traditional reliability growth testing and offer alternatives Reliability Growth Test objectives Explain traditional Reliability Growth test methodology along with the

assumptions Show shortfalls of the traditional methods

• Entire item failure rate not calculated and presented in results• Test duration too long for the modern high reliability items• Little or no relationship of reliability and stresses on the tested item

Show principles of the Physics of Failure test methodology Show how the Reliability growth test based on PoF is constructed Show how the expected stresses are applied and accelerated Show how to account for total final failure rates Show achieved considerable test cost reduction.

Tutorial Objectives

¨ Overall test duration determined based on the initial and goal reliability measure: failure rates Mean Time Between Failures, MTBF (or MTTF)

¨ Initial failure rate estimated for the entire item and then used for calculations of reliability growth

¨ Reliability growth parameters and test duration determined based on the goal reliability - mathematically

¨ Magnitude (stress level of applied operational and environmental stresses equal to those in use – but not their duration Applied stress duration determined by engineering judgment, and level

by assumptions of some “mean” stress Overall test duration and stress application are unrelated to use profiles

or required life or mission of the product – only to mathematics¨ Additional errors:

Mathematical

Traditional RG Test Methodology

¨ Goal: Increase the current (existing reliability – measured in mean time

between failures) Goal magnitude guided by:

• Requirement or commercial logic

¨ Item as designed contains design errors: Those are going to appear in test reasonably within the determined

test time The test errors are going to be eliminated by design corrections type

B failure modes) The test continuation will evaluate success of the fix. Design errors that cannot be fixed (type A failure modes) will

continuously be counted Failures determined to be random will not be counted Reliability growth will be measured.

Principles and Assumptions

¨ Failure rate during the test is constant when there are no changes of the tested item

¨ Failure rate decreases with introduced design corrections in steps, and remains constant through the next change

¨ The step curve is fitted with a curve representing Non-Homogenous Poisson Process, NHPP) The process definition: failure rate is constant until changes occur.

¨ The facts not considered in application of that theory: The initial failure rate is just the total failure rate. No rationale how

much of it is attributed to:• Design problems that can be corrected• Random events (those failure modes one does not know where they

come from, they “just happen”)• Design problems that cannot be corrected for one of the reasons:

– Technically impossible– Economically not justifiable– Time to market constraints

Principles and Assumptions, cont.

¨ The expected accumulated number of failures up to test time T is given by:

where

• l is the scale parameter;• b is the shape parameter (a function of the general effectiveness of the

improvements; (0 < b < 1, corresponds to reliability growth; b = 1 corresponds to no reliability growth; b > 1 corresponds to negative reliability growth- reliability degradation)

¨ The failure intensity when it is changing as a result of design improvements after T h of testing is given by:

Mathematical Model - Refresher

0 0, 0, , TTTNE with

0 ,E 1 tttNt

t withd

d 1

TT

.)(

.)(

constt

constt

rr

AA

)()()(

)()()()(1 tttt

tttt

rAItem

rABItem

1

1

)(

)(

tt

tt

ItemItem

rAItem

¨ Failure modes types in test: Systematic: corrected in test (Type B), not corrected (Type A), Random -

constant

Mathematics of Traditional Reliability Growth

0

0,01

0,02

0,03

0,04

0,05

0,06

0 1000 2000 3000 4000 5000 6000

Failu

re in

tens

ity/f

ailu

re ra

te (f

ailu

res/

hour

)

Test duration (hours)

S(t)=A(t)+r(t)+B(t)

r(t)

B(t)

A(t)

The only failure modes with decreasing failure rates (power law)

)()()(

)()()()(1 tttt

tttt

rAItem

rABItem

Only type B failure modes failure rates are accounted for in a reliability test program – those that show growth expressed by the power law model; the type A and random remain constant.

1)( ttItem

¨ To plan a reliability growth, the initial value of failure rate, lI or initial mean time between failures, qI, was assumed as known at some time tI. This initial failure rate would have a value that was known by experience for that item or by similarity with another like item, wI(tI)=constant

¨ The thought process was then that this initial failure rate would decrease under the rules of the power law and at the end of the test with the corrections would assume a final value (a constant again), wF(tF).

¨ The Crow/AMSAA/Duane planning model is simple and easy to implement:

¨ But, the initial failure rate has three components, only one of those can be improved and fitted with the power law, the failure rate of the B failure modes. The remaining components are constant.

Planning Reliability Growth

1

I

II t

ttt

¨ The remaining two components are constant. The final failure rate as a function of time also contains three components, two constant and one only that can be fitted with the power law:

¨ The final B-modes failure rate is then made of the improved B-type failure modes failure rate and the total final item or system failure rate contains also two additional constant components:

Planning Reliability Growth, cont.

rAI

FIBIF

rAFBFF

FrFAFFF

I

FIBIFBF

rAIBII

IrIAIBII

t

ttt

tt

tttt

t

ttt

tt

tttt

1

1

1

)()(

)()(

)()()(

)(

)()(

)()()()(

¨ The random failure rates are not recorded or taken into account, the A-type failures are considered in the number of failures it is said that they are included into the shape parameter calculations but

there is no example in current Handbooks that would show how it was done It is also stated that the Type A failure modes are counted every time they

show up, repetitions included; no example of that statement could be found

¨ Given that there is no improvement applied, type A failure modes should be treated in the same manner as the random failure rates. They could be separately accounted for, but numerically, their failure rate will be added to the random failure rate.

¨ This means that during the test, the A type failure modes should be counted as another group of constant failure rates In which case the methodology of the fixed duration testing should

be applied to determine failure rates for both:• The A – type failure modes• All other random failure modes where the origin is not identifiable.

A Failure Modes

¨ Test duration is mathematically determined from the reciprocal of the “failure rate” as:

Where: qF = final product MTBF (for mitigated. “fixed” failure modes only) – given goal

qI = initial product MTBF (for failure modes that will be mitigated) - assumed

tF =test duration needed to achieve the final MTBF for fixed failure modes

tI = initial test time (has various explanations) – assumed – what is it?

Example – old school: qI=4,000 hours,

qF=10,000, b = 0.6

Present Method to Determine Test Duration

1

111 t

ttt F

FF

1

11 log1

loglogt

tt

F

FF

et

0 400 800

0

2 103

4 103

6 103

8 103

1 104

Initial Test Time (hours)

Tes

t D

ura

tion (

hours

)

tF tI

tI

¨ In the traditional test design, the initial test MTBF is the MTBF assumed for the product, but: The reciprocal of this initial MTBF is the initial failure rate made up of

three components, two of them are constant, not Power Law:• Design – correctable• Design – non correctable• Random failure rates or failure modes

It is only the design failure modes that can be corrected (B type) that can be fitted by the Power Law (Weibull Intensity Function), thus:

• What part of the entire item initial assumed, estimated failure rate could those correctable failure modes could be?

• Analytical prediction contains only the random failure rates– If the Design Engineering is reasonably competent, Type A or B failure modes

could be at the most 40% of the assumed initial failure rate – B failure rate could be only a small fraction of the estimated product failure rate

before the test.

Initial MTBF – What is It?

BIIBI

IBI tt

1

¨ Recorded in test are cumulative times of occurrence of A and B failure modes.

¨ A modes are not addressed, they should not be a part of the power

law – handbook text suggested they are counted, if they were it would have been in error

From test data, shape and scale parameters are determined

¨ The reported failure rate and MTBF are:

¨ Random and A modes do not seem to be a part of the achieved growth. They are unfortunately - forgotten.

Parameters and Results

0 ,E 1 tttNt

t B withd

dB

1

TT

BB

T

N

ttN

N

tTN

N B

N

iiB

BN

iiB

B

BB

+ : Unbiased

00

0lnln

1;

lnln

ˆ

1

1

+

+

++

++

TT

TT

1B

B

¨ If initial test time was assumed to be 200 hours¨ Traditional test (all failure rates – power law):

Initial failure rate: lI = 2.5×10-4 f/hr

Initial MTBF: qI = 4,000 hours

Final MTBF: qF = 10,000 hours

Final test time: 1,976 hours (from the initial time)¨ True status, only B-type failure modes improved (e.g. maximum 40% of the

old “initial” failure rate: lI = 2.5×10-4 f/hr

Initial failure rate for B modes: lI = 0.4 ×2.5×10-4 f/hr = 1×10-4 f/hr

Initial MTBF: qIB = 10,000 hours

Possible final MTBF for B modes: qFB = 30,000 hours

Overall final failure rate B modes + random and A modes: 1,833 ×10-4 Final overall MTBF: qF = 5,544 hours

Final test time: 3,118 hours (from the initial time)¨ The forgotten, unreported failure rate: = 1.5×10-4 f/hr

Comparison

¨ The possible correct solution: Prepare a reliability growth test for only B failure modes Count A type failure modes as if they are random Count random failures Calculate final B failure modes failure rate and MTBF Add the constant A and random failure rates to get results

¨ Possible problems - difficulties: The calculated mathematical test duration is unrelated to use stresses or use

profile The traditionally determined test duration is too short to account for the random

failures, normally the required test duration for a reasonable confidence is about 10 MTBFs (in our example would be about 70,000 hours)• The traditional RG test duration does not support this test time

A short reliability growth test does not disclose any cumulative damage or failures of small failure rates that would start showing only after the test is complete, while useful life of the item could be 10 or 20 years

¨ The proposed viable solution – accelerated Reliability Growth test.

The Solution – Way Forward

¨ Failures occur when an item is not strong enough to withstand one or more attributes of a stress: Level, duration, or repetitions of its application

• The higher the level the shorter duration or less repetitions induce a failure

• If the mean of strength is a k times multiple of the mean of stress (load) and the standard deviations of each are a and b times their respective mean values, reliability of an item regarding each use stress (i), and the total reliability will be:

Physics of Failure and Reliability

The area of overlap of strength and stress distributions represents probability of failure for each of the stresses;mL, sL = mean and standard deviation of the load distribution sL = b× mL

mS, sS = mean and standard deviation of the strength distribution sS = a × mS

2_

2_

___ ),(

iLiL

iLiLiLi

bka

kkR

S

iiStressItem tRtR

i

10 )()(

¨ Allocate reliability regarding each of the expected stresses in use The cumulative damage and ultimately failure due to a stress is proportional

to the stress level and its duration. For the stress applied at the same level as in life, the cumulative damage model is:

Physics of Failure Reliability – Margin k Selection

t

dttStD )()(

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50

Re

liab

ility

Multiplier k

b=0,5a=0,05

b=0,2a=0,05

b=0,05a=0,05

b=0,2a=0,02

b=0,1a=0.02

b=0,05a=0,02

For the allocated reliability regarding each stress, select the value of margin k which would multiply its duration in use to be applied in test;Apply stresses simultaneously whenever possible;If the same stress type is applied at different levels in use, recalculate their durations to the highest level (using acceleration factors);The most common values for a and b are:a = 0.05, b = 0.2

Rel

iabi

lity

¨ Each of the stresses is accelerated in test to allow for shorter test duration

¨ Total item failure rate is the sum of its failure rates regarding each individual stress (l0 is the item total failure rate in use condition and lA is the accelerated item total failure rate (in reliability growth l is equivalent to w):

¨ Product j exists when the stresses 1 to j produce the same failure mode.¨ Stress acceleration models for different stresses – example:

inverse power law model (usually applicable to thermal cycling, vibration, shock, humidity);

Arrhenius model (used for temperature acceleration using absolute temperature);

Eyring model (used also when the thermal stress is a factor in process acceleration);

step stress model, where the stress is increasing in steps; fatigue model representing the degradation due to the repetitious stress.

Test Acceleration

SN

ii

ijjTestA AA

10

Test Example B Failure Modes – duration k×life

Stresses: Thermal cycling Thermal exposure (thermal dwell) Humidity Vibration Operational cycling

Thermal cycling

One thermal cycle in test = 24 hours in life

.

m

Use

TestTC T

TA

3/1

_

Uset

TestRateRampA

RateRampTC

UseTCTestTC AA

kNN

_

__

Determination of factor k – for major stresses:

k=1.5 946.0)()( 41

000 tRtRi

0,5

0,55

0,6

0,65

0,7

0,75

0,8

0,85

0,9

0,95

1

1,00 1,05 1,10 1,15 1,20 1,25 1,30 1,35 1,40 1,45 1,50

Relia

bilit

y

Multiplier k

a=0,1b=0,1

Thermal dwell (normalize exposure when OFF to duration at ON temperature):

Duration of accelerated exposure:

hours 754,8

273

1

273

1exp

_

_

NON

ONOFFB

aOFFONNON

t

TTk

Ettt

h 1.168

273

1

273

1exp

_

__

TestT

TestONB

aNONTestT

t

TTk

Ektt

Parameter Symbol Value

Required life t0 10 years = 87 600 h

Required reliability R0(t0) 0,8

Time ON tON 2 h/day=7 300 h

Temperature ON TON 65 °C

Time OFF tOFF 22 h/day=80 300 h

Temperature OFF TOFF 35 °C

Thermal cycling TUse 45 °C, two times per day

Total cycles NUse 7 300

Temperature ramp rate 1,5 °C/min

Vibrations, random WUse 16,68 m/sec2 r.m.s

Relative humidity RHUse 50 %

Activation energy Ea 1,2 eV

The thermal exposure is combined with the thermal cycling, distributed over the high temperature:

The test cycle profile:

Humidity: Test 95% RH and temperature TRH= 85 °C (65 °C chamber + 20 °C internal temperature rise)

Vibration: 150,000 miles, 150 hours per axis vibration at 1.7 g rms. Test level: 3.2 g rms To project test time to life use acceleration factor to multiply test time

Data for reliability plotting:

Initial B failure modes MTBF 100,000 hours, final 106hours

Initial test time: 100 hours

Total traditional test time: 4.6x103hours

Final test reliability (B failure modes): 0.99997

Final MTBF (improved failure modes):1,431,964 hours

Total accelerated test time; 526 hours

Test Example, Cont.

h 0.875 min 3.5253.2210

1252

coldat DwellDwell) ThermalionStabilizat (temp. time)ramp(2

TC

TC

t

t

h 300

273

1

273

1exp

_

___

TestRH

RHONB

ah

Test

UseNONTestTestRH

t

h

TTk

E

RH

RHtt

2.3

axisper hours 18

4:With

_

__

TestVib

w

Test

UseUseVibTestVib

t

w

W

Wtkt

ilure Time to failure

h

Cumulative time to

failure (n=24)

q(t) log(t) log[q(t)]

1 3,821.33 91,711,92 91 ,711.92 4.96 4.96

2 5,781.33 138,751.92 69 ,375.96 5.14 4.84

3 14,016 336,384.00 112 ,128 5.53 5.05

4 18,563.44 445 522,56 111, 380.64 5.65 5.05

t0*k 131.400 3 ,153 ,600 788 ,400 6.50 5.90

¨ The test duration covers product entire life It allows detection of all design problems, not only those that appear in a

small fraction of product life It enables estimate of failure rate regarding product random events,

disregarded in traditional RG testing The failure rate achieved by design improvement with the random failure

rate provides realistic estimate of total product reliability¨ Test duration is determined based on required total reliability in view of

product physical cumulative damage from life stresses in use;¨ Test acceleration allows achievement of very reasonable test duration,

shorter than traditional mathematically derived testing The reliability improvement through test is no longer cost prohibitive

¨ Test failure times are projected to their appearance in real life and the analysis uses this data;

¨ Even though covering the product expected life (durability information), it is still considerably shorter than the traditional reliability growth test.

Why Accelerated Reliability Growth?

¨ [email protected]¨ Milena Krasich is a Senior Principal Systems Engineer in Raytheon Integrated Defense

Systems, Whole Life Engineering in RAM Engineering Group, Sudbury, MA.¨ Prior to joining Raytheon, she was a Senior Technical Lead of Reliability Engineering in Design

Quality Engineering of Bose Corporation, Automotive Systems Division. Before joining Bose, she was a Member of Technical Staff in the Reliability Engineering Group of General Dynamics Advanced Technology Systems formerly Lucent Technologies, after the five year tenure at the Jet Propulsion Laboratory in Pasadena, California. While in California, she was a part-time professor at the California State University Dominguez Hills, where she taught graduate courses in System Reliability, Advanced Reliability and Maintainability, and Statistical Process Control. At that time, she was also a part-time professor at the California State Polytechnic University, Pomona, teaching undergraduate courses in Engineering Statistics, Reliability, SPC, Environmental Testing, Production Systems Design,. She holds a BS and MS in Electrical Engineering from the University of Belgrade, Yugoslavia, and is a California registered Professional Electrical Engineer. She is also a member of the IEEE and ASQC Reliability Society, and a Fellow and the president Emeritus of the Institute of Environmental Sciences and Technology. Currently, she is the Technical Advisor (Chair) to the US Technical Advisory Group (TAG) to the International Electrotechnical Committee, IEC, Technical Committee, TC56, Dependability. As a part of the TC56 Working groups she is working on dependability/Reliability standards as a project leader for revision of many released and current international standards such as IEC/IEEE/ANSI Reliability Growth IEC 61014 and IEC 61164, Fault Tree Analysis IEC /ANSI 61025, Testing for the constant failure rate and failure intensity (Reliability compliance/demonstration tests), IEC/ANSI 61124 and FMEA, IEC/ANSI 60812, and for preparation of the new IEC standard on Accelerated Testing, IEC 62506.

Biography

mailto:[email protected]

Upcoming Reliability Webinars

Title: 40 Years of HALT: What Have We Learned

Author: Mike Silverman

Date: Sept 12, 2013, 12pm EST


Location: Webinar

HALT began 40 years ago with a simple idea of testing beyond specifications in order to better understand design margins. Over the past 40 years, thousands of engineers around the world have been exposed to the concepts of HALT and have tried the techniques.

This tutorial will explore what we have learned in the past 40 Years and what the future of HALT could be.




Accelerated Stress Testing and Reliability Workshop October 9-11, 2013 San Diego, CA Accelerating...

Documents

Transcript of Accelerated Stress Testing and Reliability Workshop October 9-11, 2013 San Diego, CA Accelerating...