Introduction to Reliability. WARWICK MANUFACTURING GROUP

download Introduction to Reliability. WARWICK MANUFACTURING GROUP

of 37

Transcript of Introduction to Reliability. WARWICK MANUFACTURING GROUP

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    1/37

    WARWICK MANUFACTURING GROUP

    Product Excellence using 6 Sigma (PEUSS)

    IInnttrroodduuccttiioonnttooRReelliiaabbiilliittyy

    SeSeccttiioonn

    77

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    2/37

    AN INTRODUCTION TO RELIABILITY

    ENGINEERING

    Contents

    1 Introduction 1

    2 Measuring reliability 4

    3 Design for reliability 12

    4 Reliability management 34

    5 Summary 35

    Copyright 2007 University of Warwick

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    3/37

    Introduction to Reliability Engineering Page 1

    RELIABILITY ENGINEERING

    1 Introduction

    1.1

    Definition

    Most people will have some concept of what reliability is from everyday life, for example,

    people may discuss how reliable their washing machine has been over the length of time they

    have owned it. Similarly, a car that doesnt need to go to the garage for repairs often, during

    its lifetime, would be said to have been reliable. It can be said that reliability is quality over

    time. Quality is associated with workmanship and manufacturing and therefore if a product

    doesnt work or breaks as soon as you buy it you would consider the product to have poor

    quality. However if over time parts of the product wear-out before you expect them to thenthis would be termed poor reliability. The difference therefore between quality and reliability

    is concerned with time and more specifically product life time.

    Reliability engineering has both quantitative and qualitative aspects; measurements of

    reliability are necessary for customer requirements compliance. However measuring

    reliability does not make a product reliable, only by designing in reliability can a product

    achieve its reliability targets. These lecture notes will therefore introduce some of the

    terminology used in reliability engineering. It will provide information about measuring

    reliability as well as designing for reliability. Moreover it will emphasise the importance of

    good engineering principles to ensure product reliability. By identifying possible causes offailure and elimination will obviously help to improve product reliability.

    The formal definition of reliability is as follows: The ability of an item to perform a required

    function under stated conditions for a stated period of time. BS4778

    Another definition concerns the probabilistic nature of measuring reliability, i.e. the

    probability of an item to perform a required function under specified conditions for a stated

    period of time. It is therefore a measure of engineering uncertainty and to quantify reliability

    involves the use of statistics and more specifically probability theory. These notes will also

    describe some useful probability distributions that can describe the lifetime behaviour ofproducts.

    1.2 What is reliability?

    Reliability is associated with unexpected failures of products or services and understanding

    why these failures occur is key to improving reliability. The main reasons why failures occur

    include:

    The product is not fit for purpose or more specifically the design is inherentlyincapable.

    The item may be overstressed in some way.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    4/37

    Introduction to Reliability Engineering Page 2

    Failures can be caused by wear-out

    Failures might be caused by variation.

    Wrong specifications may cause failures.

    Misuse of the item may cause failure.

    Items are designed for a specific operating environment and if they are then usedoutside this environment then failure can occur.

    There are many reasons for failure in items the list above is a generic list.

    The load and strength of an item may be generally known, however there will always be an

    element of uncertainty. The actual strength values of any population of components will vary;

    there will be some that are relatively strong, others that are relatively weak, but most will be

    of nearly average strength. Similarly there will be some loads greater than others but mostlythey will be average. Figure 1, below shows the load strength relationship with no overlaps.

    Load Strength

    Probability

    Load Strength

    Probability

    Figure 1: load strength relationship , no overlaps

    However if, as shown in figure 2, there is an overlap of the two distributions then failures will

    occur. There therefore needs to be a safety margin to ensure that there is no overlap of these

    distributions.

    Load Strength

    Probability

    Failure

    Load Strength

    Probability

    FailureFailure

    Figure 2 load strength relationship - overlaps

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    5/37

    Introduction to Reliability Engineering Page 3

    It is clear that to ensure good reliability the causes of failure need to be identified and

    eliminated. Indeed the objectives of reliability engineering are:

    To apply engineering knowledge to prevent or reduce the likelihood or frequency of

    failures;

    To identify and correct the causes of failure that do occur;

    To determine ways of coping with failures that do occur;

    To apply methods of estimating the likely reliability of new designs, and for analysingreliability data.

    These notes will discuss some of the techniques that can be used to identify failures as well as

    the statistical techniques for analysing reliability data.

    1.3

    Why is Reliability important?

    Unreliability has a number of unfortunate consequences and therefore for many products and

    services is a serious threat. For example poor reliability can have implications for:

    Safety

    Competitiveness

    Profit margins

    Cost of repair and maintenance

    Delays further up supply chain

    Reputation

    Good will

    KEY POINTS

    Reliability is a measure of uncertainty and therefore estimating reliability meansusing statistics and probability theory

    Reliability is quality over time

    Reliability must be designed into a product or service

    Most important aspect of reliability is to identify cause of failure and eliminate indesign if possible otherwise identify ways of accommodation

    Reliability is defined as the ability of an item to perform a required function withoutfailure under stated conditions for a stated period of time

    The costs of unreliability can be damaging to a company

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    6/37

    Introduction to Reliability Engineering Page 4

    2 Measuring reliability

    2.1

    Requirements

    Many customers will produce a statement of the reliability requirements that is included in the

    specification of the product. This statement should include the following:

    The definition of failure related to the products function and should cover all failuremodes relevant to the function;

    A full description of the environments in which the product will be stored, transported,operated and maintained;

    A statement of the reliability requirement

    Care must be given in defining failure to ensure that the failure criteria are unambiguous.

    Failure should always relate to a measurable parameter or to a clear indication. For example,

    a definition of failure could include failure of a function to operate. To be able to design for

    the load of the product the design team must have accurate information concerning the

    environment of the product. If an item must fully operate at high altitude with extreme

    changes in temperature then the design must be robust enough to withstand such

    environmental factors. Similarly if a product is stored in extreme conditions prior to use then

    the design must accommodate for the storage conditions.

    The reliability requirement should be stated in a way which can be verified, and which makes

    sense relative to the use of the product. The simplest requirement is to state that no failure

    will occur under stated conditions. Reliability requirements based on life parameters (see

    section 2.3) must be based on the corresponding life distributions. A common parameter used

    is MTBF, when a constant failure rate is assumed.

    Reliability and Maintainability case

    The UK MOD has recently moved away from prescriptive reliability requirements and now

    requests a reliability case from their suppliers. The reliability and maintainability (R&M)

    case is defines as A reasoned, auditable argument created to support the contention that a

    defined system satisfies the R&M requirements . DEF STAN 0042 part 3 is a document

    produced by the UK MOD that gives guidance on what goes into an R&M case.

    2.2 The bath tub curve

    The bath-tub curve is a representation of the reliability performance of components or non-

    repaired items. It observes the reliability performance of a large sample of homogenous items

    entering the field at some start time (usually zero). If we observe the items over their lifetime

    without replacement then we can observe three distinct shapes or periods. Figure 3 shows thebath-tub curve and these 3 periods. The infant mortality or early failures portion shows that

    the population will initially experience a high hazard function that starts to decrease. This

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    7/37

    Introduction to Reliability Engineering Page 5

    period of time represents the burn-in or debugging period where weak items are weeded out.

    After the initial phase when the weak components have been weeded out and mistakes

    corrected, the remaining population reaches a relatively constant hazard function period,

    known as the useful life period. From figure 3 you can see that the hazard function is

    constant, this shape can be modelled by the exponential distribution (see section 2.3) whenfailure are occurring randomly through time. The final portion of the bath-tub curve is called

    the wear-out phase, this is when the hazard function increases with time.

    Useful Life

    InfantMortality Wear Out

    Hazardfun

    ction

    Time

    Useful Life

    InfantMortality

    Hazardfun

    ction

    Time

    Wear Out

    Figure 3 The bath-tub curve

    2.3 Life distributions

    2.3.1 Distribution functions

    If you take a large number of measurements you can draw a histogram to show the how themeasurements vary. A more useful diagram, for continuous data, is the probability density

    function. The y axis is the percentage measured in a range(shown on the x-axis) rather than

    the frequency as in a histogram. If you reduce the ranges(or intervals) then the histogram

    becomes a curve which describes the distribution of the measurements or values. This

    distribution is the probability density function or PDF. Figure 4, below, shows an example

    of a PDF. The area under the curve of the distribution is equal to 1, i.e.

    1)( =

    dxxf

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    8/37

    Introduction to Reliability Engineering Page 6

    The probability of a value falling between any two values x1and x2 is the area bounded by

    this interval, i.e.

    =

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    9/37

    Introduction to Reliability Engineering Page 7

    In reliability engineering we are concerned with the probability that an item will survive for a

    stated interval of time (or cycles or distance etc.) i.e. there is no failure in the interval (0 to t).

    This is known as the survival function and is given by R(t). From the definition:

    ===t

    t

    dttfdttftFtR )(1)()(1)(

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1 1 2 3 3 4 5 5 6 7 7 9

    Value

    Frequenc

    y(%)

    R(t)

    t

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1 1 2 3 3 4 5 5 6 7 7 9

    Value

    Frequenc

    y(%)

    R(t)

    t

    Figure 6 Survival function R(t)

    Another important function in reliability is the hazard function h(t); it is the conditional

    probability of failure in the interval t to (t+dt), given that no failure has occurred by t:

    )(

    )(

    )(

    )(1)(

    tF

    tf

    tR

    tfth ==

    The bath tub curve, shown in figure 3, shows 3 different hazard functions, one decreasing, one

    constant and one increasing hazard function.

    2.3.2 Particular life distributions

    There are 3 continuous life distributions that are commonly used in reliability engineering;

    these are called, the exponential, Weibull and lognormal distributions. The normal

    distribution as discussed in both the Six Sigma and SPC lectures is not generally used inreliability engineering (although it is sometimes used).

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    10/37

    Introduction to Reliability Engineering Page 8

    The Exponential Distribution

    When an item is subject to failures that occur in random intervals and the expected number of

    failures is the same for long periods of time then the distribution of failures is said to fit anexponential distribution. The PDF, CDF and survival function is given as:

    tetf

    =)( and andtetF = 1)( tetR =)(

    The hazard function for the exponential distribution is given as:

    ==

    t

    t

    e

    eth )(

    Notice that the hazard function is not a function of time and is in fact a constant equal to ..

    For repaired items, , is called the failure rate and 1/is called the mean time between failures

    (MTBF), sometimes denoted as . An important point to note is that 63.2% of items will have

    failed by time t=.

    The failure rate can be calculated as the total number of failures divided by the total operating

    time.

    The exponential distribution is the most commonly used distribution in reliability engineering

    and models the useful life portion of the bath-tub curve.

    The Weibull Distribution

    This distribution takes account of a non-constant hazard function. The Survival function is

    given by:

    )(

    )(t

    etR

    =

    where is the shape parameter and is the scale parameter or characteristic life. Thecharacteristic life is the life at which 63.2% of the population will have failed.

    When = 1, the hazard function is constant and therefore the data can be modelled by an

    exponential distribution with =1/.

    When 1, we get a increasing hazard function

    Figure 7, below, shows the Weibull shape parameters superimposed on the bath-tub curve.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    11/37

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    12/37

    Introduction to Reliability Engineering Page 10

    2.4.1

    Series systems

    Simplest reliability model is a serial model where all the components must be working for the

    system to be successful, for example:

    A B Z

    Figure 8 example of components connected in series

    The reliability of this model is calculated by:

    RS= RA* RB.RZ

    If the components can be assumed to be exponentially distributed then the system reliability

    can be calculated as:

    ttt

    SZBA eeeR

    = *....**

    The Failure rate of the system is calculated as by adding the failure rates together, i.e.

    ZBAS +++= ..........

    2.4.2 Active redundancy

    One of the most common forms of redundancy is the parallel reliability model where twoindependent items are operating but the system can successfully operate as long as one of

    them is working, diagrammatically:

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    13/37

    Introduction to Reliability Engineering Page 11

    A

    B

    Figure 9 dual redundant system

    The reliability of the system is equal to the probability of item 1 or item 2 surviving, e.g.

    RS= RA+ RB (RA* RB)

    And for the constant hazard function case:

    ttt

    SBABA eeeR

    )( + +=

    This example assumes each item is not repaired after failure i.e. non-maintained system.

    2.4.3 M-out-of-N redundancy

    In some active parallel redundant configurations, m out of the n items may be required to be

    working for the system to function. The reliability of an m-out-of-n system, with n identical

    independent items is given by:

    =

    =1

    0

    )1(1m

    i

    ini

    i

    n RRCR

    There are other system reliability models including the standby redundancy situation but those

    above are the simplest.

    2.5 Reliability prediction

    One of the most common methods for predicting the reliability of a system is called the partscount method.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    14/37

    Introduction to Reliability Engineering Page 12

    Assuming all the parts in a system are independently exponentially distributed, i.e. one part

    does not cause the other to fail then the overall system failure rate can be calculated using the

    series system model shown above. For example, the failure rate of a printed circuit board is

    the sum of the failure rates of each of the components.

    For example:

    Component type Quantity Failure rate Quantity*failure rate

    Ceramic capacitor 30 0.00001 * 10-6 0.0003 * 10-6

    Tantalum capacitor 10 0.0003 * 10-6 0.003 * 10-6

    Carbon resistor 30 0.00001 * 10-6 0.0003 * 10-6

    Diodes 10 0.0002 * 10-6 0.002 * 10-6

    Transistors 15 0.0005 * 10-6 0.0075 * 10-6

    Logic IC 20 0.001 * 10-6 0.020 * 10-6

    PCB failure rate = 0.035800 * 10-6

    The failure rates for components can be estimated from company in-service databases or can

    be attained from published handbooks and published data.

    KEY POINTS

    PDF, CDF, Reliability function and hazard function

    Bath-tub curve infant mortality, useful life and wear-out

    Exponential distribution most widely used constant hazard function

    Weibull with shape parameter, can model decreasing and increasing hazard function.When Beta =1 is equal to exponential. Characteristic life is the 63rdpercentile

    Series systems modelling used for estimating system reliability by using parts countmethod

    3 Design for reliability

    The objective of design for reliability is to design a given product that meets its requirements

    under the specified environmental conditions. To achieve this good sound engineering design

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    15/37

    Introduction to Reliability Engineering Page 13

    rules should be followed. However there are a few general principles that should observed,

    these include:

    Component selection well-established and known components should be used

    (company usually have their own approved components list). If this is not he casethen analysis must be done to check the component is fit for purpose.

    Consider the load-strength relationship and ensure there is an adequate safety margin.

    Minimum complexity

    Diversity avoids common mode failures

    Analyse failure modes and their effects (FMEA)

    Identify any single point failures and either mitigate or design them out.

    Use lessons learned from previous products to design out any known weaknesses.

    Ultimately the aim is to maximise reliability during service life by:

    Measurement & control of manufacturing quality / screening

    Optimized design & build process to improve intrinsic reliability

    Assure no systematic faults present in product

    Provide sufficient margin to meet life requirements

    3.1 Product life cycle

    Each product has a life cycle, figure 10 illustrates a generic product life cycle. There are a

    number of tools and techniques that are most useful at various stages of the product life cycle.

    For example, at the design stage, it is most appropriate to use techniques that will be useful

    for design reviews. Testing parts for fitness of purpose using accelerated life testing is also

    necessary at this stage. When the product has been built it becomes costly to change the

    design so all design reviews need to be done as early as possible in the product life cycle.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    16/37

    Introduction to Reliability Engineering Page 14

    Design

    Development

    ManufactureTest

    Use

    FMECA, FTA, PoF,RBD

    FE,accelerated life test

    Development Test

    SPCESS, Burn-in

    Field data analysis

    FRACAS

    DesignDesign

    DevelopmentDevelopment

    ManufactureManufactureTestTest

    UseUse

    FMECA, FTA, PoF,RBD

    FE,accelerated life test

    Development Test

    SPCESS, Burn-in

    Field data analysis

    FRACAS

    Figure 10: Product life cycle

    Development testing is used to investigate the robustness of the product and to identify any

    design weaknesses with respect to the load. Development testing incorporates environmentaltesting and is used for fitness of purpose of the product.

    When the product has been developed, the design closed and ready for production, statistical

    process control and other quality engineering tools are imperative for ensuring a good quality

    product.

    Environmental stress screening or burn-in is sometimes used to test all manufactured units

    prior to release to the customer. The purpose of ESS is to identify any manufacturing

    weaknesses in individual items.

    When in-service, product performance data should be collected to check the productreliability and also to feed forward to new product design in the form of lessons learned.

    More discussion on some of these tools and techniques is given in later sections.

    3.2

    Reliability tools and techniques

    3.2.1 Introduction

    Some of the tools that are useful during the design stage can be thought of as tools for fault

    avoidance. The fall into two general methods, bottom-up and top-down.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    17/37

    Introduction to Reliability Engineering Page 15

    3.2.2 Top-down method

    Undesirable single event or system success at the highest level of interest (the top event)

    should be defined.

    Contributory causes of that event at all levels are then identified and analysed.

    Start at highest level of interest to successively lower levels

    Event-oriented method

    Useful during the early conceptual phase of system design

    Used for evaluating multiple failures including sequentially related failures and common-cause events

    Some examples of top-down methods include: Fault tree analysis (FTA); Reliability block

    diagram (RBD) and Markov analysis

    Fault tree analysis

    Fault tree analysis is a systematic way of identifying all possible faults that could lead to

    system fail-danger failure. The FTA provides a concise description of the various

    combinations of possible occurrences within the system that can result in predetermined

    critical output events. The FTA helps identify and evaluate critical components, fault paths,

    and possible errors. It is both a reliability and safety engineering task, and it is a critical data

    item that is submitted to the customer for their approval and their use in their higher-level

    FTA and safety analysis. The key elements of a FTA include:

    Gates represent the outcome

    Events represent input to the gates

    Cut sets are groups of events that would cause a system to fail

    The following diagram shows the flowchart symbols that are used in fault tree analysis in

    order to aid with the correct reading of the fault tree.

    FTA can be done qualitatively by drawing the tree and identifying all the basic events.

    However to identify the probability of the top event then probabilities or reliability figures

    must be input for the basic events. Using logic the probabilities are worked up to given a

    probability that the top event will occur. Often the data from an FMEA are used in

    conjunction with an FTA.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    18/37

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    19/37

    Introduction to Reliability Engineering Page 17

    For each fault mode the corresponding effect on performance is deduced for the nexthigher system level

    The resulting fault effect becomes the fault mode at the next higher system level, and

    so on

    Successive iterations result in the eventual identification of the fault effects at allfunctional levels up to the system level.

    Rigorous in identifying all single fault modes

    Initially may be qualitative

    Some examples of bottom-up methods include: Event tree analysis (ETA); FMEA and Hazard

    and operability study (HAZOP).

    Event tree analysis

    considers a number of possible consequences of an initiating event or a system failure.

    may be combined with a fault tree

    used when it is essential to investigate all possible paths of consequent events theirsequence

    analysis can become very involved and complicated when analysing largersystems

    Example:

    A PA1 = 0.5

    A

    B

    C

    C

    PA2 = 0.5

    C1 Pc1 = 0.5

    C2 Pc2 = 0.4

    C3 Pc3 = 0.6

    C

    CC4 Pc4 = 0.2

    C5 Pc5 = 0.8

    PB1 = 0.3

    B

    PB2 = 0.7

    Car came to slow stop, no

    damage to the car, other

    property or injuries Pc1=0.5

    Car came to slow stop, no

    damage to the wheel,

    Pc2= 0.5*0.3*0.4=0.06

    Car collided with the centre

    divider, damage to the carand the divider

    Pc3=0.5*0.3*0.6=0.09

    Car ran off the road, damage

    to the car, driver injured

    Pc4=0.5*0.7*0.2=0.07

    Collision with another vehicle,

    damage to both, both drivers

    injured

    Pc5=0.5*0.7*0.8=0.28

    A no property damage or injuryB property damage, no injuryC damage to the car only, no other property damage

    A PA1 = 0.5

    A

    B

    C

    C

    PA2 = 0.5

    C1 Pc1 = 0.5

    C2 Pc2 = 0.4

    C3 Pc3 = 0.6

    C

    CC4 Pc4 = 0.2

    C5 Pc5 = 0.8

    PB1 = 0.3

    B

    PB2 = 0.7

    Car came to slow stop, no

    damage to the car, other

    property or injuries Pc1=0.5

    Car came to slow stop, no

    damage to the wheel,

    Pc2= 0.5*0.3*0.4=0.06

    Car collided with the centre

    divider, damage to the carand the divider

    Pc3=0.5*0.3*0.6=0.09

    Car ran off the road, damage

    to the car, driver injured

    Pc4=0.5*0.7*0.2=0.07

    Collision with another vehicle,

    damage to both, both drivers

    injured

    Pc5=0.5*0.7*0.8=0.28

    A no property damage or injuryB property damage, no injuryC damage to the car only, no other property damage

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    20/37

    Introduction to Reliability Engineering Page 18

    Failure Modes and Effects Analysis (FMEA)

    Failure mode and effect analysis (FMEA) is a bottom-up, qualitative dependability analysis

    method, which is particularly suited to the study of material, component and equipment

    failures and their effects on the next higher functional system level. Iterations of this step

    (identification of single Failure modes and the evaluation of their effects on the next highersystem level) result in the eventual identification of all the system single failure modes. FMEA

    lends itself to the analysis of systems of different technologies (electrical, mechanical,

    hydraulic, software, etc.) with simple functional structures. FMECA extends the FMEA to

    include criticality analysis by quantifying failure effects in terms of probability of occurrence

    and the severity of any effects. The severity of effects is assessed by reference to a specified

    scale.

    FMEAs or FMECAs are generally done where a level of risk is anticipated in a program earlyin product or process development. Factors that may be considered are new technology, newprocesses, new designs, or changes in the environment, loads, or regulations. FMEAs or

    FMECAs can be done on components or systems that make up products, processes, ormanufacturing equipment. They can also be done on software systems.

    The FMEA or FMECA, analysis generally follows the following steps:

    Identification of how the component of system should perform;

    Identification of potential failure modes, effects, and causes;

    Identification of risk related to failure modes and effects;

    Identification of recommended actions to eliminate or reduce the risk;

    Follow-up actions to close out the recommended actions.

    Benefits include: Identifies systematically the cause and effect relationships.

    Gives an initial indication of those failure modes that are likely to be critical,

    especially single failures that may propagate.

    Identifies outcomes arising from specific causes or initiating events that are believed to

    be important.

    Provides a framework for identification of measures to mitigate risk.

    Useful in the preliminary analysis of new or untried systems or processes.

    Limitations include: The output data may be large even for relatively simple systems.

    May become complicated and unmanageable unless there is a fairly direct (or "single-

    chain") relationship between cause and effect may not easily deal with time sequences,

    restoration processes, environmental conditions, maintenance aspects, etc.

    Prioritising mode criticality is complicated by competing factors involved

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    21/37

    Introduction to Reliability Engineering Page 19

    Physics of Failure (PofF).

    In simple terms, Physics-of-Failure is an understanding of the physical properties of the

    materials, processes, and technologies used in the design and how those properties can

    interact with the life hazard conditions placed on the design during the products full life

    cycle. The reliability engineer must understand the customers use and misuse conditions andcomponent/ environment interactions to assist the design team in working around limitations

    inherent in the selected design approach.

    This type of analysis will answer most of the why, where, when, and how questions about

    the life of a component. Analysis methods prove essential in understanding, determining, and

    applying appropriate corrective action for root cause of failure. Understanding the root cause

    of a failure is essential in todays highly competitive market for successfully manufacturing

    quality components.

    3.3

    Reliability testing

    3.3.1 Accelerated life testing

    The concept of accelerated testing is to compress time and accelerate the failure mechanisms

    in a reasonable test period so that product reliability can be assessed. The only way to

    accelerate time is to stress potential failure modes. These include electrical and mechanical

    failures. Failure occurs when the stress exceeds the products strength. In a products

    population, the strength is generally distributed and usually degrades over time. Applying

    stress simply simulates aging. Increasing stress increases the unreliability and improves the

    chances for failure occurring in a shorter period of time. This also means that a smallersample population of devices can be tested with an increased probability of finding failure.

    Stress testing amplifies unreliability so failure can be detected sooner. Accelerated life tests

    are also used extensively to help make predictions. Predictions can be limited when testing

    small sample sizes. Predictions can be erroneously based on the assumption that life-test

    results are representative of the entire population. Therefore, it can be difficult to design an

    efficient experiment that yields enough failures so that the measures of uncertainty in the

    predictions are not too large. Stresses can also be unrealistic. Fortunately, it is generally rare

    for an increased stress to cause anomalous failures, especially if common sense guidelines are

    observed.

    Anomalous testing failures can occur when testing pushes the limits of the material out of theregion of the intended design capability. The natural question to ask is: What should the

    guidelines be for designing proper accelerated tests and evaluating failures? The answer is:

    Judgment is required by management and engineering staff to make the correct decisions in

    this regard. To aid such decisions, the following guidelines are provided:

    1. Always refer to the literature to see what has been done in the area of acceleratedtesting.

    2. Avoid accelerated stresses that cause nonlinearities, unless such stresses areplausible in product-use conditions. Anomalous failures occur when accelerated stress

    causes nonlinearities in the product. For example, material changing phases from

    solid to liquid, as in a chemical nonlinear phase transition (e.g., solder melting,

    inter-metallic changes, etc.); an electric spark in a material is an electrical

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    22/37

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    23/37

    Introduction to Reliability Engineering Page 21

    This speeds up testing, and if an interactive vibration/temperature failure mode is present, this

    combined testing may be the only way to find it. Other stresses used may be power step-

    stress, power cycling, package preconditioning with infrared (IR) reflow, electrostatic-

    discharge (ESD) simulation, and so forth. The choice depends on the intended type of unit

    under test and the units potential failure modes.

    HALT is primarily for assemblies and subassemblies. The HALT test method utilizes a

    HALT chamber. Today, these multi-stress environmental systems are produced by a large

    number of suppliers. The chamber is unique and can perform both temperature and vibration

    step-stress testing.

    3.3.3 Demonstration testing

    Demonstration of reliability may be required as part of a development and production

    contract, or prior to release to production, to ensure that the requirements have been met.Two basic forms of reliability measurement are used:

    1. A sample of units may be subjected to a formal reliability test, with conditionsspecified in detail.

    2. Reliability may be monitored during development and use.

    The first method has been shown to be problematic and subject to sever limitations and

    practical problems. The limitations include:

    PRST (Probability ratio sequential test) assumes a constant hazard function;

    It implies that MTBF is an inherent parameter of a system;

    Extremely costly

    It is an acceptance test

    Objective is to have no or very few failures

    It has been shown that a well managed reliability growth programme as discussed earlier

    would avoid the need for demonstration testing as they concentrate on how to improve

    products. It is has also been argued that the benefit to the product in terms of improved

    reliability is sometimes questionable having used PRST methods.

    3.3.4 Environmental Stress Screening

    If all processes were under complete control, product screening or monitoring would be

    unnecessary. If products were perfect, there would be no field returns or infant mortality

    problems, and customers would be satisfied with product reliability and quality. However, in

    the real world, unacceptable process and material variations exist. Product flaws need to be

    anticipated before customers receive final products and use them. This is the primary reason

    that a good screening and monitoring program is needed to provide high quality products.Screening and monitoring programs are a major factor in achieving customer satisfaction.

    Parts are screened in the early production stage until the process is under control and any

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    24/37

    Introduction to Reliability Engineering Page 22

    material problems have been resolved. Once this occurs, a monitoring program can ensure

    that the process has not changed and that any deviations have been stabilized. Here, the term

    screening implies 100% product testing while monitoring indicates a sample test. Screens

    are based upon a products potential failure modes. Screening may be simple, such as on-off

    cycling of the unit, or it may be more involved, requiring one or more powered environmental

    stress screens. Usually, screens that power up the unit, compared with non-powered screens,provide the best opportunity to precipitate failure-mode problems. Screens are constantly

    reviewed and may be modified based on screening yield results. For example, if field returns

    are low and the screen yields are high (near 100 percent), the screen should be changed to find

    all the field issues. If yields are high with acceptable part per million (PPM) field returns, then

    a monitoring program will replace the screen. In general, monitoring is preferred for low-

    cost/high-volume jobs. A major caution for selecting the correct screening program is to

    ensure that the process of screening out early life failures does not remove too much of a

    products useful life. Manufacturers have noted that, in the attempt to drive out early life

    failure, the useful life of some products can become reduced. If this occurs, customers will

    find wear-out failure mechanisms during early field use.

    The information obtained when a product is first introduced to the Development Phase in the

    HALT process enables the development of a HASS test. HALT, is a highly accelerated

    reliability growth Test-Analyze-And-Fix (TAAF) process. In HASS, failures are analyzed,

    and corrective actions are implemented. The test is repeated until the observed failure modes

    have been fixed, and the environmental technology limits of the part are understood. This

    information is used for the Production Phase. At this stage, one either develops a traditional

    ESS or a HASS test. HASS test combines thermal cycling, vibration, and power stress

    simultaneously. The testing range is within the operating limits that are known from prior

    HALT testing performed. Similar to HALT, HASS is an aggressive screening program to

    help weed out failure modes and implement corrective actions as soon as possible. Thisprocess enables products to be moved quickly into a monitoring program. The HASS process

    typically helps to reduce screen time (30 percent to 80 percent) and move a product more

    quickly into the Monitoring Phase. For example, a common screen uses 168-hour burn-in, 20-

    hour thermal shock, and a 60-minute vibration test. Since this is a fairly lengthy screen, it is

    advantageous to work with a HASS program. In the HASS process, this test is quickly

    reduced to a monitoring program. Since faster test results help in implementing product

    improvements and moving to a monitoring test, cost savings can be passed onto the customer.

    If HASS precipitates a failure, an immediate failure analysis is performed to determine the

    root cause. A 100 percent screen is maintained until the process is in control. At that point,

    monitoring can be performed. The monitoring also includes a HALT at given intervals to

    ensure that the product safety margins have not deteriorated from those obtained in theEvaluation Phase. Thus, knowledge of the HALT and HASS environmental limits, relative to

    a customer specification, is very helpful in providing engineering confidence in the proper

    design of the screening and the subsequent monitoring test. Such sound practices are

    important for providing a highly reliable product.

    3.3.5 Reliability Growth/Enhancement h Planning

    Traditionally, the need for Reliability Growth planning has been for large subsystems or

    systems. This is simply because of the greater risk in new product development at that levelcompared to the component level. Also, in programs where one wishes to push mature

    products or complex systems to new reliability milestones, inadequate strategies will be

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    25/37

    Introduction to Reliability Engineering Page 23

    costly. A program manager must know if Reliability Growth can be achieved under required

    time and cost constraints. A plan of attack is required for each major subsystem so that

    system-level reliability goals can be met. However Reliability Growth planning is

    recommended for all new platforms, whether they are complex subsystems or simple

    components. In a commercial environment with numerous product types, the emphasis must

    be on platforms rather than products. Often there may be little time to validate, let aloneassess, reliability. Yet, without some method of assessment, platforms could be jeopardized.

    Accelerated testing is, without question, the featured Reliability Growth tool for industry. It is

    important to devise reliability planning during development that incorporates the most time-

    and cost effective testing techniques available.

    Reliability growth can occur at the design and development stage of a project but most of the

    growth should occur in the first accelerated testing stage, early in design. Generally, there are

    two basic kinds of Reliability Growth test methods used: constant stress testing and step-

    stress testing. Constant stress testing applies to an elevated stress maintained at a particular

    level over time, such as isothermal aging, in which parts are subjected to the same

    temperature for the entire test (similar to a burn-in). Step-stress testing can apply to such

    stresses as temperature, shock, vibration, and Highly Accelerated Life Test (HALT). These

    tests stimulate potential failure modes, and Reliability Growth occurs when failure modes are

    fixed. No matter what the method, Reliability Growth planning is essential to avoid wasting

    time and money when accelerated testing is attempted without an organized program plan.

    Todays competitive market requires thorough planning, especially since platform complexity

    has increased dramatically as competition and technological advances have driven down size

    and costs. Traditional methodology provides us with many valuable tools, such as test

    planning, growth tracking and assessment, fix-effectiveness factor estimation, correctiveaction review team operations, and Test-Analyze-And-Fix (TAAF) strategies. All methods

    use a FRACAS type approach to audit corrective actions.

    There are numerous types of accelerated tests including HALT, Step-Stress, Highly

    Accelerated Stress Screening (HASS), Environmental Stress Screening (ESS), failure-free

    accelerated testing, Reliability Growth and accelerated reliability growth. These practices are

    all important, since each has been used today in both commercial and industrial applications

    for ensuring product reliability. The methods have not been without confusion. Confusion

    exists as to when and which test method should be used and the Reliability Growth that can

    be achieved with any one method. The methods should be to integrating and implemented

    throughout the product development cycle using Table 1summarizes the approach and howthese tests fit into the product life cycle.

    Accelerated test or

    methods

    Stage of product

    life cycle

    Definitions and uses

    Reliability Growth or

    Reliability Enhancement

    Design and

    development

    Reliability Growth is the positive

    improvement in a reliability parameter

    over a period of time due to changes in

    product design or the manufacturing

    process. A Reliability Growth program is

    commonly established to helpsystematically plan for reliability

    achievement over a programs duration so

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    26/37

    Introduction to Reliability Engineering Page 24

    that resources and reliability risks can be

    managed.

    Understanding Customer

    Requirements

    Proposal and

    concept design

    Understanding Customer Requirements is

    a common sense topic that is often

    overlooked. It can be a simple exercise or

    include a full approach. In a full approach,

    tools such as FMEA, competitive

    Benchmarking, QFD and Reliability

    Predictive Modeling are used to assure the

    smartest approach in a product maturation

    program.

    HALT (Highly

    Accelerated Life Test)

    Design and

    development

    HALT is a type of step-stress test that

    often combines two stresses, such as

    temperature and vibration. This highly

    accelerated stress test is used for finding

    failure modes as fast as possible andassessing product risks. Frequently it

    exceeds the equipment-specified limits.

    Step-Stress Test Design and

    development of units

    or components

    Exposing small samples of product to a

    series of successively higher steps of a

    stress (like temperature), with a

    measurement of failures after each step.

    This test is used to find failures in a short

    period of time and to perform risk studies.

    Failure-Free Test or

    demonstration test

    Post design This is also termed zero failure testing.

    This is a statistically significant reliabilitytest used to demonstrate that a particular

    reliability objective can be met at a certain

    level of confidence. For example, the

    reliability objective may be 1000 FITs (1

    million hours MTTF) at the 90 percent

    confidence level. The most efficient

    statistical sample size is calculated when

    no failures are expected during the test

    period. Hence the name.

    ESS (Environmental

    Stress Screening)

    production This is an environmental screening test or

    tests used in production to weed out latentand infant mortality failures.

    HASS (Highly

    Accelerated

    Stress Screen)

    production This is a screening test or tests used in

    production to weed out infant mortality

    failures. This is an aggressive test since it

    implements stresses that are higher than

    common ESS screens. When aggressive

    levels are used, the screening should be

    established in HALT testing.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    27/37

    Introduction to Reliability Engineering Page 25

    The basic methodology in Reliability Growth planning should minimally consist of the

    following

    The design of appropriate accelerated tests to stress expected or unexpected failuremodes of the subsystems. These tests should be chosen and designed to stimulate

    failures at a faster rate. An effective program will include a streamlined root-cause

    corrective action plan. Without a complete plan, accelerated testing will be wasted

    and Reliability Growth opportunities lost.

    The correct use of historical acceleration factors and Reliability Growth parameters.This will enable estimates of accelerated Reliability Growth over the programs

    testing phases to be generated.

    The accurate tracking and assessment of Reliability Growth during and after each testphase and corrective action. This helps with correct assessment techniques for fix

    effectiveness to properly evaluate compliance of growth goals/milestones. Planning

    Reliability Growth is not enough; periodic assessments of reliability are also essential

    so that management is assured that their achievement of the planned Reliability

    Growth goals is realistic. Good monitoring and management of corrective actions using FRACAS

    3.4 FRACAS

    A Failure Reporting and corrective action system (FRACAS) is closed-loop coordinated

    system that is used to manage failures throughout the product life cycle. It records

    information about failures of a product and when and where they occurred but it also enforces

    corrective action details are documented.

    It is used from the beginning of a project until it is withdrawn from service. All personnel use

    the system and are responsible for any FRACAS they own.

    There will be one group who will have overall control of the FRACAS administration and

    their function would be to ensure that all FRACAS raised are acted upon and closed out and

    also to collect lessons learned and feed forward as appropriate.

    For example: an engineer is working on the reliability growth programme and has

    encountered a failure of the product during the step-stress test of the unit. The engineer

    should raise a FRACAS and state the problem as well as give details of the unit number, date

    of failure, what is was doing when it failed, reference to the test conditions. This FRACAS

    would then be passed to an engineer to diagnose the fault and ultimately find a fix. The unit

    can be fixed and the testing resumed. However the FRACAS is not closed until the fix has

    been documented and implemented in the design and ultimately drawings (assuming the fault

    is due to design). The owner of this FRACAS is responsible for seeing the problem through

    and an engineering review of FRACAS would show all outstanding open FRACAS. This

    ensures that problems are getting solved rather than stacked up.

    3.5

    In-service data analysis

    Data can be collected from the field and analysed for a number of reasons. These broadlyinclude checking that the product service reliability is meeting requirements; looking at the

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    28/37

    Introduction to Reliability Engineering Page 26

    service performance and identifying any systematic faults that can be fixed in this and future

    products.

    Exploratory Data Analysis can be used on such data to look for trends or systematic failures.

    To analyse such data the data recorded should include the following basic information:

    Date of failure

    Reason for removal from field

    Serial number of product removed

    Product type

    Customer

    Length of time operating prior to failure

    Date entered serviceAfter diagnosis of the product the following information is required:

    Root cause of failure

    Repair information what was removed and repaired and when

    Diagnostic information

    Category of failure i.e. was is a design failure or a component failure or amanufacturing failure or a misuse failure or a diagnostic failure or was is working and

    no fault found.

    When was it shipped out after repair

    This type of information allows the product support organisation within a company to analyse

    the data to identify where the majority of failures are occurring.

    For example:

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    29/37

    Introduction to Reliability Engineering Page 27

    DesignBuildSupplierTest

    Determine Real Cause

    of Confirmed Removals

    DesignBuildSupplierTest

    Determine Real Cause

    of Confirmed Removals

    Confirmed Faulty

    Confirmed Not Faulty

    Agree Categorisat ion

    of all Unit Removals

    Confirmed Faulty

    Confirmed Not Faulty

    Agree Categorisat ion

    of all Unit Removals

    External Fault

    No Fault Found

    Find Root Cause

    of N.F.F. Removals

    External Fault

    No Fault Found

    Find Root Cause

    of N.F.F. Removals

    Troubleshooting

    System Design

    Fault Isolation Manual

    Understand Why

    Fault Isolation Fails

    Troubleshooting

    System Design

    Fault Isolation Manual

    Understand Why

    Fault Isolation Fails

    Understand Why

    Fault Isolation Fails

    Figure 11: Example of root cause analysis

    By finding out that most confirmed failures are due to build problems the data can then be

    filtered and analysed to explore what types of build failures are occurring and eventually to

    investigate ways of reducing such failures.

    3.6

    Risk analysis

    Risk management applies to all new product development. Common technical risk areas

    include performance, producibility, production, scheduling and resources. Risk varies

    depending on whether customer requirements match technology performance capability

    predictions, if field experience is available on analogous assemblies, if the technology is

    revolutionary or evolutionary, if the application is new, if the intended use environment is

    harsh and different from previous field experience, and so forth. Risks are often assessed in

    categories. A technology management risk matrix is often used in industry as shown below:

    Evolutionary RevolutionarySame application Low Risk High Risk

    New Application Moderate Risk Very High Risk

    Revolutionary technologies carry a higher risk. For example, when the first airplanes were

    developed in the early 1900s, flying these early machines often resulted in injury or death.

    Now that flying is a mature technology, the risks of flying are very low. Evolutionary changes

    to the aircraft having similar applications today carry low risks since the technology is mature.

    The goal of a risk management program is to make correct decisions at key points in the

    program. Technology risk management is essential to the success of any developmentprogram. Risk issues and their consequences concern everyone involved with a programs

    success. The larger and the more undeveloped a technical program, the more important it is to

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    30/37

    Introduction to Reliability Engineering Page 28

    manage risks. In the case of a reasonably large and/or complex program, many technical

    details can impact the system. The purpose of risk management is to identify, assess and

    mitigate risks throughout the project.

    Since component and subsystem risks are magnified at the system level, it is important that

    program management becomes aware of issues early in the program. All potential risk areasrequire identification and risk handling. Management can then direct resources to prioritized

    risk areas and conserve valuable time and expenses. These benefits are best realized when

    technical risk issues can be properly identified, assessed, quantified, and finally handled both

    at the system and the subsystem level.

    3.7

    REMM

    The REMM project developed a methodology that supports product reliability enhancement

    and consequently is viewed primarily as a design for reliability tool although it could be usedby management and product support to view the reliability status of products. It is viewed as a

    move away from playing the numbers game using reliability predictions towards a more

    considered approach. Its purpose is to encourage design for reliability by providing engineers

    with reliability information and lessons learned on previous designs as well as providing a

    proactive holistic approach to design

    Current reliability prediction techniques for electronics have been shown to be

    unrepresentative of real situations. There is a dichotomy between the way reliability is

    assessed (predicted) and the way it is actually achieved.

    Many aerospace companies continue to use MIL Handbook 217 for reliability predictions.

    Usually the predictions are made on a parts count basis and then the overall predicted failure

    rate is factored using in-service performance of previous products. So if a prediction had

    been done on a product that is now in the field then a multiplicative factor can be obtained by

    comparing the in-service reliability and that prediction. Such factors are therefore used in

    predictions for new designs.

    The problem with this approach is that the product reliability is not really considered. The

    parts count prediction is based on the number of components of each type and their

    corresponding failure rates and as is widely known this is all based on the assumption of

    constant failure rates. In addition there is a temptation when given an MTBF requirement to

    use the prediction to show that it will be met without doing any engineering analysis. In other

    words the prediction is often altered rather than the design or manufacture of the product.

    The R&M case was developed by the UK MOD as a move away from prescribing reliability

    methods and tools towards a more measured approach to achieving reliable products. The

    definition is given as A reasoned, auditable argument created to support the contention that a

    defined system satisfies the R&M requirement and is therefore concerned with providing

    progressive assurance that the product will be reliable. The impetus for this move towards the

    R&M case is partly driven by experience. In the past the MOD, in their contracts, prescribed

    specific reliability techniques and tools to be performed. They believed that using such tools

    and techniques would produce the required reliable products. However, in many cases, the

    opposite occurred. Payments were made to suppliers for producing the required documents

    however no evidence was necessary to show that any of the tools and techniques appliedactually affected the product design and development. In other words the focus was on

    producing documents rather than showing that the product would meet the necessary

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    31/37

    Introduction to Reliability Engineering Page 29

    reliability requirements in service. This situation is similar to the discussion on prediction

    outlined earlier and highlights that there is a synergy between the aims of REMM and the

    R&M case.

    REMM Process

    The REMM process was developed by the project team based on the philosophy of

    evolutionary design of products i.e. any new product design can be based on an existing

    product design. To design a new product the REMM philosophy urges the user to consider

    the differences between the new product design and previous product designs. These

    differences would include functionality, process (design and manufacture) and environmental

    changes to enable the project team to concentrate on the risky aspects of the new product

    design and development. Figure 11 below shows the simplified REMM process flow. As in

    DEF STAN 00-42 part3 the REMM process starts with analysing reliability requirements and

    then moves on to identifying the reliability risks. The risks are identified in REMM by

    identifying the novel aspects of the product design, manufacture, application and use.

    Having captured requirements a similar product is identified and its associated data is

    analysed in order to inform the project team of any reliability issues. The new design is

    therefore altered in light of this data analysis. For example, if a particular component used in

    a previous product and likely to be used in the new product is shown from analysis to be

    causing reliability problems in the field then by highlighting such issues the project team can

    make informed decisions regarding the new product design.

    Reliability CaseReliability Case

    Capture differences between base and new designCapture differences between base and new design

    Capture requirementsCapture requirements

    Modify design

    Review lessons learned

    through data analysis

    Service data

    Modify design

    Review lessons learned

    through data analysis

    Modify design

    Review lessons learned

    through data analysis

    Service data

    Select similar equipmentSelect similar equipment

    Reliability tasks guidanceReliability tasks guidance

    Reliability plan generated via expert systemReliability plan generated via expert system

    Reliability assessment modelReliability assessment model

    Figure 12: Simplified REMM process

    The REMM process therefore encourages the active use of lessons learned from previous

    product manufacture and performance.

    A REMM tool has been developed and implemented at Goodrich engine controls to

    encompass the process shown in figure 12. The tool allows the user to design a new productbased on all previous products. Obviously there will be changes to designs over time for

    numerous reasons, e.g. new technology, additional functionality, different installations, new

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    32/37

    Introduction to Reliability Engineering Page 30

    environment etc. These differences between the new design and previous designs are

    captured by the tool and fed into the REMM expert system. Capturing these differences

    focuses the project team on the risky aspects of this new product.

    High Level Tasks - REMM Expert system

    The REMM expert system has been implemented at Goodrich engine controls and is a

    knowledge base expert system. It has been populated with rules that were written by

    interviewing chief engineers across the different disciplines, i.e. electronic, mechanical,

    components, process and reliability. The rules consist of all the possible changes structured

    hierarchically. So for example, Environmental changes can be due to vibration, thermal,

    humidity and shock changes. Looking at vibration changes, the input fact could be, for

    example:

    Small change in level of vibration; Significant change in level of vibration; Small change in vibration frequency range; Significant change in vibration frequency range.

    Rules for new product design aspects have also been generated and so the differences

    captured include both actual differences and novel aspects.

    The REMM tool is used at the concept stage to design the new product using information on

    previous designs. One of the outputs of the tool is a task list generated from the expert

    system. The tasks suggested may well be tasks that are already planned but it gives a startingpoint for developing the reliability plan as it focuses on the high level risk areas in the new

    product design. The task list can therefore be used to develop the reliability programme and

    plan

    The tool developed at Goodrich establishes the skeleton of the product reliability case when

    the expert system is run. The skeleton reliability case consists of the product description in

    terms of functionality, installation, use, environment and technology, it lists the hardware and

    provides a list of all the differences and novel aspects of the design. It also contains the task

    list and provides references to guidance material or procedures related to specific tasks. When

    a task has been completed the case is updated with a reference to a company report detailing

    the outcome and solution. To close the loop the tool ought to be linked to the FRACASsystem to ensure that the solution found is implemented to improve the reliability of the

    product.

    At present the REMM tool is useful at the beginning of the project and really only identifies

    fairly high level tasks but by implementing the statistical model the project team can identify

    lower level, specific risks to update the skeleton reliability case document.

    Low Level Tasks - REMM statistical model

    The REMM statistical model aims to estimate reliability of a new design at specified times inthe product lifecycle to aid engineering understanding of performance and to help inform

    downstream processes that will support enhancement. The model will support what if

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    33/37

    Introduction to Reliability Engineering Page 31

    analysis. This means that alternative scenarios can be investigated and the impact on

    reliability estimated. Therefore the REMM statistical model provides a tracking system to

    help analyse how reliability will and does evolve throughout the lifecycle by integrating the

    numerical estimates with the engineering understanding of reliability. This is a deliberate

    move to a proactive approach to design for reliability.

    The REMM statistical model requires engineering concerns about potential faults in the item

    of equipment to be identified and estimates when they are likely to occur as failures under

    different scenarios. This model is therefore a Bayesian type statistical model.

    Two basic types of input are required to populate the model.

    Engineering concerns with the new design and an estimate of their probability ofoccurrence in service assuming no actions taken

    Reliability profile of the engineering concern or fault class to which it belongs

    The data therefore comes from two sources structured engineering judgement and historicalevent data.

    The basic output from the REMM statistical model will be the estimated reliability function,

    also known as the survival function. This provides an estimate of the probability of surviving

    until a specified time. Figure 13, below shows the model formulation as the reliability of the

    new design is a combination of expert judgement and event data with respect to pre-defined

    categories.

    BuildComponentDesign

    Expert

    Judgement

    Event

    Data

    Failure Class

    NBNCND

    RB(t)RC(t)RD(t)

    Reliability Estimate for New System Design

    Figure 13: Model formulation

    The engineering concerns data or expert judgement data is collected by conducting one-to-one

    semi-structured interviews of members of the project team. Prior to the individual elicitation

    interviews a group session is held to identify the changes in the new design as well as the

    novel aspects and brief each expert on the process.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    34/37

    Introduction to Reliability Engineering Page 32

    In the individual interviews each expert was asked to concentrate on the new product and

    consider the following questions:

    Do you have any concerns about any aspect of the product? What are they?

    What is the likelihood that your concerns will cause a failure in the field? This wasdone by asking the designers and specialist experts to rate risks on a scale of 0-1.

    What mitigation action could be taken to avoid this concern occurring as a failure?

    The resulting information is used to provide a list of concerns for the project team as well as

    an input to the statistical model.

    The concern list on its own can be used as a risk assessment and the project team can decide

    what actions to take to reduce the risk and improve the reliability. The concern list can be

    added to the skeleton reliability case and updated as mitigating actions are taken and concerns

    resolved. This process is an iterative process and would be implemented at key stages in thedesign and development process. The reliability case would therefore be a living document

    that grows as more information is gathered.

    Running the model provides more information about the reliability of the product at a specific

    time in the product development process. The reliability estimate can be compared with the

    reliability requirement. If the estimate is lower then closer inspection of the results can help

    to identify the key tasks to improve the reliability. For example, in figure 13, assuming the

    overall reliability estimate is unacceptable then the data for the category contributing most to

    this estimate could be investigated further. In figure 14, looking at the data for the preload

    concerns can help guide the project team towards mitigating actions that may improve the

    reliability. What if analysis can be done to identify those actions that would give the biggest

    improvement in reliability and therefore providing guidance to the project team.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    35/37

    Introduction to Reliability Engineering Page 33

    Estimated F-Lynx

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 500 1000 1500 2000

    Operational Time

    Reliability

    Crack (est)

    Lub (est)

    Corr (Est)

    Pre (Est)

    Overall (Est)Overall

    PreloadCorrosion

    LubricationCracking

    Figure 14: Example of output from REMM statistical model

    All the analysis undertaken can be added to the reliability case to show how the product is

    being progressively improved by identifying and subsequently reducing product risks.

    Reliability Measurements

    From the discussion above the REMM tool is used to help build the reliability programme

    plan and the skeleton reliability case and is carried out at the concept design stage. The

    statistical model is implemented at key points in the design and development process. The

    concerns are updated throughout as more data is gathered.

    The reliability estimates can therefore be plotted at each of the key stages to show how thereliability estimate is changing throughout the development of the product and therefore how

    close the estimate is to the requirement. Figure 15, below, shows an example of reliability

    changing throughout the product life cycle.

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    36/37

    Introduction to Reliability Engineering Page 34

    Required

    Current

    estimate

    Lifecycle stage

    Updated

    Figure 15: example of reliability varying over the lifecycle

    4

    Reliability management

    Key aspects of reliability management include:

    Corporate level involvement

    Integral part of product development not parallel

    Reliability procedures integrated into design process

    Built into programme plan and produce a reliability plan

    Ownership of the reliability plan within the design team

    A reliability plan should contain the following:

    Statement of reliability requirement

    Organisation for reliability

    Reliability activities to be performed and why

    Timing of major activities

    Management of suppliers

    Standards and company procedures to be used

    Warwick Manufacturing Group

  • 8/10/2019 Introduction to Reliability. WARWICK MANUFACTURING GROUP

    37/37

    Introduction to Reliability Engineering Page 35

    Lesson learned feedback

    Risk Analysis/risk register

    Data collection and analysis procedure

    Reliability monitoring plan

    Reference

    OConnor (2002). Chapter 15, Reliability management, Practical Reliability Engineering,

    John Wiley

    5 Summary

    These lecture notes provide information about why reliability is important, moreover they

    give an overview of all the aspects of reliability engineering, including:

    Need for Reliability requirements;

    Need for planning to achieve reliability throughout the product life cycle;

    How to measure reliability;

    Other factors that are affected by poor reliability such as safety, competitiveness,goodwill, maintenance costs and ultimately profit.

    Design for reliability;

    The bath-tub curve and Reliability (life distributions);

    Reliability methods FTA,ETA,FMEA,RBD;

    Reliability testing reliability growth planning, ESS, HALT, HASS, step-stresstesting;

    Need for good reliability management and planning;

    Managing risk;

    Using FRACAS;

    The importance of collecting an analysing in-service data.

    This lecture is an overview of reliability engineering and therefore it gives an appreciation for

    the topic. It is not expected to be a reliability-engineering manual but gives a flavour for the

    importance of the topic and how it fits into the design, development and use of a product.